Comments (19)
Situation: \ufffd
exists because it shouldn't ever appear in a correctly-encoded file on purpose
"We need to put a raw \ufffd
in a file so we can test it"
Situation: \ufffd
exists in a correctly-encoded file on purpose
from typescript.
Just write '\ufffd'
instead? Putting the signifier that says "this file was corrupted" into a file, causing a tool for humans to say "this file looks corrupted", seems like correct behavior.
from typescript.
I always thought the entire point of the 0xFFFD
codepoint was that it's only supposed to be generated by a text decoding failure and has no reason to ever appear literally in the source text. Like Ryan says it's specifically designated as an indicator of corrupted text.
from typescript.
Really the thing we need is for Buffer.toString
to say when it inserted a replacement character instead of directly reading one. Without that information there's not a lot we can do that isn't expensive.
from typescript.
Since we have a Buffer in sys.readFile
(thanks to us dealing with BOMs and so on), we can technically write:
const buf = fs.readFileSync(p);
const asText = buf.toString("utf8");
const badDecode = Buffer.compare(buf, Buffer.from(asText, "utf8")) === 0;
We could do that conditionally based on the presence of U+FFFD, but then we don't have any way to bubble that information up (because readFile returns just a string, and the point would be to not error on parse because the lower-level thing said it was okay).
from typescript.
"exists because it shouldn't ever appear in a correctly-encoded file on purpose"
I don't think this is really the case. There are many algorithms in the html spec, and markdown spec, that do things that produce the replacement character.
It's like saying NaN shouldn't exist. Of course you don't want NaN normally but it has to exist as a concept. And that also means that you can check for it in code. And in tests. You have to be able to talk about it. The Unicode spec and Wikipedia and html and markdown need to be able to talk about the character.
Also, no other tool does what TS just started doing.
I'd think it's better to check for control characters instead of a replacement character. Perhaps before that toString
from typescript.
It's like saying NaN shouldn't exist. Of course you don't want NaN normally but it has to exist as a concept. And that also means that you can check for it in code. And in tests. You have to be able to talk about it. The Unicode spec and Wikipedia and html and markdown need to be able to talk about the character.
NaN is actually a really good analogy, but maybe not for the reason you think: there's a big difference between testing for a computation that produces NaNs, as a way to detect bugs, vs. literally writing
let x = NaN;
The minute you assume the latter is allowed to happen in the wild, for any reason, then the former test becomes useless as an error-detection mechanism (this is probably the rationale behind why direct tests against NaN don't work, come to think of it; it discourages writing them literally for any reason). Yes, x
is equal to NaN
, but it was done on purpose in this case and doesn't indicate an invalid computation. Which is pretty much exactly like the problem we're having here.
Now, that having been said, I will admit that testing for binary files by looking for a Unicode decode failure is kind of hacky. But alas, design constraints (specifically, the test is done at a time when all the compiler has access to is the UTF-8 decoded text from the source file).
from typescript.
Buffer#toString
resulting in �
does not mean that Buffer#toString
is the only function that is ever allowed to result in �
per Unicode.
Markdown for example, which has to deal with potentially malicious authors, has to make documents safe. So, the function markdown(input)
also produces �
.
Here are some more practical examples:
- 15 hits in WHATWG, https://github.com/search?q=org%3Awhatwg+%22%EF%BF%BD%22&type=code
- 16 hits in my personal code: https://github.com/search?q=%22%EF%BF%BD%22+user%3Awooorm&type=code
Even if there were never �
s in input, I also argue that Buffer#toString
(implying Buffer#toString('utf8')
) resulting in �
does not mean that a file is binary. Whether a buffer is valid UTF-8 is not the same as whether a buffer is binary or not.
Looking on npm for “is binary” and checking out the code, yields:
- https://github.com/gjtorikian/isBinaryFile/blob/main/src/index.ts
- https://github.com/bevry/istextorbinary/blob/master/source/index.ts
- https://github.com/sindresorhus/file-type
In the TS code base, there is access to the buffer. And UTF-8 is enforced already. So the bytes can be checked? Not too complex? https://github.com/wayfind/is-utf8/blob/master/is-utf8.js
from typescript.
(this is probably the rationale behind why direct tests against NaN don't work, come to think of it; it discourages writing them literally for any reason).
Off-topic, but NaN compares false to NaN because there are multiple bit patterns that can represent it and never equal is much better than sometimes equal.
from typescript.
Yes, I'm aware that's the theory - but I tend to suspect there's a more pragmatic rationale behind it too 😅 I don't think many people are pulling NaNs apart to inspect their low-level bit patterns, and if they are that's well outside the purview of the IEEE float spec, so from the perspective of normal FP operations the existence of multiple representations isn't even observable.
from typescript.
¯\_(ツ)_/¯ I guess it can be two things
from typescript.
In the TS code base, there is access to the buffer. And UTF-8 is enforced already. So the bytes can be checked? Not too complex?
Everything can be fixed; the question is how much slower everyone's tsc should be in order for a handful of test files to not need to use escape sequences.
from typescript.
Indeed! But then I’d move one more step back: why are people passing 700mb video files through typescript? How many folks are doing that 😅
The patch for that already only looks at the first 256 characters. For every file.
There is apparently code for BOMs too.
I’d wager that looking at 256 first bytes with something like https://github.com/gjtorikian/isBinaryFile/blob/main/src/index.ts, particularly when modified to actually do what the goal is (bail quickly when not UTF-8), will be so slow.
from typescript.
Looking more into the original issue, with the knowledge that many binary file formats have a “BOM-like” signature in the first few bytes: #21136 (comment)
from typescript.
0x47 is "G". I don't know why someone seeing an error when the letter G appears at two random offsets would be less surprised to see "file is binary" than you are when looking at �.
from typescript.
Right. You still can use those uncommon 2 bytes to go into a slightly slower path tho?
from typescript.
We're also not just trying to detect MPEG transport streams. There are other tools, e.g. Expo, which are/were emitting arbitrary binary files into .js file extensions and causing slowdowns because we tried to parse a giant garbage file.
from typescript.
Interesting! I dunno, detecting “arbitrary binary files” made by various tools is just going to be complex I think.
Right now TS throws. I assume that on giant binary files TS was also throwing.
Why not improve that crash with a better message, check whether something appeared binary, and suggest adding an ignore pattern?
Or also when files are like 1mb+, do a small byte check then?
Or when �
appears with this recent patch.
And also, as @jakebailey mentions, comparing the buffer: #57930 (comment).
I feel like there are a couple cases where a slightly more thorough check can be done. And I’m not 100% that that check is that slow.
from typescript.
PRs accepted
from typescript.
Related Issues (20)
- C++-style `const` modifier on class members HOT 2
- Package Import is not working from neither way "CommonJS" or "Module" HOT 4
- Type guard for child not transitive parent object HOT 3
- Error when compiled JavaScript initializes static properties in a class with a hard private method (`#`) that references a static property
- The ?: inference failed; no selection was made between ifTrue and ifFalse. HOT 1
- Add Support for Flow's new component syntax HOT 6
- Documentation: "Creating and Printing a TypeScript AST" example does not work after v4.9.5
- `satisfies` does not work on a variable called `type`
- API: Can't seem to extract JsDoc tags from declaration HOT 2
- Add option to detect and strip internal exports
- false error on generic type alias parameters HOT 5
- Unable to declare an interface that extends `Record<string, nonAnyType>` with additional support functions. HOT 7
- Add support for generic types on a index accessor HOT 2
- `getTextOfJSDocComment` is stripping `#` from JSDoc comment HOT 1
- Enable `allowJs` with `isolatedDeclarations` HOT 1
- JSDoc comment string with the keyword "@private" before import statement in JS file result in cryptic error TS1191 during compilation HOT 2
- `NoInfer` isn't erased and breaks type narrowing HOT 3
- function with overloading incorrectly defines generic variables HOT 3
- TS2590: Expression produces a union type that is too complex to represent, with simple file using Tuples HOT 6
- Debug Failure in transformClassLike in "ghost" after #56955 HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from typescript.