![]() ![]() I have written a small library in Rust which uses a slightly refined version of this method to quickly determine whether a given file contains "binary" or "text" data. Nevertheless, this heuristic approach is very useful. The following macro is included in diffs source code ( src/io.c): In fact, this is exactly what diff and grep use to detect "binary" files. It turns out that this can be turned into a simple heuristic method to detect binary files, since a lot of encoded text data does not contain any NULL bytes (even though it might be legal). The image file contains a lot of NULL bytes ( 00) while the short text message does not. However, there is a difference between the two files. So clearly, looking at bytes outside the ASCII range can not be used as a method to detect "binary" files. On the other hand, the bytes 50 4e 47 at the beginning of the white image are a simple ASCII-encoded version of the characters PNG². The four bytes f0 9f 8c 8d in the message file, for example, are the UTF-8 encoded version of the Unicode code point U+1F30D (□). Note that both files contain bytes within and outside of the ASCII range ( 00… 7f). I am using hexyl as a hex viewer, but you can also use hexdump -C: To see how this works, let's go back to our two candidate files and explore their byte level content. It turns out that there is a much faster way to distinguish between text and binary files, but it comes at the cost of precision. Second, in order to test if the contents of a file is encoded in a given encoding, we would have to decode the whole contents of the file and see if it succeeds¹. First, we would need a list of all possible encodings. There are two practical problems with this definition. Given just the contents of a file (not the history on how it was created), we can therefore try the following definition:Ī file is called "text file" if its content consists of an encoded sequence of Unicode code points. And all of these look different on a byte level. The most prominent ones are US-ASCII and Latin1 (ISO 8859-1), but there are many more. Historically, encodings which support just a part of todays Unicode are also important. If we want to be able to represent the whole Unicode range, we typically choose UTF-8, sometimes UTF-16 or UTF-32. To store a given text as a sequence of bytes, we need to choose an encoding. Examples of code points are characters like k, ä or א, as well as special symbols like %, ☢ or □. It seems reasonable to begin with an abstract notion of text as being a sequence of Unicode code points. So maybe we can start by defining "text" data. On the other hand, a distinction between "text" and "non-text" (hereafter: "binary") data seems helpful for programs like grep or diff, if only not to mess up the output of your terminal emulator. Clearly, on a fundamental file-system level, every file is just a collection of bytes and could therefore be viewed as binary data. How do these programs distinguish between "text" and "binary" files?īefore we answer this question, let us first try to come up with a definition. Enter fullscreen mode Exit fullscreen mode ![]()
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |