#+TITLE: Handling binary documents having ASCII-compatible markup #+OPTIONS: ^:{} Python3 has =bytes=, which are sequences of 8-bit integers, and =str=, sequences of Unicode code-points. To go from one to the other, you need to encode or decode by giving an explicit encoding. There are many protocols where the markup is ASCII, even if the data is some other encoding that you don't know. If you know that other encoding is ASCII-compatible, it is useful to be able to parse, split etc. the markup, and you just need to pass-through the payload. An initial search on the Internet brought up [[http://www.catb.org/esr/faqs/practical-python-porting/#_what_does_work][an article by Eric S. Raymond]] that touches on that, and suggests to decode the data as ISO-8859-1, handle it as =str=, then the payload can be losslessly recovered by reencoding it. The first 256 codepoints of Unicode are exactly the ISO-8859-1 codepoints (a bit more on that further down). As a result, the following is idempotent: #+BEGIN_SRC >>> by = bytes(range(0xff)) >>> by2 = bytes(str(by, encoding="iso-8859-1"), encoding="iso-8859-1") >>> by == by2 True #+END_SRC A few days later I came across [[https://python-notes.curiousefficiency.org/en/latest/python3/text_file_processing.html][that article by Alyssa Coghlan]], which mentions the same idea, and also the existence of the [[https://peps.python.org/pep-0383/][~errors="surrogateencoding"~ (PEP 383)]] error handler (also: [[https://docs.python.org/3/library/codecs.html][codecs in the Python documentation]]), which is designed to allow exactly what I needed: #+BEGIN_SRC >>> by = bytes(range(0xff)) >>> by2 = bytes(str(by, encoding="ascii", errors="surrogateescape"), encoding="ascii", errors="surrogateescape") >>> by == by2 True #+END_SRC Alyssa Coghlan has some discussion about the merits of each approach, I can't really say that functionally they have any meaningful difference. As she points out, if you do anything but re-encode the non-ASCII compatible parts using the same codec, you risk getting Mobijake (or if you're lucky, a =UnicodeDecodeError= rather than silently producing garbage). Performance-wise, let's see: #+BEGIN_SRC >>> import timeit >>> timeit.timeit("""bytes(str(by, encoding="ISO-8859-1"), encoding="ISO-8859-1")""", setup="import random; by = random.randbytes(10_000)") 0.8885893229962676 >>> timeit.timeit("""bytes(str(by, encoding="ascii", errors="surrogateescape"), encoding="ascii", errors="surrogateescape")""", setup="import random; by = random.randbytes(10_000)") 125.00223343299876 #+END_SRC That's... a very large difference. ESR's article points out that ISO-8859-1 has some properties that make it some efficient (it maps bytes 0x80—0xff to Unicode code-points of the same numeric value, so there is no translation cost, and the in-memory representation is more efficient). Trying increasing sizes: #+BEGIN_SRC >>> for size in 10, 100, 1_000, 2_000, 5_000, 10_000, 20_000: ... by = random.randbytes(size) ... duration = timeit.timeit("""bytes(str(by, encoding="ascii", errors="surrogateescape"), encoding="ascii", errors="surrogateescape")""", globals={"by": by}, number=100_000) ... print(f"{size}\t{duration}") ... 10 0.0650910490003298 100 0.1047916579991579 1000 0.5472217770002317 2000 1.5103355319952243 5000 5.779067411000142 10000 12.497241530996689 20000 25.78209423399676 #+END_SRC That seems to grow faster than O(n); the duration seems to be ×3 when the size goes ×2, is it growing like O(n^{1.5})? In contrast, using the ISO-8859-1 method seems to having a complexity O(n): #+BEGIN_SRC >>> for size in 10, 100, 1_000, 2_000, 5_000, 10_000, 20_000, 50_000, 100_000: ... by = random.randbytes(size) ... duration = timeit.timeit("""bytes(str(by, encoding="iso-8859-1"), encoding="iso-8859-1")""", globals={"by": by}, number=100_000) ... print(f"{size}\t{duration}") ... 10 0.05453772499458864 100 0.037617702000716235 1000 0.05454556500626495 2000 0.05654650100041181 5000 0.06352802200126462 10000 0.0898260960020707 20000 0.23981017799815163 50000 0.4997737009980483 100000 0.9646763860000647 #+END_SRC -------- By design of Unicode, the first 256 codepoints are the same as those of ISO 8859-1, and the conversion between Unicode and ISO 8859-1 is implemented in C in CPython. That's why the technique describes above works losslessly, and is so much faster than any mapping to surrogate escapes (even if I still don't understand how that latter approach is so slow). I recently learned (between =$WORK= making me investigate character encoding issues, and [[../stuff/char/char.html#ISO-8859-1][making a character table explorer]] as a JavaScript learning exercise) about the C1 controls characters. [[https://en.wikipedia.org/wiki/C0_and_C1_control_codes#C1_controls][Wikipedia has a write-up]], in short in 1973, during their work on 8-bit encodings for languages beyond English, ECMA and ISO came up with an extra series of control characters in the =0x80--0x9F= range, which would allow things like terminal control (the =0x9B= /Control Sequence Introducer/ for ANSI escape sequences), switching to a different encoding (eg. SS2/SS2, used by EUC-JP), etc. However, it would still be necessary to represent those in a 7-bit enviroment; so each of those could also be written as =0x1B= ␛ followed by an ASCII char (eg. /Control Sequence Introducer/ could be spelled =0x1B [=). Both of the following print =HI= in reverse video with xfce4-terminal and libvte-2.91 v0.70.6, but the second example doesn't work in XTerm version 379. I guess XTerm doesn't (or dropped) support for the C1 controls spelling of the ANSI escape sequences. #+BEGIN_SRC $ printf "\\u001b[7m HI \\u001b[m Normal\n" HI Normal $ printf "\\u009B7m HI \\u009Bm Normal\n" HI Normal #+END_SRC Anyway, those C1 controls were almost never used, but are reserved in the ISO 8859-X encodings. Windows-1252, presumably noticing that lack of use, assigned printable glyphs at those positions. [[https://encoding.spec.whatwg.org/#names-and-labels][HTML5 aliases ISO 8859-1 to Windows-1252]], I guess because it wouldn't make sense to use control charactes in an HTML document, so it those appear, that must be because the author actually meant Windows-1252 (and if they don't, then ISO 8859-1 and Windows-1252 are identical outside the range of C1 controls.