#+TITLE: Handling binary documents having ASCII-compatible markup

#+OPTIONS: ^:{}

Python3 has =bytes=, which are sequences of 8-bit integers, and =str=,
sequences of Unicode code-points. To go from one to the other, you need to
encode or decode by giving an explicit encoding. There are many protocols
where the markup is ASCII, even if the data is some other encoding that you
don't know. If you know that other encoding is ASCII-compatible, it is useful
to be able to parse, split etc. the markup, and you just need to pass-through
the payload.

An initial search on the Internet brought up [[http://www.catb.org/esr/faqs/practical-python-porting/#_what_does_work][an article by Eric S. Raymond]]
that touches on that, and suggests to decode the data as ISO-8859-1, handle it
as =str=, then the payload can be losslessly recovered by reencoding it. The
first 256 codepoints of Unicode are exactly the ISO-8859-1 codepoints (a bit
more on that further down). As a result, the following is idempotent:

#+BEGIN_SRC
  >>> by = bytes(range(0xff))
  >>> by2 = bytes(str(by, encoding="iso-8859-1"), encoding="iso-8859-1")
  >>> by == by2
  True
#+END_SRC

A few days later I came across [[https://python-notes.curiousefficiency.org/en/latest/python3/text_file_processing.html][that article by Alyssa Coghlan]], which mentions
the same idea, and also the existence of the [[https://peps.python.org/pep-0383/][~errors="surrogateencoding"~ (PEP
383)]] error handler (also: [[https://docs.python.org/3/library/codecs.html][codecs in the Python documentation]]), which is
designed to allow exactly what I needed:

#+BEGIN_SRC
  >>> by = bytes(range(0xff))
  >>> by2 = bytes(str(by, encoding="ascii", errors="surrogateescape"), encoding="ascii", errors="surrogateescape")
  >>> by == by2
  True
#+END_SRC

Alyssa Coghlan has some discussion about the merits of each approach, I can't
really say that functionally they have any meaningful difference. As she
points out, if you do anything but re-encode the non-ASCII compatible parts
using the same codec, you risk getting Mobijake (or if you're lucky, a
=UnicodeDecodeError= rather than silently producing garbage).

Performance-wise, let's see:

#+BEGIN_SRC
  >>> import timeit
  >>> timeit.timeit("""bytes(str(by, encoding="ISO-8859-1"), encoding="ISO-8859-1")""", setup="import random; by = random.randbytes(10_000)")
  0.8885893229962676
  >>> timeit.timeit("""bytes(str(by, encoding="ascii", errors="surrogateescape"), encoding="ascii", errors="surrogateescape")""", setup="import random; by = random.randbytes(10_000)")
  125.00223343299876
#+END_SRC

That's... a very large difference. ESR's article points out that ISO-8859-1
has some properties that make it some efficient (it maps bytes 0x80—0xff to
Unicode code-points of the same numeric value, so there is no translation
cost, and the in-memory representation is more efficient). Trying increasing
sizes:

#+BEGIN_SRC
  >>> for size in 10, 100, 1_000, 2_000, 5_000, 10_000, 20_000:
  ...     by = random.randbytes(size)
  ...     duration = timeit.timeit("""bytes(str(by, encoding="ascii", errors="surrogateescape"), encoding="ascii", errors="surrogateescape")""", globals={"by": by}, number=100_000)
  ...     print(f"{size}\t{duration}")
  ... 
  10	0.0650910490003298
  100	0.1047916579991579
  1000	0.5472217770002317
  2000	1.5103355319952243
  5000	5.779067411000142
  10000	12.497241530996689
  20000	25.78209423399676
#+END_SRC

That seems to grow faster than O(n); the duration seems to be ×3 when the size
goes ×2, is it growing like O(n^{1.5})?

In contrast, using the ISO-8859-1 method seems to having a complexity O(n):

#+BEGIN_SRC
  >>> for size in 10, 100, 1_000, 2_000, 5_000, 10_000, 20_000, 50_000, 100_000:
  ...     by = random.randbytes(size)
  ...     duration = timeit.timeit("""bytes(str(by, encoding="iso-8859-1"), encoding="iso-8859-1")""", globals={"by": by}, number=100_000)
  ...     print(f"{size}\t{duration}")
  ... 
  10	0.05453772499458864
  100	0.037617702000716235
  1000	0.05454556500626495
  2000	0.05654650100041181
  5000	0.06352802200126462
  10000	0.0898260960020707
  20000	0.23981017799815163
  50000	0.4997737009980483
  100000	0.9646763860000647
#+END_SRC

--------

By design of Unicode, the first 256 codepoints are the same as those of ISO
8859-1, and the conversion between Unicode and ISO 8859-1 is implemented in C
in CPython. That's why the technique describes above works losslessly, and is
so much faster than any mapping to surrogate escapes (even if I still don't
understand how that latter approach is so slow).

I recently learned (between =$WORK= making me investigate character encoding
issues, and [[../stuff/char/char.html#ISO-8859-1][making a character table explorer]] as a JavaScript learning
exercise) about the C1 controls characters. [[https://en.wikipedia.org/wiki/C0_and_C1_control_codes#C1_controls][Wikipedia has a write-up]], in short
in 1973, during their work on 8-bit encodings for languages beyond English,
ECMA and ISO came up with an extra series of control characters in the
=0x80--0x9F= range, which would allow things like terminal control (the
=0x9B= /Control Sequence Introducer/ for ANSI escape sequences), switching to a
different encoding (eg. SS2/SS2, used by EUC-JP), etc. However, it would still
be necessary to represent those in a 7-bit enviroment; so each of those could
also be written as =0x1B= ␛ followed by an ASCII char (eg. /Control Sequence Introducer/
could be spelled =0x1B [=). Both of the following print =HI= in reverse video with
xfce4-terminal and libvte-2.91 v0.70.6, but the second example doesn't work
in XTerm version 379. I guess XTerm doesn't (or dropped) support for the C1
controls spelling of the ANSI escape sequences.

#+BEGIN_SRC
  $ printf "\\u001b[7m HI \\u001b[m Normal\n"
  HI  Normal
  $ printf "\\u009B7m HI \\u009Bm Normal\n"
  HI  Normal
#+END_SRC

Anyway, those C1 controls were almost never used, but are reserved in the ISO
8859-X encodings. Windows-1252, presumably noticing that lack of use, assigned
printable glyphs at those positions. [[https://encoding.spec.whatwg.org/#names-and-labels][HTML5 aliases ISO 8859-1 to Windows-1252]],
I guess because it wouldn't make sense to use control charactes in an HTML
document, so it those appear, that must be because the author actually meant
Windows-1252 (and if they don't, then ISO 8859-1 and Windows-1252 are
identical outside the range of C1 controls.