Handling binary documents having ASCII-compatible markup
Python3 has bytes
, which are sequences of 8-bit integers, and str
,
sequences of Unicode code-points. To go from one to the other, you need to
encode or decode by giving an explicit encoding. There are many protocols
where the markup is ASCII, even if the data is some other encoding that you
don't know. If you know that other encoding is ASCII-compatible, it is useful
to be able to parse, split etc. the markup, and you just need to pass-through
the payload.
An initial search on the Internet brought up an article by Eric S. Raymond
that touches on that, and suggests to decode the data as ISO-8859-1, handle it
as str
, then the payload can be losslessly recovered by reencoding it. The
first 256 codepoints of Unicode are exactly the ISO-8859-1 codepoints (a bit
more on that further down). As a result, the following is idempotent:
>>> by = bytes(range(0xff)) >>> by2 = bytes(str(by, encoding="iso-8859-1"), encoding="iso-8859-1") >>> by == by2 True
A few days later I came across that article by Alyssa Coghlan, which mentions
the same idea, and also the existence of the errors="surrogateencoding"
(PEP
383) error handler (also: codecs in the Python documentation), which is
designed to allow exactly what I needed:
>>> by = bytes(range(0xff)) >>> by2 = bytes(str(by, encoding="ascii", errors="surrogateescape"), encoding="ascii", errors="surrogateescape") >>> by == by2 True
Alyssa Coghlan has some discussion about the merits of each approach, I can't
really say that functionally they have any meaningful difference. As she
points out, if you do anything but re-encode the non-ASCII compatible parts
using the same codec, you risk getting Mobijake (or if you're lucky, a
UnicodeDecodeError
rather than silently producing garbage).
Performance-wise, let's see:
>>> import timeit >>> timeit.timeit("""bytes(str(by, encoding="ISO-8859-1"), encoding="ISO-8859-1")""", setup="import random; by = random.randbytes(10_000)") 0.8885893229962676 >>> timeit.timeit("""bytes(str(by, encoding="ascii", errors="surrogateescape"), encoding="ascii", errors="surrogateescape")""", setup="import random; by = random.randbytes(10_000)") 125.00223343299876
That's… a very large difference. ESR's article points out that ISO-8859-1 has some properties that make it some efficient (it maps bytes 0x80—0xff to Unicode code-points of the same numeric value, so there is no translation cost, and the in-memory representation is more efficient). Trying increasing sizes:
>>> for size in 10, 100, 1_000, 2_000, 5_000, 10_000, 20_000: ... by = random.randbytes(size) ... duration = timeit.timeit("""bytes(str(by, encoding="ascii", errors="surrogateescape"), encoding="ascii", errors="surrogateescape")""", globals={"by": by}, number=100_000) ... print(f"{size}\t{duration}") ... 10 0.0650910490003298 100 0.1047916579991579 1000 0.5472217770002317 2000 1.5103355319952243 5000 5.779067411000142 10000 12.497241530996689 20000 25.78209423399676
That seems to grow faster than O(n); the duration seems to be ×3 when the size goes ×2, is it growing like O(n1.5)?
In contrast, using the ISO-8859-1 method seems to having a complexity O(n):
>>> for size in 10, 100, 1_000, 2_000, 5_000, 10_000, 20_000, 50_000, 100_000: ... by = random.randbytes(size) ... duration = timeit.timeit("""bytes(str(by, encoding="iso-8859-1"), encoding="iso-8859-1")""", globals={"by": by}, number=100_000) ... print(f"{size}\t{duration}") ... 10 0.05453772499458864 100 0.037617702000716235 1000 0.05454556500626495 2000 0.05654650100041181 5000 0.06352802200126462 10000 0.0898260960020707 20000 0.23981017799815163 50000 0.4997737009980483 100000 0.9646763860000647
By design of Unicode, the first 256 codepoints are the same as those of ISO 8859-1, and the conversion between Unicode and ISO 8859-1 is implemented in C in CPython. That's why the technique describes above works losslessly, and is so much faster than any mapping to surrogate escapes (even if I still don't understand how that latter approach is so slow).
I recently learned (between $WORK
making me investigate character encoding
issues, and making a character table explorer as a JavaScript learning
exercise) about the C1 controls characters. Wikipedia has a write-up, in short
in 1973, during their work on 8-bit encodings for languages beyond English,
ECMA and ISO came up with an extra series of control characters in the
0x80--0x9F
range, which would allow things like terminal control (the
0x9B
Control Sequence Introducer for ANSI escape sequences), switching to a
different encoding (eg. SS2/SS2, used by EUC-JP), etc. However, it would still
be necessary to represent those in a 7-bit enviroment; so each of those could
also be written as 0x1B
␛ followed by an ASCII char (eg. Control Sequence Introducer
could be spelled 0x1B [
). Both of the following print HI
in reverse video with
xfce4-terminal and libvte-2.91 v0.70.6, but the second example doesn't work
in XTerm version 379. I guess XTerm doesn't (or dropped) support for the C1
controls spelling of the ANSI escape sequences.
$ printf "\\u001b[7m HI \\u001b[m Normal\n" HI Normal $ printf "\\u009B7m HI \\u009Bm Normal\n" HI Normal
Anyway, those C1 controls were almost never used, but are reserved in the ISO 8859-X encodings. Windows-1252, presumably noticing that lack of use, assigned printable glyphs at those positions. HTML5 aliases ISO 8859-1 to Windows-1252, I guess because it wouldn't make sense to use control charactes in an HTML document, so it those appear, that must be because the author actually meant Windows-1252 (and if they don't, then ISO 8859-1 and Windows-1252 are identical outside the range of C1 controls.