Handling binary documents having ASCII-compatible markup
Python3 has bytes
, which are sequences of 8-bit integers, and str
,
sequences of Unicode code-points. To go from one to the other, you need to
encode or decode by giving an explicit encoding. There are many protocols
where the markup is ASCII, even if the data is some other encoding that you
don't know. If you know that other encoding is ASCII-compatible, it is useful
to be able to parse, split etc. the markup, and you just need to pass-through
the payload.
An initial search on the Internet brought up an article by Eric S. Raymond
that touches on that, and suggests to decode the data as ISO-8859-1, handle it
as str
, then the payload can be losslessly recovered by reencoding it.
ISO-8859-1 has the property that each byte of the input is exactly one
codepoint; and that all 256 possible values are mapped to a code point. As a
result, the following is idempotent:
>>> by = bytes(range(0xff)) >>> by2 = bytes(str(by, encoding="iso-8859-1"), encoding="iso-8859-1") >>> by == by2 True
A few days later I came across that article by Alyssa Coghlan, which mentions
the same idea, and also the existence of the errors="surrogateencoding"
(PEP
383) error handler (also: codecs in the Python documentation), which is
designed to allow exactly what I needed:
>>> by = bytes(range(0xff)) >>> by2 = bytes(str(by, encoding="ascii", errors="surrogateescape"), encoding="ascii", errors="surrogateescape") >>> by == by2 True
Alyssa Coghlan has some discussion about the merits of each approach, I can't
really say that functionally they have any meaningful difference. As she
points out, if you do anything but re-encode the non-ASCII compatible parts
using the same codec, you risk getting Mobijake (or if you're lucky, a
UnicodeDecodeError
rather than silently producing garbage).
Performance-wise, let's see:
>>> import timeit >>> timeit.timeit("""bytes(str(by, encoding="ISO-8859-1"), encoding="ISO-8859-1")""", setup="import random; by = random.randbytes(10_000)") 0.8885893229962676 >>> timeit.timeit("""bytes(str(by, encoding="ascii", errors="surrogateescape"), encoding="ascii", errors="surrogateescape")""", setup="import random; by = random.randbytes(10_000)") 125.00223343299876
That's… a very large difference. ESR's article points out that ISO-8859-1 has some properties that make it some efficient (it maps bytes 0x80—0xff to Unicode code-points of the same numeric value, so there is no translation cost, and the in-memory representation is more efficient). Trying increasing sizes:
>>> for size in 10, 100, 1_000, 2_000, 5_000, 10_000, 20_000: ... by = random.randbytes(size) ... duration = timeit.timeit("""bytes(str(by, encoding="ascii", errors="surrogateescape"), encoding="ascii", errors="surrogateescape")""", globals={"by": by}, number=100_000) ... print(f"{size}\t{duration}") ... 10 0.0650910490003298 100 0.1047916579991579 1000 0.5472217770002317 2000 1.5103355319952243 5000 5.779067411000142 10000 12.497241530996689 20000 25.78209423399676
That seems to grow faster than O(n); the duration seems to be ×3 when the size goes ×2, is it growing like O(n1.5)?
In contrast, using the ISO-8859-1 method seems to having a complexity O(n):
>>> for size in 10, 100, 1_000, 2_000, 5_000, 10_000, 20_000, 50_000, 100_000: ... by = random.randbytes(size) ... duration = timeit.timeit("""bytes(str(by, encoding="iso-8859-1"), encoding="iso-8859-1")""", globals={"by": by}, number=100_000) ... print(f"{size}\t{duration}") ... 10 0.05453772499458864 100 0.037617702000716235 1000 0.05454556500626495 2000 0.05654650100041181 5000 0.06352802200126462 10000 0.0898260960020707 20000 0.23981017799815163 50000 0.4997737009980483 100000 0.9646763860000647