[tex-k] tex-k Digest, Vol 189, Issue 11

Sat Nov 21 17:28:19 CET 2020

Wolfgang Helbig wrote:

>| This rules out UTF-8, which is ASCII for
>| characters 0..127 and 16 bit codes above 127.

The second part of this statement is incorrect.  UTF-8 is a variable-length encoding that converts any 21-bit Unicode code point into a 1-, 2-, 3-, or 4-byte sequence.  If the high-bit of the first byte in the sequence is not set, then it's a 1-byte "sequence" representing the 7 bits of ASCII, from 0 to 127.  Otherwise, in UTF-8 a character (code point) is 2, 3, or 4 bytes long, depending on where the Unicode code point lies in the full range (ignoring grapheme clusters).

Doug McKenna