[luatex] Behavior of slnunicode.utf8.match().

Manuel Pégourié-Gonnard mpg at elzevir.fr
Tue Aug 9 19:08:14 CEST 2011


Le 09/08/2011 14:08, Paul Isambert a écrit :
>> http://tug.org/pipermail/luatex/2010-March/thread.html#1242
> 
> Well, I don't know much on the subject -- and I don't have the courage to read
> the entire thread, but the behavior does seem strange to me, so you won't feel
> less alone :)
> 
As far as I remember, the conclusion of the thread was that yes, this is strange
and hardly consistent (or a least you need some education about what kind of
consistency to expect, see below), but it's a design principle of slnunicode,
mainly for simplicity/performance reasons.

My personal, certainly flawed, recollection of the design principle is that for
length and counting, the unit is always the byte, whereas for the rest the
working unit is the (possibly multibyte) character.

This is somewhat consistent with your tests:

% Returns "é" (two bytes):
tex.print(slnunicode.utf8.match("éî", ".", 1)

Start at byte one, which is the beginning of a two-byte sequence, everything's fine.

% Returns invalid (one-byte) character:
tex.print(slnunicode.utf8.match("éî", ".", 2)

Start at byte two, which is in the middle of a character, expect breakage.

% Returns "î" (two bytes):
tex.print(slnunicode.utf8.match("éî", ".", 3)

Third byte is the first of the second character, fine.

% Returns invalid (one-byte) character:
tex.print(slnunicode.utf8.match("éî", ".", 4)

Same as above with 2.

Manuel.


More information about the luatex mailing list