[luatex] Behavior of slnunicode.utf8.match().

Mon Aug 8 09:26:17 CEST 2011

Hello all,

The manual says slnunicode.utf8.match() is normally unicode-aware, unless one
uses the empty capture. Yet I stumble on the following strange behavior
(assuming the file is encoded in utf8):

\directlua{
% Returns "é" (two bytes):
tex.print(slnunicode.utf8.match("éî", ".", 1)

% Returns invalid (one-byte) character:
tex.print(slnunicode.utf8.match("éî", ".", 2)

% Returns "î" (two bytes):
tex.print(slnunicode.utf8.match("éî", ".", 3)

% Returns invalid (one-byte) character:
tex.print(slnunicode.utf8.match("éî", ".", 4)
}

I'd expect the second call to return "î", but it looks like the function counts
in bytes (not in UTF-8 characters) yet returns an UTF-8 character (i.e. more
than one byte) if it can do so. So call 2 (resp. 4) returns the second byte of
"é" (resp. "î"), while call 1 and 3 return the correct characters starting
there.

Is this a bug or have I misunderstood something? (I can't test slunicode
independantly for the moment.)

Best,
Paul