[luatex] Behavior of slnunicode.utf8.match().

Patrick Gundlach patrick at gundla.ch
Wed Aug 10 09:38:23 CEST 2011


Hi Paul,

> I see; not consistent to me, but at least it explains some things ... except
> "slnunicode.sub("éî", 2, 2)" returns "î" and not the second byte of "é", so
> obviously there are exceptions. (Or did I get something wrong again?)

No, no exceptions. Just "." is "bytes" and everything else is "characters". 


The idea is IMO the following: 

* Lua strings can deal with binary data. With the functions find,match,gmatch and gsub the . in the patterns is for "one byte", everything else (like %d) uses character classes and is for characters (or digits or punctuation or…). With this, you can parse any file format out there rather easily.

* slnunicode is a drop in replacement for the string functions. The proof for me is that the function names and the arguments are identical.

* because it is a drop in replacement, it has to behave exactly like the regular string functions, esp. regarding the binary data. That means . is one byte and %x is a character class, but taking the utf8 byte sequence into account.

* all functions that do not use patterns deal with utf8 byte sequences (i.e. len,lower,reverse,sub,upper). Those functions that deal with patterns make the distinction between utf8 bytes (character classes) and single bytes (the dot). These are find, match, gmatch and gsub. I don't know about byte and char, but I guess these belong to the former class, I don't want to look this up now.

And I don't think replacing slnunicode with something else is really necessary, because if it isn't broken, don't fix it.

But I shut up now unless asked again.

Patrick




More information about the luatex mailing list