[luatex] problem with slnunicode's find

Stephan Hennig mailing_list at arcor.de
Tue Mar 2 19:33:59 CET 2010


Am 02.03.2010 17:18, schrieb luigi scarso:
> On Tue, Mar 2, 2010 at 4:39 PM, Stephan Hennig<mailing_list at arcor.de>  wrote:
>> Am 02.03.2010 14:41, schrieb luigi scarso:
>>>
>>> I believe 7 is ok, because in utf8 Äabcde is 7 octet long
>>> and  unittest.c says
>>>   NOTE: find positions are in bytes for all ctypes!
>>
>> Logicians might be satisfied with broken behaviour as long as it's
>> documented.
> I believe that it's not a broken behaviour, it's only  a mix from two
> differents points of view:
> "abstract" (or "sign"  or "glyph" o "character" ),  where we see Ä  as "unit"
> and "implementation"  where Ä in utf8  is two octet.

Yes, that's why I call it "broken".  Switching point of view within the 
unicode.utf8 functions doesn't seem a good design to me.  I cannot see 
why it could be sensible to regard the length of Ä as one (character) in 
len and two (octets) in find.  After all, we already have function(s) 
that return byte positions in a strings, string.find or 
unicode.ascii.find.  Why not drop unicode.utf8.find at all?  That'd be a 
clear design. (Only beaten by a find function that regards Ä the same 
length as len does.  There are use-cases for such a find function.)


>> But I'm not a logician, so I cannot agree. :)
> To be honest I'm not confortable with regex and unicode.
>
> Perl can help here, but, just to see an example
>
> #>  perl  -e '$str = "Äabcde"; print length($str),"\n" ;' ;
> 7
> #>  perl  -e 'use utf8; $str = "Äabcde"; print length($str),"\n" ;' ;
> 6

Same with string.len and unicode.ut8.len in Lua.  You made me curious. 
Is there a find function in Perl?  What values does that return?

Best regards,
Stephan Hennig


More information about the luatex mailing list