[luatex] problem with slnunicode's find

Tue Mar 2 17:18:35 CET 2010

On Tue, Mar 2, 2010 at 4:39 PM, Stephan Hennig <mailing_list at arcor.de> wrote:
> Am 02.03.2010 14:41, schrieb luigi scarso:
>>
>> On Tue, Mar 2, 2010 at 2:01 PM, Stephan Hennig<mailing_list at arcor.de>
>>  wrote:
>>>
>>> The output of
>>>
>>>  str = "abcde"
>>>  print(unicode.utf8.match(str, "()e"))
>>>  str = "Äabcde"
>>>  print(unicode.utf8.match(str, "()e"))
>>>
>>> is 5 and 7.  The second one is obviously wrong.
>>
>> I believe 7 is ok, because in utf8 Äabcde is 7 octet long
>> and  unittest.c says
>>  NOTE: find positions are in bytes for all ctypes!
>
> Logicians might be satisfied with broken behaviour as long as it's
> documented.
I believe that it's not a broken behaviour, it's only  a mix from two
differents points of view:
"abstract" (or "sign"  or "glyph" o "character" ),  where we see Ä  as "unit"
and "implementation"  where Ä in utf8  is two octet.

>But I'm not a logician, so I cannot agree. :)
To be honest I'm not confortable with regex and unicode.

Perl can help here, but, just to see an example

#> perl  -e '$str = "Äabcde"; print length($str),"\n" ;' ;
7
#> perl  -e 'use utf8; $str = "Äabcde"; print length($str),"\n" ;' ;
6

#> perl -v
This is perl, v5.10.0 built for i586-linux-thread-multi

Of course there are other libs,like
http://site.icu-project.org/
http://www.pcre.org/pcre.txt

and of course luatex can become bigger and slower .

A solution can be a dynamic loading so one can choose at runtime
what module to use --- but we must ensure that the same  shared lib.
is available for all systems, and this is not easy .

-- 
luigi