[luatex] problem with slnunicode's find
luigi scarso
luigi.scarso at gmail.com
Tue Mar 2 17:18:35 CET 2010
On Tue, Mar 2, 2010 at 4:39 PM, Stephan Hennig <mailing_list at arcor.de> wrote:
> Am 02.03.2010 14:41, schrieb luigi scarso:
>>
>> On Tue, Mar 2, 2010 at 2:01 PM, Stephan Hennig<mailing_list at arcor.de>
>> wrote:
>>>
>>> The output of
>>>
>>> str = "abcde"
>>> print(unicode.utf8.match(str, "()e"))
>>> str = "Äabcde"
>>> print(unicode.utf8.match(str, "()e"))
>>>
>>> is 5 and 7. The second one is obviously wrong.
>>
>> I believe 7 is ok, because in utf8 Äabcde is 7 octet long
>> and unittest.c says
>> NOTE: find positions are in bytes for all ctypes!
>
> Logicians might be satisfied with broken behaviour as long as it's
> documented.
I believe that it's not a broken behaviour, it's only a mix from two
differents points of view:
"abstract" (or "sign" or "glyph" o "character" ), where we see Ä as "unit"
and "implementation" where Ä in utf8 is two octet.
>But I'm not a logician, so I cannot agree. :)
To be honest I'm not confortable with regex and unicode.
Perl can help here, but, just to see an example
#> perl -e '$str = "Äabcde"; print length($str),"\n" ;' ;
7
#> perl -e 'use utf8; $str = "Äabcde"; print length($str),"\n" ;' ;
6
#> perl -v
This is perl, v5.10.0 built for i586-linux-thread-multi
Of course there are other libs,like
http://site.icu-project.org/
http://www.pcre.org/pcre.txt
and of course luatex can become bigger and slower .
A solution can be a dynamic loading so one can choose at runtime
what module to use --- but we must ensure that the same shared lib.
is available for all systems, and this is not easy .
--
luigi
More information about the luatex
mailing list