[luatex] problem with slnunicode's find
Jonathan Fine
jfine at pytex.org
Thu Mar 4 09:15:31 CET 2010
Stephan Hennig wrote:
> Hi,
>
> I have trouble getting the position of a character in a UTF-8 string
> with slnunicode. The attached Lua script reads two UTF-8 encoded (I
> think) strings, 'äb' and 'öäb', from a file and outputs their length and
> the position of the last character 'b'. (UTF-8 characters are scrambled
> in the output, because this is on a Windows console. But that shouldn't
> harm, should it?)
>
> > >texlua slnunicode-find.lua
> > line = äb
> > len(line) = 2
> > character 'b' at position 3
> >
> > line = ├Â├ñb
> > len(line) = 3
> > character 'b' at position 5
>
> I would expect the positions of 'b' being 2 and 3, resp., as that are
> the lengths of the strings as returned by unicode.utf8.len.
Stephan: Is this what you want (except of course in Lua)?
$ python
Python 2.6.2 (release26-maint, Apr 19 2009, 01:58:18)
[GCC 4.3.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> data = 'äb', 'öäb', u'äb', u'öäb'
>>> data
('\xc3\xa4b', '\xc3\xb6\xc3\xa4b', u'\xe4b', u'\xf6\xe4b')
>>> for s in data: print repr(s), s, len(s), s.index('b')
...
'\xc3\xa4b' äb 3 2
'\xc3\xb6\xc3\xa4b' öäb 5 4
u'\xe4b' äb 2 1
u'\xf6\xe4b' öäb 3 2
In the above we have the two strings, first in 8-bit form and then in
unicode.
--
Jonathan
However,
> unicode.utf8.find seems to have another notion of the length of a
> string. To correct these values manually (apparently the byte
> positions) one needed to know how many of the characters preceding 'b'
> are multiple bytes long. Actually, I thought, that is what slnunicode
> is made for.
>
> What is the preferred way to get the position of a character in a UTF-8
> string, given a string contains only 'letters'?
>
> Best regards,
> Stephan Hennig
>
>
>> >texlua -v
>> This is LuaTeX, Version beta-0.40.6-2009110118 (Web2C 2009) luatex.web
>> >= v14240
>
More information about the luatex
mailing list