[luatex] problem with slnunicode's find

Thu Mar 4 09:15:31 CET 2010

Stephan Hennig wrote:
> Hi,
> 
> I have trouble getting the position of a character in a UTF-8 string 
> with slnunicode.  The attached Lua script reads two UTF-8 encoded (I 
> think) strings, 'äb' and 'öäb', from a file and outputs their length and 
> the position of the last character 'b'.  (UTF-8 characters are scrambled 
> in the output, because this is on a Windows console.  But that shouldn't 
> harm, should it?)
> 
>  > >texlua slnunicode-find.lua
>  > line = ├ñb
>  > len(line) = 2
>  > character 'b' at position 3
>  >
>  > line = ├Â├ñb
>  > len(line) = 3
>  > character 'b' at position 5
> 
> I would expect the positions of 'b' being 2 and 3, resp., as that are 
> the lengths of the strings as returned by unicode.utf8.len.

Stephan: Is this what you want (except of course in Lua)?

$ python
Python 2.6.2 (release26-maint, Apr 19 2009, 01:58:18)
[GCC 4.3.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
 >>> data = 'äb', 'öäb', u'äb', u'öäb'
 >>> data
('\xc3\xa4b', '\xc3\xb6\xc3\xa4b', u'\xe4b', u'\xf6\xe4b')
 >>> for s in data: print repr(s), s, len(s), s.index('b')
...
'\xc3\xa4b' äb 3 2
'\xc3\xb6\xc3\xa4b' öäb 5 4
u'\xe4b' äb 2 1
u'\xf6\xe4b' öäb 3 2

In the above we have the two strings, first in 8-bit form and then in 
unicode.

-- 
Jonathan

   However,
> unicode.utf8.find seems to have another notion of the length of a 
> string.  To correct these values manually (apparently the byte 
> positions) one needed to know how many of the characters preceding 'b' 
> are multiple bytes long.  Actually, I thought, that is what slnunicode 
> is made for.
> 
> What is the preferred way to get the position of a character in a UTF-8 
> string, given a string contains only 'letters'?
> 
> Best regards,
> Stephan Hennig
> 
> 
>> >texlua -v
>> This is LuaTeX, Version beta-0.40.6-2009110118 (Web2C 2009) luatex.web 
>> >= v14240
>