[luatex] problem with slnunicode's find

Manuel Pégourié-Gonnard mpg at elzevir.fr
Wed Mar 3 10:19:25 CET 2010


luigi scarso a écrit :
>> this discussion is IMO whether unicode.* libraries are a replacement for string or not.
> Hm.
> A difficult question.
> 
IMO not. The comments state that unicode.ascii and unicode.latin1 are
locale-independent replacements for string, but doens't say anything about
unicode.utf8, and that's probably for a reason. But as Taco, said, this would be
best discussed with the selene developpers.

> Have we found a bug in unicode.utf8.find or it's correct but we
> disagree about its behavior ?

This question has been answered many times: the fact that string.find returns
positions in bytes (as opposed to characters) is a design decision and the
function behaves precisely as the doc says on this point:

-- NOTE: find positions are in bytes for all ctypes!
-- use ascii.sub to cut found ranges!
-- this is a) faster b) more reliable

> If  we disagree, what is the expected behavior ?

People who disagree would like the counts to be characters, not bytes.

> Can we implement an acceptable  wrapper  ?
> 
Yes, an proper wrapper has already been given by Patrick [1] and quoted by
myself. Here it is again, now in the form of a function:

function find_utf8_chars(str, pat)
    local a, b = unicode.utf8.find(str, pat)
    a = unicode.utf8.len(string.sub(str, 1, a))
    b = unicode.utf8.len(string.sub(str, 1, b))
    return a, b
end

Note that this is not proper full version of find (arguments 3 and 4 not
supported, no captures returned). However, it does answers Stephan's original
question.

Manuel.

[1] http://tug.org/pipermail/luatex/2010-March/001262.html




More information about the luatex mailing list