[luatex] on some special glyph nodes

Paul Isambert zappathustra at free.fr
Sun Mar 24 00:00:57 CET 2013


Stephan Hennig <mailing_list at arcor.de> a écrit:
> [resending, since I seem to be grey-listed
>
>                    The Postfix program
>
> <luatex at tug.org>: host mx.nfit.au.dk[130.225.30.10] said: 451 4.3.0
> Greylisted:
>     please try again. (in reply to RCPT TO command)
> ]
>
> Am 23.03.2013 16:24, schrieb Paul Isambert:
> > Stephan Hennig <mailing_list at arcor.de> a écrit:
> >>
> >> under what circumstances can glyph nodes still happen to represent
> >> ambiguous code points? Is there some input for LuaTeX that
> >> generates such glyphs other than via node list manipulation?
> >>
> >> under what circumstances can glyph nodes represent spacing
> >> characters?
> >
> > I'm not sure I've understood the questions properly, but I'll make a
> > tentative answer anyway.
>
> Thank you for taking the time.  To give some context, I'm trying to
> separate words in node lists and to strip punctuation from words.
>
> Concerning ambiguous characters, do I have to expect and deal with
> ambiguous characters in node lists?  Given the conversion recommended by
> Unicode standard, the naive assumption is there won't be ambiguous
> characters any more at the node list level.  Is that reasoning justified?

LuaTeX doesn't by itself do anything to characters, and apostrophes
remain apostrophes; that's the font package that will handle
substitutions, so you have to do what you want to do after the glyphs
have been processed. But if you're using a simple TFM font, you'll see
a glyph node with char field 0x27, even though the glyph is a proper
quotation mark, so you'll have no way to detect that (and that can
happen with OT fonts too if substitution is done directly in the font
when loaded rather than when node are processed).

> I'm only speaking of ASCII characters that can have various meanings and
> are considered ambiguous by the Unicode standard.  Separating
> apostrophes from English single closing quotes, which are both
> represented by RIGHT SINGLE QUOTATION MARK 0x2019, is another story.
> The former should be retained and don't represent a word separator,
> while the latter should be stripped from words.  Handling this needs
> looking ahead at the next glyph nodes. :(  But anyway, 0x2019 is
> considered unambiguous.
>
> Concerning space characters, do I have to deal with these in LuaTeX to
> recognise word boundaries?  I'm beginning to realise that the answer is
> yes.  Probably not because LuaTeX will insert them, but because they can
> already be present in the input file.

At the node level, I suppose a word boundary will generally be either
a normal space, possibly preceded by a penalty, or a non-interglyph
kern. Of course none of those are unambiguous.

> > As far as I can tell, there are three places in LuaTeX where such
> > substitutions can take place:
> >
> > 1. When manipulating the input lines fed to TeX, in either
> >    open_read_file or process_input_buffer. There characters can be
> >    easily converted, and 0x27 can be turned to 0x2d.
>
> I guess you mean 0x27 => 0x2019 (APOSTROPHE => RIGHT SINGLE QUOTATION
> MARK) or 0x2d => 0x2010 (HYPHEN-MINUS => HYPHEN).

Yes, I was careless.

> > 2. When TeX processes tokens; for instance, a non-breaking space
> >    character can be made active and \let to (the usual definition of)
> >    "~" (e.g. so the source looks as little TeXish as possible).
>
> Oh, I need to have a closer look at how LuaTeX processes the different
> space characters in input.  I wonder if making NO-BREAK SPACE 0xa0
> active is really necessary.

In that respect LuaTeX doesn't differ from TeX: it's above all a
matter of catcodes. Tokens with catcodes different from 10 are not
spaces, and tokens with catcodes 10 will be normalised to character
0x20 (ASCII space).

> > 3. Directly on nodes, which is the OpenType way to implement things:
> >    thus a node with character 0x27 will be turned into a node with
> >    character 0x2d; the font system (e.g. luaotfload) normally does
> >    that. TFM fonts are simpler yet: the right quotation mark is
> >    (generally) in position 0x27, there's no need for a substitution
> >    (you can implement that too with OTF fonts). In my opinion, this is
> >    probably the best place, for such substitutions as the apostrophe
> >    at least, since it can be font-dependent (and you may want some
> >    fonts, e.g. for code, to keep the real apostrophe).
>
> My question is whether 0x27 can occur in node lists at all when
> processing usual LuaTeX documents.

Yes, but it depends on when you're looking at the nodelist, as
mentionned above. And it depends on how much the font reflects
Unicode.

> > Another way, not related to TeX, is to input the relevant characters
> > directly. With a good editor, it is simple to input a quotation mark
> > when typing an apostrophe (which isn't readily available on a keyboard).
>
> Yeah, I have to do some tests with the range of space characters
> provided by Unicode.

Honestly, I'm not sure it's necessary; TeX's \hskip and \kern are much
more powerful than space characters, which must be processed anyway.

Best,
Paul


More information about the luatex mailing list