[luatex] ActualText attribute for hyphenated words

Paul Isambert zappathustra at free.fr
Fri Feb 3 11:20:55 CET 2012

Patrick Gundlach <patrick at gundla.ch> a écrit:
> Hello Till,
> (just for the record: this comes from a discussion on tex.sx: http://tex.stackexchange.com/q/43033/243 )
> > Is it possible/desirable to let the LuaTeX PDF generator automatically tag words which are hyphenated at the end of line with a matching /ActualText attribute (so that the sequence of glyphs "hyphen- ation", for example, is internally represented as the sequence of characters 'hyphenation')? That would make sense from a linguistic viewpoint because the display of a text in a PDF is strictly presentational and may differ from its lexical and grammatical structure. It would also ensure that you can search for and find words in a LuaTeX-generated PDF with almost any viewer.
> This might be achieved by using LuaTeX's ability to modify a node list after line breaking.

Building on this idea, see code below.

I suppose it will fail miserably in many cases, and it should be extended
to handle non-ASCII characters; also, although Acrobat, Evince and Xpdf
now all find hyphenated words (the latter two could not do that before),
they don't highlight them properly. Finally, this will work only for
those viewer which implement /ActualText, and perhaps this is not the
case with Till's previewer.

In the meanwhile, I've discovered something very nice: Acrobat for
Debian doesn't lock the document, so you can keep it open and compile
too. That was not possible under Windows!


local HBOX = node.id"hlist"
local DISC = node.id"disc"
local GLYF = node.id"glyph"
local GLUE = node.id"glue"
local KERN = node.id"kern"

local function collect (n, dir)
  local text = ""
  local limit
  while n and
        (n.id == GLYF or n.id == KERN and n.subtype == 0) do
    if n.id == GLYF then
      local c = string.char(n.char)
      text = dir == "prev" and (c .. text) or (text .. c)
      limit = n
    n = n[dir]
  return text, limit

function (head)
  for line in node.traverse_id(HBOX, head) do
    local last = node.slide(line.head)
    if last.id == GLUE and last.subtype == 9 then
      last = last.prev
    if last and last.id == DISC then
      local nextline = line.next
      while nextline do
        if nextline.id == HBOX then
          nextline = nextline.next
      if nextline then
        local prevnode = last.prev.prev
        local prevtext, l1 = collect(last.prev.prev, "prev")
        local n = nextline.head
        if n.id == GLUE and n.subtype == 8 then
          n = n.next
        local nexttext, l2 = collect(n, "next")
        local lit1, lit2 = node.new(8, 8), node.new(8, 8)
        lit1.mode, lit2.mode = 2, 2
        lit1.data = "/Span << /ActualText (" .. prevtext .. nexttext .. ") >> BDC"
        lit2.data = "EMC"
        node.insert_before(line.head, l1, lit1)
        node.insert_after(nextline.head, l2, lit2)
  return head

More information about the luatex mailing list