[luatex] ActualText attribute for hyphenated words

Paul Isambert zappathustra at free.fr
Fri Feb 3 11:20:55 CET 2012


Patrick Gundlach <patrick at gundla.ch> a écrit:
> 
> Hello Till,
> 
> (just for the record: this comes from a discussion on tex.sx: http://tex.stackexchange.com/q/43033/243 )
> 
> > Is it possible/desirable to let the LuaTeX PDF generator automatically tag words which are hyphenated at the end of line with a matching /ActualText attribute (so that the sequence of glyphs "hyphen- ation", for example, is internally represented as the sequence of characters 'hyphenation')? That would make sense from a linguistic viewpoint because the display of a text in a PDF is strictly presentational and may differ from its lexical and grammatical structure. It would also ensure that you can search for and find words in a LuaTeX-generated PDF with almost any viewer.
> 
> This might be achieved by using LuaTeX's ability to modify a node list after line breaking.

Building on this idea, see code below.

I suppose it will fail miserably in many cases, and it should be extended
to handle non-ASCII characters; also, although Acrobat, Evince and Xpdf
now all find hyphenated words (the latter two could not do that before),
they don't highlight them properly. Finally, this will work only for
those viewer which implement /ActualText, and perhaps this is not the
case with Till's previewer.

In the meanwhile, I've discovered something very nice: Acrobat for
Debian doesn't lock the document, so you can keep it open and compile
too. That was not possible under Windows!

Best,
Paul


local HBOX = node.id"hlist"
local DISC = node.id"disc"
local GLYF = node.id"glyph"
local GLUE = node.id"glue"
local KERN = node.id"kern"

local function collect (n, dir)
  local text = ""
  local limit
  while n and
        (n.id == GLYF or n.id == KERN and n.subtype == 0) do
    if n.id == GLYF then
      local c = string.char(n.char)
      text = dir == "prev" and (c .. text) or (text .. c)
      limit = n
    end
    n = n[dir]
  end
  return text, limit
end

callback.register("post_linebreak_filter",
function (head)
  for line in node.traverse_id(HBOX, head) do
    local last = node.slide(line.head)
    if last.id == GLUE and last.subtype == 9 then
      last = last.prev
    end
    if last and last.id == DISC then
      local nextline = line.next
      while nextline do
        if nextline.id == HBOX then
          break
        else
          nextline = nextline.next
        end
      end
      if nextline then
        local prevnode = last.prev.prev
        local prevtext, l1 = collect(last.prev.prev, "prev")
        local n = nextline.head
        if n.id == GLUE and n.subtype == 8 then
          n = n.next
        end
        local nexttext, l2 = collect(n, "next")
        local lit1, lit2 = node.new(8, 8), node.new(8, 8)
        lit1.mode, lit2.mode = 2, 2
        lit1.data = "/Span << /ActualText (" .. prevtext .. nexttext .. ") >> BDC"
        lit2.data = "EMC"
        node.insert_before(line.head, l1, lit1)
        node.insert_after(nextline.head, l2, lit2)
      end
    end
  end
  return head
end)



More information about the luatex mailing list