[tex-k] Easy-to-fix efficiency bug in kpathsea?

Thu Jan 4 18:16:18 CET 2018

I believe I've found a significant efficiency bug in kpathsea (source files circa 2017), which could be a win to fix, although I can't tell because I don't really understand the real-world run-time patterns/statistics of kpathsea in all its uses.  Someone with the ability to test the current code can probably verify whether this is a tempest in a teapot or not.

There are two subroutines in the file "pathsearch.c".  One is dir_list_search() and the other the more general dir_list_search_list().  Each is governed by a flag, |all|, that says whether to stop processing at a first match, or whether to continue until all matches are found.  This efficiency bug affects the latter situation of finding all matches.

Each of these routines contains a loop that scans down a linked list of directory paths, using them to build candidate paths to files that may or may not be included in an array of matches, depending on various criteria such as existence or readability.

Under the assumption that the caller's linked list of directory paths is going to be re-used several times, the first traversal attempts to optimize the linked list's orderer.  The optimization measure (calling str_llist_float()) possibly moves the current item of the linked list towards the top of the list, to be inserted just after any previously moved items.  This reduces the number of candidate path failures in future first-match-only searches.  Near as I can think it through, though, moving the items around doesn't help in future all-match searches, but there may be something I don't understand on that front.

The str_llist_float() adjustment to the order of list items is easy because they are kept in a linked list.  Snip the item from its current position, insert it after the last previously moved item near the head of the linked list, and ensure all the pointers are properly updated, certainly the "next" pointer of the item moved.

After any possible movement of |elt| towards the start of the list, the end of the body of each loop is reached, and the iteration step in the for( ; ; ) statement is executed.  It looks like this:

    elt = STR_LLIST_NEXT (*elt)

So the for loop thinks it is moving on to the next item in the linked last after the current item |elt| that may or may not have just been moved.  But if |elt| just got moved, the scanning for(;;) loop is now jumping *backwards* in the list to scan over unsuccessfully matched list items a second time until it finally gets back to the item in the list that succeeded the last moved item, at which point the loop moves on towards the list's end checking the rest of the unexamined paths.  And if it "optimizes" another successful match towards the head of the list, the loop re-scans all the previous unsuccessful path matches a third time, etc.

In other words, this optimization measure is helpful for later first-match-only calls using the same list.  But while scanning this list looking for all matches, it actually is a pessimization measure that causes everything over which each item just moved to be re-scanned (unsuccessfully) again for every successful item moved towards the list's start.

So when finding all matches, if the list is long, and the item moved backwards a long way, and there are several successful matches somewhere in the list, and this happens regularly in pre-optimized lists (all of which is implied by having implemented str_llist_float() in the first place), then there's a whole lot of unnecessary duplicative list traversal and candidate path building and testing going on.  Worst case, it turns the loop into an O(n^2) operation, rather than the O(n) operation it believes itself to be.

The fix seems to me to be pretty easy.  There needs to be a local variable that, inside the loop body, records STR_LLIST_NEXT (*elt) somewhere before calling str_llist_float().  That  local variable is then used in the increment step of the for loop, instead of STR_LLIST_NEXT (*elt).

As usual, I haven't been able to convince myself that I'm misunderstanding something, and would welcome anyone setting me straight..  But my gut says this is a classic coding mistake that I can readily imagine myself making when iterating over a set of items at the same time as mucking with their order mid-loop.  I'm willing to bet that this optimization measure, str_llist_float(), was added a while after the original loop was designed.  Then, when everything continued to work and to deliver the exact same results (because re-scanning previously unsuccessful matches doesn't affect any final result), no regression test noticed and attention moved elsewhere.

Hope this helps.  Onward, into the fog ...

- Doug McKenna