[XeTeX] Line-breaking algorithms in XeTeX

John Was john.was at ntlworld.com
Mon Apr 27 15:43:02 CEST 2009


Dear Pander

Thanks for that - I'll try to get my head round it next week some time, 
though I'm not an ideal beta-tester for technicalities like this.  I'm sure 
others will give you good feedback, though.  It looks very impressive!

Best


John



----- Original Message ----- 
From: "Pander" <pander at users.sourceforge.net>
To: "Unicode-based TeX for Mac OS X and other platforms" <xetex at tug.org>
Sent: Monday, April 27, 2009 2:08 PM
Subject: Re: [XeTeX] Line-breaking algorithms in XeTeX


> Here is the script, example output is below where F=font, r=right
> margin, l=left mragin, b=bottom margin, t=top margin, over=overfull,
> under=underfull etc. the error is calculated like:
>  math.sqrt((over * over) + (under * under) + (hyphen_percent *
> hyphen_percent) + (pages * pages))
>
> I think in the next version I will omit the pages in the error calculation
>
> Please test the script and let me know what to improve.
>
> F=FreeSerif R=0.700 L=0.467 B=0.700 T=0.467   Over:3 Under:31 Pages:98
> HyphenExcept:8.229 (674/8191)   Error:103.159
> F=Gentium Basic R=0.700 L=0.467 B=0.700 T=0.467   Over:18 Under:33
> Pages:100 HyphenExcept:8.229 (674/8191)   Error:107.148
> F=Gentium R=0.700 L=0.467 B=0.700 T=0.467   Over:18 Under:33 Pages:100
> HyphenExcept:8.229 (674/8191)   Error:107.148
> F=Gentium Book Basic R=0.700 L=0.467 B=0.700 T=0.467   Over:27 Under:31
> Pages:100 HyphenExcept:8.229 (674/8191)   Error:108.433
> F=FreeSerif R=0.700 L=0.467 B=1.000 T=0.667   Over:3 Under:41 Pages:102
> HyphenExcept:8.229 (674/8191)   Error:110.280
> F=Gentium Basic R=0.700 L=0.467 B=1.000 T=0.667   Over:18 Under:42
> Pages:102 HyphenExcept:8.229 (674/8191)   Error:112.070
> F=Gentium R=0.700 L=0.467 B=1.000 T=0.667   Over:18 Under:42 Pages:102
> HyphenExcept:8.229 (674/8191)   Error:112.070
> F=FreeSerif R=1.000 L=0.667 B=0.700 T=0.467   Over:17 Under:40 Pages:104
> HyphenExcept:8.229 (674/8191)   Error:113.016
> F=Gentium R=1.000 L=0.667 B=0.700 T=0.467   Over:31 Under:39 Pages:104
> HyphenExcept:8.229 (674/8191)   Error:115.610
> F=Gentium Basic R=1.000 L=0.667 B=0.700 T=0.467   Over:36 Under:38
> Pages:104 HyphenExcept:8.229 (674/8191)   Error:116.721
> F=Gentium Book Basic R=0.700 L=0.467 B=1.000 T=0.667   Over:27 Under:47
> Pages:110 HyphenExcept:8.229 (674/8191)   Error:122.905
> F=Gentium Book Basic R=1.000 L=0.667 B=0.700 T=0.467   Over:46 Under:39
> Pages:108 HyphenExcept:8.229 (674/8191)   Error:123.971
> F=FreeSerif R=1.000 L=0.667 B=1.000 T=0.667   Over:17 Under:46 Pages:114
> HyphenExcept:8.229 (674/8191)   Error:124.373
> F=FreeSerif R=0.700 L=0.467 B=1.300 T=0.867   Over:3 Under:46 Pages:116
> HyphenExcept:8.229 (674/8191)   Error:125.095
> F=FreeSerif R=1.300 L=0.867 B=0.700 T=0.467   Over:29 Under:44 Pages:114
> HyphenExcept:8.229 (674/8191)   Error:125.860
> F=Gentium R=1.000 L=0.667 B=1.000 T=0.667   Over:31 Under:45 Pages:114
> HyphenExcept:8.229 (674/8191)   Error:126.687
> F=Gentium R=0.700 L=0.467 B=1.300 T=0.867   Over:18 Under:47 Pages:118
> HyphenExcept:8.229 (674/8191)   Error:128.548
> F=Gentium Basic R=0.700 L=0.467 B=1.300 T=0.867   Over:18 Under:48
> Pages:118 HyphenExcept:8.229 (674/8191)   Error:128.917
> F=Gentium Basic R=1.000 L=0.667 B=1.000 T=0.667   Over:36 Under:46
> Pages:116 HyphenExcept:8.229 (674/8191)   Error:130.137
> F=FreeSerif R=1.000 L=0.667 B=1.300 T=0.867   Over:17 Under:47 Pages:122
> HyphenExcept:8.229 (674/8191)   Error:132.097
> F=Gentium Book Basic R=0.700 L=0.467 B=1.300 T=0.867   Over:27 Under:48
> Pages:120 HyphenExcept:8.229 (674/8191)   Error:132.290
> F=FreeSerif R=1.300 L=0.867 B=1.000 T=0.667   Over:29 Under:47 Pages:122
> HyphenExcept:8.229 (674/8191)   Error:134.170
> F=Gentium R=1.000 L=0.667 B=1.300 T=0.867   Over:31 Under:48 Pages:122
> HyphenExcept:8.229 (674/8191)   Error:134.969
> F=Gentium Book Basic R=1.000 L=0.667 B=1.000 T=0.667   Over:46 Under:46
> Pages:120 HyphenExcept:8.229 (674/8191)   Error:136.747
> F=Gentium Basic R=1.000 L=0.667 B=1.300 T=0.867   Over:36 Under:46
> Pages:124 HyphenExcept:8.229 (674/8191)   Error:137.316
> F=Gentium Book Basic R=1.000 L=0.667 B=1.300 T=0.867   Over:46 Under:51
> Pages:124 HyphenExcept:8.229 (674/8191)   Error:141.988
> F=FreeSerif R=1.300 L=0.867 B=1.300 T=0.867   Over:29 Under:56 Pages:130
> HyphenExcept:8.229 (674/8191)   Error:144.723
> F=Gentium R=1.300 L=0.867 B=0.700 T=0.467   Over:104 Under:47 Pages:116
> HyphenExcept:8.229 (674/8191)   Error:162.938
> F=Gentium Basic R=1.300 L=0.867 B=0.700 T=0.467   Over:107 Under:48
> Pages:116 HyphenExcept:8.229 (674/8191)   Error:165.157
> F=Gentium R=1.300 L=0.867 B=1.000 T=0.667   Over:104 Under:48 Pages:124
> HyphenExcept:8.229 (674/8191)   Error:169.008
> F=Gentium Basic R=1.300 L=0.867 B=1.000 T=0.667   Over:107 Under:50
> Pages:124 HyphenExcept:8.229 (674/8191)   Error:171.443
> F=Gentium R=1.300 L=0.867 B=1.300 T=0.867   Over:104 Under:54 Pages:130
> HyphenExcept:8.229 (674/8191)   Error:175.213
> F=Gentium Basic R=1.300 L=0.867 B=1.300 T=0.867   Over:107 Under:55
> Pages:130 HyphenExcept:8.229 (674/8191)   Error:177.318
> F=Gentium Book Basic R=1.300 L=0.867 B=0.700 T=0.467   Over:136 Under:48
> Pages:118 HyphenExcept:8.229 (674/8191)   Error:186.525
> F=Gentium Book Basic R=1.300 L=0.867 B=1.000 T=0.667   Over:136 Under:51
> Pages:124 HyphenExcept:8.229 (674/8191)   Error:191.156
> F=Gentium Book Basic R=1.300 L=0.867 B=1.300 T=0.867   Over:136 Under:60
> Pages:134 HyphenExcept:8.229 (674/8191)   Error:200.299
>
>
>
> Pander wrote:
>> John Was wrote:
>>> Dear All
>>>
>>> Since starting to use (plain) XeTeX I've noticed something strange with
>>> the paragraphing/line-breaking mechanism which has never happened during
>>> the ten years or so during which I have used traditional TeX.  It is
>>> cropping up in the fourth issue of a periodical that I have set with
>>> XeTeX, so I'm pretty sure that it's not a random fluke.
>>>
>>> (1) I sometimes get an overfull rule (i.e. rectangular box) at the
>>> right-hand side which will disappear when I either (a) attach the word
>>> causing the problem to the next word with ~, forcing it over (I
>>> sometimes have to put the word in an \hbox{} as well); or (b) when I
>>> increase the line-count by giving \looseness1 for the paragraph.  In the
>>> past, plain TeX would always make such decisions for itself and never
>>> generate an overfull rule when it could find a way to justify the
>>> paragraph without doing so.  This happens most frequently in the reviews
>>> section of the periodical, where  \looseness is set to -1 by default to
>>> save as much space as possible:  but until I started to use XeTeX, it
>>> was always the case that if the paragraph could not lose a line, then
>>> the negative looseness was ignored and the paragraph was set
>>> successfully with normal looseness  (i.e. \looseness = 0).  It was never
>>> (I think) the case that a tight looseness which generated an overfull
>>> box would get through and need manual intervention from me.  So has
>>> something altered in the way XeTeX is handling the line-breaks, giving
>>> priority to the looseness command even at the expense of generating an
>>> overfull rule, and even when zero looseness would cause that error to
>>> disappear?
>>>
>>> (2) This is even more puzzling (and more of an nuisance).  For the
>>> purpose of sending contributors proofs of their reviews I start each
>>> review on a new page so that they don't also receive the tops and tails
>>> of adjacent reviews, but while initially typesetting I have the reviews
>>> running on consecutively, as they will do in the final published
>>> version.  There is a switch at the end of each review which generates a
>>> \vfill \eject when \ifseparatereviews is true, otherwise it just
>>> produces a \vskip: there is no other difference.  Yet I sometimes get
>>> overfull rules showing up (at random points) when the reviews are
>>> separated out, even though the same paragraph typeset without error
>>> while the reviews were set to run on continuously.  The problem almost
>>> (but not entirely) disappears if I double the \hfuzz when the
>>> \ifseparatereviews switch is true, but that is no more than a quick fix
>>> to prevent authors receiving proofs with worrying blobs at the
>>> right-hand side.  This seems incomprehensible, but as it has happened
>>> with four out of four periodical issues I can't be imagining it - and
>>> the commands are precisely the same as the ones I used when the
>>> periodical was typeset using traditional plain TeX, with no new
>>> parameters such as alteration to \spaceskip or anything else that might
>>> cause this to happen.
>>>
>>> (1) and (2) seem likely to be part of the same problem (though not
>>> necessarily so).  Any ideas, or at least insight into what XeTeX is
>>> doing that old plain TeX didn't?
>>>
>>> Thanks
>>>
>>>
>>> John
>>
>> Hi all,
>>
>> Slightly related is something I have made. Sometimes you have some
>> freedom of choice in font and in the dimensions of the margins of the
>> work you are about to make. Each selection will have a different amount 
>> of:
>> - Overfull
>> - Underfull
>> - hyphenation exceptions
>>
>> I have made a python script that, via exhaustive enumeration, will find
>> the optimum settings for a minimum amount of occurrences of the list
>> above. Using those optimal settings could be a smarter starting point
>> for fixing widows, orphans and hyphenation exceptions.
>>
>> If someone is interested in this script. please contact me.
>>
>> Regards,
>>
>> Pander
>>
>>>
>>>
>>>
>>>
>>> ------------------------------------------------------------------------
>>>
>>> _______________________________________________
>>> XeTeX mailing list
>>> postmaster at tug.org
>>> http://tug.org/mailman/listinfo/xetex
>>
>> _______________________________________________
>> XeTeX mailing list
>> postmaster at tug.org
>> http://tug.org/mailman/listinfo/xetex
>
>


--------------------------------------------------------------------------------


> #!/usr/bin/python
> # -*- coding: utf-8 -*-
> #
> # name:          find-optimum
> # description:   Find margins and font for optimal minumum of overfulls,
> #                underfulls and hyphenation exceptions for an easy start 
> when
> #                fixing widows, orphans and hyphenation exceptions.
> # license:       GPL
> # version date   author                                  comments
> # 0.1 2009-04-27 Pander <pander at users.sourceforge.net>   initial version
>
> import os
> import math
>
> if __name__ == '__main__':
>    # safety, uncomment the next two lines
>    print 'Make absolutely sure you know what you are doing. Please read 
> the script before executing it. No warrenty or guarantee apply'
>    exit(0)
>
>    # create a template
>    if not os.path.exists('document.tex'):
>        print 'Error, document.tex is not existing.'
>        exit(1)
>    documentfile = open('document.tex', 'r')
>    documenttemplatefile = open('document.tex.template', 'w')
>    foundFont = False
>    fountMargins = False
>    for line in documentfile:
>        if line.find('%') == -1 and 
> line.find('\\setmainfont[Mapping=tex-text]{') != -1 and line.find('}') 
> != -1:
>            line = '\\setmainfont[Mapping=tex-text]{xFFFx}\n'
>            foundFont = True
>        elif line.find('%') == -1 and line.find('\\usepackage[paperwidth=') 
> != -1 and line.find('}]{geometry}') != -1:
>            line = line[:line.find('hdivide')] + 
> 'hdivide={xLLLxin,,xRRRxin},vdivide={xTTTxin,,xBBBxin}]{geometry}\n'
>            foundMargins = True
>        documenttemplatefile.write(line)
>    documentfile.close()
>    documenttemplatefile.close()
>    if not foundFont:
>        print 'Error, could not find like 
> "\\setmainfont[Mapping=tex-text]{...}" in document.tex, no comments are 
> allowed in this line'
>        exit(1)
>    if not foundMargins:
>        print 'Error, could not find like 
> "\\usepackage[paperwidth=...in,paperheight=...in,includehead,includefoot,hdivide={...in,,...in},vdivide={...in,,...xin}]{geometry}" 
> in document.tex, no comments are allowed in this line'
>        exit(1)
>
>    # making backup of document.tex to document.tex.backup
>    os.system('/bin/mv -f document.tex document.tex.backup')
>
>    # analyse all possabilities
>    fonts = ('Gentium', 'Gentium Basic', 'Gentium Book Basic', 'FreeSerif')
>    min_right_margin = .7
>    max_right_margin = 1.3
>    steps_in_margin = 2
>    min_bottom_margin = min_right_margin
>    step_size = (max_right_margin - min_right_margin) / steps_in_margin
>    results = []
>    n = 0
>    for f in fonts:
>        for i in range(steps_in_margin + 1):
>            right = min_right_margin + step_size * float(i)
>            left = right * 2.0 / 3.0
>            r = "%1.3f" % right
>            l = "%1.3f" % left
>            for j in range(steps_in_margin + 1):
>                n = n + 1
>                # create tex file
>                bottom = min_bottom_margin + step_size * float(j)
>                top = bottom * 2.0 / 3.0
>                b = "%1.3f" % bottom
>                t = "%1.3f" % top
>                os.system("/bin/cp -f document.tex.template document.tex")
>                os.system("sed -ie 's/xLLLx/%s/' document.tex" % l)
>                os.system("sed -ie 's/xRRRx/%s/' document.tex" % r)
>                os.system("sed -ie 's/xTTTx/%s/' document.tex" % t)
>                os.system("sed -ie 's/xBBBx/%s/' document.tex" % b)
>                os.system("sed -ie 's/xFFFx/%s/' document.tex" % f)
>
>                # create log file
>                os.system('make clean')
>                os.system('make pdf')
>
>                os.system("/bin/cp -f document.tex %d.tex" % n)
>                os.system("/bin/cp -f document.log %d.log" % n)
>
>                # analyse result in log file
>                over = 0
>                under = 0
>                hyphen_exceptions = 0
>                hyphen_total = 0
>                hyphen_percent = 0
>                pages = 0
>                logfile = open('document.log', 'r')
>                for line in logfile:
>                    if 'Overfull' in line:
>                        over = over + 1
>                    elif 'Underfull' in line:
>                        under = under + 1
>                    elif 'hyphenation exceptions out of' in line:
>                        hyphen = line.split()
>                        hyphen_exceptions = int(hyphen[0])
>                        hyphen_total = int(hyphen[5])
>                        hyphen_percent = 100 * float(hyphen_exceptions) / 
> float(hyphen_total)
>                    elif 'Output written on ' in line:
>                        pages = int(line.split()[4][1:])
>                logfile.close()
>                error = math.sqrt((over * over) + (under * under) + 
> (hyphen_percent * hyphen_percent) + (pages * pages))
>                results.append((error, 'F=' + f + ' R=' + r + ' L=' + l + ' 
> B=' + b + ' T=' + t + '   Over:' + str(over) + ' Under:' + str(under) + ' 
> Pages:%d HyphenExcept:%1.3f (%d/%d)   Error:%1.3f' % (pages, 
> hyphen_percent, hyphen_exceptions, hyphen_total, error)))
>
>    # present results
>    results.sort()
>    for i in results:
>        print i[1]
>
>    # restore document.tex from document.tex.backup
>    os.system('/bin/cp -f document.tex.backup document.tex')
>
>


--------------------------------------------------------------------------------


> _______________________________________________
> XeTeX mailing list
> postmaster at tug.org
> http://tug.org/mailman/listinfo/xetex
> 



More information about the XeTeX mailing list