[XeTeX] Line-breaking algorithms in XeTeX
John Was
john.was at ntlworld.com
Mon Apr 27 15:43:02 CEST 2009
Dear Pander
Thanks for that - I'll try to get my head round it next week some time,
though I'm not an ideal beta-tester for technicalities like this. I'm sure
others will give you good feedback, though. It looks very impressive!
Best
John
----- Original Message -----
From: "Pander" <pander at users.sourceforge.net>
To: "Unicode-based TeX for Mac OS X and other platforms" <xetex at tug.org>
Sent: Monday, April 27, 2009 2:08 PM
Subject: Re: [XeTeX] Line-breaking algorithms in XeTeX
> Here is the script, example output is below where F=font, r=right
> margin, l=left mragin, b=bottom margin, t=top margin, over=overfull,
> under=underfull etc. the error is calculated like:
> math.sqrt((over * over) + (under * under) + (hyphen_percent *
> hyphen_percent) + (pages * pages))
>
> I think in the next version I will omit the pages in the error calculation
>
> Please test the script and let me know what to improve.
>
> F=FreeSerif R=0.700 L=0.467 B=0.700 T=0.467 Over:3 Under:31 Pages:98
> HyphenExcept:8.229 (674/8191) Error:103.159
> F=Gentium Basic R=0.700 L=0.467 B=0.700 T=0.467 Over:18 Under:33
> Pages:100 HyphenExcept:8.229 (674/8191) Error:107.148
> F=Gentium R=0.700 L=0.467 B=0.700 T=0.467 Over:18 Under:33 Pages:100
> HyphenExcept:8.229 (674/8191) Error:107.148
> F=Gentium Book Basic R=0.700 L=0.467 B=0.700 T=0.467 Over:27 Under:31
> Pages:100 HyphenExcept:8.229 (674/8191) Error:108.433
> F=FreeSerif R=0.700 L=0.467 B=1.000 T=0.667 Over:3 Under:41 Pages:102
> HyphenExcept:8.229 (674/8191) Error:110.280
> F=Gentium Basic R=0.700 L=0.467 B=1.000 T=0.667 Over:18 Under:42
> Pages:102 HyphenExcept:8.229 (674/8191) Error:112.070
> F=Gentium R=0.700 L=0.467 B=1.000 T=0.667 Over:18 Under:42 Pages:102
> HyphenExcept:8.229 (674/8191) Error:112.070
> F=FreeSerif R=1.000 L=0.667 B=0.700 T=0.467 Over:17 Under:40 Pages:104
> HyphenExcept:8.229 (674/8191) Error:113.016
> F=Gentium R=1.000 L=0.667 B=0.700 T=0.467 Over:31 Under:39 Pages:104
> HyphenExcept:8.229 (674/8191) Error:115.610
> F=Gentium Basic R=1.000 L=0.667 B=0.700 T=0.467 Over:36 Under:38
> Pages:104 HyphenExcept:8.229 (674/8191) Error:116.721
> F=Gentium Book Basic R=0.700 L=0.467 B=1.000 T=0.667 Over:27 Under:47
> Pages:110 HyphenExcept:8.229 (674/8191) Error:122.905
> F=Gentium Book Basic R=1.000 L=0.667 B=0.700 T=0.467 Over:46 Under:39
> Pages:108 HyphenExcept:8.229 (674/8191) Error:123.971
> F=FreeSerif R=1.000 L=0.667 B=1.000 T=0.667 Over:17 Under:46 Pages:114
> HyphenExcept:8.229 (674/8191) Error:124.373
> F=FreeSerif R=0.700 L=0.467 B=1.300 T=0.867 Over:3 Under:46 Pages:116
> HyphenExcept:8.229 (674/8191) Error:125.095
> F=FreeSerif R=1.300 L=0.867 B=0.700 T=0.467 Over:29 Under:44 Pages:114
> HyphenExcept:8.229 (674/8191) Error:125.860
> F=Gentium R=1.000 L=0.667 B=1.000 T=0.667 Over:31 Under:45 Pages:114
> HyphenExcept:8.229 (674/8191) Error:126.687
> F=Gentium R=0.700 L=0.467 B=1.300 T=0.867 Over:18 Under:47 Pages:118
> HyphenExcept:8.229 (674/8191) Error:128.548
> F=Gentium Basic R=0.700 L=0.467 B=1.300 T=0.867 Over:18 Under:48
> Pages:118 HyphenExcept:8.229 (674/8191) Error:128.917
> F=Gentium Basic R=1.000 L=0.667 B=1.000 T=0.667 Over:36 Under:46
> Pages:116 HyphenExcept:8.229 (674/8191) Error:130.137
> F=FreeSerif R=1.000 L=0.667 B=1.300 T=0.867 Over:17 Under:47 Pages:122
> HyphenExcept:8.229 (674/8191) Error:132.097
> F=Gentium Book Basic R=0.700 L=0.467 B=1.300 T=0.867 Over:27 Under:48
> Pages:120 HyphenExcept:8.229 (674/8191) Error:132.290
> F=FreeSerif R=1.300 L=0.867 B=1.000 T=0.667 Over:29 Under:47 Pages:122
> HyphenExcept:8.229 (674/8191) Error:134.170
> F=Gentium R=1.000 L=0.667 B=1.300 T=0.867 Over:31 Under:48 Pages:122
> HyphenExcept:8.229 (674/8191) Error:134.969
> F=Gentium Book Basic R=1.000 L=0.667 B=1.000 T=0.667 Over:46 Under:46
> Pages:120 HyphenExcept:8.229 (674/8191) Error:136.747
> F=Gentium Basic R=1.000 L=0.667 B=1.300 T=0.867 Over:36 Under:46
> Pages:124 HyphenExcept:8.229 (674/8191) Error:137.316
> F=Gentium Book Basic R=1.000 L=0.667 B=1.300 T=0.867 Over:46 Under:51
> Pages:124 HyphenExcept:8.229 (674/8191) Error:141.988
> F=FreeSerif R=1.300 L=0.867 B=1.300 T=0.867 Over:29 Under:56 Pages:130
> HyphenExcept:8.229 (674/8191) Error:144.723
> F=Gentium R=1.300 L=0.867 B=0.700 T=0.467 Over:104 Under:47 Pages:116
> HyphenExcept:8.229 (674/8191) Error:162.938
> F=Gentium Basic R=1.300 L=0.867 B=0.700 T=0.467 Over:107 Under:48
> Pages:116 HyphenExcept:8.229 (674/8191) Error:165.157
> F=Gentium R=1.300 L=0.867 B=1.000 T=0.667 Over:104 Under:48 Pages:124
> HyphenExcept:8.229 (674/8191) Error:169.008
> F=Gentium Basic R=1.300 L=0.867 B=1.000 T=0.667 Over:107 Under:50
> Pages:124 HyphenExcept:8.229 (674/8191) Error:171.443
> F=Gentium R=1.300 L=0.867 B=1.300 T=0.867 Over:104 Under:54 Pages:130
> HyphenExcept:8.229 (674/8191) Error:175.213
> F=Gentium Basic R=1.300 L=0.867 B=1.300 T=0.867 Over:107 Under:55
> Pages:130 HyphenExcept:8.229 (674/8191) Error:177.318
> F=Gentium Book Basic R=1.300 L=0.867 B=0.700 T=0.467 Over:136 Under:48
> Pages:118 HyphenExcept:8.229 (674/8191) Error:186.525
> F=Gentium Book Basic R=1.300 L=0.867 B=1.000 T=0.667 Over:136 Under:51
> Pages:124 HyphenExcept:8.229 (674/8191) Error:191.156
> F=Gentium Book Basic R=1.300 L=0.867 B=1.300 T=0.867 Over:136 Under:60
> Pages:134 HyphenExcept:8.229 (674/8191) Error:200.299
>
>
>
> Pander wrote:
>> John Was wrote:
>>> Dear All
>>>
>>> Since starting to use (plain) XeTeX I've noticed something strange with
>>> the paragraphing/line-breaking mechanism which has never happened during
>>> the ten years or so during which I have used traditional TeX. It is
>>> cropping up in the fourth issue of a periodical that I have set with
>>> XeTeX, so I'm pretty sure that it's not a random fluke.
>>>
>>> (1) I sometimes get an overfull rule (i.e. rectangular box) at the
>>> right-hand side which will disappear when I either (a) attach the word
>>> causing the problem to the next word with ~, forcing it over (I
>>> sometimes have to put the word in an \hbox{} as well); or (b) when I
>>> increase the line-count by giving \looseness1 for the paragraph. In the
>>> past, plain TeX would always make such decisions for itself and never
>>> generate an overfull rule when it could find a way to justify the
>>> paragraph without doing so. This happens most frequently in the reviews
>>> section of the periodical, where \looseness is set to -1 by default to
>>> save as much space as possible: but until I started to use XeTeX, it
>>> was always the case that if the paragraph could not lose a line, then
>>> the negative looseness was ignored and the paragraph was set
>>> successfully with normal looseness (i.e. \looseness = 0). It was never
>>> (I think) the case that a tight looseness which generated an overfull
>>> box would get through and need manual intervention from me. So has
>>> something altered in the way XeTeX is handling the line-breaks, giving
>>> priority to the looseness command even at the expense of generating an
>>> overfull rule, and even when zero looseness would cause that error to
>>> disappear?
>>>
>>> (2) This is even more puzzling (and more of an nuisance). For the
>>> purpose of sending contributors proofs of their reviews I start each
>>> review on a new page so that they don't also receive the tops and tails
>>> of adjacent reviews, but while initially typesetting I have the reviews
>>> running on consecutively, as they will do in the final published
>>> version. There is a switch at the end of each review which generates a
>>> \vfill \eject when \ifseparatereviews is true, otherwise it just
>>> produces a \vskip: there is no other difference. Yet I sometimes get
>>> overfull rules showing up (at random points) when the reviews are
>>> separated out, even though the same paragraph typeset without error
>>> while the reviews were set to run on continuously. The problem almost
>>> (but not entirely) disappears if I double the \hfuzz when the
>>> \ifseparatereviews switch is true, but that is no more than a quick fix
>>> to prevent authors receiving proofs with worrying blobs at the
>>> right-hand side. This seems incomprehensible, but as it has happened
>>> with four out of four periodical issues I can't be imagining it - and
>>> the commands are precisely the same as the ones I used when the
>>> periodical was typeset using traditional plain TeX, with no new
>>> parameters such as alteration to \spaceskip or anything else that might
>>> cause this to happen.
>>>
>>> (1) and (2) seem likely to be part of the same problem (though not
>>> necessarily so). Any ideas, or at least insight into what XeTeX is
>>> doing that old plain TeX didn't?
>>>
>>> Thanks
>>>
>>>
>>> John
>>
>> Hi all,
>>
>> Slightly related is something I have made. Sometimes you have some
>> freedom of choice in font and in the dimensions of the margins of the
>> work you are about to make. Each selection will have a different amount
>> of:
>> - Overfull
>> - Underfull
>> - hyphenation exceptions
>>
>> I have made a python script that, via exhaustive enumeration, will find
>> the optimum settings for a minimum amount of occurrences of the list
>> above. Using those optimal settings could be a smarter starting point
>> for fixing widows, orphans and hyphenation exceptions.
>>
>> If someone is interested in this script. please contact me.
>>
>> Regards,
>>
>> Pander
>>
>>>
>>>
>>>
>>>
>>> ------------------------------------------------------------------------
>>>
>>> _______________________________________________
>>> XeTeX mailing list
>>> postmaster at tug.org
>>> http://tug.org/mailman/listinfo/xetex
>>
>> _______________________________________________
>> XeTeX mailing list
>> postmaster at tug.org
>> http://tug.org/mailman/listinfo/xetex
>
>
--------------------------------------------------------------------------------
> #!/usr/bin/python
> # -*- coding: utf-8 -*-
> #
> # name: find-optimum
> # description: Find margins and font for optimal minumum of overfulls,
> # underfulls and hyphenation exceptions for an easy start
> when
> # fixing widows, orphans and hyphenation exceptions.
> # license: GPL
> # version date author comments
> # 0.1 2009-04-27 Pander <pander at users.sourceforge.net> initial version
>
> import os
> import math
>
> if __name__ == '__main__':
> # safety, uncomment the next two lines
> print 'Make absolutely sure you know what you are doing. Please read
> the script before executing it. No warrenty or guarantee apply'
> exit(0)
>
> # create a template
> if not os.path.exists('document.tex'):
> print 'Error, document.tex is not existing.'
> exit(1)
> documentfile = open('document.tex', 'r')
> documenttemplatefile = open('document.tex.template', 'w')
> foundFont = False
> fountMargins = False
> for line in documentfile:
> if line.find('%') == -1 and
> line.find('\\setmainfont[Mapping=tex-text]{') != -1 and line.find('}')
> != -1:
> line = '\\setmainfont[Mapping=tex-text]{xFFFx}\n'
> foundFont = True
> elif line.find('%') == -1 and line.find('\\usepackage[paperwidth=')
> != -1 and line.find('}]{geometry}') != -1:
> line = line[:line.find('hdivide')] +
> 'hdivide={xLLLxin,,xRRRxin},vdivide={xTTTxin,,xBBBxin}]{geometry}\n'
> foundMargins = True
> documenttemplatefile.write(line)
> documentfile.close()
> documenttemplatefile.close()
> if not foundFont:
> print 'Error, could not find like
> "\\setmainfont[Mapping=tex-text]{...}" in document.tex, no comments are
> allowed in this line'
> exit(1)
> if not foundMargins:
> print 'Error, could not find like
> "\\usepackage[paperwidth=...in,paperheight=...in,includehead,includefoot,hdivide={...in,,...in},vdivide={...in,,...xin}]{geometry}"
> in document.tex, no comments are allowed in this line'
> exit(1)
>
> # making backup of document.tex to document.tex.backup
> os.system('/bin/mv -f document.tex document.tex.backup')
>
> # analyse all possabilities
> fonts = ('Gentium', 'Gentium Basic', 'Gentium Book Basic', 'FreeSerif')
> min_right_margin = .7
> max_right_margin = 1.3
> steps_in_margin = 2
> min_bottom_margin = min_right_margin
> step_size = (max_right_margin - min_right_margin) / steps_in_margin
> results = []
> n = 0
> for f in fonts:
> for i in range(steps_in_margin + 1):
> right = min_right_margin + step_size * float(i)
> left = right * 2.0 / 3.0
> r = "%1.3f" % right
> l = "%1.3f" % left
> for j in range(steps_in_margin + 1):
> n = n + 1
> # create tex file
> bottom = min_bottom_margin + step_size * float(j)
> top = bottom * 2.0 / 3.0
> b = "%1.3f" % bottom
> t = "%1.3f" % top
> os.system("/bin/cp -f document.tex.template document.tex")
> os.system("sed -ie 's/xLLLx/%s/' document.tex" % l)
> os.system("sed -ie 's/xRRRx/%s/' document.tex" % r)
> os.system("sed -ie 's/xTTTx/%s/' document.tex" % t)
> os.system("sed -ie 's/xBBBx/%s/' document.tex" % b)
> os.system("sed -ie 's/xFFFx/%s/' document.tex" % f)
>
> # create log file
> os.system('make clean')
> os.system('make pdf')
>
> os.system("/bin/cp -f document.tex %d.tex" % n)
> os.system("/bin/cp -f document.log %d.log" % n)
>
> # analyse result in log file
> over = 0
> under = 0
> hyphen_exceptions = 0
> hyphen_total = 0
> hyphen_percent = 0
> pages = 0
> logfile = open('document.log', 'r')
> for line in logfile:
> if 'Overfull' in line:
> over = over + 1
> elif 'Underfull' in line:
> under = under + 1
> elif 'hyphenation exceptions out of' in line:
> hyphen = line.split()
> hyphen_exceptions = int(hyphen[0])
> hyphen_total = int(hyphen[5])
> hyphen_percent = 100 * float(hyphen_exceptions) /
> float(hyphen_total)
> elif 'Output written on ' in line:
> pages = int(line.split()[4][1:])
> logfile.close()
> error = math.sqrt((over * over) + (under * under) +
> (hyphen_percent * hyphen_percent) + (pages * pages))
> results.append((error, 'F=' + f + ' R=' + r + ' L=' + l + '
> B=' + b + ' T=' + t + ' Over:' + str(over) + ' Under:' + str(under) + '
> Pages:%d HyphenExcept:%1.3f (%d/%d) Error:%1.3f' % (pages,
> hyphen_percent, hyphen_exceptions, hyphen_total, error)))
>
> # present results
> results.sort()
> for i in results:
> print i[1]
>
> # restore document.tex from document.tex.backup
> os.system('/bin/cp -f document.tex.backup document.tex')
>
>
--------------------------------------------------------------------------------
> _______________________________________________
> XeTeX mailing list
> postmaster at tug.org
> http://tug.org/mailman/listinfo/xetex
>
More information about the XeTeX
mailing list