New ghostscript/ghostpdl release candidate
Bruno Voisin
bvoisin at icloud.com
Fri Feb 23 09:56:30 CET 2024
> Nelson Beebe wrote:
>
> I just received news on the gs-devel at ghostscript.com list about a new
> release candidate:
>
> https://github.com/ArtifexSoftware/ghostpdl-downloads/releases/tag/gs10030rc1
>
> The news site at
>
> https://ghostscript.readthedocs.io/en/gs10.03.0/News.html
>
> contains important updates about the changes, some of which enhance
> security, and the better integration of optical character recognition
> support with the Tesseract engine. Building with the latter takes
> extra steps, described here:
>
> https://ghostscript.com/blog/ocr.html
>
> I've been using an earlier gs-9.54.0 build with Tesseract support
> since 20-Mar-2021, and employ it often to turn bitmap PDFs into text
> searchable ones.
>
> That early release has dictionaries for multiple human languages, but
> no readily visible way to change from English OCR to other languages.
> Reinhard Kotucha wrote:
>
> Let me add that TeX Live provides Ghostscript only for Windows.
> Because its sole purpose is to support scripts shipped with TeX Live,
> some stuff was removed, see
>
> https://tug.org/svn/texlive/trunk/Master/tlpkg/tlgs/README.TEXLIVE
>
> Maybe we should mention the removal of Tesseract there too.
>
> If Windows users need Ghostscript for anything else (printer driver,
> OCR,...) they have to install an external Ghostscript themselves.
>
> TeX Live's Ghostscript binaries are not in PATH and thus do not
> interfere with external installations.
Hi Reinhard, hi Nelson,
Please let me add a couple of things.
First regarding Ghostscript in the Mac version of TeX Live.
MacTeX includes three Ghostscript-related components:
- One is Ghostscript itself, compiled with the usual configure && make && make install and installed in /usr/local/bin hence in PATH. No customization (hence Tesseract is included) other than --disable-compile-inits (aka COMPILE_INITS=0 which I think is also used on Windows). This is installed by default, but the user can opt not to.
- The other two are libgs from Ghostscript, and mutool from MuPDF. These are used by dvisvgm for PS -> SVG conversion and PDF -> SVG conversion, respectively, if present. Their install is optional, unchecked by default.
The Ghostscript package is also available separately from
https://tug.org/mactex/morepackages.html
As new versions of Ghostscript are released twice a year in Spring and Autumn, Dick Koch makes two packages (Ghostscript, and Ghostscript Extras namely libgs and mutool) available from
https://pages.uoregon.edu/koch/
This is also where you can find builds of older Ghostscript versions.
The version of Ghostscript included in MacTeX 2024 will most likely be the current release version, 10.02.1.
Like Nelson yesterday I received the announcement of the release candidate of 10.03.0 on gs-devel, and tested its compilation on the latest macOS and ARM processor. Nothing to report, things work just as before.
The release notes say "As of this release (10.03.0) pdfwrite creates PDF files with XRef streams and ObjStm streams. This can result in considerably smaller PDF output files." I tested with an old TeX document of mine, from a time (2010) I was still using EPS figures: 3 857 104 bytes using gs 10.02.1, 3 809 998 bytes using gs 10.03.0. Maybe there are other types of PostScript documents for which the size reduction is more significant.
Second, regarding OCR: there's now a switch -sOCRLanguage to change language, see
https://ghostscript.readthedocs.io/en/gs10.03.0/Devices.html#ocr-text-output
I had tested OCR briefly when Tesseract/Leptonica support was added to Ghostscript, but I had not used it since.
Your message motivated me to try again with a text in my native language (French), taking the attached first page of a paper by Henri Poincaré, installing /usr/local/share/tessdata/fra.traineddata then running
gs -sDEVICE=pdfocr8 -sOCRLanguage=fra -o Poincare-1910-Ghostscript.pdf -r600 -dDownScaleFactor=3 Poincare-1910.pdf
The output is attached, together with the output of OCR with Acrobat DC.
The first paragraph is OCR'ed by Ghostscript/Tesseract as
Lord Kelvin s’est, l’un des premiers, prononcé en faveur de ; la solidité du globe terrestre, et il a cherché de tous côtés des
arguments en faveur de son opinion; quelques-uns sont fondés sur les observations de précession et de nutation. Je renverrai en
particulier à ses Popular Lectures, Vol. IIL, page 244, et à ses Mathematical Papers, Vol. IT, page 320. Dans ses investiga- tions, il envisage l’hypothèse d’une croûte solide invariable, à
l’intérieur de laquelle se trouve un liquide homogène; il suppose que la surface extérieure de cette croûte solide est un ellipsoïde
et que la cavité interne est également ellipsoïdale.
There are a couple of mistakes caused by the formatting of the printed document, but other than that all the French accents and punctuation are there, exactly as in the original French text. Pretty impressive!
Acrobat, by contrast, gives
Lord Kelvin s'est., l'un des premiers, prononcé en faveur de la soJidité du globe terrestre, et il a cherché de tous côtés des arguments en faveur de son opinion; quelques-uns sont fondés sur les observations de précession et de nutation. Je renverrai en particulier à ses Popular Lectures, VoJ. III, page 244, et à ses Mathe,nalical Papers, Vol. III, page 320. Dans ses investiga-
tions, il envisage l'hypothèse d'une croùte solide invariable, à l'intérieur de laquelle ;e trouve un liquide homogène; il suppose que la surface extérieure de cette croûte solide est un ellipso,·de
et que la cavité intern~ est également ellipsoïdale.
Not really bad, but not the same quality!
Bruno Voisin
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Poincare-1910-Acrobat.pdf
Type: application/pdf
Size: 68260 bytes
Desc: not available
URL: <https://tug.org/pipermail/tex-live/attachments/20240223/8703c049/attachment-0003.pdf>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Poincare-1910-Ghostscript.pdf
Type: application/pdf
Size: 164974 bytes
Desc: not available
URL: <https://tug.org/pipermail/tex-live/attachments/20240223/8703c049/attachment-0004.pdf>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Poincare-1910.pdf
Type: application/pdf
Size: 85498 bytes
Desc: not available
URL: <https://tug.org/pipermail/tex-live/attachments/20240223/8703c049/attachment-0005.pdf>
More information about the tex-live
mailing list.