XeLaTeX to Word/OpenOffice - the state of the art?

Zdenek Wagner zdenek.wagner at gmail.com
Fri Mar 15 13:47:15 CET 2019


pá 15. 3. 2019 v 13:35 odesílatel BPJ <bpj at melroch.se> napsal:
>
> Den 2019-03-15 kl. 08:31, skrev Janusz S. Bień:
> > On Fri, Mar 15 2019 at  7:19 +01, BPJ wrote:
> >> I use, despite myself, Google Docs to convert PDF to DOCX,
> >
> > How???
> >
> >> then Pandoc from DOCX to everything else. It works even with weird
> >> magazine layouts.
> >
> > Best regards
> >
> > Janusz
> >
>
> This may be old news to some, but I can’t remember having seen it,
> so I make a post for the record.
>
> I just discovered that you can convert a PDF to Markdown (or any
> other format Pandoc supports) by uploading it to Google Drive,
> opening it in Google Docs and downloading it from there as DOCX,
> then converting the DOCX to Markdown with Pandoc. The result is
> quite good!
>
> The steps:
>
> 1.  Log into <drive.google.com> in a web browser.
>
> 2.  Select the menu [My Drive⏷] → [Upload files…] in the top bar.
>
OK, this is exactly what I did. I am attachig the screenshot of the
original PDF and downloaded DOCX open in LibreOffice. As you can see,
the equations are unsusable, the last one even contains Tibetan
characters (why???) and they are intermixed into the text paragraphs.

I do not have Tibetan font configured properly in firefox (I do not
know Tibetan so I do not care), hence I see different garbage glyphs
in Google Docs.

>      More recently there is a “button” [+ New] in the top left
> corner. Click on it and select [File upload] in the menu which
> appears.
>
> 3.  At least on my system a file dialog opens. Browse to the PDF
> file; select it; click [Open].
>
> 4.  (If this doesn’t work try step 5.)
>
>      i.  The file appears in the “Quick access” field just below
> the top bar. You may need to refresh a couple of times.
>      ii. Right-click the file thumbnail; choose [Open with] →
> [Google Docs].
>
> 5.  If step 4 doesn’t work (the PDF file doesn’t appear in the
> quick access field):
>
>      i.  Start typing the PDF file name in the [Search Drive] box
> at the top.
>      ii. Click on the file in the menu which appears.
>      iii. The file opens in the Drive PDF viewer.
>      iv. At the top there is a menu [Open with Google Docs]. Click
> on it and select Google Docs.
>
>      Or look up the file in the file list and follow 4.ii. (Hard
> when there are lots of files in the list!)
>
> 6.  You should now find yourself in the Google Docs document view.
>
> 7.  In the [File] menu choose [Download as] → [Microsoft Word
> (.docx)].
>
> 8.  Save the DOCX file to disk and convert it with Pandoc the same
> as you would any DOCX file, or edit it with Word/LibreOffice/… if
> you are of that persuasion.
>
> Basic formatting — paragraphs, bold, italics — works very well.
> Some more advanced formatting is more or less broken:
>
> -   Tables become ordinary text, not very well lined up.
> -   Nested lists are flattened.
> -   Small caps text disappears entirely! If you have access to the
> original LaTeX file I suggest putting this in your preamble:
>
>          \renewcommand\textsc[1]{\textbf{\textit{#1}}}
>
>      or if bold italics actually occur in your document this:
>
>          \usepackage{textcase}
>
> \renewcommand\textsc[1]{\textbf{\textit{\MakeTextUppercase{#1}}}}
>
>      Uggly as hell but sequences of uppercase bold italics are
> unlikely to actually occur in a document and are relatively easy
> to find and replace with something better in a “word processor” or
> in a text editor after conversion from DOCX to some sensible
> format with Pandoc.
>
>      If you post-edit in a “WP” you may try (x)color and something
> like \renewcommand\textsc[1]{\textcolor{red}{#1}} instead. That
> may be hard to find _with_ the “WP” but is relatively easy to find
> _in_ the “WP” for a human eye.
>
> You may want to correct these things in the “word processor” but
> my definite preference is to convert the DOCX file to Pandoc’s
> extended Markdown with Pandoc, fix things up and then convert
> (back) to DOCX. You can then also apply your own custom named
> styles for things like color.
>
> http://pandoc.org/MANUAL.html#custom-styles
>
> http://pandoc.org/MANUAL.html#option--reference-doc
>
> It still says “For best results, do not make changes to this file
> other than modifying the styles used by pandoc” but that is just
> what you want to do if you are using custom styles, including
> adding your own! BTW you may want to avoid non-ASCII and
> non-alphanumeric characters in your custom style names so that you
> don’t need to quote your custom-style attribute values!
>
> Speaking of small caps it has its official Pandoc syntax: [small
> caps text]{.smallcaps}, but that is far too verbose by Markdown
> standards! ;-) I usually overload Pandoc’s generally useless
> strikeout syntax so that I can type ~~small caps text~~ with this
> Pandoc Lua filter:
>
>      function Strikeout (elem)
>          return pandoc.SmallCaps(elem.content)
>      end
>
> I hope this is of use to someone!
>
> /bpj
>

Zdeněk Wagner
http://ttsm.icpf.cas.cz/team/wagner.shtml
http://icebearsoft.euweb.cz
-------------- next part --------------
A non-text attachment was scrubbed...
Name: original.png
Type: image/png
Size: 248583 bytes
Desc: not available
URL: <https://tug.org/pipermail/xetex/attachments/20190315/fefe745c/attachment-0002.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: converted.png
Type: image/png
Size: 164810 bytes
Desc: not available
URL: <https://tug.org/pipermail/xetex/attachments/20190315/fefe745c/attachment-0003.png>


More information about the XeTeX mailing list