XeLaTeX to Word/OpenOffice - the state of the art?

Janusz S. Bień jsbien at mimuw.edu.pl
Sat Mar 16 07:46:37 CET 2019

On Fri, Mar 15 2019 at 13:34 +01, BPJ wrote:
> Den 2019-03-15 kl. 08:31, skrev Janusz S. Bień:
>> On Fri, Mar 15 2019 at  7:19 +01, BPJ wrote:
>>> I use, despite myself, Google Docs to convert PDF to DOCX,
>> How???
>>> then Pandoc from DOCX to everything else. It works even with weird
>>> magazine layouts.
>> Best regards
>> Janusz
> This may be old news to some, but I can’t remember having seen it, so
> I make a post for the record.
> I just discovered that you can convert a PDF to Markdown (or any other
> format Pandoc supports) by uploading it to Google Drive, opening it in
> Google Docs and downloading it from there as DOCX, then converting the
> DOCX to Markdown with Pandoc. The result is quite good!
> The steps:
> 1.  Log into <drive.google.com> in a web browser.
> 2.  Select the menu [My Drive⏷] → [Upload files…] in the top bar.
>     More recently there is a “button” [+ New] in the top left
> corner. Click on it and select [File upload] in the menu which
> appears.
> 3.  At least on my system a file dialog opens. Browse to the PDF file;
> select it; click [Open].
> 4.  (If this doesn’t work try step 5.)
>     i.  The file appears in the “Quick access” field just below the
> top bar. You may need to refresh a couple of times.
>     ii. Right-click the file thumbnail; choose [Open with] → [Google
> Docs].
> 5.  If step 4 doesn’t work (the PDF file doesn’t appear in the quick
> access field):
>     i.  Start typing the PDF file name in the [Search Drive] box at
> the top.
>     ii. Click on the file in the menu which appears.
>     iii. The file opens in the Drive PDF viewer.
>     iv. At the top there is a menu [Open with Google Docs]. Click on
> it and select Google Docs.
>     Or look up the file in the file list and follow 4.ii. (Hard when
> there are lots of files in the list!)
> 6.  You should now find yourself in the Google Docs document view.
> 7.  In the [File] menu choose [Download as] → [Microsoft Word
> (.docx)].
> 8.  Save the DOCX file to disk and convert it with Pandoc the same as
> you would any DOCX file, or edit it with Word/LibreOffice/… if you are
> of that persuasion.
> Basic formatting — paragraphs, bold, italics — works very well. Some
> more advanced formatting is more or less broken:
> -   Tables become ordinary text, not very well lined up.
> -   Nested lists are flattened.
> -   Small caps text disappears entirely! If you have access to the
> original LaTeX file I suggest putting this in your preamble:
>         \renewcommand\textsc[1]{\textbf{\textit{#1}}}
>     or if bold italics actually occur in your document this:
>         \usepackage{textcase}
> \renewcommand\textsc[1]{\textbf{\textit{\MakeTextUppercase{#1}}}}
>     Uggly as hell but sequences of uppercase bold italics are unlikely
> to actually occur in a document and are relatively easy to find and
> replace with something better in a “word processor” or in a text
> editor after conversion from DOCX to some sensible format with Pandoc.
>     If you post-edit in a “WP” you may try (x)color and something like
> \renewcommand\textsc[1]{\textcolor{red}{#1}} instead. That may be hard
> to find _with_ the “WP” but is relatively easy to find _in_ the “WP”
> for a human eye.
> You may want to correct these things in the “word processor” but my
> definite preference is to convert the DOCX file to Pandoc’s extended
> Markdown with Pandoc, fix things up and then convert (back) to
> DOCX. You can then also apply your own custom named styles for things
> like color.
> http://pandoc.org/MANUAL.html#custom-styles
> http://pandoc.org/MANUAL.html#option--reference-doc
> It still says “For best results, do not make changes to this file
> other than modifying the styles used by pandoc” but that is just what
> you want to do if you are using custom styles, including adding your
> own! BTW you may want to avoid non-ASCII and non-alphanumeric
> characters in your custom style names so that you don’t need to quote
> your custom-style attribute values!
> Speaking of small caps it has its official Pandoc syntax: [small caps
> text]{.smallcaps}, but that is far too verbose by Markdown standards!
> ;-) I usually overload Pandoc’s generally useless strikeout syntax so
> that I can type ~~small caps text~~ with this Pandoc Lua filter:
>     function Strikeout (elem)
>         return pandoc.SmallCaps(elem.content)
>     end
> I hope this is of use to someone!
> /bpj

Janusz S. Bien
emeryt (emeritus)

More information about the XeTeX mailing list