LaTeX to Word: the basic issue

TeX, TeX4ht

It seems to me that there are two possible avenues for the challenge of converting LaTeX source to Microsoft Word—a challenge which humanists will have to take up whenever they collaborate with non-TeXnical co-authors and editors. One route is to parse the source and output Word or, more likely, a less closed intermediate markup format. Humanists are actually better off here than scientists, since they don’t use the major feature that TeX handles very differently from word processors—and far, far better: namely, mathematical equation typesetting. And assuming that Word is not going to be the format for final publication, there’s no need to cry (much) over the loss of typographical quality resulting from subjecting your paragraphs to Word’s justification and pagination “algorithms” rather than TeX’s. The goal is most likely just to share and edit content. For these purposes (from shared authorship to substantial editing to copy-editing), LaTeX markup for humanists can be pretty simple—indeed, practically equivalent to the text-markup subset of HTML itself: paragraph, header, section (division), span. For this very minimal version of LaTeX—entirely conceivable as a possibility for humanists—conversion to Word is a matter of getting a parser for the minimal LaTeX and outputting its equivalent in XML.

Numerous converters do this. The one I have nearest to hand is the astonishing pandoc, an all-purpose converter. Pandoc’s “native” format is a plain-text format called markdown, which pretty much corresponds to the minimal text markup I mentioned above. And pandoc writes lots of markup formats, including HTML and OpenDocument. I’m writing this post in markdown and using pandoc to output the HTML. If your LaTeX has a markdown equivalent, pandoc can very robustly produce an ODT. Then NeoOffice (etc) can convert it to .doc format.

As far as I can tell, this is also the approach taken by the python-based Word converters supplied for LyX. But I haven’t tried those.

There’s a problem, however. What if your LaTeX markup exceeds the capacities of markdown? After all, TeX itself is a Turing-complete programming language (sez Wikipedia), and markdown, lacking loops and conditionals, definitely isn’t. If your LaTeX uses any of LaTeX’s more robust algorithmic powers to generate your text, the magic of pandoc will not, at least in its present version, be powerful enough for you. (I’d love to be wrong about this, but I’m pretty sure this argument holds. Because the Haskell interpreter underlies pandoc, I guess a robust TeX parser is in principle possible for pandoc, but that’s not really in the spirit of pandoc’s minimalism. Possibly some kind of compromise involving markdown with embedded haskell fragments would be possible. Sounds painful.)

But humanists—when would they use such powers? Alas, they will if they want to use those lovely bibliography-generation capabilities. A lot of algorithmic work goes into lining up all those nice Chicago-style footnotes and short references and ibids.

Now we come to the other avenue for TeX-to-Word conversion: reading the output rather than the input and converting that. On the plus side, all the algorithmic hard work will have already taken place, so all that nice generated text will be easy alphanumeric characters, spaces, and punctuation. On the minus side, TeX’s output is a DVI or a PDF, images of pages with much less semantic structure and lots and lots of non-semantic layout information. That’s the whole point of LaTeX! I guess you could use a PDF-to-Word converter, like the one embedded in Acrobat Pro; but the layout-not-semantics problems quickly spiral out of control (I’ve tried. The results with footnotes make you cry). The converted Word document may sort of look like the PDF you make with LaTeX, but it will be very hard to use it in collaborating on content.

Now I’ve often wondered whether there isn’t some intermediate stage in the LaTeX processing that would be more suitable for conversion into ODT (which is just xml markup). After all, a package like biblatex doesn’t output DVI code when it generates a citation, it outputs TeX. Isn’t there some mid-processing version of my article (say) which consists of all the LaTeX I wrote, but with all the \cite commands replaced with their output? Without knowing biblatex’s internals, I think we can be pretty sure that it’s not so simple. Lots of programmatic magic happens in a \cite call, magic that depends on global knowledge of the TeX processing run (decisions about pagination, information about sections and chapters, counters, etc. etc.).

So what’s left? tex4ht. The ingenious tactic used by this general-purpose TeX-to-markup converter is to piggyback on the TeX processing run, allowing TeX to do the work of output generation but annotating the result—a DVI file—with reminders of the original semantic structure. Then tex4ht reads the annotated TeX output back in and converts it to a new markup format; the package speaks xml and can output several flavors of html and—the key desideratum—OpenDocument XML. tex4ht redefines basic TeX/LaTeX commands to produce the annotations it needs. Of course you can immediately see the challenge: this means:

  1. tex4ht needs to “annotate” every command your document uses.

  2. Which means tex4ht needs to redefine every command you use—or a generating subset of them (i.e. a set {f_1, f_2, f_3, …} such that every command you use can be expressed in terms of the f_i).

  3. And those redefinitions are supposed to behave just like the original commands, modulo output format: fancy TeX typesetting can re lost, but not any actual text or basic style information.

The result is that the maintainers of tex4ht are constantly playing catch-up with the entire TeX ecosystem, writing “.4ht” workalike packages to convert the commands offered by popular packages into working surrogates for tex4ht. The task has been even more challenging because tex4ht was the brainchild of one person, Eitan Gurari, who died suddenly in 2009; the current maintainers have had to plunge into his work in medias res.

So there you have it: the basic issue. As I blog on, I’ll discuss the little ways of coping with it. Amazingly, you really can.