A (bib)latex to html workflow, for some values of “work”

TeX, TeX4ht, kludgetastic

Nerdery: how I’m doing latex + biblatex → html, in order to make this syllabus page.

For the last couple of years I have always prepared my syllabuses in such a way that I could generate parallel web-ready and print-ready versions. Usually the source is markdown, and then, with a few tweaks here and there, I use pandoc and xelatex to produce html and PDF respectively. One can even go directly from syllabus markdown files to a website using jekyll. For example: last semester’s Twentieth Century Fiction I site and syllabus.pdf. Unfortunately a jekyll static site cannot supply a multi-user blog, and since I’ve moved to wordpress sites for my courses this term, the mouse has reinserted itself into my web-update workflow.

Anyway, that was all very well and not too hard to manage. This semester I decided the setup was not elaborate enough. One of the limitations of markdown as a source for a syllabus is that it doesn’t provide good enough citation generation. pandoc actually does support citation generation through citeproc-hs and CSL, but I don’t know how to make CSL do what I want, whereas I do know how to make biblatex do what I want. It struck me that, in conjunction with tex4ht’s support for biblatex, I ought to be able to have a syllabus written in a mix of latex+biblatex and markdown that could be produced for the web and for print. So all the assigned texts in my course went into a .bib file, and every line on my reading schedule is made with a \cite{} command. (Bonus: easy export from the .bib to the course Zotero bibliography. And interoperation was his name-o.)

The tricky part is generating HTML, with the further constraint that the HTML has to work when I paste it into wordpress. (See, mouse in workflow! Arrgh!). I decided on the following procedure:

  1. use tex4ht to produce html from latex
  2. use pandoc to transform html to markdown
  3. use pandoc to transform markdown to html again

tex4ht relies heavily on css when it produces html mimicking latex. The roundtrip through pandoc helps to reduce the html to a minimal set of tags, unencumbered by piles of <span class="foo">s.

Of course there are some subtleties. First of all, in order for tex4ht to handle biblatex successfully, you first have to do a pdflatex-biber-pdflatex cycle on the latex source. Then you can run tex4ht (which runs latex three times itself!). Finally, the rinse cycle with pandoc is too comprehensive and devours even such basic formatting as \emph and \textbf (which, like all other font choices, are done with CSS classes by default). Thanks to a blog post on transforming to html5 by the wizardly maintainer of tex4ht, CV Radhakrishnan, though, I learned how to tweak tex4ht’s output enough to do what I wanted. The tweak requires a tex4ht configuration file. The point of the configuration is to ensure that \emph{...} gets rendered as <em>...</em>.

The tex4ht config file looks like this:

\Preamble{xhtml,NoFonts,ext=html,charset="utf-8"}
\begin{document}
\EndPreamble
\Configure{emph}{\Protect\HCode{<em>}}{\Protect\HCode{</em>}}
\Configure{textbf}{\Protect\HCode{<strong>}}{\Protect\HCode{</strong>}}

The syllabus sources consist of a set of markdown files, plus one latex file with biblatex commands: schedule.tex. These sources are then \input{...} in a minimal wrapper latex file:

\documentclass[12pt]{article}

\usepackage[english]{babel} 
\usepackage[utf8]{inputenc} 
\setcounter{secnumdepth}{-2} 
\pagestyle{empty}

\usepackage{csquotes}
\usepackage[notes,annotation,short,hyperref=false,backend=biber]{biblatex-chicago}
\bibliography{course.bib}

\usepackage[dvipsnames]{xcolor}
\usepackage[colorlinks=true,urlcolor=blue,citecolor=BlueViolet]{hyperref}
\usepackage{hanging}

\begin{document}

A printable PDF version of the course syllabus is available here: \href{http://www.rci.rutgers.edu/~ag978/arf/syllabus.pdf}{syllabus.pdf}.

\input{out/overview-web.tex}
\input{out/goals-web.tex}
\input{out/reqs-web.tex}
\input{schedule.tex}
\input{out/ack-web.tex}

\end{document}

(The actual printable PDF is generated by a similar but not identical wrapper xelatex file with more layout tweaks. As with Word generation from LaTeX, one has to work around tex4ht’s tendency to get upset with xelatex.)

One final subtlety. pandoc’s built-in latex translation uses the enumerate package to do all its lists. I couldn’t get tex4ht to deal with this correctly. But I wasn’t using any of the enumerate extensions to the basic latex environment, so I just cut the gordian knot and added a line to the Makefile to strip away the optional argument to \begin{enumerate}[1.] generated by pandoc for numbered lists. I’ll find a less kludgy solution…later.

Finally, here are the relevant Makefile rules:

# Directory for pandoc-generated files
generated_dir := out

# The syllabus source markdowns
syllabus_mdfiles := overview.md goals.md reqs.md ack.md blogging.md
syllabus_genfiles := $(patsubst %.md, $(generated_dir)/%.tex, $(syllabus_mdfiles))

# pandoc turns .md into .tex
$(syllabus_genfiles): $(generated_dir)/%.tex: %.md
	pandoc -o $@ $<

# these targets are cleaned-up versions of pandoc's tex, for tex4ht
web_genfiles := $(patsubst %.tex, %-web.tex,$(syllabus_genfiles))

# cleanup just consists of dealing with \begin{enumerate}[1.]
$(web_genfiles): %-web.tex: %.tex
	sed 's/enumerate}\[..\]/enumerate}/' $< > $@

# for tex4ht to handle biblatex, need a biber run
# otherwise this very plain-looking pdf is unused
syllabus-web.pdf: syllabus-web.tex $(web_genfiles) schedule.tex course.bib
	pdflatex syllabus-web.tex
	biber syllabus-web
	pdflatex syllabus-web.tex

# tex4ht call. Note the use of the config file
syllabus-web.html: syllabus-web.pdf syllabus.cfg
	htlatex syllabus-web.tex syllabus.cfg " -cunihtf -utf8" "-cvalidate"

# rinse with pandoc again
syllabus-web.md: syllabus-web.html
	 pandoc -o $@ $<

# Final html generation. Since I paste into wordpress,
# the generated html goes straight to the OS X clipboard.
# Also, I strip off one level of headers. Kludgetastic!
syllabus-web: syllabus-web.md syllabus-web.pdf
	sed 's/^#//' $< | pandoc -f markdown -t html --smart | pbcopy

The result: this webpage for my grad seminar (and this pdf counterpart).

It’s clear to me that the “proper” way to do all these little tweaks would involve either a fuller tex4ht configuration file that bypasses pandoc altogether, or some scripting of the pandoc parser, in Haskell, or a combination of both. I especially want to solve the problems that munged up the bibliography (something to do with definition-list handling produced a : at the start of every line). That sounds like good procrastifun for another day.

[Adding 4/21/2013: This post is my initial foray into pandoc scripting in Haskell.]