Pandoc is a document converter for multiple type of files.
Format conversion is a chore that often takes a lot of time. I recently wanted to convert a long MSWord document into a version of Markdown that I could use to update a software documentation.
Pandoc can convert many documents types into many others, here we’ll see how we can convert into Markdown using a docker version of the software so that we don’t even need to install Pandoc. For more on Docker see Docker – Beginner for Biologists. I used the docker image called pandoc/latex.
First, download the docker image:
docker pull pandoc/latex
The we can create an alias as suggested by the docker page:
alias pandock=\ 'docker run --rm -v "$(pwd):/data" -u $(id -u):$(id -g) pandoc/latex'
so now we can simply use the alias name.
To convert a “generic” MSWord file into a Markdown file it would be as simple as:
pandock -s example.docx -t markdown -o example.md
However, my documents had a few images, and I wanted to convert to a Markdown version that is “GitHub friendly” (called
gfm.) In addition I wanted to make sure that no files were truncated, so I added the option
--wrap=none. The final command looked like this:
pandock -s my.docx --wrap=none -t gfm --extract-media=images -o my.md
In this process the MSWord file
my.docx gets converted to the Markdown file
my.md without limiting lines to the default of 80 characters and at the same time extracting the images into a directory called
While it was not perfect, this was very useful to creating the final documentation that is now here: htcondor_biochem_v1.5.5
These online examples were useful in crafting the final command:
The latter documents shows examples of the large scope of pandoc with examples converting from:
– Markdown to HTML
– HTML to Markdown
– Word to Markdown
– Word to HTML
– Markdown to PDF
– Markdown to plain text
There are 39 more examples on the Pandoc demos page.