Learn Markdown – Episode 6. Convert Word to Markdown with pandoc

pandoc is a document converter

Summary

Pandoc is a document converter for multiple type of files.

Converting example

Format conversion is a chore that often takes a lot of time. I recently wanted to convert a long MSWord document into a version of Markdown that I could use to update a software documentation.

Pandoc can convert many documents types into many others, here we’ll see how we can convert into Markdown using a docker version of the software so that we don’t even need to install Pandoc. For more on Docker see Docker – Beginner for Biologists. I used the docker image called pandoc/latex.

First, download the docker image: docker pull pandoc/latex

The we can create an alias as suggested by the docker page:

alias pandock=\
'docker run --rm -v "$(pwd):/data" -u $(id -u):$(id -g) pandoc/latex'

so now we can simply use the alias name.

To convert a “generic” MSWord file into a Markdown file it would be as simple as:

pandock -s example.docx -t markdown -o example.md

However, my documents had a few images, and I wanted to convert to a Markdown version that is “GitHub friendly” (called gfm.) In addition I wanted to make sure that no files were truncated, so I added the option --wrap=none. The final command looked like this:

pandock -s my.docx --wrap=none  -t gfm --extract-media=images -o my.md            

In this process the MSWord file my.docx gets converted to the Markdown file my.md without limiting lines to the default of 80 characters and at the same time extracting the images into a directory called images.

While it was not perfect, this was very useful to creating the final documentation that is now here: htcondor_biochem_v1.5.5

Other examples

These online examples were useful in crafting the final command:

Convert Docx To Markdown With Pandoc – [Archived]
Convert Word documents to Markdown, HTML or any other format  – [Archived]

The latter documents shows examples of the large scope of pandoc with examples converting from:

– Markdown to HTML
– HTML to Markdown
– Word to Markdown
– Word to HTML
– Markdown to PDF
– Markdown to plain text

There are 39 more examples on the Pandoc demos page.