Convert Confluence pages to Markdown
There may come a day where you want to stop using Confluence and just convert all the pages to simple Markdown text files.
Luckily this is achievable using the Confluence XML export function, some custom XML parsing code and the markdownify Python package.
Confluence stores data in an Atlassian specific xhtml-format with a mix of html and custom tags (with the namespace prefix "ac"). I have created a fork of markdownify with adaptions for Confluence xml & wiki-markup.
To use the code you will have to download a zip-file with the repository and edit the __init__.py
file in the markdownify
directory.
You will need to update the configuration in the top of the __init__.py
file and provide:
- The url to a JIRA instance (can be ignored if there are no JIRA references in the Confluence documents)
- A mapping of keys to names for your Confluence spaces
- A mapping between User IDs (32 chars of hex) to user names
Next you will need to:
-
Export an XML archive of your Confluence pages, which is only possible for a single space at a time, unless you create a full Confluence site backup.
-
Write code to parse through the XML tree to extract the XHTML for each page.
-
Call your modified version of the markdownify package for each page, e.g.:
md = markdownify.markdownify(xhtml, heading_style='ATX')
-
Save the markdown text in a folder.
-
Live your new Confluence-free life.