2nd October 2023, 11 min read

Converting Journal Article from LaTeX to Markdown

Original post is here eklausmeier.goip.de/blog/2023/10-02-converting-journal-article-from-latex-to-markdown.

1. Problem statement. You have a scientific journal article in LaTeX format on arXiv but want it in Markdown format for a personal blog. In our case we take the article "A Parsec-Scale Galactic 3D Dust Map out to 1.25 kpc from the Sun" from Gordian Edenhofer et al. The original paper is here: https://arxiv.org/abs/2308.01295

If the article is in Markdown format, it can then be easily transformed into HTML. Having an article in Markdown format has a number of advantages over having the article in LaTeX format:

It is much easier to write Markdown than LaTeX
Reading HTML is easier than reading a PDF
The notion of page, i.e., paper sized page, does not have a good meaning in the world of smartphones, tablet, etc.

Of course, the math in the LaTeX document will be converted to MathJax.

2. Overview of the content of the scientific article. The article briefly describes the importance of dust:

Interstellar dust comprises only 1% of the interstellar medium by mass, but absorbs and re-radiates more than 30 of starlight at infrared wavelengths. As such, dust plays an outsized role in the evolution of galaxies, catalyzing the formation of molecular hydrogen, shielding complex molecules from the UV radiation field, coupling the magnetic field to interstellar gas, and regulating the overall heating and cooling of the interstellar medium.

Dust's ability to scatter and absorb starlight is precisely the reason why we can probe it in three spatial dimensions.

A novel $\cal O(n)$ method called Iterative Charted Refinement (ICR) was used to analyze the more than 122 billion of data from the Gaia mission.

The algorithm ran for 4 weeks using the SLURM workload manager.

We employ a new Python framework called NIFTy.re for deploying NIFTy models to GPUs. NIFTy.re is part of the NIFTy Python package and internally uses JAX to run models on the GPU. We are able to speed up the evaluation of the value and gradient of ... by two orders of magnitude by transitioning from CPUs to GPUs. Our reconstruction ran on a single NVIDIA A100 GPU with 80 GB of memory for about four weeks.

Needless to say, this 4 week run was only one of the very many runs to actually produce the final result.

The result is a 3D dust map

achieving an angular resolution of ${14'}$ ($N_\text{side}=256$). We sample the dust extinction in 516 distance bins spanning 69 pc to 1250 pc. We obtain a maximum distance resolution of 0.4pc at 69pc and a minimum distance resolution of 7pc at 1.25 kpc.

3. Solution. Initially a Pandoc approach was tried. Pandoc and all its dependencies on Arch Linux needs more than half GB (Gigabyte!) of space, just for the installation. After installation the Pandoc approach even failed.

Perl, the workhorse, had to do the job again. For the conversion I created two Perl scripts:

blogparsec: converts main.tex, i.e., the actual paper
blogbibtex: converts the Bibtex-formatted file literature.bib

Using those two script, creating the Markdown file goes like this:

blogparsec main.tex > 08-03-a-parsec-scale-galactic-3d-dust-map-out-to-1-25-kpc-from-the-sun.md
blogbibtex literature.bib >> 08-03-a-parsec-scale-galactic-3d-dust-map-out-to-1-25-kpc-from-the-sun.md

This file still needs some manual editing. One prominent case is moving the table-of-content to the top, as this is appended at the end.

4. blogparsec script. Some notes on this Perl script. The input to this script is the actual LaTeX text with all the formulas etc.

First define some variables and use strict mode.

#!/bin/perl -W
# Convert paper in "Astronomy & Astrophysics" LaTeX format to something resembling Markdown
# Manual post-processing is still necessary but a lot easier

use strict;
my ($ignore,$sectionCnt,$subSectionCnt,$replaceAlgo,$replaceTable) = (1,0,0,0,0);
my (@sections) = ();

The frontmatter header is a simple here-document:

print <<'EOF';
---
date: "2023-08-03 14:00:00"
title: "A Parsec-Scale Galactic 3D Dust Map out to 1.25 kpc from the Sun"
description: "A 3D map of the spatial distribution of interstellar dust extinction out to a distance of 1.25 kpc from the Sun"
MathJax: true
categories: ["mathematics", "astronomy"]
tags: ["interstellar dust", "interstellar medium", "Milky Way", "Gaia", "Gaussian processes", "Bayesian inference"]
---

EOF

The main loop looks at each line in main.tex. After the loop the literature section is added, then all sections collected so far are printed.

while (<>) {
    $ignore = 0 if (/\\author\{Gordian~Edenhofer/);
    next if ($ignore);

    (...)

    print;

    print "\$\$\n" if (/(\\end\{equation\}|\\end\{align\})/);	# enclose with $$ #2
}


print "## Literature<a id=Literature></a>\n";
for (@sections) {
    print $_ . "\n";
}
++$sectionCnt;
print "- [$sectionCnt. Literature](#Literature)\n";

What follows is the part which is marked as (...) in above code.

Here is the special case for processing algorithm and tables in the paper: the algorithm is simply a screenshot of the original PDF, the table is a here-document:


    # In this particular case we replace the two algorithms with a corresponding screenshot
    if (/^\\begin\{algorithm/) {
        $replaceAlgo = 1;
        next;
    } elsif (/^\s+Pseudocode for ICR creating a GP/) {
        s/^(\s+)//;
        s/(\\left|right)\\/$1\\\\/g;	# probably MathJax bug
        $replaceAlgo = 0;
        print "![](*<?=\$rbase?>*/img/parsec_res/Algorithm1.webp)\n\n";
    } elsif (/^\s+Pseudocode for our expansion point variational/) {
        s/^(\s+)//;
        $replaceAlgo = 0;
        print "![](*<?=\$rbase?>*/img/parsec_res/Algorithm2.webp)\n\n";
    } elsif ($replaceAlgo == 1) { next; }

    if (/^\\begin\{table/) {
        $replaceTable = 1;
        next;
    } elsif (/^\\end\{table/) {
        $replaceTable = 0;
        print <<'EOF';

Parameters of the prior distributions.
The parameters $s$, $\mathrm{scl}$, and $\mathrm{off}$ fully determine $\rho$.
They are jointly chosen to a prior yield the kernel reconstructed in [Leike2020][].



 Name | Distribution | Mean | Standard Deviation | Degrees of Freedom
 -----|--------------|------|--------------------|--------------------
_s_   | Normal       | 0.0  | Kernel from [Leike2020][] | 786,432 &times; 772
scl   | Log-Normal   | 1.0  | 0.5                |  1
off   |  Normal      | $-6.91\left(\approx\ln10^{-3}\right)$ <br>prior median extinction <br>from [Leike2020][] | 1.0 | 1
      |              |      | Shape Parameter    | Scale Parameter  
$n_\sigma$ | Inverse Gamma | 3.0 | 4.0 | #Stars = 53,880,655

EOF
        next;
    } elsif ($replaceTable == 1) { next; }

The header with its authors and institutions needs some extra handling:

s/^\\(author|institute)\{/\n<p>\u$1s:<\/p>\n\n1. /;

s/\~/ /g;

# Authors, institutions, abstract, etc.
s/\(\\begin\{CJK\*.+?CJK\*\}\)//;
s/\\inst\{(.+?)\}/ \($1\)/g;
if (/^\s+\\and/) { print "1. "; next; }
s/^\{% (\w+) heading \(.+$/\n\n_\u$1._ /;
s/^\\abstract/## Abstract/;
s/^\\keywords\{/__Key words.__ /;

Many lines simply are no longer needed in Markdown and therefore dropped:

# Lines to drop, not relevant
next if (/(^\\maketitle|^%\s+|^%In general|^\\date|^\\begin\{figure|^\\end\{figure|\s+\\centering|\s+\\begin\{split\}|\s+\\end\{split\}|^\s*\\label|^\\end\{acknowledgements\}|^\\FloatBarrier|^\\bibliograph|^\\end\{algorithm\}|^\\begin\{appendix|^\\end\{appendix\}|^\\end\{document\})/);

s/\s+%\s+[^%].+$//;	# Drop LaTeX comments
s/\\fnmsep.+$//;	# drop e-mail

Display math is enclosed in double dollars:

print "\$\$\n" if (/(\\begin\{equation\}|\\begin\{align\})/);	# enclose with $$a #1

Images are replaced with the usual Markdown code ![]():

# images
s/\s+\\includegraphics.+res\/(\w+)\}/!\[Photo\]\(\*<\?=\$rbase\?>\*\/img\/parsec_res\/$1\.png)/;
s/\s+\\subcaptionbox\{(.+?)\}\{\%/\n__$1__\n/g;

Some LaTeX macros are not present in MathJax and therefore need to be replaced.

# MathJax doesn't know \nicefrac
s/\\nicefrac\{(.+?)\}\{(.+?)\}/\{$1\}\/\{$2\}/g;
s/\\coloneqq/:=/g;	# MathJax doesn't know \coloneqq + \argmin + \SI
s/\\argmin/\\mathop\{\\hbox\{arg min\}\}/g;
s/\\SI(|\[parse\-numbers=false\])\{(.+?)\}/$2/g;
s/\\SIrange\{(.+?)\}\{(.+?)\}\{(|\\)([^\\]+?)\}/$1 $4 to $2 $4/g;
s/\\nano\\meter/nm/g;
s/\{\\pc\}/pc/g;
s/\{\\kpc\}/kpc/g;
s/(kpc|pc)\$/\\\\,\\hbox\{$1\}\$/g;
s/\{\\cubic\\pc\}/\\\\,\\hbox\{pc\}^3/g;

What looks good in LaTeX does not necessarily look good in Markdown:

s/i\.e\.\\ /i.e., /g;

# Special cases
s/``([A-Za-z])/"$1/g;	# double backquotes in LaTeX have an entirely different meaning than in Markdown

More MathJax specialities:

# These are probably MathJax bugs, which we correct here
s/\$\\tilde\{Q\}_\{\\bar\{\\xi\}\}\$/\$\\tilde\{Q\}\\_\{\\bar\{\\xi\}\}\$/g;
s/\$\\mathcal\{D\}_/\$\\mathcal\{D\}\\_/g;
s/\$P\(d\|\\mathcal\{D\}_/\$P\(d\|\\mathcal\{D\}\\_/g;
s/\$\\mathrm\{sf\}_/\$\\mathrm\{sf\}\\_/g;

Various LaTeX text-macros:

s/\\url\{(.+?)\}/$1/g;	# Markdown automatically URL-ifies URLs, so we can dispense \url{}

# Thousands separator, see https://stackoverflow.com/questions/33442240/perl-printf-to-use-commas-as-thousands-separator
s/\\num\[group-separator=\{,\}\]\{(\d+)\}/scalar reverse(join(",",unpack("(A3)*", reverse int($1))))/eg;

# Code
s/\\lstinline\|(.+?)\|/`$1`/g;
s/\\texttt\{(.+?)\}/`$1`/g;
s/quality\\_flags\$<\$8/quality_flags<8/g;	# special case

# Special cases for preventing code blocks because of indentation
s/   (The angular resolution)/$1/;
s/   (The stated highest r)/$1/;

Section and subsection headers become ## and ### in Markdown:

# sections + subsections
if (/\\section\{(.+?)\}\s*$/) {
    my $s = $1;
    ++$sectionCnt; $subSectionCnt = 0;
    push @sections, "- [$sectionCnt. $s](#s$sectionCnt)";
    $_ = "\n## $sectionCnt. $s<a id=s$sectionCnt></a>\n";
} elsif (/\\subsection\{(.+?)\}\s*$/) {
    my $s = $1;
    ++$subSectionCnt;
    push @sections, "\t- [$sectionCnt.$subSectionCnt $s](#s${sectionCnt}_$subSectionCnt)";
    $_ = "\n### $sectionCnt.$subSectionCnt $s<a id=s${sectionCnt}_$subSectionCnt></a>\n";
}

For footnotes I used block quotes in Markdown.

if (/(\\footnotetext\{%|^\\begin\{acknowledgements\})/) { print "> "; next; }

I fought a little bit with citations and initially had something like:

# Citations
#s/\\citep(|\[.*?\]\[\])\{(\w+)\}/'('.(length($1)>4?substr($1,1,-3).' ':'').'['.join('], [',split(',',$2)).'][])'/eg;
# First approach, now obsolete through eval()-approach
#s/\\citep\{(\w+)\}/([$1][])/g;
#s/\\citep\{(\w+),(\w+)\}/([$1][], [$2][])/g;
#s/\\citep\{(\w+),(\w+),(\w+)\}/([$1][], [$2][], [$3][])/g;
#s/\\citep\{(\w+),(\w+),(\w+),(\w+)\}/([$1][], [$2][], [$3][], [$4][])/g;
#s/\\citep\{(\w+),(\w+),(\w+),(\w+),(\w+)\}/([$1][], [$2][], [$3][], [$4][], [$5][])/g;
#s/\\citep\{(\w+),(\w+),(\w+),(\w+),(\w+),(\w+)\}/([$1][], [$2][], [$3][], [$4][], [$5][], [$6][])/g;
#s/\\citep\{(\w+),(\w+),(\w+),(\w+),(\w+),(\w+),(\w+)\}/([$1][], [$2][], [$3][], [$4][], [$5][], [$6][], [$7][])/g;
#s/\\citep\{(\w+),(\w+),(\w+),(\w+),(\w+),(\w+),(\w+),(\w+)\}/([$1][], [$2][], [$3][], [$4][], [$5][], [$6][], [$7][], [$8][])/g;
#s/\\citep\{(\w+),(\w+),(\w+),(\w+),(\w+),(\w+),(\w+),(\w+),(\w+)\}/([$1][], [$2][], [$3][], [$4][], [$5][], [$6][], [$7][], [$8][], [$9][])/g;
#s/\\citep\{(\w+),(\w+),(\w+),(\w+),(\w+),(\w+),(\w+),(\w+),(\w+),(\w+)\}/([$1][], [$2][], [$3][], [$4][], [$5][], [$6][], [$7][], [$8][], [$9][], [$10][])/g;
#s/\\citet\{(\w+)\}/[$1][]/g;

Luckily this can be handled by eval in regex, i.e., watch out for the s///eg, the e is important:

s!\\citep\{([,\w]+)\}!'(['.join('][], [',split(/,/,$1)).'][])'!eg;	# cite-paranthesis without any prefix text
s!\\citep\[(.+?)\]\[\]\{(\w+)\}!'('.$1.' ['.join('][], [',split(/,/,$2)).'][])'!eg;	# citep with prefix text
s!\\(citet|citeauthor)\{([,\w]+)\}!'['.join('][], [',split(/,/,$2)).'][]'!eg;	# we handle citet+citeauthor the same

During development of this Perl script I used Beyond Compare quite intensively, to compare the original against the changed file.

5. blogbibtex script. The input to this script is the Bibtex file with all literature references. The Bibtex file looks something like this:

@book{Draine2011,
  author  = {{Draine}, Bruce T.},
  title   = {{Physics of the Interstellar and Intergalactic Medium}},
  year    = 2011,
  adsurl  = {https://ui.adsabs.harvard.edu/abs/2011piim.book.....D},
  adsnote = {Provided by the SAO/NASA Astrophysics Data System}
}
@article{Popescu2002,
  author        = {{Popescu}, Cristina C. and {Tuffs}, Richard J.},
  title         = {{The percentage of stellar light re-radiated by dust in late-type Virgo Cluster galaxies}},
  journal       = {\mnras},
  keywords      = {galaxies: clusters: individual: Virgo Cluster, galaxies: fundamental parameters, galaxies: photometry, galaxies: spiral, galaxies: statistics, infrared: galaxies, Astrophysics},
  year          = 2002,
  month         = sep,
  volume        = {335},
  number        = {2},
  pages         = {L41-L44},
  doi           = {10.1046/j.1365-8711.2002.05881.x},
  archiveprefix = {arXiv},
  eprint        = {astro-ph/0208285},
  primaryclass  = {astro-ph},
  adsurl        = {https://ui.adsabs.harvard.edu/abs/2002MNRAS.335L..41P},
  adsnote       = {Provided by the SAO/NASA Astrophysics Data System}
}

The Perl script has some journal names preloaded:

#!/bin/perl -W
# Convert BibTeX to Markdown. Produce the following:
#    1. List of URL targets
#    2. Sorted list of literature entries

use strict;
my ($inArticle,$entry,$entryOrig,$type) = (0,"","");
my %H;	# hash of hash (each element in hash is a yet another hash)
my %Journals = (	# see http://cdsads.u-strasbg.fr/abs_doc/aas_macros.html
    '\aap'   => 'Astronomy & Astrophysics',
    '\aj'    => 'Astronomical Journal',
    '\apj'   => 'The Astrophysical Journal',
    '\apjl'  => 'Astrophysical Journal, Letters',
    '\apjs'  => 'Astrophysical Journal, Supplement',
    '\mnras' => 'Monthly Notices of the RAS',
    '\nat'   => 'Nature'
);

The actual loop populates the hash %H:

while (<>) {
    if (/^@(article|book|inproceedings|misc|software)\{(\w+),$/) {
        ($type,$entry,$entryOrig,$inArticle) = ($1,uc $2,$2,1);
        $H{$entry}{'entry'} = $entryOrig;
        $H{$entry}{'type'} = $type;
        #printf("\t\tentry = |%s|, type = |%s|\n",$entry,$type);
    } elsif ($inArticle) {
        if (/^}\s*$/) { $inArticle = 0; next; }
        if (/^\s+(\w+)\s*=\s*(.+)(|,)$/) {
            my ($key,$value) = ($1,$2);

            # LaTeX foreign language character handling
            $value =~ s/\{\\ss\}/ß/g;
            $value =~ s/\{\\"A\}/Ä/g;
            $value =~ s/\{\\"U\}/Ü/g;
            $value =~ s/\{\\"O\}/Ö/g;
            $value =~ s/\{\\"a\}/ä/g;
            $value =~ s/\{\\"u\}/ü/g;
            $value =~ s/\{\\"i\}/ï/g;
            $value =~ s/\{\\H\{o\}\}/ő/g;
            $value =~ s/\{\\"\\i\}/ï/g;
            $value =~ s/\{\\"o\}/ö/g;
            $value =~ s/\{\\'A\}/Á/g;	# accent aigu
            $value =~ s/\{\\'E\}/É/g;	# accent aigu
            $value =~ s/\{\\'O\}/Ó/g;	# accent aigu
            $value =~ s/\{\\'U\}/Ú/g;	# accent aigu
            $value =~ s/\{\\'a\}/á/g;	# accent aigu
            $value =~ s/\{\\'e\}/é/g;	# accent aigu
            $value =~ s/\{\\'o\}/ó/g;	# accent aigu
            $value =~ s/\{\\'u\}/ú/g;	# accent aigu
            $value =~ s/\{\\`a\}/à/g;	# accent grave
            $value =~ s/\{\\`e\}/è/g;	# accent grave
            $value =~ s/\{\\`u\}/ù/g;	# accent grave
            $value =~ s/\{\\^a\}/â/g;	# accent circonflexe
            $value =~ s/\{\\^e\}/ê/g;	# accent circonflexe
            $value =~ s/\{\\^i\}/î/g;	# accent circonflexe
            $value =~ s/\{\\^\\i\}/î/g;	# accent circonflexe
            $value =~ s/\{\\^o\}/ô/g;	# accent circonflexe
            $value =~ s/\{\\^u\}/û/g;	# accent circonflexe
            $value =~ s/\{\\~A\}/Ã/g;	# minuscule a
            $value =~ s/\{\\~a\}/ã/g;	# minuscule a
            $value =~ s/\{\\~O\}/Õ/g;	# minuscule o
            $value =~ s/\{\\~o\}/õ/g;	# minuscule o
            $value =~ s/\{\\~n\}/ñ/g;	# palatal n
            $value =~ s/\{\\v\{C\}/Č/g;	# grapheme C
            $value =~ s/\{\\v\{c\}/č/g;	# grapheme c
            $value =~ s/\{\\v\{S\}/Š/g;	# grapheme S
            $value =~ s/\{\\v\{s\}/š/g;	# grapheme s
            $value =~ s/\{\\v\{Z\}/Ž/g;	# grapheme Z
            $value =~ s/\{\\v\{z\}/ž/g;	# grapheme z
    
            $value =~ s/\{|\}|\~//g;	# drop {}~
            $value =~ s/,$//;	# drop last comma
            $H{$entry}{$key} = $value;
            #printf("\t\t\tentry = |%s|, key = |%s|\n", $entry, $key);
        }
    }
}

Once everything is loaded into the hash, the hash is printed out in formatted form.

print("\n");
for my $e (sort keys %H) {
    my $He = \%H{$e};
    my $url = 
    printf("[%s]: %s\n", $H{$e}{'entry'},
        exists($H{$e}{'doi'}) ? 'https://doi.org/'.$H{$e}{'doi'}
        : exists($H{$e}{'url'}) ? $H{$e}{'url'} : '#Literature');
}
print("\n");

for my $e (sort keys %H) {
    my ($He,$date,$journal) = (\$H{$e},"","");
    if (exists($$He->{'year'}) && exists($$He->{'month'}) && exists($$He->{'day'})) {
        $date = sprintf("%02d-%s-%d", $$He->{'year'}, $$He->{'month'}, $$He->{'day'});
    } elsif (exists($$He->{'year'}) && exists($$He->{'month'})) {
        my $m = $$He->{'month'};
        $date = "\u$m" . "-" . 	$$He->{'year'};
    } elsif (exists($$He->{'year'})) {
        $date = $$He->{'year'};
    }
    if (exists($$He->{'journal'})) {
        my $t = $$He->{'journal'};
        $journal = ", " . ((substr($t,0,1) eq '\\') ? $Journals{$t} : $t);
        $journal .= ", Vol. " . $$He->{'volume'} if (exists($$He->{'volume'}));
        $journal .= ", Nr. " . $$He->{'number'} if (exists($$He->{'number'}));
        $journal .= ", pp. " . $$He->{'pages'} if (exists($$He->{'pages'}));
    }

    printf("1. \\[%s\\] %s: _%s_, %s%s%s\n", $H{$e}{'entry'}, $H{$e}{'author'},
        defined($H{$e}{'title'}) ? $H{$e}{'title'} : $H{$e}{'howpublished'},
        $date, $journal,
        exists($H{$e}{'doi'}) ? ', https://doi.org/'.$H{$e}{'doi'}
        : exists($H{$e}{'url'}) ? ', ' . $H{$e}{'url'} : ''
    );
}

The output of this blogbibtex script is then appended to the output of the previous script blogparsec.

6. Open issues. I had already worked for two days on these two Perl scripts and wanted to finish it. Therefore the following topics are not adressed but can be solved quite easily.

There are still some stray curly braces, which should be removed.
Back and forward references, i.e., all these still visible \Cref tags should be converted using link references in Markdown.
LaTeX table were converted manually, should be fully automatic.
Converting the \begin{algorithm} and \end{algorithm} probably is a lot trickier, as it needs extra CSS to work properly.