Citation parsing in plain language

Last reviewed on April 24, 2026

Pulling a references section out of a paper is the easy part. Splitting it into individual citations and labelling the pieces — author, year, title, venue, pages, DOI — is where the work happens. This guide walks through the steps an extractor takes, the styles it has to recognise, and the cases where even an experienced parser quietly gets it wrong.

Locating the references

The first step is to find where the references begin. In most papers a heading announces them: "References", "Bibliography", "Works cited", "Literature cited". Capitalisation varies, the heading sometimes lives at the top of a column rather than on its own line, and a small minority of papers omit the heading entirely and rely on a horizontal rule.

A robust parser scans for the heading from the bottom of the document upward, because a paper can use the word "references" in its body text long before the bibliography starts. Once it finds the heading, it treats everything from there to the end of the body — and before any appendices — as the references block.

Splitting the block into entries

Citation styles cluster into a few visual patterns:

For the first three patterns, a heuristic parser typically does this: take the references block, split it into lines, look at each line for an "entry-start" signal (a leading number, or a name followed by a year, or an unindented first character) and collect the lines that follow as continuation until the next start signal. The result is a list of strings, one per citation.

Labelling the pieces

Inside each citation, the parser tries to attach types to substrings:

Why DOIs are worth chasing first

If a citation contains a DOI, the parser can stop guessing. A single lookup against CrossRef returns a structured record with a clean title, author list, ISSN, journal name, year, and pages. The DOI is the cheapest and most accurate way to upgrade a noisy citation into a clean reference.

It also means that, in practice, a useful parser does not have to be perfect at extracting titles or authors from the raw entry — it only has to be good enough to recognise an entry when it sees one and to find any DOIs inside it. The expensive labelling work moves to the metadata service that owns the DOI.

Where parsers usually go wrong

What to do with the output

Treat the parsed list as a draft. For each entry, look at the confidence — does it have a DOI? does the title look like a title? — and let the high-confidence entries flow straight into your reference library. Keep the low-confidence ones in a holding area and either complete them by hand or discard them. The cost of a single wrong entry in a bibliography is usually higher than the saving from automating the easy ones.