Citation parsing in plain language
Pulling a references section out of a paper is the easy part. Splitting it into individual citations and labelling the pieces — author, year, title, venue, pages, DOI — is where the work happens. This guide walks through the steps an extractor takes, the styles it has to recognise, and the cases where even an experienced parser quietly gets it wrong.
Locating the references
The first step is to find where the references begin. In most papers a heading announces them: "References", "Bibliography", "Works cited", "Literature cited". Capitalisation varies, the heading sometimes lives at the top of a column rather than on its own line, and a small minority of papers omit the heading entirely and rely on a horizontal rule.
A robust parser scans for the heading from the bottom of the document upward, because a paper can use the word "references" in its body text long before the bibliography starts. Once it finds the heading, it treats everything from there to the end of the body — and before any appendices — as the references block.
Splitting the block into entries
Citation styles cluster into a few visual patterns:
- Numbered. Each entry begins with a number, sometimes in brackets: "[1]", "1.", "(1)". This is friendly for parsers — the numbers act as anchors.
- Author-year. Entries begin with one or more authors followed by a year. There is no anchor character; the boundary between entries is implied by the change from the previous entry's tail to a new author block.
- Hanging indent. Each entry's first line starts at the left margin and subsequent lines are indented. Visually this is the easiest pattern; for a parser it requires the original layout coordinates, not just the linearised text.
- Footnote-style. In law and the humanities, references appear at the bottom of each page rather than in a list. Pulling those into a single bibliography is a separate problem and usually deserves a tool of its own.
For the first three patterns, a heuristic parser typically does this: take the references block, split it into lines, look at each line for an "entry-start" signal (a leading number, or a name followed by a year, or an unindented first character) and collect the lines that follow as continuation until the next start signal. The result is a list of strings, one per citation.
Labelling the pieces
Inside each citation, the parser tries to attach types to substrings:
- Year. Almost always a four-digit number from 1800 onwards. The first such number in the entry is usually the publication year; later four-digit numbers in the entry might be page ranges or volume issues.
- DOI. Matches a known regular expression and is, when present, the most reliable handle the entry has. Sometimes rendered as a URL ("https://doi.org/10.1234/abc") and sometimes as a bare DOI; either should be normalised to the bare form.
- URL. Anything starting with
http://orhttps://. Worth keeping even when no DOI is present. - Page range. Two numbers joined by a hyphen, near the end of the entry, usually behind a comma.
- Authors. The text up to the first year. Splitting that into individual authors then depends on the punctuation used: commas, semicolons, "and", or "&".
- Title and venue. The hardest pieces to separate without style-specific knowledge. The title is normally the longest noun phrase between the year and the venue; the venue is whatever appears in italics in the original PDF (which the linearised text has lost).
Why DOIs are worth chasing first
If a citation contains a DOI, the parser can stop guessing. A single lookup against CrossRef returns a structured record with a clean title, author list, ISSN, journal name, year, and pages. The DOI is the cheapest and most accurate way to upgrade a noisy citation into a clean reference.
It also means that, in practice, a useful parser does not have to be perfect at extracting titles or authors from the raw entry — it only has to be good enough to recognise an entry when it sees one and to find any DOIs inside it. The expensive labelling work moves to the metadata service that owns the DOI.
Where parsers usually go wrong
- Author lists with "et al." hide the real number of authors. Parsers should keep the literal "et al." rather than guess.
- Two-column references linearise into a stream that mixes the right column of one entry with the left column of the next. Without column-aware extraction, the resulting entries are nonsense.
- Citation styles with periods inside names ("J.E. Doe") confuse splitters that use ". " as an entry boundary.
- References that span multiple pages can be split by page footers and headers; a parser should remove those before splitting.
- Non-Latin scripts in author names need careful handling so they are not silently dropped or transliterated.
What to do with the output
Treat the parsed list as a draft. For each entry, look at the confidence — does it have a DOI? does the title look like a title? — and let the high-confidence entries flow straight into your reference library. Keep the low-confidence ones in a holding area and either complete them by hand or discard them. The cost of a single wrong entry in a bibliography is usually higher than the saving from automating the easy ones.