From extracted records to a reference list

Last reviewed on April 24, 2026

An extractor gives you a draft. Turning a draft into a reference list a thesis, article, or library catalogue can live with takes a few more steps. None of them is hard. Skipping them is what causes the inevitable "your bibliography has three entries for the same paper" moment two weeks before submission.

Normalise first, polish later

Before doing anything clever, put every extracted record into the same canonical shape. A reasonable target is something close to CSL-JSON: a JSON object per entry, with fields for type (article-journal, book, chapter, report), title, authors as an array of {family, given} pairs, issued date as a partial date, container title (journal or book title), volume, issue, page, publisher, URL, and DOI.

Normalisation also means cleaning up whitespace, stripping surrounding punctuation, decoding Unicode escapes, and making sure identifiers are in their bare form (DOIs without the https://doi.org/ prefix, for example). Doing this once at the top of the pipeline saves every subsequent step from doing it again.

Deduplicate

The same paper appears in different citations in different styles. Two records point to the same paper if:

Prefer DOI-based matching whenever a DOI is available on both sides. For the rest, a string-similarity metric over a normalised title (lowercased, punctuation removed, stop-words dropped) plus exact-match year gets most of the remaining duplicates without too many false positives.

Enrich from identifier services

For every entry that has a DOI, ask CrossRef (or DataCite, for data citations) for the canonical record. Replace the extractor's guesses with the service's structured values. The extractor still contributes the ordering and the "raw" entry, which is useful later when spot-checking.

For entries with only an ArXiv ID or PMID, do the same with the arXiv and PubMed APIs. Keep a note of where each field came from; when two sources disagree (CrossRef versus a publisher's own page, say), it is easier to audit if provenance is recorded.

Handle the entries you could not enrich

A residual set will have no identifier at all. These are the ones that need human attention. Split them:

Export into a reference manager

Most reference managers — Zotero, Mendeley, EndNote, Papers, JabRef — accept BibTeX, RIS, or CSL-JSON. Choose whichever your workflow already uses. Tips:

Cross-check the final list

Before submitting a paper or closing a cataloguing session, run a few sanity checks on the output:

A final pass by a human — yourself, a co-author, a supervisor — is still cheaper than a correction notice later. The extractor is there to remove the drudgery, not the judgement.