From extracted records to a reference list
An extractor gives you a draft. Turning a draft into a reference list a thesis, article, or library catalogue can live with takes a few more steps. None of them is hard. Skipping them is what causes the inevitable "your bibliography has three entries for the same paper" moment two weeks before submission.
Normalise first, polish later
Before doing anything clever, put every extracted record into the same canonical shape. A reasonable target is something close to CSL-JSON: a JSON object per entry, with fields for type (article-journal, book, chapter, report), title, authors as an array of {family, given} pairs, issued date as a partial date, container title (journal or book title), volume, issue, page, publisher, URL, and DOI.
Normalisation also means cleaning up whitespace, stripping surrounding punctuation, decoding Unicode escapes, and making sure identifiers are in their bare form (DOIs without the https://doi.org/ prefix, for example). Doing this once at the top of the pipeline saves every subsequent step from doing it again.
Deduplicate
The same paper appears in different citations in different styles. Two records point to the same paper if:
- they share a DOI, or
- they share an arXiv ID or PubMed ID, or
- their author lists (normalised), year, and title are close enough by a fuzzy match.
Prefer DOI-based matching whenever a DOI is available on both sides. For the rest, a string-similarity metric over a normalised title (lowercased, punctuation removed, stop-words dropped) plus exact-match year gets most of the remaining duplicates without too many false positives.
Enrich from identifier services
For every entry that has a DOI, ask CrossRef (or DataCite, for data citations) for the canonical record. Replace the extractor's guesses with the service's structured values. The extractor still contributes the ordering and the "raw" entry, which is useful later when spot-checking.
For entries with only an ArXiv ID or PMID, do the same with the arXiv and PubMed APIs. Keep a note of where each field came from; when two sources disagree (CrossRef versus a publisher's own page, say), it is easier to audit if provenance is recorded.
Handle the entries you could not enrich
A residual set will have no identifier at all. These are the ones that need human attention. Split them:
- High confidence. Clear author list, a plausible title, a year, maybe a venue. Keep; flag for review.
- Low confidence. Title looks like a header, author list is empty, year missing. Either discard or escalate to a hand-typed entry.
- Non-standard entries. Software, datasets, URLs, personal communications. Format according to the style guide the bibliography is using; most modern styles now have explicit types for software and datasets.
Export into a reference manager
Most reference managers — Zotero, Mendeley, EndNote, Papers, JabRef — accept BibTeX, RIS, or CSL-JSON. Choose whichever your workflow already uses. Tips:
- Use stable citation keys that survive re-exports. A key based on
firstAuthorLastName+year+ a short slug of the title avoids collisions across projects. - Preserve the DOI and URL fields through the export; many styles now print them.
- Check how the manager handles multi-word family names ("van der Waals"), suffixes ("Jr", "III"), and non-Latin scripts. Quirks in this area are common.
Cross-check the final list
Before submitting a paper or closing a cataloguing session, run a few sanity checks on the output:
- Every in-text citation resolves to an entry in the bibliography, and every entry is cited at least once.
- No duplicates survive.
- Year, title, and first author in each entry match what a DOI lookup returns.
- Non-English titles render correctly (no mojibake, no stripped diacritics).
A final pass by a human — yourself, a co-author, a supervisor — is still cheaper than a correction notice later. The extractor is there to remove the drudgery, not the judgement.