A practical guide to PDF metadata extraction
"Metadata" sounds tidy. PDFs are not. A typical academic PDF wears at least three different layers of metadata at once, each written by a different actor with a different idea of what should appear there. Anyone who has tried to import a stack of PDFs into a reference manager has met the result: titles full of journal names, authors that include the dean of the faculty, and dates that belong to the file rather than the paper. This guide walks through what is actually in a PDF, how an extractor goes hunting for the useful bits, and why the answer is sometimes "ask a human".
Three layers of PDF metadata
The Document Information dictionary
Every PDF can carry a small dictionary of named fields — Title, Author, Subject, Keywords, Creator, Producer, CreationDate, ModDate. This is the layer most viewers display in their "File > Properties" panel. It is also the layer most likely to be wrong. The Title field is often the file name the author originally saved as ("paper-final-v3.pdf"). The Creator and Producer fields tell you which software wrote the file, not who wrote the paper. CreationDate is the date someone clicked Export.
XMP metadata
The Extensible Metadata Platform is an XML packet, normally embedded as a stream inside the PDF, that uses Dublin Core and other vocabularies to describe the document. Publishers populate XMP with structured fields — title, contributors, DOI, copyright statement, journal, volume, issue. When XMP is present and well-formed it is the most reliable source of bibliographic metadata. It is also the layer that authors writing from a Word template most often leave empty.
Visible content
The third layer is the text on page one. The "real" title and author list almost always appear there in some recognisable typographic shape — biggest text on the page, centred, followed by a row of names and affiliations, then an abstract. Heuristic extractors lean heavily on this layer because it is the one authors actually take care to get right.
How a heuristic extractor reads a paper
Open a PDF and the first thing a parser sees is a graph of objects: page trees, content streams, fonts, images, annotations. PDF.js (and similar libraries) will turn each page into a stream of text fragments, each with a position, a font size, and a font name. From that, an extractor tries to reconstruct the human-readable text of the page.
A very simple title heuristic looks like this: discard anything before the document's first page; group the text fragments into lines using their vertical positions; sort the lines by font size; pick the largest non-trivial line that appears above any line containing the word "abstract". That gets the title right surprisingly often. It also picks up plenty of running headers, journal mastheads, and conference names when those happen to be set in larger type.
Author detection usually keys on the band of text between the title and the abstract. Within that band, a parser looks for runs of capitalised words, sometimes paired with email addresses or superscript affiliations. The result is a candidate list — and the same heuristic that finds authors will quite happily catch the name of the corresponding-author note's institution.
Identifiers are the most reliable thing on the page. A DOI matches a stable regular expression (10.\d{4,9}/...); an ArXiv ID looks like 2401.12345 or the older cs.CL/0401001; PubMed IDs and ISBNs each have their own predictable shape. An extractor that finds any of these on the first two pages can often skip everything else and resolve the bibliographic record directly through CrossRef, arXiv, or PubMed.
Why the same paper produces different output in different tools
If you run the same PDF through three extraction tools you will get three different records. The differences come from a small number of design choices:
- Where the tool looks first. Some extractors trust XMP unconditionally; others ignore it because they know how often it is empty or wrong.
- How aggressively it merges lines. A title broken across two visual lines needs to be glued back together; do it too eagerly and you swallow the journal title underneath it.
- What it considers an author. Some tools accept any "Firstname Lastname" pattern; others insist on seeing an affiliation marker or an email nearby.
- Whether it normalises identifiers. A DOI rendered as
10.1234/abc-defat the end of one line and continued onto the next is the same DOI; an extractor that does not stitch lines together will lose it.
Failure modes that are worth knowing
A few common situations defeat almost every heuristic extractor unless it has special handling:
- Two-column layouts read as a single linearised string by default. The text "Abstract — We present" turns into a sentence that races down both columns at once. A column-aware reader is needed; PDF.js exposes positions but does not segment columns for you.
- Scanned PDFs have no text layer at all, only images. They need an OCR pass before any of this is possible. See the scanned PDFs guide.
- Conference proceedings exported as a single file contain dozens of papers concatenated together. Most extractors only look at the first page and stop, returning the metadata of the first paper for the whole volume.
- Slide decks almost never have a paper-shaped title block, and their authors are often on a slide somewhere in the middle. A bibliographic extractor will return an unflattering result.
- Form-style PDFs — application forms, government documents, invoices — break every assumption made above. They should be processed by tools written for forms, not for papers.
What a good workflow looks like
The most useful pattern, in practice, is: extract; spot-check; resolve the identifiers you trust; correct the rest by hand. If a paper has a DOI, the cleanest record always comes from the metadata service that owns that DOI (CrossRef for the majority of academic publishers). The extractor's job is to find the DOI quickly and to provide a sensible fallback record for the papers that do not have one.
For larger collections, treat the extracted records as a first pass. Sort them by confidence — for example, "has a DOI" is high confidence, "has neither a DOI nor a recognisable author block" is low — and put the low-confidence ones in front of a human before they enter your reference library.
Trying it out
The homepage extractor implements the heuristics described here. Drop a PDF on it and inspect the JSON output to see exactly what was found and what was guessed. If something looks wrong, the JSON includes the underlying PDF Information dictionary so you can see where the wrong value came from.