Guides

Last reviewed on April 24, 2026

The guides below explain how PDF metadata is structured, how a heuristic extractor goes about pulling it out, and how to make the resulting records useful in a real workflow. They are written for people who handle a lot of academic and technical PDFs and want to understand what the tools are actually doing — and where they fall short.

Start here

A practical guide to PDF metadata extraction — what counts as metadata in a PDF, where it lives, and how to pull it out cleanly.
Citation parsing in plain language — how a tool decides where one reference ends and the next begins.
DOIs, ArXiv IDs, and friends — the persistent identifiers that make a citation actually resolvable.

Beyond the basics

Working with scanned PDFs and OCR — what to do when the text layer is missing or unreliable.
Why process PDFs in the browser — a tour of the architectural choices behind a privacy-first extractor.
From extracted records to a reference list — turning raw output into something a paper, thesis, or library catalogue can use.

Reference material

For format-specific notes — BibTeX, RIS, CSL-JSON, Markdown, DOCX — see the formats reference.