Guides
The guides below explain how PDF metadata is structured, how a heuristic extractor goes about pulling it out, and how to make the resulting records useful in a real workflow. They are written for people who handle a lot of academic and technical PDFs and want to understand what the tools are actually doing — and where they fall short.
Start here
- A practical guide to PDF metadata extraction — what counts as metadata in a PDF, where it lives, and how to pull it out cleanly.
- Citation parsing in plain language — how a tool decides where one reference ends and the next begins.
- DOIs, ArXiv IDs, and friends — the persistent identifiers that make a citation actually resolvable.
Beyond the basics
- Working with scanned PDFs and OCR — what to do when the text layer is missing or unreliable.
- Why process PDFs in the browser — a tour of the architectural choices behind a privacy-first extractor.
- From extracted records to a reference list — turning raw output into something a paper, thesis, or library catalogue can use.
Reference material
For format-specific notes — BibTeX, RIS, CSL-JSON, Markdown, DOCX — see the formats reference.