Working with scanned PDFs and OCR
The browser extractor on this site reads the text layer of a PDF. If a document has no text layer — because it was produced by photographing or scanning paper pages — the extractor will return an empty result no matter how clean the file looks on screen. This guide explains how to tell the difference, what optical character recognition adds, and what level of accuracy is realistic for academic material.
Born-digital versus scanned
A born-digital PDF is one written by software that knew the text it was placing: LaTeX, Word, InDesign, a browser printing to PDF. The file contains the text as a sequence of glyphs with positions and font metrics. Selecting a passage in a viewer and copying it works.
A scanned PDF is a series of images wrapped in PDF containers. Each page is a picture; there is no text underneath the picture for a copy command to grab. Selection either does nothing or selects the rectangle of the image. PDFs from older library archives, government records, and many books digitised before the 2010s fall into this category.
A third case sits between the two: a "searchable" scanned PDF, where someone has run OCR over the images and stored the recognised text as an invisible layer behind each page image. These look like scans but behave like born-digital files for the purpose of extraction.
How to tell which one you have
- Try to select a sentence. If you can highlight individual words and copy them as text, the file has a text layer.
- Look at the file size. A 200-page paper of pure text typically weighs a few hundred kilobytes; a 200-page scan can run to dozens of megabytes.
- Zoom in. Born-digital text stays sharp at any zoom; scanned text becomes pixelated.
- Open the file's Document Properties. The "Producer" field for a scan often names a scanner driver or OCR engine, not a typesetting tool.
What OCR does, and what it does not
OCR is an image-to-text problem: take a picture of a page and produce the underlying characters. Modern engines combine image preprocessing (deskewing, denoising, binarisation), layout analysis (detecting columns, paragraphs, headings), and a recognition model to output text together with bounding boxes for each word.
It does not, on its own, give you bibliographic structure. The output of OCR is a stream of characters; turning that into a title, an author list, and a references section is the same heuristic problem as any other PDF, only with more noise. Treat OCR as a preprocessing step that promotes a scanned PDF into a born-digital one, then run the same metadata extraction on top.
Tools commonly used
- Tesseract. The long-standing open-source OCR engine. Good baseline accuracy on clean modern scans, with language packs for many scripts. Used by ocrmypdf to add a text layer to a scanned PDF in place.
- ocrmypdf. A command-line wrapper around Tesseract that takes a scanned PDF and produces a new PDF with the same images plus an embedded text layer. The result can then be processed by the homepage extractor.
- Cloud OCR services. Google Cloud Vision, AWS Textract, Azure Document Intelligence. Often higher accuracy on tricky layouts, especially tables and mixed-script pages, at the cost of sending the file to a third party.
- End-to-end document AI models. Newer transformer-based models that go from page image to structured output in one step. They can produce excellent results on typical academic papers but require more compute and are still maturing.
Accuracy you can expect
For a clean, modern, single-column scan of English text at 300 dpi, character-level accuracy is high — usually well above 99% — and the resulting bibliographic extraction is close to what you would get from a born-digital file. Drop the scan resolution, switch to a multi-column journal layout, add maths, formula, hand-written annotations, or microfilm artefacts, and accuracy falls off quickly. A 19th-century book in a non-Latin script will fight back even with a tuned model.
The implication for a citation workflow: OCR's first pass gets you a usable text body but not always a usable references section. Hand-checking high-value entries is, again, the realistic plan.
Privacy considerations
The browser extractor on this site is intentionally local. OCR engines are usually not. Cloud OCR services upload the file to a remote server and process it there. If the document is confidential, the cleanest pattern is to run OCR locally — for example with ocrmypdf on a workstation — and only then load the output into a browser tool.
For routine published material that is already public, the privacy concern is smaller, but the operational concern remains: cloud OCR is metered, and for a large back catalogue the bills add up.
What to do when OCR is not enough
If the document is short and important, transcribe the references by hand and look each one up in CrossRef or the relevant database. If the document is one of many in a project, build a triage step that filters out files which are unlikely to OCR cleanly (very old, very small dpi, very mixed layout) and route them to a human or a more expensive pipeline. The cost of bad metadata in a research workflow is rarely worth saving the labour of a small amount of manual entry.