GROBID Tools

Extract PDF Metadata & Citations in Your Browser

🔒 100% Private - Your data never leaves your computer
📄

Drop PDF Here or Click to Upload

Maximum file size: 50MB

What Gets Extracted Automatically

🔍

Bibliographic Metadata

  • Document title
  • Author names
  • Publication year
  • Abstract text
  • Keywords
  • Page count
🔗

Identifiers & Links

  • DOI (Digital Object Identifier)
  • URLs from references
  • Citation identifiers
  • ISBN numbers
  • ArXiv IDs
  • PubMed IDs
📚

Citations & References

  • Full bibliography list
  • Citation parsing
  • Reference year detection
  • DOI extraction per citation
  • Structured JSON export
  • Raw text preservation

Multiple Export Formats

Transform your PDF data into any format you need - all processed locally in your browser

📊

JSON

Structured data with all metadata & citations

📄

Plain Text

Clean UTF-8 text file (.txt)

📝

Word

Formatted DOCX document

📋

Markdown

GitHub-flavored Markdown (.md)

📚

BibTeX

Citation format for LaTeX (.bib)

Why Choose Browser-First PDF Tools?

🔒

100% Private

Your PDFs never leave your computer. All processing happens locally in your browser.

Lightning Fast

No server round-trips. Instant processing with modern browser APIs and web workers.

📡

Works Offline

Install as a PWA and process PDFs anywhere, even without internet connection.

🆓

Free to Use

No sign-up, no API limits, no upload quotas. Built on the open-source PDF.js and pdf-lib libraries.

Learn more about PDF extraction

Practical guides on how PDF metadata is structured, how citation parsers actually work, and how to turn extracted records into a clean bibliography.

🧭

PDF metadata extraction

Where metadata lives inside a PDF, how heuristic extractors find titles and authors, and the failure modes worth knowing about.

📚

Citation parsing

How a parser splits a references section, labels each field, and keeps its hands off the ambiguous bits.

🔗

DOIs and identifiers

The persistent identifiers that make a citation resolvable — DOI, ArXiv, PubMed, ORCID, ISBN, ISSN.

🖨️

Scanned PDFs and OCR

What to do when a PDF has no text layer: tools, accuracy expectations, and privacy considerations.

🧩

Browser-side processing

The architectural choices behind a privacy-first PDF extractor, from PDF.js to service workers.

🗂️

Export formats

Reference pages for BibTeX, RIS, CSL-JSON, Markdown, and DOCX, with notes on when to pick each.

Last reviewed on April 24, 2026