DOIs, ArXiv IDs, and friends
A persistent identifier is a short string that resolves, today and ten years from now, to a particular work or person. In an academic PDF, identifiers are the most reliable thing an extractor can find. They are short, they have a known shape, and once you have one you can ask a metadata service for the canonical record. This page is a quick reference to the identifiers that show up in scholarly documents and how to recognise them.
DOI — Digital Object Identifier
A DOI looks like 10.1234/abcd.5678: a registrant prefix beginning with "10.", a slash, and a suffix chosen by the publisher. The prefix has at least four digits and may have more; the suffix may contain letters, digits, and a small set of punctuation. The CrossRef recommendation for matching DOIs in plain text is something close to 10.\d{4,9}/[-._;()/:A-Z0-9]+, case-insensitive.
DOIs are issued by registration agencies (CrossRef, DataCite, mEDRA, others) and resolve through doi.org. Once you have a DOI, you have everything: a canonical landing page, a structured metadata record from the registration agency's API, and an unambiguous reference for citation. A DOI in a citation is the single fastest way to clean that citation up.
Two pitfalls worth knowing. First, DOIs in PDF text often span a line break, with the hyphen used for the break sometimes belonging to the DOI itself; an extractor must stitch the lines together carefully. Second, a publisher's URL ("publisher.com/article/12345") is not a DOI — it can change at any time. Treat URLs and DOIs as different fields.
ArXiv ID
ArXiv assigns identifiers in two formats. New-style identifiers (since April 2007) look like 2401.12345 — four digits for the year and month, a dot, and a sequence number. They may carry a version suffix ("v2"). Old-style identifiers from before 2007 look like cs.CL/0401001: a category, a slash, the year/month, and a sequence number. Both forms appear inside PDFs, often on the first page as part of "arXiv:2401.12345 [cs.CL]".
An ArXiv ID resolves at arxiv.org/abs/<id> and the arXiv API returns a structured record. Many arXiv preprints are later published in a peer-reviewed venue and acquire a DOI; the arXiv record links to the DOI when one is registered. For a citation tool, the rule of thumb is: prefer the DOI when present, fall back to the ArXiv ID when only that is given.
PubMed ID and PubMed Central ID
PubMed IDs (PMIDs) are bare integers — typically eight digits in current use — referring to records in the PubMed bibliographic database. PubMed Central IDs (PMCIDs) take the form PMC1234567. In a PDF the PMID often appears next to a label, as in "PMID: 36720123", which makes them easy to extract with a simple pattern. Both identifiers resolve through the NCBI E-utilities API, which returns a structured record including the corresponding DOI when one exists.
ORCID
ORCID identifies researchers, not works. An ORCID iD looks like 0000-0002-1825-0097: four groups of four characters separated by hyphens, with the last character being either a digit or "X" (used as a check digit for the value 10). They are usually shown as URLs of the form https://orcid.org/0000-0002-1825-0097.
ORCID iDs are useful for disambiguating common author names. An extractor can pick them up with a straightforward pattern, but most PDFs include them only on the first page near the author block, sometimes encoded as a hyperlink rather than visible text.
ISBN and ISSN
ISBNs identify books, ISSNs identify serials. ISBN-13 has thirteen digits with hyphens in known positions ("978-0-12-345678-9"); ISBN-10 is the older, ten-character form. Both have a check digit at the end, which an extractor can validate to discard false matches.
ISSNs identify journal-level publications and look like 1234-567X. They are useful for grouping citations by venue but they do not by themselves point to a specific article.
Less common but worth recognising
- Handle (handle.net). A general-purpose persistent identifier; DOI is technically a Handle.
- URN (Uniform Resource Name). A namespaced identifier scheme, often used by national libraries.
- OpenAlex ID. Identifiers issued by the OpenAlex scholarly graph for works, authors, institutions, and venues.
- Semantic Scholar Corpus ID. A numeric identifier used internally by Semantic Scholar; useful when you need to call its API.
- NASA ADS bibcode. A 19-character code used by the Astrophysics Data System: "1995ApJ...445..511L". Common in astronomy and physics references.
- SSRN ID. Numeric identifiers for preprints on the Social Science Research Network.
Putting them together
A reasonable extraction pass is:
- Scan the first two pages and the references section for any of the patterns above.
- Validate where possible (ISBN check digits, DOI prefix shape, ORCID check character).
- Deduplicate: a single citation might give you a DOI and a PubMed ID; keep both, but prefer the DOI as the primary key.
- Stop. Do not try to derive a DOI from a publisher URL by parsing the URL — the relationship is not stable.
Once the identifiers are out, the heavy lifting moves to the metadata services that own them.