Why process PDFs in the browser
Most online PDF tools are server-side. You upload a file, a server somewhere does the work, and a result comes back. That model is operationally simple, but it also means that every document you process leaves your machine, sits on someone else's storage, and depends on that operator's policies for as long as the file is retained. For research material — preprints, drafts, confidential reports — that trade-off is often the wrong one. This guide describes how a browser-only PDF tool is put together and what the practical trade-offs look like.
The core stack
A modern browser is a fully featured runtime. The pieces a privacy-first PDF tool relies on are all standard:
- The File API and ArrayBuffer. Reading a chosen file into memory without uploading it. The file lives in the same process as the page; nothing crosses the network.
- PDF.js. A pure-JavaScript PDF renderer and parser maintained by the Mozilla project. It exposes pages, text content with positions, fonts, and the document information dictionary.
- pdf-lib. A library for creating and modifying PDF files. Useful for redaction, page extraction, and producing new PDFs from extracted content.
- Web Workers. Background threads for keeping the parser off the main UI thread. PDF.js ships its own worker.
- Service Workers. A controllable network proxy that lets the page cache its assets and remain functional offline.
- IndexedDB and the Cache API. Persistent client-side storage for the application shell and, where useful, for derived data the user explicitly opts to keep.
- The Web Crypto API. If a tool ever needs to hash, sign, or encrypt anything in the browser, the primitives are already there.
What "privacy-first" actually means
The phrase is overused. In a browser tool it has a concrete shape: there is no server-side component for the user to send data to. The HTML, CSS, and JavaScript are static files. The only network activity once the page is loaded is whatever the user opts into — fetching a fresh copy of a CDN library, for example, or loading an ad. The PDF the user picks is read from disk, processed in memory, and exported back to disk. None of it touches any other machine.
This is testable. Open the browser's network tab, load a PDF into the tool, and watch: no upload request is made. The same is true with the network disconnected after the initial page load — the service worker cache will keep the page running.
Trade-offs of the client-side model
Compute happens on the user's device
A 200-page PDF takes the same number of CPU cycles to parse whether the work happens on a server farm or on a phone. Pushing it to the user's device means the user pays the battery and CPU cost. For modern laptops this is a non-issue; for low-end mobile devices on long documents it can be slow. A worker thread keeps the page responsive but does not change the underlying time.
You give up some heavy operations
A server pipeline can pull in OCR engines, language models, large reference databases, and arbitrary tooling. A browser tool can only do what fits inside a few megabytes of JavaScript and runs in the user's browser. Heavy OCR, ML-based citation parsing, and full-text similarity search are usually out of scope for the client; they belong on a workstation or in a service the user has chosen to trust.
Updates are delivery problems, not deployment problems
There is no server to upgrade. New versions ship by replacing static files. The service worker handles cache invalidation. This makes development very fast but means a buggy release is visible to everyone immediately; staged rollouts have to be implemented in the static layer (for example, by routing a fraction of users to a different bundle).
Telemetry is limited by design
Without a server, there is no log of what the tool did. That is a feature for the user and a constraint for the developer. The only signals available are the standard analytics events (page loaded, user interacted with element X) and any errors the page chooses to report. There is no equivalent of "what did the parser produce on this user's file" — and that is the point.
Offline use
The site registers a service worker that caches the application shell on first visit. After that, the page loads and the extractor runs without a network connection. This is genuinely useful for working on a flight, in a library with patchy Wi-Fi, or in a high-security environment where the network is restricted. Installing the page as a Progressive Web App (most desktop browsers offer this from the address bar) makes the offline behaviour explicit.
What the user should still verify
"Client-side" is a property of the design, not a guarantee anyone can take on trust. A user who cares can verify it directly:
- Look at the network tab while processing a file. No upload should appear.
- Disconnect the network and try again. The extractor should still work.
- Inspect the page's content security policy. The site's CSP allows scripts only from itself and from the CDN serving the libraries; no analytics or ad domains are permitted to execute the parser.
- Read the source. The application script is unminified and short.
That is what privacy-by-design looks like in practice — not a promise, but a property someone can confirm in five minutes with the dev tools open.