Parse a PDF in TypeScript
This tutorial walks through parsing a PDF with Parse from a Node.js TypeScript script — install the SDK, set your API key, run the parse, walk the markdown output. It’s the TypeScript counterpart to the Quick Start: Parse a PDF & Interpret Outputs Python tutorial, and it’s the recommended starting point if you’re building a server-side or CLI-style integration in Node.
When to use TypeScript
Section titled “When to use TypeScript”The Parse SDK works the same way in Python and TypeScript — same tier system, same option buckets, same response shapes. Pick TypeScript when:
- You’re integrating Parse into a Node.js or Bun service, an Edge function, or a CLI tool you’ll distribute as a package
- Your team is already on TS and you want a single language across the stack
- You need to wire Parse into a Next.js, Astro, or other JS-first framework
If you’re prototyping in a notebook, the Python tutorials are a faster path. Once you’re ready to ship, the patterns translate cleanly.
1. Set up your project
Section titled “1. Set up your project”Initialize a fresh project (or skip this if you already have one):
mkdir parse-tutorial && cd parse-tutorialnpm init -yInstall the SDK and tsx (a TypeScript runner that supports modern ESM syntax including top-level await):
npm install @llamaindex/llama-cloudnpm install --save-dev tsx typescript @types/nodeSet your Parse API key as an environment variable so the SDK picks it up automatically:
export LLAMA_CLOUD_API_KEY="llx-..."Get an API key from the LlamaCloud dashboard if you don’t have one yet.
2. Download a sample PDF
Section titled “2. Download a sample PDF”For this tutorial, we’ll parse the LLaMA paper — a public research paper with the kind of multi-column layout, tables, and figures that show off Parse’s strengths.
curl -L -o llama.pdf https://arxiv.org/pdf/2302.13971.pdfYou can use any PDF — substitute its path in the script below.
3. Write the parse script
Section titled “3. Write the parse script”Create parse.ts with the bare minimum: upload, parse, print the markdown.
import LlamaCloud from "@llamaindex/llama-cloud";import fs from "fs";
const client = new LlamaCloud(); // reads LLAMA_CLOUD_API_KEY from the environment
console.log("Uploading file...");const file = await client.files.create({ file: fs.createReadStream("./llama.pdf"), purpose: "parse",});
console.log(`Uploaded file ${file.id}, parsing...`);const result = await client.parsing.parse({ file_id: file.id, tier: "agentic", version: "latest", expand: ["markdown"],});
console.log(`Job status: ${result.job.status}`);console.log(`Total pages: ${result.markdown.pages.length}`);console.log("\n--- First page markdown ---\n");console.log(result.markdown.pages[0].markdown);A few things worth noting:
new LlamaCloud()readsLLAMA_CLOUD_API_KEYfrom the environment — you don’t need to pass it explicitly. If you do want to pass it, usenew LlamaCloud({ apiKey: "llx-..." }).client.parsing.parse()blocks until the job finishes and returns the full result. The SDK handles polling for you. If you’re hitting the raw REST API directly, you’d need to poll yourself — see the REST API tab in Getting Started.- Top-level
awaitworks here becausetsxruns the file as an ES module. If you’re targeting CommonJS ("type": "commonjs"in yourpackage.json), wrap the body in anasync function main()and callmain().
4. Run it
Section titled “4. Run it”npx tsx parse.tsYou should see the upload, the parse job run, and the first page of the LLaMA paper printed as markdown — title, author block, abstract, section headings, all preserved as proper markdown structure.
5. Walk every page
Section titled “5. Walk every page”The bare-bones script above only prints the first page. To process the entire document, iterate over result.markdown.pages:
for (const page of result.markdown.pages) { console.log(`\n=== Page ${page.page_number} ===\n`); console.log(page.markdown);}Or concatenate the whole document into a single markdown string for an LLM prompt:
const fullDocument = result.markdown.pages .map((p) => p.markdown) .join("\n\n---\n\n");
console.log(`Full document is ${fullDocument.length} characters.`);// fullDocument is now ready to feed into your LLM6. Get more than markdown
Section titled “6. Get more than markdown”The expand array controls what comes back in the response. Ask for items if you want the structured tree (tables, headings, figures), text if you want plain text per page, metadata if you want confidence scores and per-page metadata:
const result = await client.parsing.parse({ file_id: file.id, tier: "agentic", version: "latest", expand: ["markdown", "items", "metadata"],});
// Walk the items tree to find tablesfor (const page of result.items.pages) { for (const item of page.items) { if (item.type === "table") { console.log( `Table on page ${page.page_number}: ${item.rows.length} rows`, ); } }}
// Inspect per-page confidencefor (const page of result.metadata.pages) { console.log(`Page ${page.page_number}: confidence ${page.confidence}`);}See Retrieving Results for every legal expand value and what it returns.
7. Add a custom prompt
Section titled “7. Add a custom prompt”Steer the parser with natural-language instructions via agentic_options.custom_prompt:
const result = await client.parsing.parse({ file_id: file.id, tier: "agentic", version: "latest", agentic_options: { custom_prompt: "This is a scientific research paper. Preserve all mathematical equations in LaTeX. Keep inline citations like (Smith 2024) intact in the flowing text. Do not flatten the multi-column layout into a single column.", }, expand: ["markdown"],});Custom prompts only work on the cost_effective, agentic, and agentic_plus tiers (not fast). See Custom Prompt for the full prompt-engineering guide.
8. Save the results to disk
Section titled “8. Save the results to disk”The most common pattern after parsing is writing the markdown to a file:
import { writeFileSync } from "fs";
const fullDocument = result.markdown.pages .map((p) => p.markdown) .join("\n\n---\n\n");
writeFileSync("./llama-parsed.md", fullDocument);console.log("Wrote llama-parsed.md");Or write each page as a separate file for downstream chunking:
for (const page of result.markdown.pages) { writeFileSync(`./pages/page-${page.page_number}.md`, page.markdown);}Common gotchas
Section titled “Common gotchas”tsxnot running top-levelawait? Check that yourpackage.jsonhas"type": "module"(or usetsx’s ESM mode explicitly). If you’re stuck on CommonJS, wrap the script inasync function main() { ... } main().- Authentication errors? Make sure
LLAMA_CLOUD_API_KEYis set in the same shell session you’re runningtsxin.echo $LLAMA_CLOUD_API_KEYshould print your key, not be empty. fs.createReadStreamissues? The path is relative to your current working directory, not the script file. Use an absolute path withpath.resolve(__dirname, "llama.pdf")if you’re running from a different directory.- Parse job hangs? It shouldn’t —
client.parsing.parse()blocks for as long as the job takes, but the SDK has built-in timeouts. If a single job takes more than a couple of minutes, check the document — very long documents (1000+ pages) onagentic_pluscan take several minutes.
See also
Section titled “See also”- Getting Started → TypeScript tab — the canonical first-parse setup with simpler scope
- Quick Start: Parse a PDF & Interpret Outputs — the Python equivalent of this tutorial with deeper interpretation of each output view
- Tiers — pick the right tier for your document
- Configuration Model — where every option lives in the request shape
- Retrieving Results — every legal
expandvalue - Custom Prompt — natural-language steering of the agentic parser