Skip to main content

Parse

Parse uploaded files to transform PDFs and drawings into LLM ready chunks and blocks.

What is File Parsing?

File parsing converts PDF documents into a structured, LLM readable blocks.

Drawing Screenshot

Using the Parse API

Download a test file:

import requests

resp = requests.get("https://assets.nomicatlas.com/department-of-labor-data.pdf")
with open("department-of-labor-data.pdf", "wb") as f:
f.write(resp.content)

Upload and parse it using the Nomic Platform:

from nomic import NomicClient

client = NomicClient()

# upload
file = client.upload_file("department-of-labor-data.pdf")

# parse
result = client.parse(file)

Print the result:

import json
import sys
print("Parsed document:")
json.dump(result["result"], sys.stdout, indent=4)

Example Result

See the response format here.

{
"chunks": [
{
"content": "Department of Labor seal with \"News Release\" text.\nTRANSMISSION OF MATERIALS IN THIS ...",
"embed": "Department of Labor seal with \"News Release\" text.\nTRANSMISSION OF MATERIALS IN THIS ...",
"blocks": [
{
"type": "Figure",
"bbox": {
"left": 0.0791983760260289,
"top": 0.03243826374863133,
"width": 0.464428297055313,
"height": 0.06306505203247072,
"page": 1
},
"content": ""
},
{
"type": "Text",
"bbox": {
"left": 0.20591503267973857,
"top": 0.13927020202020207,
"width": 0.5926797385620916,
"height": 0.026579545454545356,
"page": 1
},
"content": "TRANSMISSION OF MATERIALS IN THIS RELEASE IS EMBARGOED UNTIL 8:30 A.M. (Eastern) Thursday, January 30, 2025"
} ...
]
},
{
"content": "SEASONALLY ADJUSTED DATA\n## UNADJUSTED DATA\nThe advance number of actual initial claims under ...",
"embed": "SEASONALLY ADJUSTED DATA\n## UNADJUSTED DATA\nThe advance number of actual initial claims under ...",
"blocks": [
{
"type": "Section Header",
"bbox": {
"left": 0.10200000000000001,
"top": 0.07279166666666662,
"width": 0.17254248366013072,
"height": 0.011214646464646538,
"page": 2
},
"content": "## UNADJUSTED DATA"
} ...
]
} ...
]
}

Direct URL Usage

You can also pass a public URL directly to parse:

result = client.parse("https://assets.nomicatlas.com/department-of-labor-data.pdf")

Parse Options

You can customize the parsing behavior using ParseOptions:

from nomic import NomicClient
from nomic.client_models import ParseOptions, OcrLanguage, ContentExtractionMode

client = NomicClient()

# Configure parse options
options = ParseOptions(
ocr_language=OcrLanguage.Chinese_Japanese_English,
content_extraction_mode=ContentExtractionMode.Ocr
)

result = client.parse(file, options=options)

OCR Language

The ocr_language parameter controls which language model is used for optical character recognition:

  • OcrLanguage.English (default): Optimized for English text
  • OcrLanguage.Latin: For Latin-based languages (Spanish, French, German, etc.)
  • OcrLanguage.Chinese_Japanese_English: For documents containing Chinese, Japanese, or mixed CJK/English text

Example with Chinese/Japanese documents:

from nomic import NomicClient
from nomic.client_models import ParseOptions, OcrLanguage

client = NomicClient()
options = ParseOptions(ocr_language=OcrLanguage.Chinese_Japanese_English)
result = client.parse("chinese_document.pdf", options=options)

Content Extraction Mode

The content_extraction_mode parameter controls the overall extraction strategy:

  • ContentExtractionMode.Hybrid (default): Uses embedded text where available, OCR on images and bitmaps
  • ContentExtractionMode.Metadata: Only uses embedded document text, no OCR
  • ContentExtractionMode.Ocr: Runs OCR on all pages, even if text is embedded

Other Options

Additional parsing options include:

  • chunking: Configure how documents are split into chunks (chunk size, merge strategy, etc.)
  • table_summary: Enable table content summarization
  • figure_summary: Enable figure/image content summarization

See the Parse Options documentation for complete details on all available options.

Non-Blocking Usage

You can request a non-blocking parse, causing the function to return immediately:

task = client.parse(file, block=False)

The task object can be polled to get the result. For example:

# check without waiting
result = task.get(block=False)

# wait for 5 seconds
result = task.get(timeout=5)

# wait indefinitely
result = task.get()

Example of Non-Blocking Usage

Handle TaskPending in order to continue after a timeout:

import time
from nomic import NomicClient
from nomic.client import TaskPending

client = NomicClient()
task = client.parse("https://assets.nomicatlas.com/department-of-labor-data.pdf", block=False)

start = time.time()
while True:
try:
result = task.get(timeout=5)
break
except TaskPending:
print(f"Task still pending after {time.time() - start:.2f} seconds")

print(f"Done after {time.time() - start:.2f} seconds")

You may see an output like this:

Task still pending after 5.02 seconds
Task still pending after 10.19 seconds
Task still pending after 15.38 seconds
...
Task still pending after 51.49 seconds
Done after 57.25 seconds