Parse
Parse uploaded files to transform PDFs and drawings into LLM ready chunks and blocks.
What is File Parsing?
File parsing converts PDF documents into a structured, LLM readable blocks.
Using the Parse API
Download a test file:
import requests
resp = requests.get("https://assets.nomicatlas.com/department-of-labor-data.pdf")
with open("department-of-labor-data.pdf", "wb") as f:
f.write(resp.content)
Upload and parse it using the Nomic Platform:
from nomic import NomicClient
client = NomicClient()
# upload
file = client.upload_file("department-of-labor-data.pdf")
# parse
result = client.parse(file)
Print the result:
import json
import sys
print("Parsed document:")
json.dump(result["result"], sys.stdout, indent=4)
Example Result
See the response format here.
{
"chunks": [
{
"content": "Department of Labor seal with \"News Release\" text.\nTRANSMISSION OF MATERIALS IN THIS ...",
"embed": "Department of Labor seal with \"News Release\" text.\nTRANSMISSION OF MATERIALS IN THIS ...",
"blocks": [
{
"type": "Figure",
"bbox": {
"left": 0.0791983760260289,
"top": 0.03243826374863133,
"width": 0.464428297055313,
"height": 0.06306505203247072,
"page": 1
},
"content": ""
},
{
"type": "Text",
"bbox": {
"left": 0.20591503267973857,
"top": 0.13927020202020207,
"width": 0.5926797385620916,
"height": 0.026579545454545356,
"page": 1
},
"content": "TRANSMISSION OF MATERIALS IN THIS RELEASE IS EMBARGOED UNTIL 8:30 A.M. (Eastern) Thursday, January 30, 2025"
} ...
]
},
{
"content": "SEASONALLY ADJUSTED DATA\n## UNADJUSTED DATA\nThe advance number of actual initial claims under ...",
"embed": "SEASONALLY ADJUSTED DATA\n## UNADJUSTED DATA\nThe advance number of actual initial claims under ...",
"blocks": [
{
"type": "Section Header",
"bbox": {
"left": 0.10200000000000001,
"top": 0.07279166666666662,
"width": 0.17254248366013072,
"height": 0.011214646464646538,
"page": 2
},
"content": "## UNADJUSTED DATA"
} ...
]
} ...
]
}
Direct URL Usage
You can also pass a public URL directly to parse:
result = client.parse("https://assets.nomicatlas.com/department-of-labor-data.pdf")
Parse Options
You can customize the parsing behavior using ParseOptions:
from nomic import NomicClient
from nomic.client_models import ParseOptions, OcrLanguage, ContentExtractionMode
client = NomicClient()
# Configure parse options
options = ParseOptions(
ocr_language=OcrLanguage.Chinese_Japanese_English,
content_extraction_mode=ContentExtractionMode.Ocr
)
result = client.parse(file, options=options)
OCR Language
The ocr_language parameter controls which language model is used for optical character recognition:
OcrLanguage.English(default): Optimized for English textOcrLanguage.Latin: For Latin-based languages (Spanish, French, German, etc.)OcrLanguage.Chinese_Japanese_English: For documents containing Chinese, Japanese, or mixed CJK/English text
Example with Chinese/Japanese documents:
from nomic import NomicClient
from nomic.client_models import ParseOptions, OcrLanguage
client = NomicClient()
options = ParseOptions(ocr_language=OcrLanguage.Chinese_Japanese_English)
result = client.parse("chinese_document.pdf", options=options)
Content Extraction Mode
The content_extraction_mode parameter controls the overall extraction strategy:
ContentExtractionMode.Hybrid(default): Uses embedded text where available, OCR on images and bitmapsContentExtractionMode.Metadata: Only uses embedded document text, no OCRContentExtractionMode.Ocr: Runs OCR on all pages, even if text is embedded
Other Options
Additional parsing options include:
chunking: Configure how documents are split into chunks (chunk size, merge strategy, etc.)table_summary: Enable table content summarizationfigure_summary: Enable figure/image content summarization
See the Parse Options documentation for complete details on all available options.
Non-Blocking Usage
You can request a non-blocking parse, causing the function to return immediately:
task = client.parse(file, block=False)
The task object can be polled to get the result. For example:
# check without waiting
result = task.get(block=False)
# wait for 5 seconds
result = task.get(timeout=5)
# wait indefinitely
result = task.get()
Example of Non-Blocking Usage
Handle TaskPending in order to continue after a timeout:
import time
from nomic import NomicClient
from nomic.client import TaskPending
client = NomicClient()
task = client.parse("https://assets.nomicatlas.com/department-of-labor-data.pdf", block=False)
start = time.time()
while True:
try:
result = task.get(timeout=5)
break
except TaskPending:
print(f"Task still pending after {time.time() - start:.2f} seconds")
print(f"Done after {time.time() - start:.2f} seconds")
You may see an output like this:
Task still pending after 5.02 seconds
Task still pending after 10.19 seconds
Task still pending after 15.38 seconds
...
Task still pending after 51.49 seconds
Done after 57.25 seconds