Parse
Parse uploaded files to extract structured content and make them ready for analysis, embedding generation, and data visualization.
What is File Parsing?
File parsing converts PDF documents into a structured, machine-readable format. This process extracts text, metadata, and structural information.
Using the Parse API
Download a test file:
import requests
resp = requests.get("https://assets.nomicatlas.com/department-of-labor-data.pdf")
with open("department-of-labor-data.pdf", "wb") as f:
f.write(resp.content)
Upload and parse it using the Nomic Platform:
from nomic import NomicClient
client = NomicClient()
# upload
file = client.upload_file("department-of-labor-data.pdf")
# parse
result = client.parse(file)
Print the result:
import json
import sys
print("Parsed document:")
json.dump(result["result"], sys.stdout, indent=4)
Example Result
See the response format here.
{
"chunks": [
{
"content": "Department of Labor seal with \"News Release\" text.\nTRANSMISSION OF MATERIALS IN THIS ...",
"embed": "Department of Labor seal with \"News Release\" text.\nTRANSMISSION OF MATERIALS IN THIS ...",
"blocks": [
{
"type": "Figure",
"bbox": {
"left": 0.0791983760260289,
"top": 0.03243826374863133,
"width": 0.464428297055313,
"height": 0.06306505203247072,
"page": 1
},
"content": ""
},
{
"type": "Text",
"bbox": {
"left": 0.20591503267973857,
"top": 0.13927020202020207,
"width": 0.5926797385620916,
"height": 0.026579545454545356,
"page": 1
},
"content": "TRANSMISSION OF MATERIALS IN THIS RELEASE IS EMBARGOED UNTIL 8:30 A.M. (Eastern) Thursday, January 30, 2025"
} ...
]
},
{
"content": "SEASONALLY ADJUSTED DATA\n## UNADJUSTED DATA\nThe advance number of actual initial claims under ...",
"embed": "SEASONALLY ADJUSTED DATA\n## UNADJUSTED DATA\nThe advance number of actual initial claims under ...",
"blocks": [
{
"type": "Section Header",
"bbox": {
"left": 0.10200000000000001,
"top": 0.07279166666666662,
"width": 0.17254248366013072,
"height": 0.011214646464646538,
"page": 2
},
"content": "## UNADJUSTED DATA"
} ...
]
} ...
]
}
Direct URL Usage
You can also pass a public URL directly to parse:
result = client.parse("https://assets.nomicatlas.com/department-of-labor-data.pdf")
Parse Options
You can customize the parsing behavior using ParseOptions:
from nomic import NomicClient
from nomic.client_models import ParseOptions, OcrLanguage, ContentExtractionMode
client = NomicClient()
# Configure parse options
options = ParseOptions(
ocr_language=OcrLanguage.Chinese_Japanese_English,
content_extraction_mode=ContentExtractionMode.Ocr
)
result = client.parse(file, options=options)
OCR Language
The ocr_language parameter controls which language model is used for optical character recognition:
OcrLanguage.English(default): Optimized for English textOcrLanguage.Latin: For Latin-based languages (Spanish, French, German, etc.)OcrLanguage.Chinese_Japanese_English: For documents containing Chinese, Japanese, or mixed CJK/English text
Example with Chinese/Japanese documents:
from nomic import NomicClient
from nomic.client_models import ParseOptions, OcrLanguage
client = NomicClient()
options = ParseOptions(ocr_language=OcrLanguage.Chinese_Japanese_English)
result = client.parse("chinese_document.pdf", options=options)
Content Extraction Mode
The content_extraction_mode parameter controls the overall extraction strategy:
ContentExtractionMode.Hybrid(default): Uses embedded text where available, OCR on images and bitmapsContentExtractionMode.Metadata: Only uses embedded document text, no OCRContentExtractionMode.Ocr: Runs OCR on all pages, even if text is embedded
Other Options
Additional parsing options include:
chunking: Configure how documents are split into chunks (chunk size, merge strategy, etc.)table_summary: Enable table content summarizationfigure_summary: Enable figure/image content summarization
See the Parse Options documentation for complete details on all available options.
Non-Blocking Usage
You can request a non-blocking parse, causing the function to return immediately:
task = client.parse(file, block=False)
The task object can be polled to get the result. For example:
# check without waiting
result = task.get(block=False)
# wait for 5 seconds
result = task.get(timeout=5)
# wait indefinitely
result = task.get()
Example of Non-Blocking Usage
Handle TaskPending in order to continue after a timeout:
import time
from nomic import NomicClient
from nomic.client import TaskPending
client = NomicClient()
task = client.parse("https://assets.nomicatlas.com/department-of-labor-data.pdf", block=False)
start = time.time()
while True:
try:
result = task.get(timeout=5)
break
except TaskPending:
print(f"Task still pending after {time.time() - start:.2f} seconds")
print(f"Done after {time.time() - start:.2f} seconds")
You may see an output like this:
Task still pending after 5.02 seconds
Task still pending after 10.19 seconds
Task still pending after 15.38 seconds
...
Task still pending after 51.49 seconds
Done after 57.25 seconds
Next Steps
After parsing files, you can:
- Generate embeddings for semantic search
- Create visualizations and maps in Atlas
You may also consider:
- Extracting specific data using a custom schema