Parse
Parse uploaded files to extract structured content and make them ready for analysis, embedding generation, and data visualization.
What is File Parsing?
File parsing converts PDF documents into a structured, machine-readable format. This process extracts text, metadata, and structural information.
Using the Parse API
Download a test file:
import requests
resp = requests.get("https://assets.nomicatlas.com/department-of-labor-data.pdf")
with open("department-of-labor-data.pdf", "wb") as f:
f.write(resp.content)
Upload and parse it using the Nomic Platform:
from nomic import NomicClient
client = NomicClient()
# upload
file = client.upload_file("department-of-labor-data.pdf")
# parse
result = client.parse(file)
Print the result:
import json
import sys
print("Parsed document:")
json.dump(result["result"], sys.stdout, indent=4)
Example Result
See the response format here.
{
"chunks": [
{
"content": "Department of Labor seal with \"News Release\" text.\nTRANSMISSION OF MATERIALS IN THIS ...",
"embed": "Department of Labor seal with \"News Release\" text.\nTRANSMISSION OF MATERIALS IN THIS ...",
"blocks": [
{
"type": "Figure",
"bbox": {
"left": 0.0791983760260289,
"top": 0.03243826374863133,
"width": 0.464428297055313,
"height": 0.06306505203247072,
"page": 1
},
"content": ""
},
{
"type": "Text",
"bbox": {
"left": 0.20591503267973857,
"top": 0.13927020202020207,
"width": 0.5926797385620916,
"height": 0.026579545454545356,
"page": 1
},
"content": "TRANSMISSION OF MATERIALS IN THIS RELEASE IS EMBARGOED UNTIL 8:30 A.M. (Eastern) Thursday, January 30, 2025"
} ...
]
},
{
"content": "SEASONALLY ADJUSTED DATA\n## UNADJUSTED DATA\nThe advance number of actual initial claims under ...",
"embed": "SEASONALLY ADJUSTED DATA\n## UNADJUSTED DATA\nThe advance number of actual initial claims under ...",
"blocks": [
{
"type": "Section Header",
"bbox": {
"left": 0.10200000000000001,
"top": 0.07279166666666662,
"width": 0.17254248366013072,
"height": 0.011214646464646538,
"page": 2
},
"content": "## UNADJUSTED DATA"
} ...
]
} ...
]
}
Direct URL Usage
You can also pass a public URL directly to parse:
result = client.parse("https://assets.nomicatlas.com/department-of-labor-data.pdf")
Non-Blocking Usage
You can request a non-blocking parse, causing the function to return immediately:
task = client.parse(file, block=False)
The task object can be polled to get the result. For example:
# check without waiting
result = task.get(block=False)
# wait for 5 seconds
result = task.get(timeout=5)
# wait indefinitely
result = task.get()
Example of Non-Blocking Usage
Handle TaskPending in order to continue after a timeout:
import time
from nomic import NomicClient
from nomic.client import TaskPending
client = NomicClient()
task = client.parse("https://assets.nomicatlas.com/department-of-labor-data.pdf", block=False)
start = time.time()
while True:
try:
result = task.get(timeout=5)
break
except TaskPending:
print(f"Task still pending after {time.time() - start:.2f} seconds")
print(f"Done after {time.time() - start:.2f} seconds")
You may see an output like this:
Task still pending after 5.02 seconds
Task still pending after 10.19 seconds
Task still pending after 15.38 seconds
...
Task still pending after 51.49 seconds
Done after 57.25 seconds
Next Steps
After parsing files, you can:
- Generate embeddings for semantic search
- Create visualizations and maps in Atlas
You may also consider:
- Extracting specific data using a custom schema