Skip to main content

Parse

Parse uploaded files to extract structured content and make them ready for analysis, embedding generation, and data visualization.

What is File Parsing?

File parsing converts PDF documents into a structured, machine-readable format. This process extracts text, metadata, and structural information.

Using the Parse API

Download a test file:

import requests

resp = requests.get("https://assets.nomicatlas.com/department-of-labor-data.pdf")
with open("department-of-labor-data.pdf", "wb") as f:
f.write(resp.content)

Upload and parse it using the Nomic Platform:

from nomic import NomicClient

client = NomicClient()

# upload
file = client.upload_file("department-of-labor-data.pdf")

# parse
result = client.parse(file)

Print the result:

import json
import sys
print("Parsed document:")
json.dump(result["result"], sys.stdout, indent=4)

Example Result

See the response format here.

{
"chunks": [
{
"content": "Department of Labor seal with \"News Release\" text.\nTRANSMISSION OF MATERIALS IN THIS ...",
"embed": "Department of Labor seal with \"News Release\" text.\nTRANSMISSION OF MATERIALS IN THIS ...",
"blocks": [
{
"type": "Figure",
"bbox": {
"left": 0.0791983760260289,
"top": 0.03243826374863133,
"width": 0.464428297055313,
"height": 0.06306505203247072,
"page": 1
},
"content": ""
},
{
"type": "Text",
"bbox": {
"left": 0.20591503267973857,
"top": 0.13927020202020207,
"width": 0.5926797385620916,
"height": 0.026579545454545356,
"page": 1
},
"content": "TRANSMISSION OF MATERIALS IN THIS RELEASE IS EMBARGOED UNTIL 8:30 A.M. (Eastern) Thursday, January 30, 2025"
} ...
]
},
{
"content": "SEASONALLY ADJUSTED DATA\n## UNADJUSTED DATA\nThe advance number of actual initial claims under ...",
"embed": "SEASONALLY ADJUSTED DATA\n## UNADJUSTED DATA\nThe advance number of actual initial claims under ...",
"blocks": [
{
"type": "Section Header",
"bbox": {
"left": 0.10200000000000001,
"top": 0.07279166666666662,
"width": 0.17254248366013072,
"height": 0.011214646464646538,
"page": 2
},
"content": "## UNADJUSTED DATA"
} ...
]
} ...
]
}

Direct URL Usage

You can also pass a public URL directly to parse:

result = client.parse("https://assets.nomicatlas.com/department-of-labor-data.pdf")

Non-Blocking Usage

You can request a non-blocking parse, causing the function to return immediately:

task = client.parse(file, block=False)

The task object can be polled to get the result. For example:

# check without waiting
result = task.get(block=False)

# wait for 5 seconds
result = task.get(timeout=5)

# wait indefinitely
result = task.get()

Example of Non-Blocking Usage

Handle TaskPending in order to continue after a timeout:

import time
from nomic import NomicClient
from nomic.client import TaskPending

client = NomicClient()
task = client.parse("https://assets.nomicatlas.com/department-of-labor-data.pdf", block=False)

start = time.time()
while True:
try:
result = task.get(timeout=5)
break
except TaskPending:
print(f"Task still pending after {time.time() - start:.2f} seconds")

print(f"Done after {time.time() - start:.2f} seconds")

You may see an output like this:

Task still pending after 5.02 seconds
Task still pending after 10.19 seconds
Task still pending after 15.38 seconds
...
Task still pending after 51.49 seconds
Done after 57.25 seconds

Next Steps

After parsing files, you can:

  • Generate embeddings for semantic search
  • Create visualizations and maps in Atlas

You may also consider: