Parse

Parse uploaded files to transform PDFs and drawings into LLM ready chunks and blocks.

What is File Parsing?

File parsing converts PDF documents into a structured, LLM readable blocks.

Using the Parse API

Download a test file:

import requests

resp = requests.get("https://assets.nomicatlas.com/department-of-labor-data.pdf")
with open("department-of-labor-data.pdf", "wb") as f:
    f.write(resp.content)

Upload and parse it using the Nomic Platform:

from nomic import NomicClient

client = NomicClient()

# upload
file = client.upload_file("department-of-labor-data.pdf")

# parse
result = client.parse(file)

Print the result:

import json
import sys
print("Parsed document:")
json.dump(result["result"], sys.stdout, indent=4)

Example Result

See the response format here.

{
    "chunks": [
        {
            "content": "Department of Labor seal with \"News Release\" text.\nTRANSMISSION OF MATERIALS IN THIS ...",
            "embed": "Department of Labor seal with \"News Release\" text.\nTRANSMISSION OF MATERIALS IN THIS ...",
            "blocks": [
                {
                    "type": "Figure",
                    "bbox": {
                        "left": 0.0791983760260289,
                        "top": 0.03243826374863133,
                        "width": 0.464428297055313,
                        "height": 0.06306505203247072,
                        "page": 1
                    },
                    "content": ""
                },
                {
                    "type": "Text",
                    "bbox": {
                        "left": 0.20591503267973857,
                        "top": 0.13927020202020207,
                        "width": 0.5926797385620916,
                        "height": 0.026579545454545356,
                        "page": 1
                    },
                    "content": "TRANSMISSION OF MATERIALS IN THIS RELEASE IS EMBARGOED UNTIL 8:30 A.M. (Eastern) Thursday, January 30, 2025"
                } ...
            ]
        },
        {
            "content": "SEASONALLY ADJUSTED DATA\n## UNADJUSTED DATA\nThe advance number of actual initial claims under ...",
            "embed": "SEASONALLY ADJUSTED DATA\n## UNADJUSTED DATA\nThe advance number of actual initial claims under ...",
            "blocks": [
                {
                    "type": "Section Header",
                    "bbox": {
                        "left": 0.10200000000000001,
                        "top": 0.07279166666666662,
                        "width": 0.17254248366013072,
                        "height": 0.011214646464646538,
                        "page": 2
                    },
                    "content": "## UNADJUSTED DATA"
                } ...
            ]
        } ...
    ]
}

Direct URL Usage

You can also pass a public URL directly to parse:

result = client.parse("https://assets.nomicatlas.com/department-of-labor-data.pdf")

Parse Options

You can customize the parsing behavior using ParseOptions:

from nomic import NomicClient
from nomic.client_models import ParseOptions, OcrLanguage, ContentExtractionMode

client = NomicClient()

# Configure parse options
options = ParseOptions(
    ocr_language=OcrLanguage.Chinese_Japanese_English,
    content_extraction_mode=ContentExtractionMode.Ocr
)

result = client.parse(file, options=options)

OCR Language

The ocr_language parameter controls which language model is used for optical character recognition:

OcrLanguage.English (default): Optimized for English text
OcrLanguage.Latin: For Latin-based languages (Spanish, French, German, etc.)
OcrLanguage.Chinese_Japanese_English: For documents containing Chinese, Japanese, or mixed CJK/English text

Example with Chinese/Japanese documents:

from nomic import NomicClient
from nomic.client_models import ParseOptions, OcrLanguage

client = NomicClient()
options = ParseOptions(ocr_language=OcrLanguage.Chinese_Japanese_English)
result = client.parse("chinese_document.pdf", options=options)

Content Extraction Mode

The content_extraction_mode parameter controls the overall extraction strategy:

ContentExtractionMode.Hybrid (default): Uses embedded text where available, OCR on images and bitmaps
ContentExtractionMode.Metadata: Only uses embedded document text, no OCR
ContentExtractionMode.Ocr: Runs OCR on all pages, even if text is embedded

Other Options

Additional parsing options include:

chunking: Configure how documents are split into chunks (chunk size, merge strategy, etc.)
table_summary: Enable table content summarization
figure_summary: Enable figure/image content summarization

See the Parse Options documentation for complete details on all available options.

Non-Blocking Usage

You can request a non-blocking parse, causing the function to return immediately:

task = client.parse(file, block=False)

The task object can be polled to get the result. For example:

# check without waiting
result = task.get(block=False)

# wait for 5 seconds
result = task.get(timeout=5)

# wait indefinitely
result = task.get()

Example of Non-Blocking Usage

Handle TaskPending in order to continue after a timeout:

import time
from nomic import NomicClient
from nomic.client import TaskPending

client = NomicClient()
task = client.parse("https://assets.nomicatlas.com/department-of-labor-data.pdf", block=False)

start = time.time()
while True:
    try:
        result = task.get(timeout=5)
        break
    except TaskPending:
        print(f"Task still pending after {time.time() - start:.2f} seconds")

print(f"Done after {time.time() - start:.2f} seconds")

You may see an output like this:

Task still pending after 5.02 seconds
Task still pending after 10.19 seconds
Task still pending after 15.38 seconds
...
Task still pending after 51.49 seconds
Done after 57.25 seconds

What is File Parsing?​

Using the Parse API​

Example Result​

Direct URL Usage​

Parse Options​

OCR Language​

Content Extraction Mode​

Other Options​

Non-Blocking Usage​

Example of Non-Blocking Usage​