Parse Options

Customize document parsing behavior using the ParseOptions class and its related configuration objects.

ParseOptions

The main options object for configuring document parsing.

Parameters

`content_extraction_mode`

Type: ContentExtractionMode (default: ContentExtractionMode.Hybrid)

The overall strategy for extracting content from the document.

ContentExtractionMode.Hybrid (default): Uses embedded document text where available, and runs OCR on images and bitmaps found in the document. Best balance of speed and accuracy for most documents.
ContentExtractionMode.Metadata: Only uses embedded document text. Disables all OCR. Fastest option, but may miss content in scanned documents or images.
ContentExtractionMode.Ocr: Runs OCR on all pages, even if text is embedded. Slowest option, but provides the most consistent results across different document types.

Example:

from nomic.client_models import ParseOptions, ContentExtractionMode

options = ParseOptions(
    content_extraction_mode=ContentExtractionMode.Ocr
)

`ocr_language`

Type: OcrLanguage (default: OcrLanguage.English)

Language selection for OCR. Choosing the correct language model significantly improves accuracy for non-English documents.

OcrLanguage.English (default): Optimized for English text
OcrLanguage.Latin: For Latin-based languages including Spanish, French, German, Italian, Portuguese, and other Romance and Germanic languages
OcrLanguage.Chinese_Japanese_English: For documents containing Chinese, Japanese, or mixed CJK/English text

Example:

from nomic.client_models import ParseOptions, OcrLanguage

# For a Spanish document
options = ParseOptions(ocr_language=OcrLanguage.Latin)

# For a mixed Chinese/English document
options = ParseOptions(ocr_language=OcrLanguage.Chinese_Japanese_English)

`table_summary`

Type: TableSummaryOptions | None (default: None)

Options for generating table summaries. When None, default behavior is used.

See TableSummaryOptions below for details.

`figure_summary`

Type: FigureSummaryOptions | None (default: None)

Options for generating figure summaries. When None, default behavior is used.

See FigureSummaryOptions below for details.

TableSummaryOptions

Options for generating summaries of table content.

Parameters

`enabled`

Type: bool (default: False)

Whether to generate a natural language summary of table content. When enabled, tables will have an additional text description that can improve retrieval and understanding.

Example:

from nomic.client_models import ParseOptions, TableSummaryOptions

options = ParseOptions(
    table_summary=TableSummaryOptions(enabled=True)
)

FigureSummaryOptions

Options for generating summaries of figures and images.

Parameters

`enabled`

Type: bool (default: True)

Whether to generate a natural language description of figure content. When enabled, figures and images will have text descriptions that describe their visual content.

Example:

from nomic.client_models import ParseOptions, FigureSummaryOptions

# Disable figure summarization for faster processing
options = ParseOptions(
    figure_summary=FigureSummaryOptions(enabled=False)
)

Complete Example

Here's a comprehensive example showing how to use all the options together:

from nomic import NomicClient
from nomic.client_models import (
    ParseOptions,
    ContentExtractionMode,
    OcrLanguage,
    TableSummaryOptions,
    FigureSummaryOptions,
)

client = NomicClient()

# Configure comprehensive parse options
options = ParseOptions(
    # Use OCR on all pages for maximum consistency
    content_extraction_mode=ContentExtractionMode.Ocr,

    # Use Latin language model for Spanish document
    ocr_language=OcrLanguage.Latin,

    # Enable table summaries for better retrieval
    table_summary=TableSummaryOptions(enabled=True),

    # Keep figure summaries enabled (default)
    figure_summary=FigureSummaryOptions(enabled=True)
)

# Parse with custom options
file = client.upload_file("document.pdf")
result = client.parse(file, options=options)

Best Practices

Choosing Content Extraction Mode

Use Hybrid (default) for most documents - it provides the best balance
Use Metadata when you know the document has good embedded text and you need maximum speed
Use Ocr for scanned documents, documents with poor embedded text, or when you need consistent results

Selecting OCR Language

Always set ocr_language to match your document's primary language:

Documents in English → OcrLanguage.English
Documents in Spanish, French, German, etc. → OcrLanguage.Latin
Documents in Chinese, Japanese, or mixed CJK/English → OcrLanguage.Chinese_Japanese_English

Using Summaries

Table summaries: Enable when tables contain important information that should be retrievable through semantic search
Figure summaries: Usually beneficial to keep enabled unless you're processing documents with many irrelevant images

ParseOptions​

Parameters​

content_extraction_mode​

ocr_language​

table_summary​

figure_summary​

TableSummaryOptions​

Parameters​

enabled​

FigureSummaryOptions​

Parameters​

enabled​

Complete Example​

Best Practices​

Choosing Content Extraction Mode​

Selecting OCR Language​

Using Summaries​

ParseOptions

Parameters

`content_extraction_mode`

`ocr_language`

`table_summary`

`figure_summary`

TableSummaryOptions

Parameters

`enabled`

FigureSummaryOptions

Parameters

`enabled`

Complete Example

Best Practices

Choosing Content Extraction Mode

Selecting OCR Language

Using Summaries