Skip to main content

Parse Options

Customize document parsing behavior using the ParseOptions class and its related configuration objects.

ParseOptions

The main options object for configuring document parsing.

Parameters

content_extraction_mode

Type: ContentExtractionMode (default: ContentExtractionMode.Hybrid)

The overall strategy for extracting content from the document.

  • ContentExtractionMode.Hybrid (default): Uses embedded document text where available, and runs OCR on images and bitmaps found in the document. Best balance of speed and accuracy for most documents.
  • ContentExtractionMode.Metadata: Only uses embedded document text. Disables all OCR. Fastest option, but may miss content in scanned documents or images.
  • ContentExtractionMode.Ocr: Runs OCR on all pages, even if text is embedded. Slowest option, but provides the most consistent results across different document types.

Example:

from nomic.client_models import ParseOptions, ContentExtractionMode

options = ParseOptions(
content_extraction_mode=ContentExtractionMode.Ocr
)

ocr_language

Type: OcrLanguage (default: OcrLanguage.English)

Language selection for OCR. Choosing the correct language model significantly improves accuracy for non-English documents.

  • OcrLanguage.English (default): Optimized for English text
  • OcrLanguage.Latin: For Latin-based languages including Spanish, French, German, Italian, Portuguese, and other Romance and Germanic languages
  • OcrLanguage.Chinese_Japanese_English: For documents containing Chinese, Japanese, or mixed CJK/English text

Example:

from nomic.client_models import ParseOptions, OcrLanguage

# For a Spanish document
options = ParseOptions(ocr_language=OcrLanguage.Latin)

# For a mixed Chinese/English document
options = ParseOptions(ocr_language=OcrLanguage.Chinese_Japanese_English)

table_summary

Type: TableSummaryOptions | None (default: None)

Options for generating table summaries. When None, default behavior is used.

See TableSummaryOptions below for details.

figure_summary

Type: FigureSummaryOptions | None (default: None)

Options for generating figure summaries. When None, default behavior is used.

See FigureSummaryOptions below for details.


TableSummaryOptions

Options for generating summaries of table content.

Parameters

enabled

Type: bool (default: False)

Whether to generate a natural language summary of table content. When enabled, tables will have an additional text description that can improve retrieval and understanding.

Example:

from nomic.client_models import ParseOptions, TableSummaryOptions

options = ParseOptions(
table_summary=TableSummaryOptions(enabled=True)
)

FigureSummaryOptions

Options for generating summaries of figures and images.

Parameters

enabled

Type: bool (default: True)

Whether to generate a natural language description of figure content. When enabled, figures and images will have text descriptions that describe their visual content.

Example:

from nomic.client_models import ParseOptions, FigureSummaryOptions

# Disable figure summarization for faster processing
options = ParseOptions(
figure_summary=FigureSummaryOptions(enabled=False)
)

Complete Example

Here's a comprehensive example showing how to use all the options together:

from nomic import NomicClient
from nomic.client_models import (
ParseOptions,
ContentExtractionMode,
OcrLanguage,
TableSummaryOptions,
FigureSummaryOptions,
)

client = NomicClient()

# Configure comprehensive parse options
options = ParseOptions(
# Use OCR on all pages for maximum consistency
content_extraction_mode=ContentExtractionMode.Ocr,

# Use Latin language model for Spanish document
ocr_language=OcrLanguage.Latin,

# Enable table summaries for better retrieval
table_summary=TableSummaryOptions(enabled=True),

# Keep figure summaries enabled (default)
figure_summary=FigureSummaryOptions(enabled=True)
)

# Parse with custom options
file = client.upload_file("document.pdf")
result = client.parse(file, options=options)

Best Practices

Choosing Content Extraction Mode

  • Use Hybrid (default) for most documents - it provides the best balance
  • Use Metadata when you know the document has good embedded text and you need maximum speed
  • Use Ocr for scanned documents, documents with poor embedded text, or when you need consistent results

Selecting OCR Language

Always set ocr_language to match your document's primary language:

  • Documents in English → OcrLanguage.English
  • Documents in Spanish, French, German, etc. → OcrLanguage.Latin
  • Documents in Chinese, Japanese, or mixed CJK/English → OcrLanguage.Chinese_Japanese_English

Using Summaries

  • Table summaries: Enable when tables contain important information that should be retrievable through semantic search
  • Figure summaries: Usually beneficial to keep enabled unless you're processing documents with many irrelevant images