Skip to main content

Extraction overview

Extract turns unstructured documents (PDFs, scans, images, HTML) into predictable, structured JSON. You provide a JSON Schema describing the output you want, and the API extracts those fields from your document into that shape.

At a high level:

  • You upload a file or pass a URL.
  • You define a JSON Schema for the output.
  • The API parses your document and extracts values according to your schema, returning structured JSON.

How it works

  1. Input: a document (URL or uploaded file).
  2. Schema: a JSON Schema describing the fields, types, and structure you expect in the output.
  3. Extraction: the service reads the parsed content of your document and fills the schema, returning structured JSON.

Extraction does not invent data. If a value does not exist in the document (or in the parsed content), it will be empty or omitted unless you have marked it as required.

Design your schema

Schemas follow standard JSON Schema. Tips for reliable extractions:

  • Use clear field names and helpful description text that mirrors how the value appears in the document.
  • Set type for every field (string, number, boolean, object, array).
  • Use required for must-have fields so missing data is surfaced.
  • Use enum to constrain values when there are limited options.
  • Keep nesting shallow (ideally ≤ 2–3 levels).
  • Use array for long lists or line items.

Example schema with an array of accounts:

{
"type": "object",
"properties": {
"customerName": {
"type": "string",
"description": "Full name of the customer as shown on the statement."
},
"accounts": {
"type": "array",
"description": "List of the customer's financial accounts.",
"items": {
"type": "object",
"properties": {
"accountType": {
"type": "string",
"description": "Type of account.",
"enum": ["checking", "savings", "investment", "other"]
},
"accountNumber": {
"type": "string",
"description": "Account identifier as printed on the document."
},
"endingValue": {
"type": "number",
"description": "Closing balance or value of the account."
}
},
"required": ["accountType", "accountNumber", "endingValue"]
}
}
},
"required": ["customerName", "accounts"]
}

Quickstart

Define your schema and run an extraction. Below is a minimal example similar to the current sample:

from nomic import NomicClient

client = NomicClient()

# Define a schema
schema = {
"type": "object",
"properties": {
"title": {"type": "string", "description": "The title of the document"},
"university": {"type": "string", "description": "The university of the original document"},
"translator": {"type": "string", "description": "The translator of the document"},
"old god": {"type": "string", "description": "The name of the famous Egyptian old god"}
},
"required": ["title"]
}

# Extract using a URL
url = "https://assets.nomicatlas.com/Phaedrus.pdf"
result = client.extract(url, schema)

# Or extract using an uploaded file
file = client.upload_file("sample.pdf")
result = client.extract(file, schema)

To print the result:

import json
import sys
print("Extracted result:")
json.dump(result["result"], sys.stdout, indent=4)

Non-blocking usage

As with parse, you can pass block=False to receive a task handle immediately instead of waiting for completion.

Example result

{
"title": "Phaedrus",
"old god": "Theuth",
"translator": "Benjamin Jowett",
"university": "MIT",
"_metadata": {
"processing_status": "completed",
"fields_extracted": 4,
"schema_used": {
"type": "object",
"properties": {
"title": {"type": "string", "description": "The title of the document"},
"old god": {"type": "string", "description": "The name of the famous Egyptian old god"},
"translator": {"type": "string", "description": "The translator of the document"},
"university": {"type": "string", "description": "The university of the original document"}
}
}
}
}

Troubleshooting basics

  • Missing values: Confirm the value appears in the document. Strengthen field description text and keep schema shallow.
  • Long lists: Use an array type at the appropriate level for tables or line items.
  • Consistency: Use enum where fields have limited options; add required for must-have fields.

What’s next

After extraction, you can:

  • Feed results to downstream systems or databases
  • Build search indexes
  • Enrich with other pipelines as needed