Prepare Data for Atlas

Atlas accepts data files that contain text & image content for embedding generation, as well as the relevant associated metadata.

note

For programmatic dataset uploads, our Python SDK allows you to upload datasets directly from Python code.

File Requirements

Supported formats: CSV, TSV, JSON, or JSONL
Column names:
- For CSV/TSV: Files must include a header row specifying column names
- For JSON/JSONL: Each object must use consistent field names
Content: At least one column/field to embed, containing either natural language text or a path to an image.
Size limits: at least 20 rows (and for free users, at most 250,000 rows), with a total file size under 1GB

Best Practices

Null Data

When uploading numeric data that contains missing values, the proper formatting is:

JSON/JSONL: Use the null keyword (without quotes) for missing values
CSV/TSV: Leave cells empty to represent NULL values

Important: If a column contains mixed data types (such as numbers and text like "n/a"), Atlas will likely interpret the entire column as string data. For optimal analysis in Atlas, ensure your numeric columns contain only numbers and properly formatted NULL values.

Date and Time Formatting

For consistent handling of temporal information, we recommend using ISO 8601 format: YYYY-MM-DD for dates (e.g., 2024-03-14) and YYYY-MM-DDThh:mm:ssZ for timestamps (e.g., 2024-03-14T15:30:00Z).

Our CSV parser will attempt to detect other date formats, but using the ISO standard ensures the most reliable parsing.

Include All Available Metadata

When preparing your dataset, include all relevant metadata fields since Atlas can efficiently handle datasets with many columns. These additional columns provide valuable context and enable richer analysis through the data map controls. Consider including metadata like timestamps, categories, authors, status fields, IDs, ratings, and any other fields that could provide useful context or filtering capabilities.

Date and Time Formatting

For consistent handling of temporal information, convert all datetime fields to ISO 8601 format. Instead of using formats like 3/14/24 3:30 PM, use the standardized format 2024-03-14T15:30:00Z. This ensures that Atlas can properly process and display temporal data across your dataset.

Unnest Data

If your data contains nested structures or objects (like JSON objects), unnest them into separate columns before uploading to Atlas. For example, instead of having a single column containing {"status": "open", "priority": 3}, split it into two separate columns: status with value "open" and priority with value 3. This flat structure allows for better filtering and analysis in Atlas.

Adding Custom Positioning Coordinates

Atlas can display your data points using custom coordinate systems provided in your dataset, instead of relying solely on the default projection based on embeddings. This is useful for visualizing data with inherent spatial properties or pre-defined layouts.

Custom X-Y Coordinates

To use custom X-Y coordinates, include paired columns in your dataset containing the numerical X and Y values for each data point. Atlas recognizes the following naming patterns for these columns:

Position_X and Position_Y
Position-X and Position-Y
Position.X and Position.Y
Position X and Position Y

Example:

id,text,Position_X,Position_Y
1,"Point A",10.5,20.1
2,"Point B",15.2,25.8
3,"Point C",12.0,22.5

If valid paired columns are detected, they will appear as selectable options (e.g., "Position XY") in the Position Mode dropdown within the map's View Settings. Multiple distinct X-Y pairs (e.g., Layout1_X, Layout1_Y, Layout2_X, Layout2_Y) are supported.

Geospatial Coordinates

For datasets containing geographical information, include columns with latitude and longitude values. Atlas recognizes the following naming patterns:

lat and lon
latitude and longitude

Example:

id,location_name,latitude,longitude
1,"City Hall",40.7128,-74.0060
2,"Central Park",40.7850,-73.9682
3,"Statue of Liberty",40.6892,-74.0445

Atlas will automatically detect these columns and offer a geospatial projection mode in the Position Mode dropdown.

note

Currently, only one pair of latitude/longitude columns (one latitude column and one longitude column) is supported per dataset.

Upload

Visit Atlas Dashboard

Once you are signed up for Atlas and logged in, visit https://atlas.nomic.ai/data to open your Atlas Dashboard.

Alteratively, on the Nomic Atlas homepage click the Dashboard button.

Your organization's data maps live here. For new organizations with no datasets yet, you will be prompted to get started uploading your first dataset.

Create New Dataset

If you have no datasets yet, visiting your dashboard will prompt you to create one.

If you have existing datasets, click to create a new dataset.

You can upload text to Atlas with dataset connectors, by dragging your file in directly, or by using the Atlas Python SDK.

If you are using a connector or dragging in your own file, make sure to inspect the auto-inferred Name, Embedding Field, and dataset settings (e.g., whether to use a multilingual model when embedding, which we deactivate in the above video) before clicking . Your map will take a few minutes to load, requiring more time the more data you upload.

Upload Options

These options are available when uploading data in the Atlas UI. More options are available when you upload data with the Atlas Python SDK.

Dataset Name

The name which will be used for your data map's display in the Dashboard, as well as its URL.

Dataset Description

The description for your dataset visible within the dataset information menu and your dataset settings.

Field to Embed

The attribute of your dataset used to arrange the points in the Atlas map.

Atlas automatically selects the best field for embedding from the dataset you choose, but you can choose a different field.

The embedding field determines how the datapoints will get arranged as a map in Atlas: data that show up as neighboring points in the data map will have similar semnatic content in this field/column from your dataset (and thus similar embeddings via the embedding model). Typically, you will want this to be the text column from your data, as opposed to non-semantic content like IDs or numerical metadata.

Embedding Model

Which embedding model to use, which will group data points together based on semantic meaning:

• English (nomic-embed-text-v1.5): If you use this model and your data contains multiple languages, this model will create distinct clusters of text depending on the language used.

• Multilingual (gte-multilingual-base): If you use this model and your data contains multiple languages, this model will cluster text regardless of language used.

Dataset Visibility

For Business & Enterprise accounts, you can make a map private so that they are not publicly accessible, and only members of your organization will be able to find/access the map.

• Public: anyone can view

• Private: only your fellow organization members can view

• Restricted: manually grant access

Data Preparation Guides & Examples

View our data export guides to see walkthroughs of getting data from common sources and platforms into the right format for Atlas.

For more advanced data preparation examples, check out our cookbook tutorials:

Reddit Comments Analysis

Learn how to analyze Reddit comments data in Atlas: View the Reddit Comments Tutorial

GitHub Commit History Analysis

Learn how to analyze Git repository commit histories in Atlas: View the GitHub Commits Tutorial

CSV Parsing with DuckDB

Atlas uses DuckDB for parsing and processing CSV files, which provides robust automatic type detection capabilities. When you upload a CSV file, DuckDB automatically:

Detects the dialect of your CSV file (delimiter, quoting rules, escape characters)
Infers the data types of each column
Detects whether the file has a header row

Type Detection

Atlas uses DuckDB to convert values in each column to candidate types, following this priority order:

BIGINT (integer)
DOUBLE (floating point)
TIMESTAMP
VARCHAR (string)

A column is assigned the highest-priority type that successfully converts all sampled values. If values cannot be converted to a particular type, that candidate type is eliminated. VARCHAR is always the fallback type.

For more details on DuckDB's CSV parsing and type inference, visit the DuckDB CSV auto detection documentation.

File Requirements​

Best Practices​

Null Data​

Date and Time Formatting​

Include All Available Metadata​

Date and Time Formatting​

Unnest Data​

Adding Custom Positioning Coordinates​

Custom X-Y Coordinates​

Geospatial Coordinates​

Upload​

Visit Atlas Dashboard​

Create New Dataset​

Upload Options​

Dataset Name​

Dataset Description​

Field to Embed​

Embedding Model​

Dataset Visibility​

Data Preparation Guides & Examples​

Reddit Comments Analysis​

GitHub Commit History Analysis​

CSV Parsing with DuckDB​

Type Detection​

File Requirements

Best Practices

Null Data

Date and Time Formatting

Include All Available Metadata

Date and Time Formatting

Unnest Data

Adding Custom Positioning Coordinates

Custom X-Y Coordinates

Geospatial Coordinates

Upload

Visit Atlas Dashboard

Create New Dataset

Upload Options

Dataset Name

Dataset Description

Field to Embed

Embedding Model

Dataset Visibility

Data Preparation Guides & Examples

Reddit Comments Analysis

GitHub Commit History Analysis

CSV Parsing with DuckDB

Type Detection