Skip to main content

Prepare Data for Atlas

Atlas accepts data files that contain text & image content for embedding generation, as well as the relevant associated metadata.

note

For programmatic dataset uploads, our Python SDK allows you to upload datasets directly from Python code.

File Requirements

  • Supported formats: CSV, TSV, JSON, or JSONL
  • Column names:
    • For CSV/TSV: Files must include a header row specifying column names
    • For JSON/JSONL: Each object must use consistent field names
  • Content: At least one column/field to embed, containing either natural language text or a path to an image.
  • Size limits: at least 20 rows (and for free users, at most 250,000 rows), with a total file size under 1GB

Best Practices

Include All Available Metadata

When preparing your dataset, include all relevant metadata fields since Atlas can efficiently handle datasets with many columns. These additional columns provide valuable context and enable richer analysis through the data map controls. Consider including metadata like timestamps, categories, authors, status fields, IDs, ratings, and any other fields that could provide useful context or filtering capabilities.

Date and Time Formatting

For consistent handling of temporal information, convert all datetime fields to ISO 8601 format. Instead of using formats like 3/14/24 3:30 PM, use the standardized format 2024-03-14T15:30:00Z. This ensures that Atlas can properly process and display temporal data across your dataset.

Unnest Data

If your data contains nested structures or objects (like JSON objects), unnest them into separate columns before uploading to Atlas. For example, instead of having a single column containing {"status": "open", "priority": 3}, split it into two separate columns: status with value "open" and priority with value 3. This flat structure allows for better filtering and analysis in Atlas.

Upload

Visit Atlas Dashboard

Once you are signed up for Atlas and logged in, visit https://atlas.nomic.ai/data to open your Atlas Dashboard.

Alteratively, on the Nomic Atlas homepage click the Dashboard button.

Your organization's data maps live here. For new organizations with no datasets yet, you will be prompted to get started uploading your first dataset.

Create New Dataset

If you have no datasets yet, visiting your dashboard will prompt you to create one.

If you have existing datasets, click Create Dataset button in Atlas to create a new dataset.

You can upload text to Atlas via dataset connectors, via Python, or by uploading your own file directly. Here we upload a .csv file by dropping it onto the Atlas upload page.

Take a look at the auto-inferred Name, Embedding Field, and dataset settings (e.g., whether to use a multilingual model when embedding, which we deactivate in the above video) before clicking Upload Dataset button in Atlas. Your map will take a few minutes to load, requiring more time the more data you upload.

Upload Options

Name

The name which will be used for your data map's display in the Dashboard, as well as its URL.

Embedding Field / Indexed Field

The attribute of your dataset used to arrange the points in the Atlas map.

Uploading data to Atlas requires choosing a which field/column from your dataset to embed with an embedding model. This choice determines how the datapoints will get arranged as a map in Atlas: data that show up as neighboring points in the data map will have similar semnatic content in this field/column from your dataset (and thus similar embeddings via the embedding model). Typically, you will want this to be the text column from your data, as opposed to non-semantic content like IDs or numerical metadata.

Build Topic Model

Whether to build a topic model, which displays labels over clusters and subclusters within the Atlas data map interface. You can read more about how it works here.

Duplicate Detection

Whether to activate duplicate detection, which will create a new column of metadata for your dataset indicating which points are likely duplicates of other data points.

Use Multilingual Model

Whether to use a multilingual embedding model, which will group data points together based on semantic meaning regardless of language used in the text (as opposed to the default nomic-embed-text-v1.5 embedding model, which will create distinct clusters of data depending on the language used in the text).

Private Map

For Pro & Enterprise accounts, you can make a map private so that they are not publicly accessible, and only members of your organization will be able to find/access the map.

Data Preparation Guides & Examples

View our data export guides to see walkthroughs of getting data from common sources and platforms into the right format for Atlas.

For more advanced data preparation examples, check out our cookbook tutorials:

Reddit Comments Analysis

Learn how to analyze Reddit comments data in Atlas: View the Reddit Comments Tutorial

GitHub Commit History Analysis

Learn how to analyze Git repository commit histories in Atlas: View the GitHub Commits Tutorial