Prepare Data for Atlas
Atlas accepts data files that contain text & image content for embedding generation, as well as the relevant associated metadata.
For programmatic dataset uploads, our Python SDK allows you to upload datasets directly from Python code.
File Requirements
- Supported formats: CSV, TSV, JSON, or JSONL
- Column names:
- For CSV/TSV: Files must include a header row specifying column names
- For JSON/JSONL: Each object must use consistent field names
- Content: At least one column/field to embed, containing either natural language text or a path to an image.
- Size limits: at least 20 rows (and for free users, at most 250,000 rows), with a total file size under 1GB
Best Practices
Include All Available Metadata
When preparing your dataset, include all relevant metadata fields since Atlas can efficiently handle datasets with many columns. These additional columns provide valuable context and enable richer analysis through the data map controls. Consider including metadata like timestamps, categories, authors, status fields, IDs, ratings, and any other fields that could provide useful context or filtering capabilities.
Date and Time Formatting
For consistent handling of temporal information, convert all datetime fields to ISO 8601 format. Instead of using formats like 3/14/24 3:30 PM
, use the standardized format 2024-03-14T15:30:00Z
. This ensures that Atlas can properly process and display temporal data across your dataset.
Unnest Data
If your data contains nested structures or objects (like JSON objects), unnest them into separate columns before uploading to Atlas. For example, instead of having a single column containing {"status": "open", "priority": 3}
, split it into two separate columns: status
with value "open" and priority
with value 3. This flat structure allows for better filtering and analysis in Atlas.
Upload
Visit Atlas Dashboard
Once you are signed up for Atlas and logged in, visit https://atlas.nomic.ai/data to open your Atlas Dashboard.
Alteratively, on the Nomic Atlas homepage click the Dashboard
button.
Your organization's data maps live here. For new organizations with no datasets yet, you will be prompted to get started uploading your first dataset.
Create New Dataset
If you have no datasets yet, visiting your dashboard will prompt you to create one.
If you have existing datasets,
click to create a new dataset.
You can upload text to Atlas via dataset connectors, via Python, or by uploading your own file directly. Here we upload a .csv
file by dropping it onto the Atlas upload page.
Take a look at the auto-inferred Name,
Embedding Field,
and dataset settings (e.g., whether to use a multilingual model when embedding, which we deactivate in the above video) before clicking . Your map will take a few minutes to load, requiring more time the more data you upload.
Upload Options
Name
The name which will be used for your data map's display in the Dashboard, as well as its URL.
Embedding Field / Indexed Field
The attribute of your dataset used to arrange the points in the Atlas map.
Uploading data to Atlas requires choosing a which field/column from your dataset to embed with an embedding model. This choice determines how the datapoints will get arranged as a map in Atlas: data that show up as neighboring points in the data map will have similar semnatic content in this field/column from your dataset (and thus similar embeddings via the embedding model). Typically, you will want this to be the text column from your data, as opposed to non-semantic content like IDs or numerical metadata.
Build Topic Model
Whether to build a topic model, which displays labels over clusters and subclusters within the Atlas data map interface. You can read more about how it works here.
Duplicate Detection
Whether to activate duplicate detection, which will create a new column of metadata for your dataset indicating which points are likely duplicates of other data points.
Use Multilingual Model
Whether to use a multilingual embedding model, which will group data points together based on semantic meaning regardless of language used in the text (as opposed to the default nomic-embed-text-v1.5 embedding model, which will create distinct clusters of data depending on the language used in the text).
Private Map
For Pro & Enterprise accounts, you can make a map private so that they are not publicly accessible, and only members of your organization will be able to find/access the map.
Data Preparation Guides & Examples
View our data export guides to see walkthroughs of getting data from common sources and platforms into the right format for Atlas.
For more advanced data preparation examples, check out our cookbook tutorials:
Reddit Comments Analysis
Learn how to analyze Reddit comments data in Atlas: View the Reddit Comments Tutorial
GitHub Commit History Analysis
Learn how to analyze Git repository commit histories in Atlas: View the GitHub Commits Tutorial