Data Curation
Learn how to use Atlas to improve the quality of your complex datasets by annotating and curating data with ease.
Introduction
Use Atlas to remove outliers, identify and remove clusters, and annotate data from an intuitive visual layout. Atlas aids you in every step of the data annotation and cleaning pipeline to:
- Visualize your data
- Find and annotate your datapoints
- Access your annotated data in Python
- Use data annotations for data cleaning, new maps, and downstream apps
Quickstart
Learn how to select, annotate, and curate data with Atlas and Python in the example below.
Exploring and labeling a news dataset
In this example, we will map and label a news dataset. To start, load the dataset ag_news
(source on the Hugging Face website), randomly sample 10,000 points and map them in Atlas.
Follow along with this tutorial in a Colab notebook here.
Step 0. Build an Atlas map of your dataset
The code example below shows how to build an Atlas map with 10,000 news article data points.
from nomic import atlas
import nomic
import numpy as np
import random
from datasets import load_dataset
np.random.seed(0) # so your map has the same points sampled.
dataset = load_dataset('ag_news')['train']
max_documents = 10000
subset_idxs = np.random.choice(len(dataset), size=max_documents, replace=False).tolist()
documents = [dataset[i] for i in subset_idxs]
for i in range(len(documents)):
documents[i]['id'] = i
dataset = atlas.map_data(data=documents,
id_field='id',
indexed_field='text',
identifier='News 10k Tutorial',
description='10k News Articles for Labeling'
)
# Re-run this cell to view map updates
dataset.maps[0]

Step 1. Select datapoints
Using the toolbar on the right side of the map, you can interact with your mapped datapoints.
- To select your points, click the Lasso tool and circle the Sports related points.

You know where they are because the map pre-organizes them all together!
Step 2. Annotate datapoints
This brings up the Selection pane where you can:
- Filter through your selected datapoints (try the arrow or WASD keybindings)
- Tag your selected points
- Click the
+ tag all
button in the Selection Pane and tag the region as sports. This annotates every point in that region.

Now here is where Atlas shines. The interaction you just did in your web browser is synced to your dataset's state.
Load up your dataset in Python by initializing it with organization name and dataset name, and then access your tags.
from nomic import AtlasDataset
dataset = AtlasDataset('<YOUR ORGANIZATION HERE>/news-10k-tutorial')
map = dataset.maps[0]
tags = map.tags.get_tags()
for tag in tags:
datum_ids = map.tags.get_datums_in_tag(tag['tag_name'])
print(tag['tag_name'], "Count:", len(datum_ids), "Sample:", datum_ids[:10])
dataset.get_data(ids=map.tags.get_datums_in_tag('sports')[:2])
Step 3. Curate data
Now let's make a new map from this dataset without the data tagged 'sports':
old_df = map.data.df
ids_to_remove = map.tags.get_datums_in_tag('sports')
new_df = old_df[~old_df.id.apply(lambda x : x in ids_to_remove)]
new_dataset = atlas.map_data(data=new_df,
id_field='id',
indexed_field='text',
identifier='News 10k Tutorial without sports',
description='10k News Articles for Labeling, sports removed'
)
After a few minutes, your new map should be organized without any sports articles. Notice how all other positions remain largely the same.

With this workflow, you can quickly triage through, tag and clean large unstructured datasets.
Tagging will only work if you have edit access to the map.