Skip to main content

Data Curation

Learn how to use Atlas to improve the quality of your complex datasets by annotating and curating data with ease.

Introduction

Use Atlas to remove outliers, identify and remove clusters, and annotate data from an intuitive visual layout. Atlas aids you in every step of the data annotation and cleaning pipeline to:

  • Visualize your data
  • Find and annotate your datapoints
  • Access your annotated data in Python
  • Use data annotations for data cleaning, new maps, and downstream apps

Quickstart

Learn how to select, annotate, and curate data with Atlas and Python in the example below.

Exploring and labeling a news dataset

In this example, we will map and label a news dataset. To start, load the dataset ag_news (source on the Hugging Face website), randomly sample 10,000 points and map them in Atlas.

Follow along with this tutorial in a Colab notebook here.

Step 0. Build an Atlas map of your dataset

The code example below shows how to build an Atlas map with 10,000 news article data points.

from nomic import atlas
import nomic
import numpy as np
import random
from datasets import load_dataset

np.random.seed(0) # so your map has the same points sampled.

dataset = load_dataset('ag_news')['train']

max_documents = 10000
subset_idxs = np.random.choice(len(dataset), size=max_documents, replace=False).tolist()
documents = [dataset[i] for i in subset_idxs]
for i in range(len(documents)):
documents[i]['id'] = i

dataset = atlas.map_data(data=documents,
id_field='id',
indexed_field='text',
identifier='News 10k Tutorial',
description='10k News Articles for Labeling'
)
# Re-run this cell to view map updates
dataset.maps[0]
Atlas news map demo

Step 1. Select datapoints

Using the toolbar on the right side of the map, you can interact with your mapped datapoints.

  1. To select your points, click the Lasso tool and circle the Sports related points.
Lasso on Atlas news map demo

You know where they are because the map pre-organizes them all together!

Step 2. Annotate datapoints

This brings up the Selection pane where you can:

  1. Filter through your selected datapoints (try the arrow or WASD keybindings)
  2. Tag your selected points
  3. Click the + tag all button in the Selection Pane and tag the region as sports. This annotates every point in that region.
Lasso + tag on Atlas news map demo

Now here is where Atlas shines. The interaction you just did in your web browser is synced to your dataset's state.

Load up your dataset in Python by initializing it with organization name and dataset name, and then access your tags.

from nomic import AtlasDataset
dataset = AtlasDataset('<YOUR ORGANIZATION HERE>/news-10k-tutorial')
map = dataset.maps[0]
tags = map.tags.get_tags()
for tag in tags:
datum_ids = map.tags.get_datums_in_tag(tag['tag_name'])
print(tag['tag_name'], "Count:", len(datum_ids), "Sample:", datum_ids[:10])

dataset.get_data(ids=map.tags.get_datums_in_tag('sports')[:2])

Step 3. Curate data

Now let's make a new map from this dataset without the data tagged 'sports':

old_df = map.data.df
ids_to_remove = map.tags.get_datums_in_tag('sports')
new_df = old_df[~old_df.id.apply(lambda x : x in ids_to_remove)]

new_dataset = atlas.map_data(data=new_df,
id_field='id',
indexed_field='text',
identifier='News 10k Tutorial without sports',
description='10k News Articles for Labeling, sports removed'
)

After a few minutes, your new map should be organized without any sports articles. Notice how all other positions remain largely the same.

Atlas news map demo with sports removed

With this workflow, you can quickly triage through, tag and clean large unstructured datasets.

note

Tagging will only work if you have edit access to the map.