Skip to main content

Data cleaning and improvement

Learn how to use Atlas to improve the quality of your complex datasets by annotating, adding, and deleting data with ease.

Introduction

Use Atlas to remove outliers, identify and remove clusters, and annotate data from an intuitive visual layout. Atlas aids you in every step of the data annotation and cleaning pipeline to:

  • Visualize your data
  • Find and annotate your datapoints
  • Access your annotated data in Python
  • Use data annotations for data cleaning, new maps, and downstream apps

Quickstart

Learn how to select, annotate, and delete datapoints with Atlas and Python in the example below.

Exploring and labeling a news dataset

In this example, we will map and label a news dataset. To start, load the dataset ag_news (source on the Hugging Face website), randomly sample 10,000 points and map them in Atlas.

Follow along with this tutorial in a Colab notebook here.

Step 0. Build an Atlas map of your dataset

The code example below shows how to build an Atlas map with 10,000 news article data points.

from nomic import atlas
import nomic
import numpy as np
import random
from datasets import load_dataset

np.random.seed(0) # so your map has the same points sampled.

dataset = load_dataset('ag_news')['train']

max_documents = 10000
subset_idxs = np.random.choice(len(dataset), size=max_documents, replace=False).tolist()
documents = [dataset[i] for i in subset_idxs]
for i in range(len(documents)):
documents[i]['id'] = i

dataset = atlas.map_data(data=documents,
id_field='id',
indexed_field='text',
name='News 10k Tutorial',
description='10k News Articles for Labeling'
)
# Re-run this cell to view map updates
dataset.maps[0]
Atlas news map demo

Step 1. Select datapoints

Using the toolbar on the right side of the map, you can interact with your mapped datapoints.

  1. To select your points, click the Lasso tool and circle the Sports related points.
Lasso on Atlas news map demo

You know where they are because the map pre-organizes them all together!

Step 2. Annotate datapoints

This brings up the Selection pane where you can:

  1. Filter through your selected datapoints (try the arrow or WASD keybindings)
  2. Tag your selected points
  3. Click the + tag all button in the Selection Pane and tag the region as sports. This annotates every point in that region.
Lasso + tag on Atlas news map demo

Now here is where Atlas shines. The interaction you just did in your web browser is synced to your dataset's state.

Load up your dataset in Python by initializing it by name and then access your tags.

from nomic import AtlasDataset
dataset = AtlasDataset('news-10k-tutorial')
map = dataset.maps[0]
tags = map.tags.get_tags()
for tag, datum_ids in tags.items():
print(tag, "Count:", len(datum_ids), datum_ids[:10])

print(dataset.get_data(ids=tags['sports'][:2]))

Step 3. Deleting datapoints by tag

Now let's delete all points related to sports from the map. Call the delete_data method on your dataset with the ids of the sports data point. Then, rebuild the map.

dataset.delete_data(ids=tags['sports'])
dataset.rebuild_maps()

After about a minute, your map should be re-organized without any sports articles. Notice how all other positions remain largely the same.

Atlas news map demo with sports removed

With this workflow, you can quickly triage through, tag and clean large unstructured datasets. Checkout out the Chatbot tutorial to learn how you can utilize tags and labels that Atlas has automatically extracted from your data as part of your workflow.

note

Tagging will only work if you have edit access to the map.