Data cleaning and improvement
Learn how to use Atlas to improve the quality of your complex datasets by annotating, adding, and deleting data with ease.
Introduction
Use Atlas to remove outliers, identify and remove clusters, and annotate data from an intuitive visual layout. Atlas aids you in every step of the data annotation and cleaning pipeline to:
- Visualize your data
- Find and annotate your datapoints
- Access your annotated data in Python
- Use data annotations for data cleaning, new maps, and downstream apps
Quickstart
Learn how to select, annotate, and delete datapoints with Atlas and Python in the example below.
Exploring and labeling a news dataset
In this example, we will map and label a news dataset. To start, load the dataset ag_news
(source on the Hugging Face website), randomly sample 10,000 points and map them in Atlas.
Follow along with this tutorial in a Colab notebook here.
Step 0. Build an Atlas map of your dataset
The code example below shows how to build an Atlas map with 10,000 news article data points.
from nomic import atlas
import nomic
import numpy as np
import random
from datasets import load_dataset
np.random.seed(0) # so your map has the same points sampled.
dataset = load_dataset('ag_news')['train']
max_documents = 10000
subset_idxs = np.random.choice(len(dataset), size=max_documents, replace=False).tolist()
documents = [dataset[i] for i in subset_idxs]
for i in range(len(documents)):
documents[i]['id'] = i
dataset = atlas.map_data(data=documents,
id_field='id',
indexed_field='text',
name='News 10k Tutorial',
description='10k News Articles for Labeling'
)
# Re-run this cell to view map updates
dataset.maps[0]
Step 1. Select datapoints
Using the toolbar on the right side of the map, you can interact with your mapped datapoints.
- To select your points, click the Lasso tool and circle the Sports related points.
You know where they are because the map pre-organizes them all together!
Step 2. Annotate datapoints
This brings up the Selection pane where you can:
- Filter through your selected datapoints (try the arrow or WASD keybindings)
- Tag your selected points
- Click the
+ tag all
button in the Selection Pane and tag the region as sports. This annotates every point in that region.
Now here is where Atlas shines. The interaction you just did in your web browser is synced to your dataset's state.
Load up your dataset in Python by initializing it by name and then access your tags.
from nomic import AtlasDataset
dataset = AtlasDataset('news-10k-tutorial')
map = dataset.maps[0]
tags = map.tags.get_tags()
for tag, datum_ids in tags.items():
print(tag, "Count:", len(datum_ids), datum_ids[:10])
print(dataset.get_data(ids=tags['sports'][:2]))
Step 3. Deleting datapoints by tag
Now let's delete all points related to sports from the map. Call the delete_data
method on your dataset with the ids of the sports data point. Then, rebuild the map.
dataset.delete_data(ids=tags['sports'])
dataset.rebuild_maps()
After about a minute, your map should be re-organized without any sports articles. Notice how all other positions remain largely the same.
With this workflow, you can quickly triage through, tag and clean large unstructured datasets. Checkout out the Chatbot tutorial to learn how you can utilize tags and labels that Atlas has automatically extracted from your data as part of your workflow.
Tagging will only work if you have edit access to the map.