Data Exploration, Cleaning and Labeling in Atlas¶
Video Tutorial¶
This tutorial describes how to use Atlas to quickly label and tag a large corpus of text.
Atlas provides insights into a text corpus by organizing its documents onto a map. Documents of text that are semantically similar cluster together on a map allowing for the following data labeling workflow:
- Make a map of your data.
- Use the pencil tool in Atlas to tag regions based on your domain expertise.
- Access your annotated tags from a python script or jupyter notebook.
Tags can then be funneled into a downstream machine learning model, used to clean your dataset by deleting points from your project and leveraged to build new maps on subsets of your data.
Exploring and Labeling a News Dataset¶
In this example, we will map and label a news dataset from the Huggingface hub. To start, load the dataset ag_news, randomly sample 10,000 points and map it.
The dataset is composed of news articles scraped by an academic news scraping engine after 2004.
from nomic import atlas
import nomic
import numpy as np
import random
from datasets import load_dataset
np.random.seed(0) # so your map has the same points sampled.
dataset = load_dataset('ag_news')['train']
max_documents = 10000
subset_idxs = np.random.choice(len(dataset), size=max_documents, replace=False).tolist()
documents = [dataset[i] for i in subset_idxs]
for i in range(len(documents)):
documents[i]['id'] = i
project = atlas.map_text(data=documents,
id_field='id',
indexed_field='text',
name='News 10k Tutorial',
description='10k News Articles for Labeling'
)
# Once your map is built, run this cell.
project.maps[0]
Project: News 10k Tutorial
Annotating the Map¶
Using the toolbar on the right side of the Map, you can interact with your uploaded data. In this example, we are going to:
- Select all sports related points.
- Tag them as 'sports'.
- Delete the points via the Python client and the see the map update!
To get started, click the Pencil tool and circle the Sports related points. You know where they are because the map pre-organizes them all together!
This brings up the Selection pane where you can:
- Filter through your selected datapoints (try the arrow or WASD keybindings)
- Tag your selected points (tagging only works if that map belongs to you)
Click the Bulk Tag button (top right of pane) in the Selection Pane and tag the region as sports
.
Now here is where Atlas shines. The interaction you just did in your web browser is synced to your project's state.
Load up your project by initializing it by name and then access your tags.
from nomic import AtlasProject
project = AtlasProject(name='News 10k Tutorial')
map = project.maps[0]
tags = map.tags.get_tags()
for tag, datum_ids in tags.items():
print(tag, "Count:", len(datum_ids), datum_ids[:10])
print(project.get_data(ids=tags['sports'][:2]))
sports Count: 2574 ['1003', '1005', '1006', '1010', '1015', '1016', '102', '103', '1031', '1038'] [{'id': 1003, 'text': 'Boxing: Ronald Wright retains titles Ronald Wright used an effective right jab to retain his World Boxing Association and World Boxing Council junior middleweight titles when posting a majority decision over former champion Shane Mosley in Las Vegas yesterday.', 'label': 1}, {'id': 1005, 'text': 'U.S. Furious at Proposal That Hamm Return His Gold ATHENS (Reuters) - U.S. Olympic chiefs reacted furiously Friday to a suggestion all-round champion Paul Hamm should give his gold medal to a South Korean rival under a plan floated by the governing body of world gymnastics.', 'label': 1}]
Removing all sports articles¶
Now let's delete all points related to sports from the map. Call the delete_data
method on your project with the ids of the sports data point.
Then, rebuild the map.
project.delete_data(ids=tags['sports'])
project.rebuild_maps()
After about a minute, your map should be re-organized without any sports articles.
Notice how all other positions remain largely the same.
With this workflow, you can quickly triage through, tag and clean large unstructured datasets. Checkout out the Monitoring Text Over Time tutorial to learn how you can utilize tags and labels that Atlas has automatically extracted from your data as part of your workflow.