Duplicate detection
Duplicate detection allows you detect, filter and remove duplicate datapoints in your datasets.
The Atlas duplicate detection capability uses your datasets embeddings to identify datapoints that are near-duplicates and then gives you the ability to take action.
Duplicate detection operates by finding duplicate clusters in your dataset and assigning each datapoint to one of three categories. If a cluster has only one point, it's called a singleton cluster and that point gets a singleton label. Each duplicate cluster with more than one point has an arbitrary datapoint labeled as a retention candidate and the rest are labeled as deletion candidates.
Duplicate detection can be used in the web interface or via the API.
- Duplicate Detection in the Browser
- Duplicate Detection with API
Enabling duplicate detection
You must enable duplicate detection on dataset creation. You can do this by checking Detect Duplicates in the web browser upload flow or by setting duplicate_detection = True
when uploading data through the Python client.
Learn about various configuration operations for duplicate detection in the API Reference.
Filtering duplicates
Click the filter tool in the selection pane and the select Duplicate Class. Duplicate detection assigns each datapoint into one of three categories:
- Deletion candidates: The set of points you can remove because they are near-duplicates of other points.
- Retention Candidates: The set of points that are part of a duplicate cluster and are the single point from that cluster chosen to be retained.
- Singleton: The set of datapoints that have no near-duplicates in the dataset.
Deduplicating your dataset
Form a compound selection of the retention candidates and singleton datapoints and then download the selection.
Exploring duplicate datapoints
Color your dataset by the Duplicate Class and then combine with other selection tools.
Configuring duplicate detection
You can configure duplicate detection by changing the duplicate cluster cutoff threshold. Smaller thresholds
result in duplicate clusters containing datapoints that are closer to exact matches. The default threshold is 0.1
.
See API Reference for details.
Duplicate detection in Atlas
To enable duplicate detection on your Atlas map, set the flag duplicate_detection=True
when uploading your dataset.
dataset = atlas.map_data(data=my_data,
indexed_field='text',
name='My Map',
duplicate_detection=True
)
or
dataset = AtlasDataset(
name="My Map",
unique_id_field="id",
)
dataset.add_data(my_data) # my_data is a list of records (dicts) or pandas DataFrame
dataset.create_index(
indexed_field="text",
duplicate_detection=True,
)
Access duplicates with Python
Atlas automatically groups embeddings that are sufficiently close into semantic clusters. You can use these clusters for semantic duplicate detection allowing you to quickly deduplicate your data.
Access duplicate data
See how Atlas's duplication detection system categorizes your data into duplicate classes, including singletons, retention candidates, and deletion candidates.
- Python
- Output
from nomic import AtlasDataset
dataset = AtlasDataset('my-dataset')
map = dataset.maps[0]
# Retrieve dataframe containing all points, their duplicate class, and cluster ID.
print(map.duplicates.df)
# print(map.duplicates.df)
index duplicate_class cluster_id
0 100001 singleton 16492
1 100011 singleton 16017
2 100016 singleton 7826
3 100020 singleton 5044
4 100030 retention candidate 412
... ... ... ...
19995 99721 singleton 9218
19996 99730 deletion candidate 371
19997 99918 singleton 13311
19998 99926 singleton 13725
19999 9997 singleton 3866
[20000 rows x 3 columns]
Clean data with deletion candidates
Directly access duplicate candidates for deletion and remove them from your data to improve your data quality.
# Get list of IDs to delete (all duplicate data not including retention candidates)
ids_to_delete = map.duplicates.deletion_candidates()
# Delete duplicate data from dataset, keeping one copy of each duplicate cluster
dataset.delete_data(ids_to_delete)