Skip to main content

Duplicate detection

Duplicate detection allows you detect, filter and remove duplicate datapoints in your datasets.

The Atlas duplicate detection capability uses your datasets embeddings to identify datapoints that are near-duplicates and then gives you the ability to take action.

Duplicate detection operates by finding duplicate clusters in your dataset and assigning each datapoint to one of three categories. If a cluster has only one point, it's called a singleton cluster and that point gets a singleton label. Each duplicate cluster with more than one point has an arbitrary datapoint labeled as a retention candidate and the rest are labeled as deletion candidates.

Duplicate detection can be used in the web interface or via the API.

Enabling duplicate detection

You must enable duplicate detection on dataset creation. You can do this by checking Detect Duplicates in the web browser upload flow or by setting duplicate_detection = True when uploading data through the Python client.

Learn about various configuration operations for duplicate detection in the API Reference.

Filtering duplicates

Click the filter tool in the selection pane and the select Duplicate Class. Duplicate detection assigns each datapoint into one of three categories:

  • Deletion candidates: The set of points you can remove because they are near-duplicates of other points.
  • Retention Candidates: The set of points that are part of a duplicate cluster and are the single point from that cluster chosen to be retained.
  • Singleton: The set of datapoints that have no near-duplicates in the dataset.

Deduplicating your dataset

Form a compound selection of the retention candidates and singleton datapoints and then download the selection.

Deduplicated Selection

Exploring duplicate datapoints

Color your dataset by the Duplicate Class and then combine with other selection tools.

Configuring duplicate detection

You can configure duplicate detection by changing the duplicate cluster cutoff threshold. Smaller thresholds result in duplicate clusters containing datapoints that are closer to exact matches. The default threshold is 0.1. See API Reference for details.