Unstructured data interface
Nomic Atlas introduces a revolutionary interface for working with unstructured data: The Atlas Map.
The Atlas Map plots your entire dataset on one screen and organizes it by meaning into clusters. Clusters on your Atlas Map contain datapoints that are semantically similar. An Atlas Map can search, filter and export data at scale.
All operations on the Atlas Map browser interface can be executed with the API.
Use cases
- Understand and find insights in your text, image, audio and video datasets.
- Cluster and auto-categorize your datasets.
- Discover outlier and anomolious regions in your data.
- Rapidly and collaboratively iterate on your dataset by removing undesired datapoints, applying tags and sharing insights.
Navigating the Atlas Map Interface
To better explain how to interact with an Atlas Map, let’s take a tour around a dataset of news articles. Click around the American News Sample map below or visit it in the browser.
An Atlas Map has the following properties:
-
Points close to each other on the map are semantically similar/related.
All news articles about sports are on the left side of the map. Inside the sports region, the map breaks down by type of sport because news articles about a fixed sport (e.g. football) have more similarity to each other than with news articles about other types of sports (e.g. baseball).
-
Numerical distances between 2D point positions do not have concrete meaning.
For example, the observation that the Baseball and Football clubs news article clusters are adjacent signify a relationship between Baseball and Football in the embedding space. You should not, however, make claims or draw conclusions using the Euclidean distance between points in the two clusters. Distance information is only meaningful in the ambient embedding space and can be retrieved with
vector_search
. -
Floating labels correspond to distinct topics in your data.
For example, the Baseball cluster has the label 'Astros win Game 1 of World'. Labels are automatically determined from the textual contents of your data and are crucial for navigating the map. Learn more about how topics labels are generated.
-
Topics have a hierarchy.
Topics group your dataset into homogenous regions. As you zoom around the map, more granular versions of your datasets topics will appear.
-
Maps update as your data updates.
When new data enters your dataset, Atlas can rebuild the map to reflect how the new data relates to existing data.
-
Built for collaboration across technical and non-technical teams.
All information and operations that are visually presented on an Atlas Map have a programmatic analog. For example, engineers can access cluster ids, topic information, duplicate clusters and run vector search through the Python client.
Search
Instantly search datasets with up to tens of millions of points in Atlas.
You can search over any column in your dataset and matches will display on your Atlas Map.
For more involved search terms, you may want to layer the helper tools (below) onto your search which match complete words or match case exactly. Using regular expressions can allow you to apply complex pattern-matching to your search.
Search options
Only match complete words
Match case exactly
Regular expression search
Example: Search over beauty reviews
On an Atlas map of Amazon Beauty Reviews, doing a search on the keyword “hair” highlights at least two large areas of the graph which contain the word “hair.”
Zooming in shows that the cluster on the left side is mostly composed of reviews related to shaving and hair removal products, while the cluster on the right side has reviews related to hairstyling.
Check out the Atlas map on Amazon Beauty Reviews yourself and try running a search!
Atlas Map on Amazon Beauty Reviews
Below: Above map zoomed in on right-most cluster for "hair"
Metadata filters
Apply filters to your data to filter over metadata, giving you new views and greater insight into your dataset. Slicing by timestamp allows you to see change in topics over time. Filter over any of your numerical metadata values like sentiment value, temperature, price, score, and much more.
Example: Filter a dataset of TikTok videos
The example below shows the same map, before and after applying a filter. This map uses a dataset of TikTok videos from a one-week span in 2023. In this dataset, metadata such as like count and play count were collected along with the videos.
If you were interested in looking at popular videos, you could filter your map to view datapoints above a certain threshold. In the map below, the data was filtered on videos with like counts above 1M. As we can see from the data sidebar, there are only 10 videos out of 39k total which surpassed 1M likes.
Check out the TikTok map shown below and try applying your own filters.
Original map on sample of a week of TikTok videos
Below: Above map filtered for videos with more than 1M likes
Lasso and tagging
The Lasso tool allows you to select points on the map by circling them with your mouse. Lassoing can be a part of your data pipeline as you find, select, tag, and clean your data.
Example: Identifying an outlier cluster from a news dataset
In the news dataset example below, we can use Lasso to select an outlier cluster.
On inspection, we see an area of the map containing points related to betting and casinos. Let's say we don’t want to include these points in our news analysis.
To tag these points using the Lasso tool:
- Select the lasso function under “Selection Tools.”
- Draw an outline on the map which captures the points of interest.
- On the data sidebar, click
+tag all
and add the name of the tag you want to apply to all lassoed points. - Your points are now tagged!
See the API reference or the data tagging walkthrough to learn how to use Python to use your tags for cleaning data.
Atlas News Map zoomed into betting and gambling area
Video: Example of lasso tool used to tag Atlas map
Duplicate detection
Duplicate detection in Atlas streamlines your data by identifying and consolidating duplicate entries. This tool ensures data accuracy and integrity, enabling cleaner datasets for more reliable analysis. Use the tool in the browser to find your duplicate datapoints.
Visual configuration
You can customize the color scheme and point sizes on your map in View Settings.
You may color by existing columns in your data. Depending on your metadata, you might be able to color a news map, for example, by language, news outlet name, country of origin, or number of views. Coloring works for both categorical or numerical data types.
To color datapoints by topic clusters, you can color by Nomic Topic: 1/2/3. Depth level 1 is most general and depth level 3 is the most specific. This can give you a clearer view of the divisions and overlap between different topics in your data.
The legend on your graph will describe the current labels corresponding to colors of points on the map. If the colorable field is one of the Nomic Topic depth levels, then the labels in the legend and on the map will be the topic labels themselves.
Adjust your point size to any size that works for you — the right point size can better highlight the structures in the map, help you more quickly identify outliers, or more easily identify color patterns.
Point positioning
By default, all points are positioned using our own projection algorithm, Nomic Project. However, Atlas allows you to reposition points using alternative positioning schemes.
Combining point repositioning with selection filters allows for more precise data selection. For example, make a lasso in one positioning scheme and then switch to another to see the selected points in a different context.
X-Y positioning
To use an alternative X-Y coordinate positioning scheme, you must include a pair of named X and Y columns in your dataset. For example, if you want a position scheme called MyPosition, you would include any of the following pairs of columns in your dataset:
MyPosition_X
andMyPosition_Y
(x and y are case-insensitive)MyPosition-X
andMyPosition-Y
MyPosition.X
andMyPosition.Y
MyPosition X
andMyPosition Y
This will appear as "MyPosition XY" in the "Position Mode" dropdown. Multiple X-Y pairs can be used in the same dataset, and you can switch between them in the dropdown.
Geospatial positioning
To use geospatial positioning, you must include a pair of Latitude and Longitude columns in your dataset. These can be lat
and lon
or latitude
and longitude
.
Unlike X-Y positioning, geospatial positioning only supports one pair of latitude and longitude columns in a dataset for now.