Map Resources

Atlas generates a variety of useful resources from your data, such as embeddings, 2D projections, and topic labels, which you can use for analysis and integration into your applications.

Dataset

Your Atlas Dataset is the collection of data you upload to Atlas, from which all other resources are generated.

You can access the dataset corresponding to an Atlas Map with atlas_map.data.

from nomic import AtlasDataset

dataset = AtlasDataset('my-dataset')

atlas_map = dataset.maps[0]

df = atlas_map.data.df # use atlas_map.data.tb to load as an Arrow table

AtlasMapData API Reference

class AtlasMapData()

Atlas Map Data (Metadata) State. This is how you can access text and other associated metadata columns you uploaded with your project.

df

@property
def df() -> pd.DataFrame

A pandas DataFrame associating each datapoint on your map to their metadata. Converting to pandas DataFrame may materialize a large amount of data into memory.

tb

@property
def tb() -> pa.Table

Pyarrow table associating each datapoint on the map to their metadata columns. This table is memmapped from the underlying files and is the most efficient way to access metadata information.

Embeddings

Access your datasets embeddings with atlas_map.embeddings.

atlas_map.embeddings.latent contains the high-dimensional embeddings produced by a Nomic Embedding Model.

atlas_map.embeddings.projected contains the 2D reduced version. These are the positions you see on your Atlas Map in the web browser.

This example shows how to access high- and low-dimensional embeddings of your data generated and stored by Atlas.

from nomic import AtlasDataset

dataset = AtlasDataset('my-dataset')

atlas_map = dataset.maps[0]

# projected embeddings are your 2D embeddings
projected_embeddings = atlas_map.embeddings.projected

# latent embeddings are your high-dim vectors
latent_embeddings = atlas_map.embeddings.latent

AtlasMapEmbeddings API Reference

class AtlasMapEmbeddings()

Atlas Embeddings State

Access latent (high-dimensional) and projected (two-dimensional) embeddings of your datapoints.

Topics

Directly access your Atlas-generated topics with atlas_map.topics.df. This dataframe include depth-unique ids and human interpretable descriptions for your datasets topics.

Atlas topic models are hierarchical. As you zoom in and out in the Atlas in-browser data map explorer, you will see topics appear and disappear at different zoom levels.

The topics which the Atlas system generates behind the scenes is directly accessible via Python. Information is available about topic hierarchy and topic density. Topic information can be used for downstream pipelines like visualization, analyses, and predictions.

Your embedding information can be accessed in the map.topics attribute of the AtlasDataset:

from nomic import AtlasDataset

map = AtlasDataset('my-dataset').maps[0]

# Pandas df of your data with columns ID, topic_depth_n, topic_depth_n+1, etc.
print(map.topics.df)

     id_    topic_depth_1          topic_depth_2             topic_depth_3
0    +Bw    Baby, Ray, Sunglasses  Apparel                   T-Shirts (2)
1    fHM    Phone Protector        Music Genre               Blues Music  
2    9Ts    Lighting Replacement   Years                     Hyundai Engines
3    6mU    Women's Fashion (3)    Footwear (14)             Women's Sandals
4    8j8    Women's Fashion (3)    Tops, Shirts, Shirt       Women's Tops (2)
...  ...    ...                    ...                       ...
117238 GRs  Electronics (5)        Smartphones (3)           Computer Peripherals
117239 AULT Electronics (5)        Computer Hardware (2)     Computer Upgrades
117240 P0U  Electronics (5)        Computer Hardware (2)     Computer Hardware  
117241 AWnV Electronics (5)        Computer Hardware (2)     Computer Hardware
117242 5Vg  Electronics (5)        Computer Hardware (2)     Computer Hardware

[117243 rows × 4 columns]

Pandas dataframe where each row corresponds to a unique topic. Metadata associated with each topic includes:

topic depth
a human-readable topic description (topic label)
identifying keywords that differentiate the topic from other topics

# Returns a Pandas df
print(map.topics.metadata)

      depth  topic_id  topic_depth_1           topic_description                                 topic_short_description  topic_depth_2  topic_depth_3
   1     0         Women's Fashion (3)      women/tops/dress/sandals/womens/casual/shoes/p...  Women's Fashion (3)     NaN           NaN
   1     1         Electronics (5)          USB/Bluetooth/iPhone/charging/Intel/cable/HDMI...  Electronics (5)         NaN           NaN
   1     2         Jewelry Collection (2)   jewelry/IceCarats/Jewelry/Type/ICECARATS/Sterl... Jewelry Collection (2)   NaN           NaN
   1     3         Phone Protector         phone/Galaxy/Samsung/dogs/Watch/protector/scre...   Phone Protector         NaN           NaN
   1     4         Pool Supplies           Pool/pool/Floats/chair/Brand/Amazon/Lathe/floa...   Pool Supplies           NaN           NaN
...   ...   ...       ...                     ...                                                 ...                      ...           ...
 3     507       Lighting Replacement    hose/garden/Garden/watering/Hose/ft/plants/Jet...   Garden Hose             Plumbing S... Garden Hose
 3     508       Lighting Replacement    Rate/9930/gallons/207/months/38℃/125PSI/GPM/34...   Water Pump              Plumbing S... Water Pump
 3     509       Lighting Replacement    NPT/¼/½/PSI/Pump/Straight/tire/pump/12V/Connec...   Tire Pump               Plumbing S... Tire Pump
 3     510       Lighting Replacement    drain/Drain/sink/pipe/Sink/stopper/steel/toile...   Plumbing Fixtures       Plumbing S... Plumbing Fixtures
 3     511       Lighting Replacement    shower/water/Shower/filter/solar/fountain/head...    Bathroom Essentials     Plumbing S... Bathroom Essentials
rows × 7 columns

The topic hierarchy brances from the most general topics down to sub-topics at different depths.

# map.topics.hierarchy is a dict

hierarchy = map.topics.hierarchy
print(f'Your depth 1 (most general) topics are: {hierarchy.keys()}')

Your depth 1 (most general) topics are: dict_keys([
  ("Women's Fashion (3)", 1), 
  ('Electronics (5)', 1), 
  ('Jewelry Collection (2)', 1), 
  ...
])

You can use higher-level topic keys to access lower-level topics in your hierarchy.

import random
# List the subtopics in a random top-level topic
random_topic_1 = random.choice(list(hierarchy.keys()))
print(f'The general topic {random_topic_1} contains subtopics {hierarchy[random_topic_1]}')

The general topic ('Footwear (14)', 2) contains subtopics [
  'Shoes (3)', 'Sandal', 'Sneaker Culture', ..., "Women's Sandals"
]

By providing a specific level of the topic hierarchy, you get a list of dictionaries where each item is a distinct topic at that level.

Keys for that topic include subtopics, subtopic_ids, topic_id, topic_short_description, topic_long_description, and datum_ids.

your_depth_level = 2
print(map.topics.group_by_topic(your_depth_level)[0])

{
  'subtopics': ['Miscellaneous (3)'], 
  'subtopic_ids': [87], 
  'topic_id': 16, 
  'topic_short_description': 'Audio Equipment (3)', 
  'topic_long_description': 'Bluetooth/speaker/Speaker/music/CarPlay/MP3/prevention/bluetooth/stereo/sound/karaoke/Loss/Radio/⭐/radio', 
  'datum_ids': {'61c', '/WM', 'Rsw', 'q6I', ...,  'AVjU'}
}

AtlasMapTopics API Reference

class AtlasMapTopics()

Atlas Topics State

df

@property
def df() -> pd.DataFrame

A pandas DataFrame associating each datapoint on your map to their topics as each topic depth.

tb

@property
def tb() -> pa.Table

Pyarrow table associating each datapoint on the map to their Atlas assigned topics. This table is memmapped from the underlying files and is the most efficient way to access topic information.

metadata

@property
def metadata() -> pd.DataFrame

Pandas DataFrame where each row gives metadata all map topics including:

topic id
a human readable topic description (topic label)
identifying keywords that differentiate the topic from other topics

hierarchy

@property
def hierarchy() -> Dict

A dictionary that allows iteration of the topic hierarchy. Each key is of (topic label, topic depth) to its direct sub-topics. If topic is not a key in the hierarchy, it is leaf in the topic hierarchy.

group_by_topic

def group_by_topic(topic_depth: int = 1) -> List[Dict]

Associates topics at a given depth in the topic hierarchy to the identifiers of their contained datapoints.

Arguments:

topic_depth - Topic depth to group datums by.

Returns:

List of dictionaries where each dictionary contains next depth subtopics, subtopic ids, topic_id, topic_short_description, topic_long_description, and list of datum_ids.

get_topic_density

def get_topic_density(time_field: str, start: datetime, end: datetime)

Computes the density/frequency of topics in a given interval of a timestamp field.

Useful for answering questions such as:

What topics increased in prevalence between December and January?

Arguments:

time_field - Your metadata field containing isoformat timestamps
start - A datetime object for the window start
end - A datetime object for the window end

Returns:

A list of {topic, count} dictionaries, sorted from largest count to smallest count.

vector_search_topics

def vector_search_topics(queries: np.ndarray,
                         k: int = 32,
                         depth: int = 3) -> Dict

Given an embedding, returns a normalized distribution over topics.

Useful for answering the questions such as:

What topic does my new datapoint belong to?
Does by datapoint belong to the "Dog" topic or the "Cat" topic.

Arguments:

queries - a 2d NumPy array where each row corresponds to a query vector
k - (Default 32) the number of neighbors to use when estimating the posterior
depth - (Default 3) the topic depth at which you want to search

Returns:

A dict mapping {topic: posterior probability} for each query.

Duplicate Detection

Duplicate detection is enabled by default when creating an Atlas map using the Nomic Atlas Python SDK:

from nomic import AtlasDataset

dataset_identifier = "my-dataset" # for my-dataset in the organization connected to your Nomic API key
# dataset_identifier = "<ORG_NAME>/my-dataset" # for my-dataset in other organizations you are a member of

atlas_dataset = AtlasDataset(dataset_identifier)

atlas_dataset.add_data(my_data)

# Duplicate detection runs by default
atlas_dataset.create_index(indexed_field="text")

While duplicate detection runs by default, you can fine-tune its behavior by providing NomicDuplicatesOptions to the create_index method. The main parameter you might adjust is the duplicate_cutoff threshold. Smaller thresholds result in duplicate clusters containing datapoints that are closer to exact matches. The default threshold is 0.1.

from nomic import AtlasDataset, NomicDuplicatesOptions

# ... (Dataset creation and data addition as above) ...

atlas_dataset.create_index(
        indexed_field="text",
        duplicate_detection=NomicDuplicatesOptions(
                tag_duplicates=True,
                duplicate_cutoff=0.2, # Adjust the similarity threshold
    )
)

Directly access your Atlas-detected duplicate datapoint clusters with atlas_map.duplicates.df.

Every datapoint is assigned a duplicate cluster_id. Two points share a cluster_id if Atlas identified them to be semantic duplicates based of latent space properties.

from nomic import AtlasDataset

atlas_dataset = AtlasDataset(dataset_identifier)
atlas_map = atlas_dataset.maps[0]
duplicate_info_df = atlas_map.duplicates.df
print(duplicate_info_df)

        index      duplicate_class  cluster_id
    100001            singleton       16492
    100011            singleton       16017
    100016            singleton        7826
    100020            singleton        5044
    100030  retention candidate         412
...       ...                  ...         ...
 99721            singleton        9218
 99730   deletion candidate         371
 99918            singleton       13311
 99926            singleton       13725
  9997            singleton        3866
...       ...                  ...         ...

You can use your deletion candidates from the duplicate detection results and curate a new dataset without them.

# Get your old dataframe
old_df = atlas_map.data.df

# Get a list of IDs for datapoints marked as deletion candidates
# We will keep only singletons and retention candidates
ids_to_remove = atlas_map.duplicates.deletion_candidates()

if ids_to_remove:
    new_df = old_df[~old_df['id'].isin(ids_to_remove)]
    new_identifier = dataset_identifier + "-without-duplicates"

    new_dataset = AtlasDataset(
        new_identifier,
        description=f'Map of {dataset_identifier} without duplicates',
    )

    new_dataset.add_data(new_df)

    new_dataset.create_index(
        indexed_field="text"
    )

AtlasMapDuplicates API Reference

class AtlasMapDuplicates()

Atlas Duplicate Clusters State. Atlas can automatically group embeddings that are sufficiently close into semantic clusters. You can use these clusters for semantic duplicate detection allowing you to quickly deduplicate your data.

df

@property
def df() -> pd.DataFrame

Pandas DataFrame mapping each data point to its cluster of semantically similar points.

tb

@property
def tb() -> pa.Table

Pyarrow table with information about duplicate clusters and candidates. This table is memmapped from the underlying files and is the most efficient way to access duplicate information.

deletion_candidates

def deletion_candidates() -> List[str]

Returns:

The ids for all data points which are semantic duplicates and are candidates for being deleted from the dataset. If you remove these data points from your dataset, your dataset will be semantically deduplicated.

Map Resources

Dataset

AtlasMapData API Reference

df

tb

Embeddings

AtlasMapEmbeddings API Reference

Topics

AtlasMapTopics API Reference

df

tb

metadata

hierarchy

group_by_topic

get_topic_density

vector_search_topics

Duplicate Detection

AtlasMapDuplicates API Reference

df

tb

deletion_candidates

Tags

AtlasMapTags API Reference

df

get_tags

get_datums_in_tag

Dataset​

AtlasMapData API Reference​

df​

tb​

Embeddings​

AtlasMapEmbeddings API Reference​

Topics​

AtlasMapTopics API Reference​

df​

tb​

metadata​

hierarchy​

group_by_topic​

get_topic_density​

vector_search_topics​

Duplicate Detection​

AtlasMapDuplicates API Reference​

df​

tb​

deletion_candidates​

Tags​

AtlasMapTags API Reference​

df​

get_tags​

get_datums_in_tag​

Dataset

AtlasMapData API Reference

df

tb

Embeddings

AtlasMapEmbeddings API Reference

Topics

AtlasMapTopics API Reference

df

tb

metadata

hierarchy

group_by_topic

get_topic_density

vector_search_topics

Duplicate Detection

AtlasMapDuplicates API Reference

df

tb

deletion_candidates

Tags

AtlasMapTags API Reference

df

get_tags

get_datums_in_tag