Skip to main content

Map Resources

Atlas generates a variety of useful resources from your data, such as embeddings, 2D projections, and topic labels, which you can use for analysis and integration into your applications.

Dataset

Your Atlas Dataset is the collection of data you upload to Atlas, from which all other resources are generated.

You can access the dataset corresponding to an Atlas Map with atlas_map.data.

from nomic import AtlasDataset

dataset = AtlasDataset('my-dataset')

atlas_map = dataset.maps[0]

df = atlas_map.data.df # use atlas_map.data.tb to load as an Arrow table

AtlasMapData API Reference

class AtlasMapData()

Atlas Map Data (Metadata) State. This is how you can access text and other associated metadata columns you uploaded with your project.


df

@property
def df() -> pd.DataFrame

A pandas DataFrame associating each datapoint on your map to their metadata. Converting to pandas DataFrame may materialize a large amount of data into memory.


tb

@property
def tb() -> pa.Table

Pyarrow table associating each datapoint on the map to their metadata columns. This table is memmapped from the underlying files and is the most efficient way to access metadata information.

Embeddings

Access your datasets embeddings with atlas_map.embeddings.

atlas_map.embeddings.latent contains the high-dimensional embeddings produced by a Nomic Embedding Model.

atlas_map.embeddings.projected contains the 2D reduced version. These are the positions you see on your Atlas Map in the web browser.

This example shows how to access high- and low-dimensional embeddings of your data generated and stored by Atlas.

from nomic import AtlasDataset

dataset = AtlasDataset('my-dataset')

atlas_map = dataset.maps[0]

# projected embeddings are your 2D embeddings
projected_embeddings = atlas_map.embeddings.projected

# latent embeddings are your high-dim vectors
latent_embeddings = atlas_map.embeddings.latent

AtlasMapEmbeddings API Reference

class AtlasMapEmbeddings()

Atlas Embeddings State

Access latent (high-dimensional) and projected (two-dimensional) embeddings of your datapoints.

Topics

Directly access your Atlas-generated topics with atlas_map.topics.df. This dataframe include depth-unique ids and human interpretable descriptions for your datasets topics.

Atlas topic models are hierarchical. As you zoom in and out in the Atlas in-browser data map explorer, you will see topics appear and disappear at different zoom levels.

The topics which the Atlas system generates behind the scenes is directly accessible via Python. Information is available about topic hierarchy and topic density. Topic information can be used for downstream pipelines like visualization, analyses, and predictions.

Your embedding information can be accessed in the map.topics attribute of the AtlasDataset:

from nomic import AtlasDataset

map = AtlasDataset('my-dataset').maps[0]
# Pandas df of your data with columns ID, topic_depth_n, topic_depth_n+1, etc.
print(map.topics.df)
     id_    topic_depth_1          topic_depth_2             topic_depth_3
0 +Bw Baby, Ray, Sunglasses Apparel T-Shirts (2)
1 fHM Phone Protector Music Genre Blues Music
2 9Ts Lighting Replacement Years Hyundai Engines
3 6mU Women's Fashion (3) Footwear (14) Women's Sandals
4 8j8 Women's Fashion (3) Tops, Shirts, Shirt Women's Tops (2)
... ... ... ... ...
117238 GRs Electronics (5) Smartphones (3) Computer Peripherals
117239 AULT Electronics (5) Computer Hardware (2) Computer Upgrades
117240 P0U Electronics (5) Computer Hardware (2) Computer Hardware
117241 AWnV Electronics (5) Computer Hardware (2) Computer Hardware
117242 5Vg Electronics (5) Computer Hardware (2) Computer Hardware

[117243 rows × 4 columns]

Pandas dataframe where each row corresponds to a unique topic. Metadata associated with each topic includes:

  • topic depth
  • a human-readable topic description (topic label)
  • identifying keywords that differentiate the topic from other topics
# Returns a Pandas df
print(map.topics.metadata)
      depth  topic_id  topic_depth_1           topic_description                                 topic_short_description  topic_depth_2  topic_depth_3
0 1 0 Women's Fashion (3) women/tops/dress/sandals/womens/casual/shoes/p... Women's Fashion (3) NaN NaN
1 1 1 Electronics (5) USB/Bluetooth/iPhone/charging/Intel/cable/HDMI... Electronics (5) NaN NaN
2 1 2 Jewelry Collection (2) jewelry/IceCarats/Jewelry/Type/ICECARATS/Sterl... Jewelry Collection (2) NaN NaN
3 1 3 Phone Protector phone/Galaxy/Samsung/dogs/Watch/protector/scre... Phone Protector NaN NaN
4 1 4 Pool Supplies Pool/pool/Floats/chair/Brand/Amazon/Lathe/floa... Pool Supplies NaN NaN
... ... ... ... ... ... ... ...
605 3 507 Lighting Replacement hose/garden/Garden/watering/Hose/ft/plants/Jet... Garden Hose Plumbing S... Garden Hose
606 3 508 Lighting Replacement Rate/9930/gallons/207/months/38/125PSI/GPM/34... Water Pump Plumbing S... Water Pump
607 3 509 Lighting Replacement NPT/¼/½/PSI/Pump/Straight/tire/pump/12V/Connec... Tire Pump Plumbing S... Tire Pump
608 3 510 Lighting Replacement drain/Drain/sink/pipe/Sink/stopper/steel/toile... Plumbing Fixtures Plumbing S... Plumbing Fixtures
609 3 511 Lighting Replacement shower/water/Shower/filter/solar/fountain/head... Bathroom Essentials Plumbing S... Bathroom Essentials
610 rows × 7 columns

The topic hierarchy brances from the most general topics down to sub-topics at different depths.

# map.topics.hierarchy is a dict

hierarchy = map.topics.hierarchy
print(f'Your depth 1 (most general) topics are: {hierarchy.keys()}')
Your depth 1 (most general) topics are: dict_keys([
("Women's Fashion (3)", 1),
('Electronics (5)', 1),
('Jewelry Collection (2)', 1),
...
])

You can use higher-level topic keys to access lower-level topics in your hierarchy.

import random
# List the subtopics in a random top-level topic
random_topic_1 = random.choice(list(hierarchy.keys()))
print(f'The general topic {random_topic_1} contains subtopics {hierarchy[random_topic_1]}')
The general topic ('Footwear (14)', 2) contains subtopics [
'Shoes (3)', 'Sandal', 'Sneaker Culture', ..., "Women's Sandals"
]

By providing a specific level of the topic hierarchy, you get a list of dictionaries where each item is a distinct topic at that level.

Keys for that topic include subtopics, subtopic_ids, topic_id, topic_short_description, topic_long_description, and datum_ids.

your_depth_level = 2
print(map.topics.group_by_topic(your_depth_level)[0])
{
'subtopics': ['Miscellaneous (3)'],
'subtopic_ids': [87],
'topic_id': 16,
'topic_short_description': 'Audio Equipment (3)',
'topic_long_description': 'Bluetooth/speaker/Speaker/music/CarPlay/MP3/prevention/bluetooth/stereo/sound/karaoke/Loss/Radio/⭐/radio',
'datum_ids': {'61c', '/WM', 'Rsw', 'q6I', ..., 'AVjU'}
}

AtlasMapTopics API Reference

class AtlasMapTopics()

Atlas Topics State


df

@property
def df() -> pd.DataFrame

A pandas DataFrame associating each datapoint on your map to their topics as each topic depth.


tb

@property
def tb() -> pa.Table

Pyarrow table associating each datapoint on the map to their Atlas assigned topics. This table is memmapped from the underlying files and is the most efficient way to access topic information.


metadata

@property
def metadata() -> pd.DataFrame

Pandas DataFrame where each row gives metadata all map topics including:

  • topic id
  • a human readable topic description (topic label)
  • identifying keywords that differentiate the topic from other topics

hierarchy

@property
def hierarchy() -> Dict

A dictionary that allows iteration of the topic hierarchy. Each key is of (topic label, topic depth) to its direct sub-topics. If topic is not a key in the hierarchy, it is leaf in the topic hierarchy.


group_by_topic

def group_by_topic(topic_depth: int = 1) -> List[Dict]

Associates topics at a given depth in the topic hierarchy to the identifiers of their contained datapoints.

Arguments:

  • topic_depth - Topic depth to group datums by.

Returns:

List of dictionaries where each dictionary contains next depth subtopics, subtopic ids, topic_id, topic_short_description, topic_long_description, and list of datum_ids.


get_topic_density

def get_topic_density(time_field: str, start: datetime, end: datetime)

Computes the density/frequency of topics in a given interval of a timestamp field.

Useful for answering questions such as:

  • What topics increased in prevalence between December and January?

Arguments:

  • time_field - Your metadata field containing isoformat timestamps
  • start - A datetime object for the window start
  • end - A datetime object for the window end

Returns:

A list of {topic, count} dictionaries, sorted from largest count to smallest count.


vector_search_topics

def vector_search_topics(queries: np.ndarray,
k: int = 32,
depth: int = 3) -> Dict

Given an embedding, returns a normalized distribution over topics.

Useful for answering the questions such as:

  • What topic does my new datapoint belong to?
  • Does by datapoint belong to the "Dog" topic or the "Cat" topic.

Arguments:

  • queries - a 2d NumPy array where each row corresponds to a query vector
  • k - (Default 32) the number of neighbors to use when estimating the posterior
  • depth - (Default 3) the topic depth at which you want to search

Returns:

A dict mapping {topic: posterior probability} for each query.

Duplicate Detection

Duplicate detection is enabled by default when creating an Atlas map using the Nomic Atlas Python SDK:

from nomic import AtlasDataset

dataset_identifier = "my-dataset" # for my-dataset in the organization connected to your Nomic API key
# dataset_identifier = "<ORG_NAME>/my-dataset" # for my-dataset in other organizations you are a member of

atlas_dataset = AtlasDataset(dataset_identifier)

atlas_dataset.add_data(my_data)

# Duplicate detection runs by default
atlas_dataset.create_index(indexed_field="text")

While duplicate detection runs by default, you can fine-tune its behavior by providing NomicDuplicatesOptions to the create_index method. The main parameter you might adjust is the duplicate_cutoff threshold. Smaller thresholds result in duplicate clusters containing datapoints that are closer to exact matches. The default threshold is 0.1.

from nomic import AtlasDataset, NomicDuplicatesOptions

# ... (Dataset creation and data addition as above) ...

atlas_dataset.create_index(
indexed_field="text",
duplicate_detection=NomicDuplicatesOptions(
tag_duplicates=True,
duplicate_cutoff=0.2, # Adjust the similarity threshold
)
)

Directly access your Atlas-detected duplicate datapoint clusters with atlas_map.duplicates.df.

Every datapoint is assigned a duplicate cluster_id. Two points share a cluster_id if Atlas identified them to be semantic duplicates based of latent space properties.

from nomic import AtlasDataset

atlas_dataset = AtlasDataset(dataset_identifier)
atlas_map = atlas_dataset.maps[0]
duplicate_info_df = atlas_map.duplicates.df
print(duplicate_info_df)
        index      duplicate_class  cluster_id
0 100001 singleton 16492
1 100011 singleton 16017
2 100016 singleton 7826
3 100020 singleton 5044
4 100030 retention candidate 412
... ... ... ...
19995 99721 singleton 9218
19996 99730 deletion candidate 371
19997 99918 singleton 13311
19998 99926 singleton 13725
19999 9997 singleton 3866
... ... ... ...

You can use your deletion candidates from the duplicate detection results and curate a new dataset without them.

# Get your old dataframe
old_df = atlas_map.data.df

# Get a list of IDs for datapoints marked as deletion candidates
# We will keep only singletons and retention candidates
ids_to_remove = atlas_map.duplicates.deletion_candidates()

if ids_to_remove:
new_df = old_df[~old_df['id'].isin(ids_to_remove)]
new_identifier = dataset_identifier + "-without-duplicates"

new_dataset = AtlasDataset(
new_identifier,
description=f'Map of {dataset_identifier} without duplicates',
)

new_dataset.add_data(new_df)

new_dataset.create_index(
indexed_field="text"
)

AtlasMapDuplicates API Reference

class AtlasMapDuplicates()

Atlas Duplicate Clusters State. Atlas can automatically group embeddings that are sufficiently close into semantic clusters. You can use these clusters for semantic duplicate detection allowing you to quickly deduplicate your data.


df

@property
def df() -> pd.DataFrame

Pandas DataFrame mapping each data point to its cluster of semantically similar points.


tb

@property
def tb() -> pa.Table

Pyarrow table with information about duplicate clusters and candidates. This table is memmapped from the underlying files and is the most efficient way to access duplicate information.


deletion_candidates

def deletion_candidates() -> List[str]

Returns:

The ids for all data points which are semantic duplicates and are candidates for being deleted from the dataset. If you remove these data points from your dataset, your dataset will be semantically deduplicated.

Tags

Users can tag data in Atlas for workflows like data cleaning or curating training data for a classification model.

From python, to get the data that has been given a tag, use atlas_map.tags.get_datums_in_tag(tag_name).

For example, here is how to get a map's data that has been tagged 'sports':

from nomic import AtlasDataset

dataset = AtlasDataset('my-dataset')

atlas_map = dataset.maps[0]

sports_ids = atlas_map.tags.get_datums_in_tag('sports')

sports_data = dataset.get_data(sports_ids)

Here is how to get all existing map tags and the number of points in each with atlas_map.tags.get_tags():

from nomic import AtlasDataset

dataset = AtlasDataset('my-dataset')

atlas_map = dataset.maps[0]

tags = atlas_map.tags.get_tags()

for tag in tags:
tag_ids = atlas_map.tags.get_datums_in_tag(tag['tag_name'])
print(tag['tag_name'], "Count:", len(tag_ids))

For more on using tags in a pipeline, check out the data curation walkthrough.

AtlasMapTags API Reference

class AtlasMapTags()

Atlas Map Tag State. You can manipulate tags by filtering over the associated pandas DataFrame.


df

@property
def df(overwrite: bool = False) -> pd.DataFrame

Pandas DataFrame mapping each data point to its tags.


get_tags

def get_tags() -> List[Dict[str, str]]

Retrieves back all tags made in the web browser for a specific map. Each tag is a dictionary containing tag_name, tag_id, and metadata.

Returns:

A list of tags a user has created for projection.


get_datums_in_tag

def get_datums_in_tag(tag_name: str, overwrite: bool = False)

Returns the datum ids in a given tag.

Arguments:

  • overwrite - If True, re-downloads the tag. Otherwise, checks to see if up to date tag already exists.

Returns:

List of datum ids.