Accessing Dataset State

A key feature of the AtlasDataset is that actions in the web browser sync to your state in Python.

Make a tag, find some duplicates and retrieve your datasets embeddings from anywhere.

Your dataset metadata is stored in attributes on your Atlas Maps - maps are the output of indexing your dataset. You should think of maps as semantic views into your dataset.

from nomic import AtlasDataset
dataset = AtlasDataset('my-dataset')
map = dataset.maps[0]



Access your datasets embeddings with map.embeddings.

map.embeddings.latent contains the high-dimensional embeddings produced by a Nomic Embedding Model.

map.embeddings.projected contains the 2D reduced version. These are the positions you see on your Atlas Map in the web browser.


This example shows how to access high- and low-dimensional embeddings of your data generated and stored by Atlas.

from nomic import AtlasDataset

map = AtlasDataset('my-dataset').maps[0]

# projected embeddings are your 2D embeddings
projected_embeddings = map.embeddings.projected

# latent embeddings are your high-dim vectors
latent_embeddings = map.embeddings.latent


Access your uploaded data columns with

from atlas import AtlasDataset

dataset = AtlasDataset('my-dataset')
map = dataset.maps[0]


API Reference:

class AtlasMapData()

Atlas Map Data (Metadata) State. This is how you can access text and other associated metadata columns you uploaded with your project.


def df() -> pd.DataFrame

A pandas DataFrame associating each datapoint on your map to their metadata. Converting to pandas DataFrame may materialize a large amount of data into memory.


def tb() -> pa.Table

Pyarrow table associating each datapoint on the map to their metadata columns. This table is memmapped from the underlying files and is the most efficient way to access metadata information.


Directly access your Atlas-generated topics with map.topics.df. This dataframe include depth-unique ids and human interpretable descriptions for your datasets topics. Atlas topic models are hierarchical. Each topic has a corresponding label as visible the Atlas map interface.

API Reference:

class AtlasMapTopics()

Atlas Topics State


def df() -> pd.DataFrame

A pandas DataFrame associating each datapoint on your map to their topics as each topic depth.


def tb() -> pa.Table

Pyarrow table associating each datapoint on the map to their Atlas assigned topics. This table is memmapped from the underlying files and is the most efficient way to access topic information.


def metadata() -> pd.DataFrame

Pandas DataFrame where each row gives metadata all map topics including:

  • topic id
  • a human readable topic description (topic label)
  • identifying keywords that differentiate the topic from other topics


def hierarchy() -> Dict

A dictionary that allows iteration of the topic hierarchy. Each key is of (topic label, topic depth) to its direct sub-topics. If topic is not a key in the hierarchy, it is leaf in the topic hierarchy.


def group_by_topic(topic_depth: int = 1) -> List[Dict]

Associates topics at a given depth in the topic hierarchy to the identifiers of their contained datapoints.


  • topic_depth - Topic depth to group datums by.


List of dictionaries where each dictionary contains next depth subtopics, subtopic ids, topic_id, topic_short_description, topic_long_description, and list of datum_ids.


def get_topic_density(time_field: str, start: datetime, end: datetime)

Computes the density/frequency of topics in a given interval of a timestamp field.

Useful for answering questions such as:

  • What topics increased in prevalence between December and January?


  • time_field - Your metadata field containing isoformat timestamps
  • start - A datetime object for the window start
  • end - A datetime object for the window end


A list of {topic, count} dictionaries, sorted from largest count to smallest count.


def vector_search_topics(queries: np.ndarray,
k: int = 32,
depth: int = 3) -> Dict

Given an embedding, returns a normalized distribution over topics.

Useful for answering the questions such as:

  • What topic does my new datapoint belong to?
  • Does by datapoint belong to the "Dog" topic or the "Cat" topic.


  • queries - a 2d NumPy array where each row corresponds to a query vector
  • k - (Default 32) the number of neighbors to use when estimating the posterior
  • depth - (Default 3) the topic depth at which you want to search


A dict mapping {topic: posterior probability} for each query.

Duplicate Detection

Directly access your Atlas-detected duplicate datapoint clusters with map.duplicates.df.

Every datapoint is assigned a duplicate cluster_id. Two points share a cluster_id if Atlas identified them to be semantic duplicates based of latent space properties.

API Reference:

class AtlasMapDuplicates()

Atlas Duplicate Clusters State. Atlas can automatically group embeddings that are sufficiently close into semantic clusters. You can use these clusters for semantic duplicate detection allowing you to quickly deduplicate your data.


def df() -> pd.DataFrame

Pandas DataFrame mapping each data point to its cluster of semantically similar points.


def tb() -> pa.Table

Pyarrow table with information about duplicate clusters and candidates. This table is memmapped from the underlying files and is the most efficient way to access duplicate information.


def deletion_candidates() -> List[str]


The ids for all data points which are semantic duplicates and are candidates for being deleted from the dataset. If you remove these data points from your dataset, your dataset will be semantically deduplicated.


Atlas tags allow users to annotate their data for easy access in any kind of data pipeline, from cleaning to classification. Retrieve existing tags from your data with map.tags.get_tags(), add them with map.tags.add(), and remove tags with map.tags.remove().


from nomic import AtlasDataset
dataset = AtlasDataset('my-dataset')
map = dataset.maps[0]

tags = map.tags.get_tags()
for tag, datum_ids in tags.items():
print(tag, "Count:", len(datum_ids), list(datum_ids)[:10])


For more on using tags in a pipeline, check out the walkthrough at Atlas Capabilities: Data cleaning.

API Reference:

class AtlasMapTags()

Atlas Map Tag State. You can manipulate tags by filtering over the associated pandas DataFrame.


def df(overwrite: bool = False) -> pd.DataFrame

Pandas DataFrame mapping each data point to its tags.


def get_tags() -> List[Dict[str, str]]

Retrieves back all tags made in the web browser for a specific map. Each tag is a dictionary containing tag_name, tag_id, and metadata.


A list of tags a user has created for projection.


def get_datums_in_tag(tag_name: str, overwrite: bool = False)

Returns the datum ids in a given tag.


  • overwrite - If True, re-downloads the tag. Otherwise, checks to see if up to date tag already exists.


List of datum ids.