Skip to main content

Accessing Dataset State

A key feature of the AtlasDataset is that actions in the web browser sync to your state in Python.

Make a tag, find some duplicates and retrieve your datasets embeddings from anywhere.

Your dataset metadata is stored in attributes on your Atlas Maps - maps are the output of indexing your dataset. You should think of maps as semantic views into your dataset.

from nomic import AtlasDataset
dataset = AtlasDataset('my-dataset')
map = dataset.maps[0]

map.topics
map.duplicates
map.embeddings
map.data

Embeddings

Access your datasets embeddings with map.embeddings.

map.embeddings.latent contains the high-dimensional embeddings produced by a Nomic Embedding Model.

map.embeddings.projected contains the 2D reduced version. These are the positions you see on your Atlas Map in the web browser.

Example

This example shows how to access high- and low-dimensional embeddings of your data generated and stored by Atlas.

from nomic import AtlasDataset

map = AtlasDataset('my-dataset').maps[0]

# projected embeddings are your 2D embeddings
projected_embeddings = map.embeddings.projected

# latent embeddings are your high-dim vectors
latent_embeddings = map.embeddings.latent

Data

Access your uploaded data columns with map.data.

from atlas import AtlasDataset

dataset = AtlasDataset('my-dataset')
map = dataset.maps[0]

print(map.data.df)

API Reference:

class AtlasMapData()

Atlas Map Data (Metadata) State. This is how you can access text and other associated metadata columns you uploaded with your project.


df

@property
def df() -> pd.DataFrame

A pandas DataFrame associating each datapoint on your map to their metadata. Converting to pandas DataFrame may materialize a large amount of data into memory.


tb

@property
def tb() -> pa.Table

Pyarrow table associating each datapoint on the map to their metadata columns. This table is memmapped from the underlying files and is the most efficient way to access metadata information.


Topics

Directly access your Atlas-generated topics with map.topics.df. This dataframe include depth-unique ids and human interpretable descriptions for your datasets topics. Atlas topic models are hierarchical. Each topic has a corresponding label as visible the Atlas map interface.

API Reference:

class AtlasMapTopics()

Atlas Topics State


df

@property
def df() -> pd.DataFrame

A pandas DataFrame associating each datapoint on your map to their topics as each topic depth.


tb

@property
def tb() -> pa.Table

Pyarrow table associating each datapoint on the map to their Atlas assigned topics. This table is memmapped from the underlying files and is the most efficient way to access topic information.


metadata

@property
def metadata() -> pd.DataFrame

Pandas DataFrame where each row gives metadata all map topics including:

  • topic id
  • a human readable topic description (topic label)
  • identifying keywords that differentiate the topic from other topics

hierarchy

@property
def hierarchy() -> Dict

A dictionary that allows iteration of the topic hierarchy. Each key is of (topic label, topic depth) to its direct sub-topics. If topic is not a key in the hierarchy, it is leaf in the topic hierarchy.


group_by_topic

def group_by_topic(topic_depth: int = 1) -> List[Dict]

Associates topics at a given depth in the topic hierarchy to the identifiers of their contained datapoints.

Arguments:

  • topic_depth - Topic depth to group datums by.

Returns:

List of dictionaries where each dictionary contains next depth subtopics, subtopic ids, topic_id, topic_short_description, topic_long_description, and list of datum_ids.


get_topic_density

def get_topic_density(time_field: str, start: datetime, end: datetime)

Computes the density/frequency of topics in a given interval of a timestamp field.

Useful for answering questions such as:

  • What topics increased in prevalence between December and January?

Arguments:

  • time_field - Your metadata field containing isoformat timestamps
  • start - A datetime object for the window start
  • end - A datetime object for the window end

Returns:

A list of {topic, count} dictionaries, sorted from largest count to smallest count.


vector_search_topics

def vector_search_topics(queries: np.ndarray,
k: int = 32,
depth: int = 3) -> Dict

Given an embedding, returns a normalized distribution over topics.

Useful for answering the questions such as:

  • What topic does my new datapoint belong to?
  • Does by datapoint belong to the "Dog" topic or the "Cat" topic.

Arguments:

  • queries - a 2d NumPy array where each row corresponds to a query vector
  • k - (Default 32) the number of neighbors to use when estimating the posterior
  • depth - (Default 3) the topic depth at which you want to search

Returns:

A dict mapping {topic: posterior probability} for each query.


Duplicate Detection

Directly access your Atlas-detected duplicate datapoint clusters with map.duplicates.df.

Every datapoint is assigned a duplicate cluster_id. Two points share a cluster_id if Atlas identified them to be semantic duplicates based of latent space properties.

API Reference:

class AtlasMapDuplicates()

Atlas Duplicate Clusters State. Atlas can automatically group embeddings that are sufficiently close into semantic clusters. You can use these clusters for semantic duplicate detection allowing you to quickly deduplicate your data.


df

@property
def df() -> pd.DataFrame

Pandas DataFrame mapping each data point to its cluster of semantically similar points.


tb

@property
def tb() -> pa.Table

Pyarrow table with information about duplicate clusters and candidates. This table is memmapped from the underlying files and is the most efficient way to access duplicate information.


deletion_candidates

def deletion_candidates() -> List[str]

Returns:

The ids for all data points which are semantic duplicates and are candidates for being deleted from the dataset. If you remove these data points from your dataset, your dataset will be semantically deduplicated.

Tags

Atlas tags allow users to annotate their data for easy access in any kind of data pipeline, from cleaning to classification. Retrieve existing tags from your data with map.tags.get_tags(), add them with map.tags.add(), and remove tags with map.tags.remove().

Example

from nomic import AtlasDataset
dataset = AtlasDataset('my-dataset')
map = dataset.maps[0]

tags = map.tags.get_tags()
for tag, datum_ids in tags.items():
print(tag, "Count:", len(datum_ids), list(datum_ids)[:10])

print(dataset.get_data(ids=list(tags['sports'])[:2]))

For more on using tags in a pipeline, check out the walkthrough at Atlas Capabilities: Data cleaning.

API Reference:

class AtlasMapTags()

Atlas Map Tag State. You can manipulate tags by filtering over the associated pandas DataFrame.


df

@property
def df(overwrite: bool = False) -> pd.DataFrame

Pandas DataFrame mapping each data point to its tags.


get_tags

def get_tags() -> List[Dict[str, str]]

Retrieves back all tags made in the web browser for a specific map. Each tag is a dictionary containing tag_name, tag_id, and metadata.

Returns:

A list of tags a user has created for projection.


get_datums_in_tag

def get_datums_in_tag(tag_name: str, overwrite: bool = False)

Returns the datum ids in a given tag.

Arguments:

  • overwrite - If True, re-downloads the tag. Otherwise, checks to see if up to date tag already exists.

Returns:

List of datum ids.