Skip to main content

Accessing Dataset State

A key feature of the AtlasDataset is that actions in the web browser sync to your state in Python.

Make a tag, find some duplicates and retrieve your datasets embeddings from anywhere.

Your dataset metadata is stored in attributes on your Atlas Maps - maps are the output of indexing your dataset. You should think of maps as semantic views into your dataset.

from nomic import AtlasDataset
dataset = AtlasDataset('my-dataset')
map = dataset.maps[0]

map.topics
map.duplicates
map.embeddings
map.data

Accessing Topic Data

Directly access your Atlas-generated topics with map.topics.df. This dataframe include depth-unique ids and human interpretable descriptions for your datasets topics. Atlas topic models are hierarchical. Each topic has a corresponding label as visible the Atlas map interface.


Topics API Reference

class AtlasMapTopics()

Atlas Topics State


df

@property
def df() -> pandas.DataFrame

A pandas DataFrame associating each datapoint on your map to their topics as each topic depth.


tb

@property
def tb() -> pa.Table

Pyarrow table associating each datapoint on the map to their Atlas assigned topics. This table is memmapped from the underlying files and is the most efficient way to access topic information.


metadata

@property
def metadata() -> pandas.DataFrame

Pandas DataFrame where each row gives metadata all map topics including:

  • topic id
  • a human readable topic description (topic label)
  • identifying keywords that differentiate the topic from other topics

hierarchy

@property
def hierarchy() -> Dict

A dictionary that allows iteration of the topic hierarchy. Each key is of (topic label, topic depth) to its direct sub-topics. If topic is not a key in the hierarchy, it is leaf in the topic hierarchy.


group_by_topic

def group_by_topic(topic_depth: int = 1) -> List[Dict]

Associates topics at a given depth in the topic hierarchy to the identifiers of their contained datapoints.

Arguments:

  • topic_depth - Topic depth to group datums by.

Returns:

List of dictionaries where each dictionary contains next depth subtopics, subtopic ids, topic_id, topic_short_description, topic_long_description, and list of datum_ids.


get_topic_density

def get_topic_density(time_field: str, start: datetime, end: datetime)

Computes the density/frequency of topics in a given interval of a timestamp field.

Useful for answering questions such as:

  • What topics increased in prevalence between December and January?

Arguments:

  • time_field - Your metadata field containing isoformat timestamps
  • start - A datetime object for the window start
  • end - A datetime object for the window end

Returns:

A list of {topic, count} dictionaries, sorted from largest count to smallest count.


vector_search_topics

def vector_search_topics(queries: np.array,
k: int = 32,
depth: int = 3) -> Dict

Given an embedding, returns a normalized distribution over topics.

Useful for answering the questions such as:

  • What topic does my new datapoint belong to?
  • Does by datapoint belong to the "Dog" topic or the "Cat" topic.

Arguments:

  • queries - a 2d NumPy array where each row corresponds to a query vector
  • k - (Default 32) the number of neighbors to use when estimating the posterior
  • depth - (Default 3) the topic depth at which you want to search

Returns:

A dict mapping {topic: posterior probability} for each query.


Accessing Duplicates

Directly access your Atlas-detected duplicate datapoint clusters with map.duplicates.df.

Every datapoint is assigned a duplicate cluster_id. Two points share a cluster_id if Atlas identified them to be semantic duplicates based of latent space properties.


Duplicates API Reference

class AtlasMapDuplicates()

Atlas Duplicate Clusters State. Atlas can automatically group embeddings that are sufficiently close into semantic clusters. You can use these clusters for semantic duplicate detection allowing you to quickly deduplicate your data.


df

@property
def df() -> pd.DataFrame

Pandas DataFrame mapping each data point to its cluster of semantically similar points.


tb

@property
def tb() -> pa.Table

Pyarrow table with information about duplicate clusters and candidates. This table is memmapped from the underlying files and is the most efficient way to access duplicate information.


deletion_candidates

def deletion_candidates() -> List[str]

Returns:

The ids for all data points which are semantic duplicates and are candidates for being deleted from the dataset. If you remove these data points from your dataset, your dataset will be semantically deduplicated.


Accessing Data Embeddings

Access your datasets embeddings with map.embeddings.

map.embeddings.latent contains the high-dimensional embeddings produced by a Nomic Embedding Model.

map.embeddings.projected contains the 2D reduced version. These are the positions you see on your Atlas Map in the web browser.

Example

This example shows how to access high- and low-dimensional embeddings of your data generated and stored by Atlas.

from nomic import AtlasDataset

map = AtlasDataset('my-dataset').maps[0]

# projected embeddings are your 2D embeddings
projected_embeddings = map.embeddings.projected

# latent embeddings are your high-dim vectors
latent_embeddings = map.embeddings.latent

Accessing Dataset Tags

Atlas tags allow users to annotate their data for easy access in any kind of data pipeline, from cleaning to classification. Retrieve existing tags from your data with map.tags.get_tags(), add them with map.tags.add(), and remove tags with map.tags.remove().

For more on using tags in a pipeline, check out the walkthrough at Atlas Capabilities: Data cleaning.

Example with Tags

from nomic import AtlasDataset
dataset = AtlasDataset('my-dataset')
map = dataset.maps[0]

tags = map.tags.get_tags()
for tag, datum_ids in tags.items():
print(tag, "Count:", len(datum_ids), list(datum_ids)[:10])

print(dataset.get_data(ids=list(tags['sports'])[:2]))


Tags API Reference

class AtlasMapTags()

Atlas Map Tag State. You can manipulate tags by filtering over the associated pandas DataFrame.


df

@property
def df() -> pd.DataFrame

Pandas DataFrame mapping each data point to its tags.


get_tags

def get_tags() -> Dict[str, List[str]]

Retrieves back all tags made in the web browser for a specific map.

Returns:

A dictionary mapping data points to tags.


add

def add(ids: List[str], tags: List[str])

Adds tags to datapoints.

Arguments:

  • ids - The datum ids you want to tag
  • tags - A list containing the tags you want to apply to these data points.

remove

def remove(ids: List[str], tags: List[str], delete_all: bool = False) -> bool

Deletes the specified tags from the given data points.

Arguments:

  • ids - The datum_ids to delete tags from.
  • tags - The list of tags to delete from the data points. Each tag will be applied to all data points in ids.
  • delete_all - If true, ignores ids parameter and deletes all specified tags from all data points.

Returns:

True on success.


Accessing Your Data

Access your uploaded data columns with map.data.

Example with Data

from atlas import AtlasDataset

dataset = AtlasDataset('my-dataset')
map = dataset.maps[0]

print(map.data.df)

Data API Reference

class AtlasMapData()

Atlas Map Data (Metadata) State. This is how you can access text and other associated metadata columns you uploaded with your project.


df

@property
def df() -> pandas.DataFrame

A pandas DataFrame associating each datapoint on your map to their metadata. Converting to pandas DataFrame may materialize a large amount of data into memory.


tb

@property
def tb() -> pa.Table

Pyarrow table associating each datapoint on the map to their metadata columns. This table is memmapped from the underlying files and is the most efficient way to access metadata information.