Skip to main content

Atlas Dataset

The AtlasDataset class manages your Atlas Dataset. Atlas Datasets store information server side and dynamically download it to your local environment with caching when necessary.

Any action you perform on your dataset in the web browser you can accomplish by interacting with an AtlasDataset object.


Modality Support

Atlas Datasets natively support text and embedding datasets. Image, audio and video modalities can be first embedded with the Atlas Embedding API and used via the embedding modality.

Creating an Atlas Dataset

from nomic import AtlasDataset

dataset = AtlasDataset(
"example-dataset",
unique_id_field="id",
)

Datasets are uniquely identified in your Nomic organization with a URL-safe name.


Adding data to an Atlas Dataset

from nomic import AtlasDataset

dataset = AtlasDataset(
"example-dataset",
unique_id_field="id",
)

dataset.add_data(
data=[{'text': 'my first document'}, {'text': 'my second document'}]
)

Creating an Atlas Map

To structure your dataset in Atlas, you must index it. Indexing your dataset will create a map view of the data at a point in time, automatically detect topics, generate embeddings for unstructured data fields and augment it with metadata such as duplicate information

from nomic import AtlasDataset

dataset = AtlasDataset(
"example-dataset",
unique_id_field="id",
)

dataset.add_data(
data=[{'text': 'my first document'}, {'text': 'my second document'}]
)

map = dataset.create_index(
indexed_field='text',
topic_model: NomicTopicOptions = True,
duplicate_detection: NomicDuplicatesOptions = True,
projection: NomicProjectOptions = None,
embedding_model: NomicEmbedOptions = 'NomicEmbed'
)

There are several options you can configure for how Atlas will index your dataset:

Adding a Topic Model

Specifying topic_model during index creation will build a topic model over your datasets embeddings.

class NomicTopicOptions(BaseModel)

Options for Nomic Topic Model

Arguments:

  • build_topic_model - If True, builds a topic model over your dataset's embeddings.
  • topic_label_field - The dataset field/column that Atlas will use to assign a human-readable description to each topic.

Detecting Duplicate Datapoints

Specifying duplicate_detection during index creation will automatically identify datapoints in your data that are semantic duplicates.

class NomicDuplicatesOptions(BaseModel)

Options for Duplicate Detection

Arguments:

  • tag_duplicates - Should duplicate detection run over your datasets embeddings?
  • duplicate_cutoff - A hyperparameter of duplicate detection, smaller values capture more exact duplicates.

Modifying the 2D Reduction Algorithm

Specifying projection during index creation will allow you to configure the hyperparameters that define the 2D map layout.

class NomicProjectOptions(BaseModel)

Options for Nomic 2D Dimensionality Reduction Model

Arguments:

  • n_neighbors - The number of neighbors to use when approximating the high dimensional embedding space during reduction.
  • n_epochs - How many dataset passes to train the projection model.

Customizing the embedding model

Specifying embedding_model during index creation will allow you to configure the hyperparameters of the embedding model. If you've uploaded your own embeddings, this is option is ignored.

class NomicEmbedOptions(BaseModel)

Options for Configuring the Nomic Embedding Model

Arguments:

  • model - The Nomic Embedding Model to use.

The AtlasDataset

class AtlasDataset(AtlasClass)

__init__

def __init__(identifier: Optional[str] = None,
description: Optional[str] = 'A description for your map.',
unique_id_field: Optional[str] = None,
is_public: bool = True,
dataset_id=None,
organization_name=None)

Creates or loads an AtlasDataset. AtlasDataset's store data (text, embeddings, etc) that you can organize by building indices. If the organization already contains a dataset with this name, it will be returned instead.

Parameters:

  • identifier - The dataset identifier in the form dataset or organization/dataset. If no organization is passed, your default organization will be used.
  • description - A description for the dataset.
  • unique_id_field - The field that uniquely identifies each data point.
  • is_public - Should this dataset be publicly accessible for viewing (read only). If False, only members of your Nomic organization can view.
  • dataset_id - An alternative way to load a dataset is by passing the dataset_id directly. This only works if a dataset exists.

delete

def delete()

Deletes an atlas dataset with all associated metadata.


id

@property
def id() -> str

The UUID of the dataset.


total_datums

@property
def total_datums() -> int

The total number of data points in the dataset.


name

@property
def name() -> str

The customizable name of the dataset.


slug

@property
def slug() -> str

The URL-safe identifier for this dataset.


identifier

@property
def identifier() -> str

The Atlas globally unique, URL-safe identifier for this dataset


is_accepting_data

@property
def is_accepting_data() -> bool

Checks if the dataset can accept data. Datasets cannot accept data when they are being indexed.

Returns:

True if dataset is unlocked for data additions, false otherwise.


wait_for_dataset_lock

@contextmanager
def wait_for_dataset_lock()

Blocks thread execution until dataset is in a state where it can ingest data.


get_map

def get_map(name: str = None,
atlas_index_id: str = None,
projection_id: str = None) -> AtlasProjection

Retrieves a map.

Arguments:

  • name - The name of your map. This defaults to your dataset name but can be different if you build multiple maps in your dataset.
  • atlas_index_id - If specified, will only return a map if there is one built under the index with the id atlas_index_id.
  • projection_id - If projection_id is specified, will only return a map if there is one built under the index with id projection_id.

Returns:

The map or a ValueError.


create_index

def create_index(name: str = None,
indexed_field: str = None,
modality: str = None,
projection: Union[bool, Dict, NomicProjectOptions] = True,
topic_model: Union[bool, Dict, NomicTopicOptions] = True,
duplicate_detection: Union[bool, Dict,
NomicDuplicatesOptions] = True,
embedding_model: Optional[Union[str, Dict,
NomicEmbedOptions]] = None,
reuse_embeddings_from_index: str = None) -> AtlasProjection

Creates an index in the specified dataset.

Arguments:

  • name - The name of the index and the map.
  • indexed_field - For text datasets, name the data field corresponding to the text to be mapped.
  • reuse_embeddings_from_index - the name of the index to reuse embeddings from.
  • modality - The data modality of this index. Currently, Atlas supports either text or embedding indices.
  • projection - Options for configuring the 2D projection algorithm
  • topic_model - Options for configuring the topic model
  • duplicate_detection - Options for configuring semantic duplicate detection
  • embedding_model - Options for configuring the embedding model

Returns:

The projection this index has built.


get_data

def get_data(ids: List[str]) -> List[Dict]

Retrieve the contents of the data given ids.

Arguments:

  • ids - a list of datum ids

Returns:

A list of dictionaries corresponding to the data.


delete_data

def delete_data(ids: List[str]) -> bool

Deletes the specified datapoints from the dataset.

Arguments:

  • ids - A list of data ids to delete

Returns:

True if data deleted successfully.


add_data

def add_data(data=Union[DataFrame, List[Dict], pa.Table],
embeddings: np.array = None,
pbar=None)

Adds data of varying modality to an Atlas dataset.

Arguments:

  • data - A pandas DataFrame, list of dictionaries, or pyarrow Table matching the dataset schema.
  • embeddings - A numpy array of embeddings: each row corresponds to a row in the table. Use if you already have embeddings for your datapoints.
  • pbar - (Optional). A tqdm progress bar to update.

update_maps

def update_maps(data: List[Dict],
embeddings: Optional[np.array] = None,
num_workers: int = 10)

Utility method to update a project's maps by adding the given data.

Arguments:

  • data - An [N,] element list of dictionaries containing metadata for each embedding.
  • embeddings - An [N, d] matrix of embeddings for updating embedding dataset. Leave as None to update text dataset.
  • shard_size - Data is uploaded in parallel by many threads. Adjust the number of datums to upload by each worker.
  • num_workers - The number of workers to use when sending data.

update_indices

def update_indices(rebuild_topic_models: bool = False)

Rebuilds all maps in a dataset with the latest state dataset data state. Maps will not be rebuilt to reflect the additions, deletions or updates you have made to your data until this method is called.

Arguments:

  • rebuild_topic_models - (Default False) - If true, will create new topic models when updating these indices.