Atlas Dataset
The AtlasDataset
class manages your Atlas Dataset. Atlas Datasets store information server side and
dynamically download it to your local environment with caching when necessary.
Any action you perform on your dataset in the web browser you can accomplish by interacting with an
AtlasDataset
object.
Modality Support
Atlas Datasets natively support text, image, and embedding datasets. Other modalities, such as video and audio, are supported by uploading user generated embeddings.
Creating an Atlas Dataset
from nomic import AtlasDataset
dataset = AtlasDataset(
"example-dataset",
unique_id_field="id",
)
Datasets are uniquely identified in your Nomic organization with a URL-safe name.
Adding data to an Atlas Dataset
from nomic import AtlasDataset
dataset = AtlasDataset(
"example-dataset",
unique_id_field="id",
)
dataset.add_data(
data=[{'text': 'my first document'}, {'text': 'my second document'}]
)
Adding data to an Image Atlas Dataset
from nomic import AtlasDataset
dataset = AtlasDataset(
"example-image-dataset",
unique_id_field="id",
)
dataset.add_data(
blobs=['cat.jpg', 'dog.jpg'],
data=[{'label': 'cat', 'id': 1}, {'label': 'dog', 'id': 2}]
)
Creating an Index
To structure your dataset in Atlas, you must index it. Indexing your dataset will create a map view of the data at a point in time, automatically detect topics, generate embeddings for unstructured data fields and augment it with metadata such as duplicate information
from nomic import AtlasDataset
dataset = AtlasDataset(
"example-dataset",
unique_id_field="id",
)
dataset.add_data(
data=[{'text': 'my first document'}, {'text': 'my second document'}]
)
map = dataset.create_index(
indexed_field='text',
topic_model: NomicTopicOptions = True,
duplicate_detection: NomicDuplicatesOptions = True,
projection: NomicProjectOptions = None,
embedding_model: NomicEmbedOptions = 'NomicEmbed'
)
There are several options you can configure for how Atlas will index your dataset:
Adding a Topic Model
Specifying topic_model
during index creation will build a topic model over your datasets embeddings.
class NomicTopicOptions(BaseModel)
Options for Nomic Topic Model
Arguments:
build_topic_model
- If True, builds a topic model over your dataset's embeddings.topic_label_field
- The dataset column (usually the column you embedded) that Atlas will use to assign a human-readable description to each topic.
Detecting Duplicate Datapoints
Specifying duplicate_detection
during index creation will automatically identify datapoints in your data that are semantic duplicates.
class NomicDuplicatesOptions(BaseModel)
Options for Duplicate Detection
Arguments:
tag_duplicates
- Should duplicate detection run over your datasets embeddings?duplicate_cutoff
- A hyperparameter of duplicate detection, smaller values capture more exact duplicates.
Modifying the 2D Reduction Algorithm
Specifying projection
during index creation will allow you to configure the hyperparameters that define the 2D map layout.
class NomicProjectOptions(BaseModel)
Options for Nomic 2D Dimensionality Reduction Model
Arguments:
n_neighbors
- The number of neighbors to use when approximating the high dimensional embedding space during reduction. Default:None
(Auto-inferred).n_epochs
- How many dataset passes to train the projection model. Default:None
(Auto-inferred).model
- The model to use when generating the 2D projected embedding space layout. Possible values:None
ornomic-project-v1
ornomic-project-v2
. Default:None
.local_neighborhood_size
- Only used when model isnomic-project-v2
. Controls the size of the neighborhood used in the local structure optimizing step ofnomic-project-v2
algorithm. Min value:max(n_neighbors, 1)
; max value:128
. Default:None
(Auto-inferred).spread
- Determines how tight together points appear. Larger values result a more spread out point layout. Min value:0
. It is recommended leaving this value as the defaultNone
(Auto-inferred).rho
- Only used when model is nomic-project-v2. Controls the spread in the local structure optimizing step ofnomic-project-v2
. Min value:0
; max value:1
. It is recommended to leave this value as the defaultNone
(Auto-inferred).
Customizing the embedding model
Specifying embedding_model
during index creation will allow you to configure the hyperparameters
of the embedding model. If you've uploaded your own embeddings, this is option is ignored.
class NomicEmbedOptions(BaseModel)
Options for Configuring the Nomic Embedding Model
Arguments:
model
- The Nomic Embedding Model to use.
AtlasDataset API Reference
class AtlasDataset(AtlasClass)
__init__
def __init__(identifier: Optional[str] = None,
description: Optional[str] = "A description for your map.",
unique_id_field: Optional[str] = None,
is_public: bool = True,
dataset_id=None,
organization_name=None)
Creates or loads an AtlasDataset. AtlasDataset's store data (text, embeddings, etc) that you can organize by building indices. If the organization already contains a dataset with this name, it will be returned instead.
Parameters:
- identifier - The dataset identifier in the form
dataset
ororganization/dataset
. If no organization is passed, your default organization will be used. - description - A description for the dataset.
- unique_id_field - The field that uniquely identifies each data point.
- is_public - Should this dataset be publicly accessible for viewing (read only). If False, only members of your Nomic organization can view.
- dataset_id - An alternative way to load a dataset is by passing the dataset_id directly. This only works if a dataset exists.
delete
def delete()
Deletes an atlas dataset with all associated metadata.
id
@property
def id() -> str
The UUID of the dataset.
total_datums
@property
def total_datums() -> int
The total number of data points in the dataset.
name
@property
def name() -> str
The customizable name of the dataset.
slug
@property
def slug() -> str
The URL-safe identifier for this dataset.
identifier
@property
def identifier() -> str
The Atlas globally unique, URL-safe identifier for this dataset
is_accepting_data
@property
def is_accepting_data() -> bool
Checks if the dataset can accept data. Datasets cannot accept data when they are being indexed.
Returns:
True if dataset is unlocked for data additions, false otherwise.
wait_for_dataset_lock
@contextmanager
def wait_for_dataset_lock()
Blocks thread execution until dataset is in a state where it can ingest data.
get_map
def get_map(name: Optional[str] = None,
atlas_index_id: Optional[str] = None,
projection_id: Optional[str] = None) -> AtlasProjection
Retrieves a map.
Arguments:
name
- The name of your map. This defaults to your dataset name but can be different if you build multiple maps in your dataset.atlas_index_id
- If specified, will only return a map if there is one built under the index with the id atlas_index_id.projection_id
- If projection_id is specified, will only return a map if there is one built under the index with id projection_id.
Returns:
The map or a ValueError.
create_index
def create_index(
name: Optional[str] = None,
indexed_field: Optional[str] = None,
modality: Optional[str] = None,
projection: Union[bool, Dict, NomicProjectOptions] = True,
topic_model: Union[bool, Dict, NomicTopicOptions] = True,
duplicate_detection: Union[bool, Dict, NomicDuplicatesOptions] = True,
embedding_model: Optional[Union[str, Dict, NomicEmbedOptions]] = None,
reuse_embeddings_from_index: Optional[str] = None
) -> Optional[AtlasProjection]
Creates an index in the specified dataset.
Arguments:
name
- The name of the index and the map.indexed_field
- For text datasets, name the data field corresponding to the text to be mapped.reuse_embeddings_from_index
- the name of the index to reuse embeddings from.modality
- The data modality of this index. Currently, Atlas supports eithertext
,image
, orembedding
indices.projection
- Options for configuring the 2D projection algorithmtopic_model
- Options for configuring the topic modelduplicate_detection
- Options for configuring semantic duplicate detectionembedding_model
- Options for configuring the embedding model
Returns:
The projection this index has built.
get_data
def get_data(ids: List[str]) -> List[Dict]
Retrieve the contents of the data given ids.
Arguments:
ids
- a list of datum ids
Returns:
A list of dictionaries corresponding to the data.
delete_data
def delete_data(ids: List[str]) -> bool
Deletes the specified datapoints from the dataset.
Arguments:
ids
- A list of data ids to delete
Returns:
True if data deleted successfully.
add_data
def add_data(data=Union[DataFrame, List[Dict], pa.Table],
embeddings: Optional[np.ndarray] = None,
blobs: Optional[List[Union[str, bytes, Image.Image]]] = None,
pbar=None)
Adds data of varying modality to an Atlas dataset.
Arguments:
data
- A pandas DataFrame, list of dictionaries, or pyarrow Table matching the dataset schema.embeddings
- A numpy array of embeddings: each row corresponds to a row in the table. Use if you already have embeddings for your datapoints.blobs
- A list of image paths, bytes, or PIL Images. Use if you want to create an AtlasDataset using image embeddings over your images. Note: Blobs are stored locally only.pbar
- (Optional). A tqdm progress bar to update.
update_maps
def update_maps(data: List[Dict],
embeddings: Optional[np.ndarray] = None,
num_workers: int = 10)
Utility method to update a project's maps by adding the given data.
Arguments:
data
- An [N,] element list of dictionaries containing metadata for each embedding.embeddings
- An [N, d] matrix of embeddings for updating embedding dataset. Leave as None to update text dataset.shard_size
- Data is uploaded in parallel by many threads. Adjust the number of datums to upload by each worker.num_workers
- The number of workers to use when sending data.
update_indices
def update_indices(rebuild_topic_models: bool = False)
Rebuilds all maps in a dataset with the latest state dataset data state. Maps will not be rebuilt to reflect the additions, deletions or updates you have made to your data until this method is called.
Arguments:
rebuild_topic_models
- (Default False) - If true, will create new topic models when updating these indices.