Atlas Datasets
Atlas Datasets are the foundation for storing and managing data in Atlas. They provide a scalable, columnar storage format that supports text, images, and embeddings.
Core Concepts
Atlas Datasets store data in a columnar format optimized for large-scale operations. When you create a dataset, Atlas automatically handles data partitioning, caching, and synchronization between your local environment and Atlas servers. This enables efficient access to large datasets without loading everything into memory at once.
Each dataset belongs to an organization and has a unique identifier in the format organization/dataset-name. Datasets can be public (accessible to anyone) or private (accessible only to organization members).
Every Atlas Dataset can have one or more maps - semantic views of your data created through indexing. Maps organize your data spatially based on content similarity, making it easy to explore and understand large datasets. While the dataset stores your raw data, maps provide the structure for visualization and interaction.
Think of a Atlas Dataset as the source of truth for your data, and an Atlas Data Map as a snapshot of the data in time.
To work with Atlas Datasets, you typically follow these steps:
-
Create or Load: Initialize a dataset by specifying an organization and dataset name. You can create a new dataset or load an existing one.
-
Add Data: Add data to your dataset in batches. Atlas supports various input formats including lists of dictionaries, pandas DataFrames, and PyArrow Tables.
-
Create Maps: Generate semantic views of your data by creating maps. Maps require specifying which field to index (for text) or providing pre-computed embeddings.
-
Access Data: Access your data through the dataset's maps. Atlas provides both DataFrame and PyArrow Table interfaces for data retrieval.
Data Types and Storage
Atlas natively supports several data types:
-
Text documents and metadata
-
Images (as files or URLs)
-
Embedding vectors (high-dimensional numerical arrays)
Data is stored in Apache Arrow format, providing efficient memory usage and fast data access. When you interact with a dataset, Atlas dynamically loads data to your local environment with automatic caching.
Best Practices
When working with Atlas Datasets:
-
Data cannot be added while a dataset is being indexed
-
Use descriptive dataset names that reflect the content
-
Specify a unique ID field when creating datasets to ensure consistent data tracking
-
Batch large data uploads into manageable chunks
-
Use PyArrow Tables for memory-efficient data access with large datasets
-
Consider data privacy requirements when setting dataset visibility