Data Upload
Setup
Make sure you have the nomic
package installed:
pip install nomic
Then login with your API key at a terminal / command line:
nomic login nk-...
The following code snippets show how to use the map_data
function to create Atlas maps from different types of data:
Additionally, you can add batches of data to existing Atlas datasets
Upload a text dataset
For text data, use:
from nomic import atlas
atlas.map_data(
data=your_data, # Your DataFrame or list of dictionaries containing the text to embed, plus optional metadata
indexed_field=your_text_field, # The field/column containing text to embed
identifier=your_dataset_name # Dataset name
)
Example
This will create a map of 25,000 news articles. Each article will be embedded automatically with Nomic Embed Text.
from nomic import atlas
import pandas
news_articles = pandas.read_csv(
'https://raw.githubusercontent.com/nomic-ai/maps/main/data/ag_news_25k.csv'
)
atlas.map_data(
data=news_articles,
indexed_field='text',
identifier="Example-text-dataset-news"
)
Upload an image dataset
For image data, use:
from nomic import atlas
atlas.map_data(
blobs=your_images, # Your list of images (list of image paths, bytes, or PIL images stored locally)
data=your_metadata, # Optional metadata for each image
identifier=your_dataset_name # Dataset name
)
Example
This will create a map of the CIFAR10 dataset. Each image will be embedded automatically with Nomic Embed Vision.
from nomic import atlas
from datasets import load_dataset
cifar = load_dataset('cifar10', split="train")
images = cifar["img"]
data = [{"label": label} for label in cifar["label"]]
atlas.map_data(
blobs=images,
data=data,
identifier="Example-image-dataset-CIFAR10"
)
Batches
If you have a large dataset of images, you can additionally successively add images to the dataset using the add_data method.
This example creates a map of the Imagenette dataset.
from nomic import atlas
from datasets import load_dataset
from tqdm import tqdm
id2label = {
"0": "tench",
"1": "English springer",
"2": "cassette player",
"3": "chain saw",
"4": "church",
"5": "French horn",
"6": "garbage truck",
"7": "gas pump",
"8": "golf ball",
"9": "parachute"
}
image_dataset = load_dataset('frgfm/imagenette', '160px')['train'].shuffle(seed=42)
images = image_dataset["image"]
labels = image_dataset["label"]
metadata = [{"label": id2label[str(label)]} for label in tqdm(labels, desc="Creating metadata")]
dataset = AtlasDataset(
'<YOUR_ORGANIZATION_HERE>/imagenette10k-successive-adds',
unique_id_field="id",
)
for i, record in enumerate(metadata):
metadata[i]["id"] = i
for i in range(0, len(images), 1000):
dataset.add_data(blobs=images[i:i+1000],
data=metadata[i:i+1000],
)
atlas_map = dataset.create_index(topic_model={"build_topic_model": False}, embedding_model="nomic-embed-vision-v1.5")
Upload an embeddings dataset
For pre-computed embeddings, use:
from nomic import atlas
atlas.map_data(
embeddings=your_embeddings, # np.array of shape (n_embeddings, embedding_dim)
data=your_metadata, # Optional metadata for each embedding
identifier=your_dataset_name # Dataset name
)
Example
This will create a map of the same 25,000 news articles as the above example for creating a text dataset, except instead of having the embeddings created automatically upon upload to Atlas on your behalf, we first generate the embeddings ourselves in Python and then upload those embedding vectors directly.
We use Nomic's embedding model nomic-embed-text-v1.5
, to create the embeddings, using the task prefix "clustering"
to create embeddings specialized for visual clustering, and using local inference mode to generate the embeddings on your local device using the downloaded model.
Additionally, we use NomicTopicOptions
to create topic labels for our embeddings dataset. Topics are generated automatically when you create a text dataset, but when you create an embeddings dataset you need to specify which features to use for descriptive topics in this way. To learn more about topics and topic modeling in Atlas, visit our Topic modeling explainer guide.
from nomic import atlas, embed
from nomic.data_inference import NomicTopicOptions
import pandas
news_articles = pandas.read_csv(
'https://raw.githubusercontent.com/nomic-ai/maps/main/data/ag_news_25k.csv'
)
embeddings = embed.text(
texts=news_articles.text.values,
model='nomic-embed-text-v1.5',
task_type='clustering',
inference_mode='local'
)['embeddings']
atlas.map_data(
data=news_articles,
embeddings=embeddings,
topic_model=NomicTopicOptions(
build_topic_model=True,
topic_label_field='text'
),
identifier='Example-embeddings-dataset-news'
)
Note: Nomic Embed vs other embedding vectors
You can upload any embedding vectors from any embedding model to Atlas, but features like vector search only are available when you upload vectors in Nomic's embedding space.
API Reference
atlas.map_data
def map_data(
data: Optional[Union[DataFrame, List[Dict], Table]] = None,
blobs: Optional[List[Union[str, bytes, Image.Image]]] = None,
embeddings: Optional[np.ndarray] = None,
identifier: Optional[str] = None,
description: str = "",
id_field: Optional[str] = None,
is_public: bool = True,
indexed_field: Optional[str] = None,
projection: Union[bool, Dict, NomicProjectOptions] = True,
topic_model: Union[bool, Dict, NomicTopicOptions] = True,
duplicate_detection: Union[bool, Dict, NomicDuplicatesOptions] = True,
embedding_model: Optional[Union[str, Dict, NomicEmbedOptions]] = None
) -> AtlasDataset
Arguments:
data
: An ordered collection of the datapoints you are structuring. Can be a list of dictionaries, Pandas Dataframe or PyArrow Table.blobs
: A list of image paths, bytes, or PIL images to add to your image dataset that are stored locally.embeddings
: An [N,d] numpy array containing the N embeddings to add.identifier
: A name for your dataset that is used to generate the dataset identifier. A unique name will be chosen if not supplied.description
: The description of your datasetid_field
: Specify your data unique id field. This field can be up 36 characters in length. If not specified, one will be created for you namedid_
.is_public
: Should the dataset be accessible outside your Nomic Atlas organization.indexed_field
: The text field from the dataset that will be used to create embeddings, which determines the layout of the data map in Atlas. Required for text data but won't have an impact if uploading embeddings or image blobs.projection
: Options to adjust Nomic Project - the dimensionality algorithm organizing your dataset.topic_model
: Options to adjust Nomic Topic - the topic model organizing your dataset.duplicate_detection
: Options to adjust Nomic Duplicates - the duplicate detection algorithm.embedding_model
: Options to adjust the embedding model used to embed your dataset.