Map Your Text
Map your text documents with Atlas using the map_text
function.
Atlas will ingest your documents, organize them with state-of-the-art AI and then serve you back an interactive map.
Any interaction you do with your data (e.g. tagging) can be accessed programmatically with the Atlas Python API.
Map text with Atlas
When sending text you should specify an indexed_field
in the map_text
function. This lets Atlas know what metadata field to use when building your map.
from nomic import atlas
import numpy as np
from datasets import load_dataset
#Make a dataset with the shape [{'col1': 'val', 'col2': 'val', ...}, etc]
#Tip: if you're working with a pandas dataframe
# use pandas.DataFrame.to_dict('records')
dataset = load_dataset('ag_news')['train']
max_documents = 10000
subset_idxs = np.random.choice(len(dataset), size=max_documents, replace=False).tolist()
documents = [dataset[i] for i in subset_idxs]
project = atlas.map_text(data=documents,
indexed_field='text',
name='News 10k Example',
colorable_fields=['label'],
description='News 10k Example.'
)
Map text with your own models
Nomic integrates with embedding providers such as co:here and huggingface to help you build maps of text.
Text maps with a 🤗 HuggingFace model
This code snippet is a complete example of how to make a map with a HuggingFace model. Example Huggingface Map
Note
This example requires additional packages. Install them with
from nomic import atlas
from transformers import AutoTokenizer, AutoModel
import numpy as np
import torch
from datasets import load_dataset
#make dataset
max_documents = 10000
dataset = load_dataset("sentiment140")['train']
documents = [dataset[i] for i in np.random.choice(len(dataset), size=max_documents, replace=False).tolist()]
model = AutoModel.from_pretrained("prajjwal1/bert-mini")
tokenizer = AutoTokenizer.from_pretrained("prajjwal1/bert-mini")
embeddings = []
with torch.no_grad():
batch_size = 10 # lower this if needed
for i in range(0, len(documents), batch_size):
batch = [document['text'] for document in documents[i:i+batch_size]]
encoded_input = tokenizer(batch, return_tensors='pt', padding=True)
cls_embeddings = model(**encoded_input)['last_hidden_state'][:, 0]
embeddings.append(cls_embeddings)
embeddings = torch.cat(embeddings).numpy()
response = atlas.map_embeddings(embeddings=embeddings,
data=documents,
colorable_fields=['sentiment'],
name="Huggingface Model Example",
description="An example of building a text map with a huggingface model.")
print(response)
Text maps with a Cohere model
Obtain an API key from cohere.ai to embed your text data.
Add your Cohere API key to the below example to see how their large language model organizes text from a sentiment analysis dataset.
from nomic import atlas
from nomic import CohereEmbedder
import numpy as np
from datasets import load_dataset
cohere_api_key = ''
dataset = load_dataset("sentiment140")['train']
max_documents = 10000
subset_idxs = np.random.choice(len(dataset), size=max_documents, replace=False).tolist()
documents = [dataset[i] for i in subset_idxs]
embedder = CohereEmbedder(cohere_api_key=cohere_api_key)
print(f"Embedding {len(documents)} documents with Cohere API")
embeddings = embedder.embed(texts=[document['user'] for document in documents],
model='small')
if len(embeddings) != len(documents):
raise Exception("Embedding job failed")
print("Embedding job complete.")
response = atlas.map_embeddings(embeddings=np.array(embeddings),
data=documents,
colorable_fields=['sentiment'],
name='Sentiment 140',
description='A 10,000 point sample of the huggingface sentiment140 dataset embedded with the co:here small model.',
)
print(response)