How to Visualize Embeddings with t-SNE, UMAP, and Nomic Atlas
Neural networks transform data (e.g. text, images, audio) into numerical vectors called embeddings. These vectors capture semantic or structural relationships within the data, allowing models to perform tasks like recommendation, classification, and clustering. However, embeddings typically live in high-dimensional spaces (e.g., 128, 512, or 1024 dimensions), making them hard to interpret directly without the right tools.
This guide explores three leading methods for embedding visualization:
- t-SNE (t-Distributed Stochastic Neighbor Embedding)
- UMAP (Uniform Manifold Approximation and Projection)
- Nomic Atlas (Cloud-based, Interactive Visualization)
By the end, you will understand how each technique works, their pros and cons, and when to use each one for debugging models, analyzing clustering, and refining vector search performance.
What Are Embeddings?
In machine learning, embeddings are dense vector representations of data. For example:
- Text Embeddings: Words or sentences mapped into high-dimensional vectors where similar meanings cluster together.
- Image Embeddings: Visual features extracted by a convolutional and transformer network.
Visualization techniques like t-SNE, UMAP, and Nomic Atlas help you project these high-dimensional embeddings into 2D space for easy exploration on a flat canvas.
Why Visualize Embeddings?
- Identify Clusters: See if your data forms distinct groups, which can be crucial for classification or recommendation tasks.
- Debug Model Failures: Spot misclassified or outlier points where a model’s assumptions break down.
- Refine Vector Search: Understand how embeddings are distributed, which can improve search accuracy and performance.
- Explainability: Provide stakeholders with a tangible view of how a model “sees” the data.
Visualizing Embeddings with t-SNE
t-SNE is a non-linear dimensionality reduction method that emphasizes local relationships among data points.
Key Characteristics
- Local Structure: Excels at forming tight clusters of similar points.
- Complexity: Can be computationally expensive (often O(N² log N) for naive implementations).
- Best For: Small or medium-sized datasets (a few thousand to tens of thousands of points).
Pros & Cons
Pros
- Captures local relationships well
- Effective at visually separating small, distinct clusters
Cons
- Slow for large datasets
- May distort global distances; clusters might appear artificially separated
- Requires careful hyperparameter tuning (perplexity, learning rate)
- Python
import numpy as np
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt
# Sample high-dimensional data (1,000 points, 128-dimensional)
X = np.random.rand(1000, 128)
# Reduce to 2D using t-SNE
tsne = TSNE(n_components=2, perplexity=30, random_state=42)
X_embedded = tsne.fit_transform(X)
# Plot results
plt.scatter(X_embedded[:, 0], X_embedded[:, 1], s=5)
plt.title("t-SNE Visualization")
plt.show()
Visualizing Embeddings with UMAP
UMAP (Uniform Manifold Approximation and Projection) is a faster alternative to t-SNE that also aims to preserve some global structure.
Key Characteristics
- Speed: Typically much faster than t-SNE, often utilizing approximate nearest neighbor search.
- Structure: Balances local and global neighborhood preservation.
- Best For: Medium to large datasets, where you need quicker computation than t-SNE.
Pros & Cons
Pros
- Faster than t-SNE, scales better
- Preserves both local and global relationships (less likely to split single clusters)
Cons
- Results can vary based on hyperparameters (n_neighbors, min_dist)
- Some interpretability challenges remain for very large datasets
- Python
import numpy as np
import umap
import matplotlib.pyplot as plt
# Sample high-dimensional data (1,000 points, 128-dimensional)
X = np.random.rand(1000, 128)
# Reduce to 2D using UMAP
umap_model = umap.UMAP(n_neighbors=15, min_dist=0.1, metric='cosine', random_state=42)
X_embedded = umap_model.fit_transform(X)
# Plot results
plt.scatter(X_embedded[:, 0], X_embedded[:, 1], s=5)
plt.title("UMAP Visualization")
plt.show()
Visualizing Embeddings with Atlas
Nomic Atlas is a cloud-based platform for scalable, interactive embedding visualization. Unlike standalone local methods (UMAP, t-SNE), it:
- Handles millions of embeddings with real-time interactivity
- Provides collaborative, web-based interactive maps you can share
- Allows filtering and annotation by metadata
When to Use Nomic Atlas
- Very Large Datasets: Seamlessly upload and visualize millions of embeddings.
- Collaboration: Share interactive maps and analysis with team members or stakeholders.
- Exploration: Filter, zoom in on clusters, and annotate specific points.
- Data Workflows: Building data workflows like embedding power model improvements and data curation.
Example: Uploading Embeddings to Atlas
- Python
from nomic import atlas
import numpy as np
# Suppose we have text embeddings for 10,000 documents (512-dimensional)
embeddings = np.random.rand(10000, 512)
metadata = [{'id': str(i), 'category': 'demo'} for i in range(10000)]
# Create an interactive map
atlas_data_map = atlas.map_data(
embeddings=embeddings,
data=metadata,
identifier='my-first-embedding-map'
)
- Install the Nomic Atlas Python SDK:
pip install nomic
- Run the above script to upload embeddings and metadata.
- Log in to Nomic Atlas to explore and share the interactive map.
Comparison of t-SNE, UMAP, and Nomic Atlas
Below is a quick reference to help you decide which method best fits your needs:
Feature | t-SNE | UMAP | Nomic Atlas |
---|---|---|---|
Dataset Size | Small to Medium | Medium to Large | Small to Massive |
Speed | Often Slow | Faster than t-SNE | Cloud-based, scalable |
Local/Global Structure | Emphasizes Local | Balances Local/Global | Interactive, full overview |
Ease of Sharing | No | No | Yes (web-based) |
Interactivity | No | No | Yes (zoom, filter, annotate) |
Metadata Filtering | Manual / External Tool | Manual / External Tool | Built-in filtering |
Use Case Example | Small custom datasets | Larger text/image data | Millions of embeddings |
Best Practices for Embedding Visualization
-
Pick the Right Method:
- t-SNE for smaller datasets where local detail is paramount.
- UMAP for medium-large datasets and a better global picture.
- Nomic Atlas for collaborative, interactive embedding visualization and exploration.
-
Tune Hyperparameters:
- t-SNE: Experiment with
perplexity
(often 5–50) andlearning_rate
. - UMAP: Adjust
n_neighbors
(5–50),min_dist
. - Nomic Atlas: Configure map settings (e.g., dimensionality reduction method, metadata columns).
- t-SNE: Experiment with
-
Use Metadata:
- Always color or label points by known categories, class labels, or other relevant metadata.
-
Compare Multiple Methods:
- If suspicious clusters appear, try a second technique (e.g., UMAP after t-SNE) to confirm.
Common Pitfalls and How to Avoid Them
- Misinterpreting Distances: 2D projections can distort high-dimensional distances. Validate with metadata or alternative projections.
- Not Tuning Hyperparameters: Default perplexity in t-SNE or default
n_neighbors
in UMAP might not reveal the best structure. - Overlooking Global Structure: t-SNE, in particular, can overemphasize local clusters. Use UMAP or a larger perplexity to see global relationships.
- Using Local Tools for Huge Datasets: For datasets over 100k embeddings, a local method might become too slow or memory-intensive. Consider Nomic Atlas or other cloud-based solutions.
Frequently Asked Questions (FAQ)
1. How do I choose perplexity for t-SNE?
A good rule of thumb is to try values between 5 and 50. Larger datasets often work better with a higher perplexity. Experiment with different values and visually inspect the results.
2. What about GPU acceleration for t-SNE, UMAP and Nomic Atlas?
Libraries like OpenTSNE and Rapids UMAP leverage GPU acceleration, speeding up t-SNE and UMAP dramatically on compatible hardware.
Nomic Atlas will automatically select the fastest hardware to generate your embedding visualization.
3. Which embedding distances work best for UMAP?
Cosine is a popular choice for text data (e.g., NLP embeddings). Euclidean might work well for image embeddings. Always pick a distance metric aligned with how embeddings were generated.
Further Reading and References
- t-SNE Paper: Maaten & Hinton, 2008
- UMAP Documentation: UMAP-Learn Docs
- Getting started with Atlas: Nomic Atlas
Conclusion and Next Steps
Which Method Should You Use?
- t-SNE for smaller datasets and detailed cluster separation.
- UMAP for larger datasets where speed and partial global structure matter.
- Nomic Atlas for scaling to millions of points, collaborative features, and interactive exploration.
Whether you choose t-SNE, UMAP, or Nomic Atlas, visualizing embeddings is a crucial step in understanding, debugging, and refining your machine learning models.
Get Started with Nomic Atlas:
-
Install Nomic Atlas:
pip install nomic
-
Upload and Explore:
from nomic import atlas
import numpy as np
embeddings = np.random.rand(50000, 128)
dataset = atlas.map_data(
embeddings=embeddings
) -
Visualize: Log in to Nomic Atlas, open your project, and start exploring your embeddings interactively.