Skip to main content

How to Visualize Embeddings with t-SNE, UMAP, and Nomic Atlas

Neural networks transform data (e.g. text, images, audio) into numerical vectors called embeddings. These vectors capture semantic or structural relationships within the data, allowing models to perform tasks like recommendation, classification, and clustering. However, embeddings typically live in high-dimensional spaces (e.g., 128, 512, or 1024 dimensions), making them hard to interpret directly without the right tools.

This guide explores three leading methods for embedding visualization:

  • t-SNE (t-Distributed Stochastic Neighbor Embedding)
  • UMAP (Uniform Manifold Approximation and Projection)
  • Nomic Atlas (Cloud-based, Interactive Visualization)

By the end, you will understand how each technique works, their pros and cons, and when to use each one for debugging models, analyzing clustering, and refining vector search performance.


What Are Embeddings?

In machine learning, embeddings are dense vector representations of data. For example:

  • Text Embeddings: Words or sentences mapped into high-dimensional vectors where similar meanings cluster together.
  • Image Embeddings: Visual features extracted by a convolutional and transformer network.

Visualization techniques like t-SNE, UMAP, and Nomic Atlas help you project these high-dimensional embeddings into 2D space for easy exploration on a flat canvas.


Why Visualize Embeddings?

  • Identify Clusters: See if your data forms distinct groups, which can be crucial for classification or recommendation tasks.
  • Debug Model Failures: Spot misclassified or outlier points where a model’s assumptions break down.
  • Refine Vector Search: Understand how embeddings are distributed, which can improve search accuracy and performance.
  • Explainability: Provide stakeholders with a tangible view of how a model “sees” the data.

Visualizing Embeddings with t-SNE

t-SNE is a non-linear dimensionality reduction method that emphasizes local relationships among data points.

Key Characteristics

  • Local Structure: Excels at forming tight clusters of similar points.
  • Complexity: Can be computationally expensive (often O(N² log N) for naive implementations).
  • Best For: Small or medium-sized datasets (a few thousand to tens of thousands of points).

Pros & Cons

Pros

  • Captures local relationships well
  • Effective at visually separating small, distinct clusters

Cons

  • Slow for large datasets
  • May distort global distances; clusters might appear artificially separated
  • Requires careful hyperparameter tuning (perplexity, learning rate)
import numpy as np
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt

# Sample high-dimensional data (1,000 points, 128-dimensional)
X = np.random.rand(1000, 128)

# Reduce to 2D using t-SNE
tsne = TSNE(n_components=2, perplexity=30, random_state=42)
X_embedded = tsne.fit_transform(X)

# Plot results
plt.scatter(X_embedded[:, 0], X_embedded[:, 1], s=5)
plt.title("t-SNE Visualization")
plt.show()

Visualizing Embeddings with UMAP

UMAP (Uniform Manifold Approximation and Projection) is a faster alternative to t-SNE that also aims to preserve some global structure.

Key Characteristics

  • Speed: Typically much faster than t-SNE, often utilizing approximate nearest neighbor search.
  • Structure: Balances local and global neighborhood preservation.
  • Best For: Medium to large datasets, where you need quicker computation than t-SNE.

Pros & Cons

Pros

  • Faster than t-SNE, scales better
  • Preserves both local and global relationships (less likely to split single clusters)

Cons

  • Results can vary based on hyperparameters (n_neighbors, min_dist)
  • Some interpretability challenges remain for very large datasets
import numpy as np
import umap
import matplotlib.pyplot as plt

# Sample high-dimensional data (1,000 points, 128-dimensional)
X = np.random.rand(1000, 128)

# Reduce to 2D using UMAP
umap_model = umap.UMAP(n_neighbors=15, min_dist=0.1, metric='cosine', random_state=42)
X_embedded = umap_model.fit_transform(X)

# Plot results
plt.scatter(X_embedded[:, 0], X_embedded[:, 1], s=5)
plt.title("UMAP Visualization")
plt.show()

Visualizing Embeddings with Atlas

Nomic Atlas is a cloud-based platform for scalable, interactive embedding visualization. Unlike standalone local methods (UMAP, t-SNE), it:

  • Handles millions of embeddings with real-time interactivity
  • Provides collaborative, web-based interactive maps you can share
  • Allows filtering and annotation by metadata

When to Use Nomic Atlas

  • Very Large Datasets: Seamlessly upload and visualize millions of embeddings.
  • Collaboration: Share interactive maps and analysis with team members or stakeholders.
  • Exploration: Filter, zoom in on clusters, and annotate specific points.
  • Data Workflows: Building data workflows like embedding power model improvements and data curation.

Example: Uploading Embeddings to Atlas

from nomic import atlas
import numpy as np

# Suppose we have text embeddings for 10,000 documents (512-dimensional)
embeddings = np.random.rand(10000, 512)
metadata = [{'id': str(i), 'category': 'demo'} for i in range(10000)]

# Create an interactive map
atlas_data_map = atlas.map_data(
embeddings=embeddings,
data=metadata,
identifier='my-first-embedding-map'
)
  1. Install the Nomic Atlas Python SDK:
    pip install nomic
  2. Run the above script to upload embeddings and metadata.
  3. Log in to Nomic Atlas to explore and share the interactive map.

Comparison of t-SNE, UMAP, and Nomic Atlas

Below is a quick reference to help you decide which method best fits your needs:

Featuret-SNEUMAPNomic Atlas
Dataset SizeSmall to MediumMedium to LargeSmall to Massive
SpeedOften SlowFaster than t-SNECloud-based, scalable
Local/Global StructureEmphasizes LocalBalances Local/GlobalInteractive, full overview
Ease of SharingNoNoYes (web-based)
InteractivityNoNoYes (zoom, filter, annotate)
Metadata FilteringManual / External ToolManual / External ToolBuilt-in filtering
Use Case ExampleSmall custom datasetsLarger text/image dataMillions of embeddings

Best Practices for Embedding Visualization

  • Pick the Right Method:

    • t-SNE for smaller datasets where local detail is paramount.
    • UMAP for medium-large datasets and a better global picture.
    • Nomic Atlas for collaborative, interactive embedding visualization and exploration.
  • Tune Hyperparameters:

    • t-SNE: Experiment with perplexity (often 5–50) and learning_rate.
    • UMAP: Adjust n_neighbors (5–50), min_dist.
    • Nomic Atlas: Configure map settings (e.g., dimensionality reduction method, metadata columns).
  • Use Metadata:

    • Always color or label points by known categories, class labels, or other relevant metadata.
  • Compare Multiple Methods:

    • If suspicious clusters appear, try a second technique (e.g., UMAP after t-SNE) to confirm.

Common Pitfalls and How to Avoid Them

  • Misinterpreting Distances: 2D projections can distort high-dimensional distances. Validate with metadata or alternative projections.
  • Not Tuning Hyperparameters: Default perplexity in t-SNE or default n_neighbors in UMAP might not reveal the best structure.
  • Overlooking Global Structure: t-SNE, in particular, can overemphasize local clusters. Use UMAP or a larger perplexity to see global relationships.
  • Using Local Tools for Huge Datasets: For datasets over 100k embeddings, a local method might become too slow or memory-intensive. Consider Nomic Atlas or other cloud-based solutions.

Frequently Asked Questions (FAQ)

1. How do I choose perplexity for t-SNE?
A good rule of thumb is to try values between 5 and 50. Larger datasets often work better with a higher perplexity. Experiment with different values and visually inspect the results.

2. What about GPU acceleration for t-SNE, UMAP and Nomic Atlas?
Libraries like OpenTSNE and Rapids UMAP leverage GPU acceleration, speeding up t-SNE and UMAP dramatically on compatible hardware. Nomic Atlas will automatically select the fastest hardware to generate your embedding visualization.

3. Which embedding distances work best for UMAP?
Cosine is a popular choice for text data (e.g., NLP embeddings). Euclidean might work well for image embeddings. Always pick a distance metric aligned with how embeddings were generated.


Further Reading and References


Conclusion and Next Steps

Which Method Should You Use?

  • t-SNE for smaller datasets and detailed cluster separation.
  • UMAP for larger datasets where speed and partial global structure matter.
  • Nomic Atlas for scaling to millions of points, collaborative features, and interactive exploration.

Whether you choose t-SNE, UMAP, or Nomic Atlas, visualizing embeddings is a crucial step in understanding, debugging, and refining your machine learning models.

Get Started with Nomic Atlas:

  1. Install Nomic Atlas:

    pip install nomic
  2. Upload and Explore:

    from nomic import atlas
    import numpy as np

    embeddings = np.random.rand(50000, 128)
    dataset = atlas.map_data(
    embeddings=embeddings
    )
  3. Visualize: Log in to Nomic Atlas, open your project, and start exploring your embeddings interactively.