Key terms
Atlas is a platform for visually and programmatically interacting with massive unstructured datasets of text documents, images and embeddings.
Atlas augments your unstructured dataset with four key steps: Vectorization, 2D Layout Optimization, Annotation, and Presentation.
- Vectorization is the transformation of data into high dimensional vector space. Atlas uses a contrastively trained deep neural network, similar to Neelakantan et al., to vectorize text. Users may also upload their own embeddings for Atlas.
- 2D Layout Optimization is the arrangement of data points into two dimensions for human interpretability. Nomic’s layout optimizer performs dimensionality reduction, and decides on parameters for the high dimensional kernel, low dimensional kernel, and optimizer that enable better projection speed, size, and quality.
- Annotation colors and labels the map using user- and Atlas-generated metadata (like topic labels and duplicate detection).
- Presentation in the Atlas system is done with Deepscatter, a library developed by the Nomic Team which can efficiently visualize billions of datapoints in the browser using custom WebGL and Apache Arrow
feather
files in a custom quadtree format.
Unstructured data
Nomic Atlas excels in managing and visualizing unstructured data, which encompasses diverse formats such as text documents, image galleries, audio files, videos, and AI model datasets. This type of data, characterized by its lack of conventional structure, is challenging to organize and analyze using traditional spreadsheets or databases.
To address this, Nomic Atlas introduces an AI-powered interface known as "the map." This innovative tool spatially organizes unstructured datapoints, grouping similar items together to facilitate intuitive understanding and interaction. Whether it's exploring large collections of multimedia content or collaborating on complex datasets, Nomic Atlas makes it possible to visualize and work with unstructured data seamlessly. Users can share insights and collaborate through a simple browser link, while developers gain programmatic access to the map's data and functionalities, enhancing the overall experience of unstructured data analysis.
Embeddings
Embeddings are the core of understanding and processing unstructured data in Nomic Atlas. They are numerical vectors representing real-world data, converting various forms of unstructured data into a format that computers can understand and process. These embeddings encapsulate the semantics and contextual relationships of data, whether it's text, images, or other media. Nomic Atlas automatically assigns embeddings to uploaded data, organizing it meaningfully. This process involves advanced AI and embedding models, transforming data into vector representations that capture their underlying meaning and relationships.
Projection
Projection in Nomic Atlas involves compressing high-dimensional data (like embeddings) into lower-dimensional spaces, primarily 2-D, for visualization and analysis. Techniques like UMAP, t-SNE, and PCA are employed to ensure that relative distances and relationships in high-dimensional space are preserved in the projected lower-dimensional space. This feature allows users to view and interact with complex datasets in a more comprehensible 2-D format on their screens, facilitating better understanding and manipulation of the data.
Topic modeling
Topic modeling is an advanced technique used in Nomic Atlas to automatically discover and categorize abstract themes within large text collections. By analyzing patterns of keywords and themes across documents, this method effectively groups texts into clusters based on shared topics. For instance, articles about health, finance, and technology can be automatically categorized into distinct clusters, each characterized by relevant sets of keywords. This automated categorization aids in sifting through vast amounts of unstructured text data to uncover underlying thematic structures without manual intervention.