Topic modeling
Nomic Atlas organizes your data into a semantic topic heirachy allowing you to quickly group similar datapoints together.
Learn how to access your topics in Python or read more about the topic modeling algorithms behind the Atlas system.
Use cases
- Document clustering and classification
- Content recommendation
- Trend analysis and monitoring / topic evolution over time
- Text summarization
- Knowledge discovery, pattern-finding, and data mining
What is an "indexed field"?
Your indexed_field
is the attribute of data which is used to arrange the Atlas map and is up to you. As a result, a map of news articles indexed over their text content will be a semantic layout of news article content. So, the topic labels correspond to the data being indexed (article content), so topic labels describe article content. Topic labels ideally describe entire clusters of datapoints.
For example, your data may contain metadata, like time of publication, language, name of author, name of outlet, and possibly other user-generated attributes like bias, objectivity, and polarity.
If you build an Atlas map on this dataset, you will likely want to specify to Atlas to "index over" the article's text contents so the resulting map shows the landscape of article content, not some other metadata attribute like time or polarity.
- Atlas Topics in Browser
- Atlas Topics with API
Understanding topics in Atlas
The topic labels on the Atlas map are automatically generated based on the underlying data. More specifically, the topics describe one user-selected attribute of the data, like the text content of news articles.
Example: News map with Atlas topics
Reading clusters and labels
We can see clustering and topic labeling in action in the map above, a multilingual map on world news (try browsing it yourself!).
- High-level topics: A very general topic like "Russia-Ukraine conflict" broadly describes the data points within the bottom-right cluster.
- Sub-topics: Within the cluster, there are sub-clusters with topic labels that describe more specific themes, like "Russian Navy", "Drones," "Russia-Canada Relations" and "Black Sea Fleet Headquarters."
- Individual points comprise clusters: Zooming in and hovering over individual data points in the "Russia-Canada Relations" cluster, we see headlines like "Le président de l'Ukraine, Volodymyr Zelenskyy, effectuera une visite au Canada" ("Ukrainian President Volodymyr Zelenskyy to visit Canada") and "Tổng thống Ukraine tiếp tục chuyến vận động tới Canada sau những thách thức tại Mỹ" ("The President of Ukraine continues his campaign trip to Canada after challenges in the US").
- Topic inference from clusters: The topic model infers labels such as "Russia-Canada Relations" based on clusters of individual datapoints like these. In this case in particular, the system uses a multilingual-aware model.
To learn more about the specifics of the computational processes behind the topic label generation process, see the Topics section in How Atlas Works.
Editing topic labels
An editor of a dataset can update topic labels from within the Atlas Map.
In the View Settings panel, click the "Edit Topics" toggle to enter edit mode. In edit mode, click on a topic label to open a modal where the topic label can be altered. Within that modal the most prominent keywords related to that topic can also be viewed.
Once an edited topic label is saved, it is immediately reflected in the map, and propagated to the server. Other users of the map will see the updated topic upon refreshing the map.
Intuition behind Atlas topics
If you think about an Atlas map like an actual map — like one in a phone app or a classic tri-fold tucked in your glove compartment — both serve as a guide through a landscape, except Atlas does so with your data.
To understand the labels on an Atlas map, we can look to real maps as an analogy.
Consider a digital map of Earth.
- When we see the whole world on our screen at once, we can see labels of continents, countries, ocean names and mountain ranges.
- Zooming in brings more granularity, like states and provinces, rivers, and lakes.
- Zooming in further, names of cities, towns and villages come into view.
- Zooming in even further, we may see labeled buildngs, roads, paths, bridges, or monuments.
The Atlas projection algorithm applies statistical methods to generate a "landscape" of your data where semantically similar datapoints are close to each other on the map (for more info on semantic layouts, see the note below). This imposes order over a previously disordered dataset.
Now, to help users better understand the data landscape, Atlas generates topic labels which describe the underlying data. Hierarchical clustering results in more general labels and more specific labels. Zooming into the map reveals more specific labels; zooming out of the map hides these specific labels.
To provide some other examples of unstructured data that gets structured in real life:
- The grocery store: A grocery store organizes thousands of individual products into sections like produce, dairy, frozen goods, and baked goods based on the temperature, taste, type, culinary use, and origin of food and drink products. Often, there is an order within each section: the baked goods section may keep sweet and savory items separate.
- The library: A library is organized by different attributes of its many books. The Dewey Decimal System divides books up into ten main classes of content like history, science, and the arts, and these get arranged around the library's stacks. Libraries could also be organized by last name of author, genre, reading-level, and more.
Although far-removed from data analysis, these examples show how a complex disarray of objects can be turned into a navigable space.
Atlas topics in Python
The topics which the Atlas system generates behind the scenes is directly accessible via Python. Information is available about topic hierarchy and topic density. Topic information can be used for downstream pipelines like visualization, analyses, and predictions.
Your embedding information can be accessed in the map.topics
attribute of the AtlasDataset
:
from nomic import AtlasDataset
map = AtlasDataset(identifier='my-dataset').maps[0]
map.topics
Access data with topics
- Python
- Output
# Pandas df of your data with columns ID, topic_depth_n, topic_depth_n+1, etc.
print(map.topics.df)
# print(map.topics.df)
id topic_depth_1 topic_depth_2
0 0 Oil Prices Space Exploration
1 1 Basketball Movies
2 10 Iran-Bush-Presidential-Elections Canadian Budget
3 100 Red Sox win World Series Women's Basketball
4 1000 Oil Prices Cutbacks
... ... ... ...
24995 14975 Red Sox win World Series Contract
24996 14988 Oil Prices Airline bankruptcies
24997 14991 Red Sox win World Series Baseball
24998 14994 Red Sox win World Series Football coaches
24999 14997 Red Sox win World Series Vikings receiver injured in game
25000 rows × 3 columns
Access topic metadata
Pandas dataframe where each row corresponds to a unique topic. Metadata associated with each topic includes:
- topic depth
- a human-readable topic description (topic label)
- identifying keywords that differentiate the topic from other topics
- Python
- Output
# Returns a Pandas df
print(map.topics.metadata)
# print(map.topics.metadata)
depth topic_id topic_depth_1 topic_description topic_short_description topic_depth_2
0 1 0 Economy economy/jobs/percent/workers/growth/strike/uni... Economy NaN
1 1 1 Computer Hardware Windows/IBM/Intel/software/Apple/Linux/Interne... Computer Hardware NaN
2 1 2 Basketball game/Bryant/Kobe/Theft/Andreas/Neal/Xbox/Auto/... Basketball NaN
3 1 3 War killed/Iraq/Palestinian/Gaza/Israeli/Darfur/Ar... War NaN
4 1 4 Red Sox win World Series Cup/win/season/Sox/night/coach/victory/team/Le... Red Sox win World Series NaN
... ... ... ... ... ... ...
199 2 191 Iran-Bush-Presidential-Elections Pakistan/Musharraf/Kashmir/Pervez/Aziz/Pakista... Pakistan-India peace talks Pakistan-India peace talks
200 2 192 Iran-Bush-Presidential-Elections Kerry/Bush/John/Democratic/Vietnam/Northern/AF... Bush vs. Kerry Bush vs. Kerry
201 2 193 Iran-Bush-Presidential-Elections Brown/Gordon/Canadian/tax/CP/pension/Canada/Qu... Canadian Budget Canadian Budget
202 2 194 Iran-Bush-Presidential-Elections Karzai/Afghan/Afghanistan/Hamid/KABUL/presiden... Afghan presidential election Afghan presidential election
203 2 195 Iran-Bush-Presidential-Elections Nations/United/NATIONS/UNITED/Council/Security... United Nations Security Council United Nations Security Council
204 rows × 6 columns
Access topic hierarchy
Learn more about your topic breakdown as a Python dictionary. What are the most general topics, and which sub-topics do they contain?
- Python
- Output
# map.topics.hierarchy is a dict
hierarchy = map.topics.hierarchy
print(f'Your depth 1 (most general) topics are: {hierarchy.keys()}')
# print(f'Your depth 1 (most general) topics are: {hierarchy.keys()}')
Your depth 1 (most general) topics are: dict_keys([('Software sales', 1), ("Pinochet's Dictatorship", 1), ('Palestinian leader', 1), ('Middle East Conflict', 1), ('Olympic Gold Medal', 1), ('Space', 1), ('Sports', 1)])
You can use higher-level topic keys to access lower-level topics in your hierarchy.
- Python
- Output
import random
# List the subtopics in a random top-level topic
random_topic_1 = random.choice(list(hierarchy.keys()))
print(f'The general topic {random_topic_1} contains subtopics {hierarchy[random_topic_1]}')
# print(f'The general topic {random_topic_1} contains subtopics {hierarchy[random_topic_1]}')
The general topic ('Music Sharing', 1) contains sub-topics ['Patent lawsuits', 'Broadband', 'Music Piracy']
Access topic groups
By providing a level of hierarchy, get a list of dictionaries where each item is a distinct topic at that level. Keys for that topic include subtopics
, subtopic_ids
, topic_id
, topic_short_description
, topic_long_description
, and datum_ids
.
- Python
- Output
# Access high- and low-level info for your topics at different depths
your_depth_level = 1
# This accesses info for your first topic at topic depth 1
print(map.topics.group_by_topic(your_depth_level)[0])
# print(map.topics.group_by_topic(your_depth_level)[0])
{'subtopics': ['Grand Theft Auto', 'Basketball players', 'Movies'],
'subtopic_ids': [35, 36, 37],
'topic_id': 2,
'topic_short_description': 'Basketball',
'topic_long_description': 'game/Bryant/Kobe/Theft/Andreas/Neal/Xbox/Auto/NBA/games/Grand/Game/Shaquille/gamers/Lakers',
'datum_ids': {'1',
'10070',
'10167',
'10186',
'10195',
'10266',
'10288',
'10289',
'10295',
'10358'}}