%load_ext autoreload
%autoreload 2
Grouping Data Samples by Topic Data¶
from nomic import AtlasProject
proj = AtlasProject(project_id = "change me")
Download tiles¶
We can download all the tiles by grabbing the first projection, and then using the web_tile_data
method to
return a table of all data for the tile. This step is a scalable and performant method for fetching topic labels per data sample. By default, these files are stored in ~/.nomic/cache
.
Dowloading these files only needs to happen once. You can avoid redownloading and rewriting the tiles by setting
overwrite = False
.
tb = proj.indices[0].projections[0].web_tile_data(overwrite=True)
Viewing data¶
Now we can access some of the Atlas-generated data. Topics are stored at _topic_depth_1
, _topic_depth_2
, etc.
ID fields that you supplied are stored according to their name--in this set, it is 'id_'. Data you uploaded that wasn't used for visualization (like text fields) will not be included here, but any date or categorical fields will be.
The data is returned as an Apache Arrow table: you may want to use the to_pandas()
or to_pylist()
method to put it
in a more familiar format.
import pyarrow as pa
tb.select(['_id', '_topic_depth_1']).to_pandas()
id_ | _topic_depth_1 | _duplicate_class | |
---|---|---|---|
0 | 00001524-17ea-4424-9c9a-da114648b11d | Imagining and reasoning about consciousness | retention candidate |
1 | 0007f8c6-f3a3-47f3-a9ba-5314266b5b6f | Hypothesis | singleton |
2 | 001af751-d4c5-4029-a2b3-da1045f6a331 | Hypothesis | singleton |
3 | 001b2c8b-b054-46cb-b75f-5f1bbdf3adb2 | Sentences | singleton |
4 | 0020c06b-f5d8-49f1-84b1-7f8c773617b9 | Verifiable Claims | singleton |
... | ... | ... | ... |
183843 | fc29b66a-1819-4484-877c-b65fab78b60f | Thinking about thinking | singleton |
183844 | fe2a707e-0aad-427f-b096-16e5ef072ec3 | Thinking about thinking | singleton |
183845 | fe658d3b-a845-44b8-8887-31283d26605e | Thinking about thinking | singleton |
183846 | ff8029ab-16e3-43b0-bf0c-109bc8b0a7e4 | Thinking about thinking | singleton |
183847 | ffbcdb2f-527a-42a2-bd0c-c8df110277d3 | Thinking about thinking | singleton |
183848 rows × 3 columns
Grouping by Topics¶
We provide a method group_by_topic
in our Projection API to return a list of topic dictionaries. Under the hood, we are performing group operations on the web tile data found above.
Each dictionary contains topic metadata including description, subtopics, etc., as well as a list of datum_ids
that fall the topic.
As our topics are hierarchical, you can change which topic depth to group on using the topic_depth
parameter; the greater the topic depth the more specific the topic.
proj.indices[0].projections[0].group_by_topic(topic_depth=1)