Tutorial

Importing packages

 
import warnings
warnings.filterwarnings('ignore')
 
import tmplot as tmp
import pickle as pkl
import pandas as pd

Importing data

Let’s take the BTM model trained on a test dataset (SearchSnippets) as an example. We will begin with reading it from a file:

 
with open('data/model_btm.pkl', 'rb') as file:
    model = pkl.load(file)
 
docs = pd.read_csv('data/SearchSnippets.txt.gz', header=None).values.ravel()

Matrices

Researchers working with topic models often need to obtain phi (words vs topics probability) and theta (topics vs documents probability) matrices. Tmplot provides two functions for getting these matrices from tomotopy, bitermplus, and gensim models.

Phi matrix

Note that you will need to pass a vocabulary for a gensim model.

 
phi = tmp.get_phi(model)
phi.head()
topics 0 1 2 3 4 5 6 7
words
aaa 3.195102e-08 3.012856e-08 3.047842e-08 3.542745e-08 3.836165e-08 2.961217e-08 2.362519e-08 4.831267e-08
aaas 3.837318e-05 3.012856e-08 3.047842e-08 3.542745e-08 3.836165e-08 5.922729e-04 6.144912e-05 2.903592e-05
aaron 3.195102e-08 3.012856e-08 3.047842e-08 3.542745e-08 4.296888e-04 2.961217e-08 2.362519e-08 4.831267e-08
aau 3.195102e-08 3.012856e-08 3.047842e-08 3.542745e-08 3.836165e-08 2.961217e-08 2.362519e-08 4.203686e-04
abbreviations 7.990951e-05 3.163800e-04 3.047842e-08 3.542745e-08 3.836165e-08 2.961217e-08 2.386144e-06 4.831267e-08

Theta matrix

 
tmp.get_theta(model).head()
docs 0 1 2 3 4 5 6 7 8 9 ... 990 991 992 993 994 995 996 997 998 999
topics
0 0.354702 0.294777 0.178074 0.332888 0.596412 0.726975 0.099094 0.257602 0.532725 0.471059 ... 0.007651 0.085897 0.025840 0.019194 0.033898 0.020408 0.030728 0.036133 0.084323 0.024301
1 0.000245 0.007173 0.021324 0.019411 0.029472 0.008740 0.011804 0.036323 0.011349 0.003909 ... 0.069988 0.263869 0.058431 0.227196 0.022920 0.021660 0.040932 0.060534 0.150018 0.071271
2 0.003073 0.057144 0.013837 0.014514 0.011813 0.002588 0.000247 0.027391 0.002325 0.005435 ... 0.007558 0.014669 0.014206 0.002697 0.008854 0.017299 0.014710 0.027672 0.061375 0.011318
3 0.003678 0.029281 0.010010 0.001287 0.027349 0.004351 0.018189 0.085879 0.011453 0.002965 ... 0.007010 0.022462 0.007516 0.006018 0.001193 0.007400 0.007335 0.021119 0.012309 0.006168
4 0.000927 0.035162 0.001736 0.319421 0.024606 0.042996 0.019524 0.036119 0.001910 0.039332 ... 0.016587 0.056386 0.005925 0.003503 0.001620 0.006468 0.004151 0.018374 0.008712 0.087364

5 rows × 1000 columns

Documents

Here is how you can get documents with maximum probabilities \(P(t|d)\) for each topic:

 
tmp.get_top_docs(docs, model=model)
topic0 topic1 topic2 topic3 topic4 topic5 topic6 topic7
0 speakeasy speedtest speakeasy speed test test ... links jstor sici sici jstor postwar consumptio... imdb name julia roberts julia roberts imdb mov... guitars bodies amps guitars strings vcic unc edu vcic venture capital investment c... washington edu drivers device drivers device d... apache api dom document document xml standard ... hypotheses hypotheses author illustrates hypot...
1 speedtest bandwidth speed test bandwidth speed... econpapers repec article econpapers postwar co... celebrities cruise celebrity tom cruise tom cr... louis french fashion designer designer manufac... national venture capital association foster un... manufactures parallel serial drives schools dom default xml dom tutorial xml docum... surreal surreal
2 home bandwidth broadband speedtest bandwidth c... findarticles articles consumption consumer exp... imdb name tom cruise tom cruise imdb movies ce... fashion designers default fashion designers fa... san jose mercury news venture capital expanded... leonardo leonardo vinci inventor information c... access cards ieee access allposters surrealism posters surrealism poste...
3 home bandwidth broadband speedtest bandwidth c... financial financial international health insur... absolutely roberts absolutely julia roberts ph... fashion designers audio fashion designer net f... seattlepi nwsource venture seattle venture cap... journals searching biomedical journals engine ... generator xml generator sample xml instance do... hypotheses hypotheses nature research hypothes...
4 portfolio shareholder services manage investme... consumption consumer rights consumption consum... imdb title imdb movies celebs fashion fashion designers fashion designers fa... venture capital journal listening model ventur... lwn articles driver lwn device drivers kernel ... reference standard template library standard t... allposters beatles posters beatles prints allp...

Visualization

tmplot takes much from LDAvis, but also extends the functionality with a number of algorithms and metrics for plotting topics and terms. tmplot is based on ipywidgets and Altair (Vega-backed package for nice plots).

Topics

First, we need to calculate the coordinates of topics based on intertopic distance values. By default, the combination of t-distributed Stochastic Neighbor Embedding and symmetric Kullback-Leibler divergence is used to calculate topics coordinates in 2D, but a number of other metrics and algorithms are also available (see tmplot.get_topics_dist and tmplot.get_topics_scatter functions for additional information).

 
topics_coords = tmp.prepare_coords(model)
topics_coords.head()
x y topic size label
0 -41.183987 -30.480648 0 21.160233 0
1 -11.704910 -34.631725 1 4.265470 1
2 -56.292171 -4.832846 2 20.599346 2
3 9.921317 -14.181945 3 7.176289 3
4 -45.702721 22.987968 4 4.535249 4

Plotting topics:

 
tmp.plot_scatter_topics(topics_coords, size_col='size', label_col='label')

Words (or terms)

tmplot also uses terms relevance that was introduced by Sievert and Shirley (2014) for sorting terms.

 
terms_probs = tmp.calc_terms_probs_ratio(phi, topic=0, lambda_=1)
 
tmp.plot_terms(terms_probs)

Documents

 
top_docs_topic0 = tmp.get_top_docs(docs, model=model, docs_num=2, topics=[0])
top_docs_topic0
topic0
0 speakeasy speedtest speakeasy speed test test ...
1 speedtest bandwidth speed test bandwidth speed...

The following output is used within the interactive interface that we will explore shortly:

 
tmp.plot_docs(top_docs_topic0)
topic0
0 speakeasy speedtest speakeasy speed test test speed internet connection speakeasy speed test
1 speedtest bandwidth speed test bandwidth speed test bandwidth bandwidth speed internet service

Interactive report interface

To run the report interface, just call tmplot.report() function with your model and docs. You can tweak most of the hidden parameters using keyword arguments (see function docstring).

 
tmp.report(model, docs=docs, height=400, width=250)

Report