Distance functions

tmplot.get_topics_dist(phi: ndarray | DataFrame, method: str = 'sklb', **kwargs) → ndarray

Finding closest topics in models.

Parameters:

phi (Union[ndarray, DataFrame]) – Words vs topics matrix (W x T).
method (str = "sklb") – Comparison method. Possible variants: 1) “klb” - Kullback-Leibler divergence. 2) “sklb” - Symmetric Kullback-Leibler divergence. 3) “jsd” - Jensen-Shannon divergence. 4) “jef” - Jeffrey’s divergence. 5) “hel” - Hellinger distance. 6) “bhat” - Bhattacharyya distance. 7) “tv” — Total variation distance. 8) “jac” - Jaccard index.
**kwargs (dict) – Keyword arguments passed to distance function.

Returns:

Topics distances matrix.

Return type:

numpy.ndarray

tmplot.get_topics_scatter(topic_dists: ndarray, theta: ndarray, method: str = 'tsne', method_kws: dict = None) → DataFrame

Calculate topics coordinates for a scatter plot.

Parameters:

topic_dists (numpy.ndarray) – Topics distance matrix.
theta (numpy.ndarray) – Topics vs documents probability matrix.
method (str = 'tsne') – Method to calculate topics scatter coordinates (X and Y). Possible values: 1) ‘tsne’ - t-distributed Stochastic Neighbor Embedding. 2) ‘sem’ - SpectralEmbedding. 3) ‘mds’ - MDS. 4) ‘lle’ - LocallyLinearEmbedding. 5) ‘ltsa’ - LocallyLinearEmbedding with LTDA method. 6) ‘isomap’ - Isomap.
method_kws (dict = None) – Keyword arguments passed to method function.

Returns:

Topics scatter coordinates.

Return type:

DataFrame

tmplot.get_top_topic_words(phi: DataFrame, words_num: int = 20, topics_idx: List[int] | ndarray = None) → DataFrame

Select top topic words from a fitted model.

Parameters:

phi (pandas.DataFrame) – Words vs topics matrix (phi) with words as indices and topics as columns.
words_num (int = 20) – The number of words to select.
topics_idx (Union[List, numpy.ndarray] = None) – Topics indices.

Returns:

Words with highest probabilities in all (or selected) topics.

Return type:

DataFrame