Helper functions

tmplot.get_phi(model: object, vocabulary: Sequence | None = None) → DataFrame

Get words vs topics matrix (phi).

Returns phi matrix of shape W x T, where W is the number of words, and T is the number of topics.

Parameters:

model (object) – Topic model instance.
vocabulary (Optional[Sequence], optional) – Vocabulary as a list of words. Needed for getting phi matrix from gensim model instance.

Returns:

Words vs topics matrix (phi).

Return type:

pandas.DataFrame

tmplot.get_theta(model: object, corpus: List | None = None) → DataFrame

Get topics vs documents (theta) matrix.

Returns theta matrix of shape T x D, where T is the number of topics, D is the number of documents.

Parameters:

model (object) – Topic model instance.
corpus (Optional[List], optional) – Corpus.

Returns:

Topics vs documents matrix (theta).

Return type:

pandas.DataFrame

tmplot.get_docs(model: object) → List[str]

Retrieve documents from topic model object.

Parameters:: model (object) – Topic model instance.
Returns:: List of documents.
Return type:: List[str]

tmplot.get_top_docs(docs: Sequence[str], model: object = None, theta: ndarray = None, corpus: List | None = None, docs_num: int = 5, topics: Sequence[int] = None) → DataFrame

Get top documents for all (or a selected) topic.

Parameters:

docs (Sequence) – List of documents.
model (object, optional) – Topic model instance.
theta (numpy.ndarray, optional) – Topics vs documents matrix.
corpus (Optional[List], optional) – Corpus for gensim model.
docs_num (int, optional) – Number of documents to return.
topics (Sequence[int], optional) – Sequence of topics indices.

Returns:

Top documents.

Return type:

pandas.DataFrame

Raises:

ValueError – If neither a model or theta matrix is passed, ValueError is raised.

tmplot.get_relevant_terms(phi: ndarray | DataFrame, topic: int, lambda_: float = 0.6) → Series

Select relevant terms.

Parameters:

phi (Union[numpy.ndarray, pandas.DataFrame]) – Words vs topics matrix (phi).
topic (int) – Topic index.
lambda (float = 0.6) – Weight parameter. It determines the weight given to the probability of term W under topic T relative to its lift [2]. Setting it to 1 equals topic-specific probabilities of terms.

References

Returns:: Terms sorted by relevance (descendingly).
Return type:: pandas.Series

tmplot.get_salient_terms(terms_freqs: ndarray, phi: ndarray, theta: ndarray) → ndarray

Get salient terms.

Calculated as: saliency(w) = frequency(w) * [sum_t p(t | w) * log(p(t | w)/p(t))], where w is a term index, t is a topic index.

Parameters:

terms_freqs (numpy.ndarray) – Words frequencies.
phi (numpy.ndarray) – Words vs topics matrix.
theta (numpy.ndarray) – Topics vs documents matrix.

Returns:

Terms saliency values.

Return type:

numpy.ndarray

tmplot.calc_terms_marg_probs(phi: ndarray | DataFrame, word_id: int | None = None) → ndarray | Series

Calculate marginal terms probabilities.

Parameters:

phi (Union[numpy.ndarray, pandas.DataFrame]) – Words vs topics matrix.
word_id (Optional[int]) – Word index.

Returns:

Marginal terms probabilities.

Return type:

Union[numpy.ndarray, pandas.Series]

tmplot.calc_topics_marg_probs(theta: DataFrame | ndarray, topic_id: int = None) → DataFrame | ndarray

Calculate marginal topics probabilities.

Parameters:

theta (Union[pandas.DataFrame, numpy.ndarray]) – Topics vs documents matrix.
topic_id (int, optional) – Topic index.

Returns:

Marginal topics probabilities.

Return type:

Union[pandas.DataFrame, numpy.ndarray]

tmplot.calc_terms_probs_ratio(phi: DataFrame, topic: int, terms_num: int = 30, lambda_: float = 0.6) → DataFrame

Get terms conditional and marginal probabilities.

Parameters:

phi (pandas.DataFrame) – Words vs topics matrix.
topic (int) – Topic index.
terms_num (int, optional) – Number of words to return.
lambda (float, optional) – Weight parameter. It determines the weight given to the probability of term W under topic T relative to its lift [1]. Setting it to 1 equals topic-specific probabilities of terms.

References

Returns:: Words conditional and marginal probabilities.
Return type:: pandas.DataFrame