Helper functions

tmplot.get_phi(model: object, vocabulary: Sequence | None = None) DataFrame

Get words vs topics matrix (phi).

Returns phi matrix of shape W x T, where W is the number of words, and T is the number of topics.

Parameters:
  • model (object) – Topic model instance.

  • vocabulary (Optional[Sequence], optional) – Vocabulary as a list of words. Needed for getting phi matrix from gensim model instance.

Returns:

Words vs topics matrix (phi).

Return type:

pandas.DataFrame

tmplot.get_theta(model: object, corpus: List | None = None) DataFrame

Get topics vs documents (theta) matrix.

Returns theta matrix of shape T x D, where T is the number of topics, D is the number of documents.

Parameters:
  • model (object) – Topic model instance.

  • corpus (Optional[List], optional) – Corpus.

Returns:

Topics vs documents matrix (theta).

Return type:

pandas.DataFrame

tmplot.get_docs(model: object) List[str]

Retrieve documents from topic model object.

Parameters:

model (object) – Topic model instance.

Returns:

List of documents.

Return type:

List[str]

tmplot.get_top_docs(docs: Sequence[str], model: object = None, theta: ndarray = None, corpus: List | None = None, docs_num: int = 5, topics: Sequence[int] = None) DataFrame

Get top documents for all (or a selected) topic.

Parameters:
  • docs (Sequence) – List of documents.

  • model (object, optional) – Topic model instance.

  • theta (numpy.ndarray, optional) – Topics vs documents matrix.

  • corpus (Optional[List], optional) – Corpus for gensim model.

  • docs_num (int, optional) – Number of documents to return.

  • topics (Sequence[int], optional) – Sequence of topics indices.

Returns:

Top documents.

Return type:

pandas.DataFrame

Raises:

ValueError – If neither a model or theta matrix is passed, ValueError is raised.

tmplot.get_relevant_terms(phi: ndarray | DataFrame, topic: int, lambda_: float = 0.6) Series

Select relevant terms.

Parameters:
  • phi (Union[numpy.ndarray, pandas.DataFrame]) – Words vs topics matrix (phi).

  • topic (int) – Topic index.

  • lambda (float = 0.6) – Weight parameter. It determines the weight given to the probability of term W under topic T relative to its lift [2]. Setting it to 1 equals topic-specific probabilities of terms.

References

Returns:

Terms sorted by relevance (descendingly).

Return type:

pandas.Series

tmplot.get_salient_terms(terms_freqs: ndarray, phi: ndarray, theta: ndarray) ndarray

Get salient terms.

Calculated as: saliency(w) = frequency(w) * [sum_t p(t | w) * log(p(t | w)/p(t))], where w is a term index, t is a topic index.

Parameters:
  • terms_freqs (numpy.ndarray) – Words frequencies.

  • phi (numpy.ndarray) – Words vs topics matrix.

  • theta (numpy.ndarray) – Topics vs documents matrix.

Returns:

Terms saliency values.

Return type:

numpy.ndarray

tmplot.calc_terms_marg_probs(phi: ndarray | DataFrame, word_id: int | None = None) ndarray | Series

Calculate marginal terms probabilities.

Parameters:
  • phi (Union[numpy.ndarray, pandas.DataFrame]) – Words vs topics matrix.

  • word_id (Optional[int]) – Word index.

Returns:

Marginal terms probabilities.

Return type:

Union[numpy.ndarray, pandas.Series]

tmplot.calc_topics_marg_probs(theta: DataFrame | ndarray, topic_id: int = None) DataFrame | ndarray

Calculate marginal topics probabilities.

Parameters:
  • theta (Union[pandas.DataFrame, numpy.ndarray]) – Topics vs documents matrix.

  • topic_id (int, optional) – Topic index.

Returns:

Marginal topics probabilities.

Return type:

Union[pandas.DataFrame, numpy.ndarray]

tmplot.calc_terms_probs_ratio(phi: DataFrame, topic: int, terms_num: int = 30, lambda_: float = 0.6) DataFrame

Get terms conditional and marginal probabilities.

Parameters:
  • phi (pandas.DataFrame) – Words vs topics matrix.

  • topic (int) – Topic index.

  • terms_num (int, optional) – Number of words to return.

  • lambda (float, optional) – Weight parameter. It determines the weight given to the probability of term W under topic T relative to its lift [1]. Setting it to 1 equals topic-specific probabilities of terms.

References

Returns:

Words conditional and marginal probabilities.

Return type:

pandas.DataFrame