Helper functions
- tmplot.get_phi(model: object, vocabulary: Sequence | None = None) DataFrame
Get words vs topics matrix (phi).
Returns
phi
matrix of shape W x T, where W is the number of words, and T is the number of topics.- Parameters:
model (object) – Topic model instance.
vocabulary (Optional[Sequence], optional) – Vocabulary as a list of words. Needed for getting
phi
matrix fromgensim
model instance.
- Returns:
Words vs topics matrix (phi).
- Return type:
pandas.DataFrame
- tmplot.get_theta(model: object, corpus: List | None = None) DataFrame
Get topics vs documents (theta) matrix.
Returns theta matrix of shape T x D, where T is the number of topics, D is the number of documents.
- Parameters:
model (object) – Topic model instance.
corpus (Optional[List], optional) – Corpus.
- Returns:
Topics vs documents matrix (theta).
- Return type:
pandas.DataFrame
- tmplot.get_docs(model: object) List[str]
Retrieve documents from topic model object.
- Parameters:
model (object) – Topic model instance.
- Returns:
List of documents.
- Return type:
List[str]
- tmplot.get_top_docs(docs: Sequence[str], model: object = None, theta: ndarray = None, corpus: List | None = None, docs_num: int = 5, topics: Sequence[int] = None) DataFrame
Get top documents for all (or a selected) topic.
- Parameters:
docs (Sequence) – List of documents.
model (object, optional) – Topic model instance.
theta (numpy.ndarray, optional) – Topics vs documents matrix.
corpus (Optional[List], optional) – Corpus for
gensim
model.docs_num (int, optional) – Number of documents to return.
topics (Sequence[int], optional) – Sequence of topics indices.
- Returns:
Top documents.
- Return type:
pandas.DataFrame
- Raises:
ValueError – If neither a model or theta matrix is passed, ValueError is raised.
- tmplot.get_relevant_terms(phi: ndarray | DataFrame, topic: int, lambda_: float = 0.6) Series
Select relevant terms.
- Parameters:
phi (Union[numpy.ndarray, pandas.DataFrame]) – Words vs topics matrix (phi).
topic (int) – Topic index.
lambda (float = 0.6) – Weight parameter. It determines the weight given to the probability of term W under topic T relative to its lift [2]. Setting it to 1 equals topic-specific probabilities of terms.
References
- Returns:
Terms sorted by relevance (descendingly).
- Return type:
pandas.Series
- tmplot.get_salient_terms(terms_freqs: ndarray, phi: ndarray, theta: ndarray) ndarray
Get salient terms.
Calculated as: saliency(w) = frequency(w) * [sum_t p(t | w) * log(p(t | w)/p(t))], where
w
is a term index,t
is a topic index.- Parameters:
terms_freqs (numpy.ndarray) – Words frequencies.
phi (numpy.ndarray) – Words vs topics matrix.
theta (numpy.ndarray) – Topics vs documents matrix.
- Returns:
Terms saliency values.
- Return type:
numpy.ndarray
- tmplot.calc_terms_marg_probs(phi: ndarray | DataFrame, word_id: int | None = None) ndarray | Series
Calculate marginal terms probabilities.
- Parameters:
phi (Union[numpy.ndarray, pandas.DataFrame]) – Words vs topics matrix.
word_id (Optional[int]) – Word index.
- Returns:
Marginal terms probabilities.
- Return type:
Union[numpy.ndarray, pandas.Series]
- tmplot.calc_topics_marg_probs(theta: DataFrame | ndarray, topic_id: int = None) DataFrame | ndarray
Calculate marginal topics probabilities.
- Parameters:
theta (Union[pandas.DataFrame, numpy.ndarray]) – Topics vs documents matrix.
topic_id (int, optional) – Topic index.
- Returns:
Marginal topics probabilities.
- Return type:
Union[pandas.DataFrame, numpy.ndarray]
- tmplot.calc_terms_probs_ratio(phi: DataFrame, topic: int, terms_num: int = 30, lambda_: float = 0.6) DataFrame
Get terms conditional and marginal probabilities.
- Parameters:
phi (pandas.DataFrame) – Words vs topics matrix.
topic (int) – Topic index.
terms_num (int, optional) – Number of words to return.
lambda (float, optional) – Weight parameter. It determines the weight given to the probability of term W under topic T relative to its lift [1]. Setting it to 1 equals topic-specific probabilities of terms.
References
[1] Sievert, C., & Shirley, K. (2014). LDAvis: A method for visualizing and interpreting topics. In Proceedings of the workshop on interactive language learning, visualization, and interfaces (pp. 63-70).
- Returns:
Words conditional and marginal probabilities.
- Return type:
pandas.DataFrame