Stability functions

tmplot.get_closest_topics(models: List[Any], ref: int = 0, method: str = 'sklb', top_words: int = 100, verbose: bool = True) Tuple[ndarray, ndarray]

Finding closest topics in models.

Parameters:
  • models (List[Any]) – List of supported and fitted topic models.

  • ref (int = 0) – Index of reference matrix (zero-based indexing).

  • method (str = "sklb") – Distance calculation method. Possible variants: 1) “klb” - Kullback-Leibler divergence. 2) “sklb” - Symmetric Kullback-Leibler divergence. 3) “jsd” - Jensen-Shannon divergence. 4) “jef” - Jeffrey’s divergence. 5) “hel” - Hellinger distance. 6) “bhat” - Bhattacharyya distance. 7) “tv” - Total variation distance. 8) “jac” - Jaccard index.

  • top_words (int = 100) – Number of top words in each topic to use in Jaccard index calculation.

  • verbose (bool = True) – Verbose output (progress bar).

Returns:

  • closest_topics (np.ndarray) – Closest topics indices in one two-dimensional array (topics ✕ models). Columns correspond to the compared models (their indices), rows are the closest topics pairs.

  • dist (np.ndarray) – Closest topics distances (e.g., Kullback-Leibler or Jaccard index values). Shape of this array corresponds to the shape of the first returned argument.

Example

>>> # `models` must be an iterable of fitted models
>>> closest_topics, kldiv = tmplot.get_closest_topics(models)
tmplot.get_stable_topics(closest_topics: ndarray, dist: ndarray, norm: bool = True, inverse: bool = True, inverse_factor: float = 1.0, ref: int = 0, thres: float = 0.9, thres_models: int = 2) Tuple[ndarray, ndarray]

Finding stable topics in models.

Parameters:
  • closest_topics (np.ndarray) – Closest topics indices in a two-dimensional array. Columns correspond to the compared matrices (their indices), rows are the closest topics pairs. Typically, this should be the first value returned by tmplot.get_closest_topics() function.

  • dist (np.ndarray) – Distance values: Kullback-Leibler divergence or Jaccard index values corresponding to the matrix of the closest topics. Typically, this should be the second value returned by tmplot.get_closest_topics() function.

  • norm (bool = True) – Normalize distance values (passed as dist argument).

  • inverse (bool = True) – Inverse distance values by subtracting them from inverse_factor. Should be set to False if Jaccard index was used to calculate closest topics.

  • inverse_factor (float = 1.0) – Subtract distance values from this factor to inverse.

  • ref (int = 0) – Index of reference matrix (i.e. reference column index, zero-based indexing).

  • thres (float = 0.9) – Threshold for distance values filtering.

  • thres_models (int = 2) – Minimum topic recurrence frequency across all models.

Returns:

  • stable_topics (np.ndarray) – Filtered matrix of the closest topics indices (i.e. stable topics).

  • dist (np.ndarray) – Filtered distance values corresponding to the matrix of the closest topics.

Example

>>> closest_topics, kldiv = tmplot.get_closest_topics(models)
>>> stable_topics, stable_kldiv = tmplot.get_stable_topics(
...     closest_topics, kldiv)