LDA model

This page describes LDA class.

class artm.LDA(num_topics=None, num_processors=None, cache_theta=False, dictionary=None, num_document_passes=10, seed=-1, alpha=0.01, beta=0.01, theta_columns_naming='id')
__init__(num_topics=None, num_processors=None, cache_theta=False, dictionary=None, num_document_passes=10, seed=-1, alpha=0.01, beta=0.01, theta_columns_naming='id')
Parameters:
  • num_topics (int) – the number of topics in model, will be overwrited if topic_names is set
  • num_processors (int) – how many threads will be used for model training, if not specified then number of threads will be detected by the lib
  • cache_theta (bool) – save or not the Theta matrix in model. Necessary if ARTM.get_theta() usage expects
  • num_document_passes (int) – number of inner iterations over each document
  • dictionary (str or reference to Dictionary object) – dictionary to be used for initialization, if None nothing will be done
  • reuse_theta (bool) – reuse Theta from previous iteration or not
  • seed (unsigned int or -1) – seed for random initialization, -1 means no seed
  • alpha (float) – hyperparameter of Theta smoothing regularizer
  • beta (float or list of floats with len == num_topics) – hyperparameter of Phi smoothing regularizer
  • theta_columns_naming (str) – either ‘id’ or ‘title’, determines how to name columns (documents) in theta dataframe
Note:
  • the type (not value!) of beta should not change after initialization: if it was scalar - it should stay scalar, if it was list - it should stay list.
fit_offline(batch_vectorizer, num_collection_passes=1)
Description:

proceeds the learning of topic model in offline mode

Parameters:
  • batch_vectorizer (object_referenece) – an instance of BatchVectorizer class
  • num_collection_passes (int) – number of iterations over whole given collection
fit_online(batch_vectorizer, tau0=1024.0, kappa=0.7, update_every=1)
Description:

proceeds the learning of topic model in online mode

Parameters:
  • batch_vectorizer (object_reference) – an instance of BatchVectorizer class
  • update_every (int) – the number of batches; model will be updated once per it
  • tau0 (float) – coefficient (see ‘Update formulas’ paragraph)
  • kappa (float) (float) – power for tau0, (see ‘Update formulas’ paragraph)
  • update_after (list of int) – number of batches to be passed for Phi synchronizations
Update formulas:
 
  • The formulas for decay_weight and apply_weight:
  • update_count = current_processed_docs / (batch_size * update_every);
  • rho = pow(tau0 + update_count, -kappa);
  • decay_weight = 1-rho;
  • apply_weight = rho;
get_theta()
Description:get Theta matrix for training set of documents
Returns:
  • pandas.DataFrame: (data, columns, rows), where:
  • columns — the ids of documents, for which the Theta matrix was requested;
  • rows — the names of topics in topic model, that was used to create Theta;
  • data — content of Theta matrix.
get_top_tokens(num_tokens=10, with_weights=False)
Description:

returns most probable tokens for each topic

Parameters:
  • num_tokens (int) – number of top tokens to be returned
  • with_weights (bool) – return only tokens, or tuples (token, its p_wt)
Returns:

  • list of lists of str, each internal list corresponds one topic in natural order, if with_weights == False, or list, or list of lists of tules, each tuple is (str, float)

initialize(dictionary)
Description:initialize topic model before learning
Parameters:dictionary (str or reference to Dictionary object) – loaded BigARTM collection dictionary
load(filename, model_name='p_wt')
Description:

loads from disk the topic model saved by LDA.save()

Parameters:
  • filename (str) – the name of file containing model
  • model_name (str) – the name of matrix to be saved, ‘p_wt’ or ‘n_wt’
Note:
  • We strongly recommend you to reset all important parameters of the LDA model, used earlier.
remove_theta()
Description:removes cached theta matrix
save(filename, model_name='p_wt')
Description:

saves one Phi-like matrix to disk

Parameters:
  • filename (str) – the name of file to store model
  • model_name (str) – the name of matrix to be saved, ‘p_wt’ or ‘n_wt’
transform(batch_vectorizer, theta_matrix_type='dense_theta')
Description:

find Theta matrix for new documents

Parameters:
  • batch_vectorizer (object_reference) – an instance of BatchVectorizer class
  • theta_matrix_type (str) – type of matrix to be returned, possible values: ‘dense_theta’, None, default=’dense_theta’
Returns:

  • pandas.DataFrame: (data, columns, rows), where:
  • columns — the ids of documents, for which the Theta matrix was requested;
  • rows — the names of topics in topic model, that was used to create Theta;
  • data — content of Theta matrix.