ARTM model

This page describes ARTM class.

class artm.ARTM(num_topics=None, topic_names=None, num_processors=None, class_ids=None, scores=None, regularizers=None, num_document_passes=10, reuse_theta=False, dictionary=None, cache_theta=False, theta_columns_naming='id', seed=-1)
__init__(num_topics=None, topic_names=None, num_processors=None, class_ids=None, scores=None, regularizers=None, num_document_passes=10, reuse_theta=False, dictionary=None, cache_theta=False, theta_columns_naming='id', seed=-1)
Parameters:
  • num_topics (int) – the number of topics in model, will be overwrited if topic_names is set
  • num_processors (int) – how many threads will be used for model training, if not specified then number of threads will be detected by the lib
  • topic_names (list of str) – names of topics in model
  • class_ids (dict) – list of class_ids and their weights to be used in model, key — class_id, value — weight, if not specified then all class_ids will be used
  • cache_theta (bool) – save or not the Theta matrix in model. Necessary if ARTM.get_theta() usage expects
  • scores (list) – list of scores (objects of artm.*Score classes)
  • regularizers (list) – list with regularizers (objects of artm.*Regularizer classes)
  • num_document_passes (int) – number of inner iterations over each document
  • dictionary (str or reference to Dictionary object) – dictionary to be used for initialization, if None nothing will be done
  • reuse_theta (bool) – reuse Theta from previous iteration or not
  • theta_columns_naming (str) – either ‘id’ or ‘title’, determines how to name columns (documents) in theta dataframe
  • seed (unsigned int or -1) – seed for random initialization, -1 means no seed
Important public fields:
 
  • regularizers: contains dict of regularizers, included into model
  • scores: contains dict of scores, included into model
  • score_tracker: contains dict of scoring results: key — score name, value — ScoreTracker object, which contains info about values of score on each synchronization (e.g. collection pass) in list
Note:
  • Here and anywhere in BigARTM empty topic_names or class_ids means that model (or regularizer, or score) should use all topics or class_ids.
  • If some fields of regularizers or scores are not defined by user — internal lib defaults would be used.
  • If field ‘topic_names’ is None, it will be generated by BigARTM and will be available using ARTM.topic_names().
dispose()
Description:

free all native memory, allocated for this model

Note:
  • This method does not free memory occupied by dictionaries, because dictionaries are shared across all models
  • ARTM class implements __exit__ and __del___ methods, which automatically call dispose.
fit_offline(batch_vectorizer=None, num_collection_passes=1)
Description:

proceeds the learning of topic model in offline mode

Parameters:
  • batch_vectorizer (object_referenece) – an instance of BatchVectorizer class
  • num_collection_passes (int) – number of iterations over whole given collection
fit_online(batch_vectorizer=None, tau0=1024.0, kappa=0.7, update_every=1, apply_weight=None, decay_weight=None, update_after=None, async=False)
Description:

proceeds the learning of topic model in online mode

Parameters:
  • batch_vectorizer (object_reference) – an instance of BatchVectorizer class
  • update_every (int) – the number of batches; model will be updated once per it
  • tau0 (float) – coefficient (see ‘Update formulas’ paragraph)
  • kappa (float) (float) – power for tau0, (see ‘Update formulas’ paragraph)
  • update_after (list of int) – number of batches to be passed for Phi synchronizations
  • apply_weight (list of float) – weight of applying new counters
  • decay_weight (list of float) – weight of applying old counters
  • async (bool) – use or not the async implementation of the EM-algorithm
Note:

async=True leads to impossibility of score extraction via score_tracker. Use get_score() instead.

Update formulas:
 
  • The formulas for decay_weight and apply_weight:
  • update_count = current_processed_docs / (batch_size * update_every);
  • rho = pow(tau0 + update_count, -kappa);
  • decay_weight = 1-rho;
  • apply_weight = rho;
  • if apply_weight, decay_weight and update_after are set, they will be used, otherwise the code below will be used (with update_every, tau0 and kappa)
get_phi(topic_names=None, class_ids=None, model_name=None)
Description:

get custom Phi matrix of model. The extraction of the whole Phi matrix expects ARTM.phi_ call.

Parameters:
  • topic_names (list of str) – list with topics to extract, None value means all topics
  • class_ids (list of str) – list with class ids to extract, None means all class ids
  • model_name (str) – self.model_pwt by default, self.model_nwt is also reasonable to extract unnormalized counters
Returns:

  • pandas.DataFrame: (data, columns, rows), where:
  • columns — the names of topics in topic model;
  • rows — the tokens of topic model;
  • data — content of Phi matrix.

get_phi_sparse(topic_names=None, class_ids=None, model_name=None, eps=None)
Description:

get phi matrix in sparse format

Parameters:
  • topic_names (list of str) – list with topics to extract, None value means all topics
  • class_ids (list of str) – list with class ids to extract, None means all class ids
  • model_name (str) – self.model_pwt by default, self.model_nwt is also reasonable to extract unnormalized counters
  • eps (float) – threshold to consider values as zero
Returns:

  • a 3-tuple of (data, rows, columns), where
  • data — scipy.sparse.csr_matrix with values
  • columns — the names of topics in topic model;
  • rows — the tokens of topic model;

get_score(score_name)
Description:get score after fit_offline, fit_online or transform
Parameters:score_name (str) – the name of the score to return
get_theta(topic_names=None)
Description:get Theta matrix for training set of documents (or cached after transform)
Parameters:topic_names (list of str) – list with topics to extract, None means all topics
Returns:
  • pandas.DataFrame: (data, columns, rows), where:
  • columns — the ids of documents, for which the Theta matrix was requested;
  • rows — the names of topics in topic model, that was used to create Theta;
  • data — content of Theta matrix.
get_theta_sparse(topic_names=None, eps=None)
Description:

get Theta matrix in sparse format

Parameters:
  • topic_names (list of str) – list with topics to extract, None means all topics
  • eps (float) – threshold to consider values as zero
Returns:

  • a 3-tuple of (data, rows, columns), where
  • data — scipy.sparse.csr_matrix with values
  • columns — the ids of documents;
  • rows — the names of topics in topic model;

info
Description:returns internal diagnostics information about the model
initialize(dictionary=None)
Description:initialize topic model before learning
Parameters:dictionary (str or reference to Dictionary object) – loaded BigARTM collection dictionary
library_version
Description:the version of BigARTM library in a MAJOR.MINOR.PATCH format
load(filename, model_name='p_wt')
Description:

loads from disk the topic model saved by ARTM.save()

Parameters:
  • filename (str) – the name of file containing model
  • model_name (str) – the name of matrix to be saved, ‘p_wt’ or ‘n_wt’
Note:
  • Loaded model will overwrite ARTM.topic_names and class_ids fields.
  • All class_ids weights will be set to 1.0, you need to specify them by hand if it’s necessary.
  • The method call will empty ARTM.score_tracker.
  • All regularizers and scores will be forgotten.
  • etc.
  • We strongly recommend you to reset all important parameters of the ARTM model, used earlier.
remove_theta()
Description:removes cached theta matrix
reshape_topics(topic_names)
Description:update topic names of the model.

Adds, removes, and reorders columns of phi matrices according to the new set of topic names. New topics are initialized with zeros.

save(filename, model_name='p_wt')
Description:

saves one Phi-like matrix to disk

Parameters:
  • filename (str) – the name of file to store model
  • model_name (str) – the name of matrix to be saved, ‘p_wt’ or ‘n_wt’
topic_names
Description:

Gets or sets the list of topic names of the model.

Note:
  • Setting topic name allows you to put new labels on the existing topics. To add, remove or reorder topics use ARTM.reshape_topics() method.
  • In ARTM topic names are used just as string identifiers, which give a unique name to each column of the phi matrix. Typically you want to set topic names as something like “topic0”, “topic1”, etc. Later operations like get_phi() allow you to specify which topics you need to retrieve. Most regularizers allow you to limit the set of topics they act upon. If you configure a rich set of regularizers it is important design your topic names according to how they are regularizerd. For example, you may use names obj0, obj1, ..., objN for objective topics (those where you enable sparsity regularizers), and back0, back1, ..., backM for background topics (those where you enable smoothing regularizers).
transform(batch_vectorizer=None, theta_matrix_type='dense_theta', predict_class_id=None)
Description:

find Theta matrix for new documents

Parameters:
  • batch_vectorizer (object_reference) – an instance of BatchVectorizer class
  • theta_matrix_type (str) – type of matrix to be returned, possible values: ‘dense_theta’, ‘dense_ptdw’, ‘cache’, None, default=’dense_theta’
  • predict_class_id (str) – class_id of a target modality to predict. When this option is enabled the resulting columns of theta matrix will correspond to unique labels of a target modality. The values will represent p(c|d), which give the probability of class label c for document d.
Returns:

  • pandas.DataFrame: (data, columns, rows), where:
  • columns — the ids of documents, for which the Theta matrix was requested;
  • rows — the names of topics in topic model, that was used to create Theta;
  • data — content of Theta matrix.

Note:
  • ‘dense_ptdw’ mode provides simple access to values of p(t|w,d). The resulting pandas.DataFrame object will contain a flat theta matrix (no 3D) where each item has multiple columns - as many as the number of tokens in that document. These columns will have the same item_id. The order of columns with equal item_id is the same as the order of tokens in the input data (batch.item.token_id).
transform_sparse(batch_vectorizer, eps=None)
Description:

find Theta matrix for new documents as sparse scipy matrix

Parameters:
  • batch_vectorizer (object_reference) – an instance of BatchVectorizer class
  • eps (float) – threshold to consider values as zero
Returns:

  • a 3-tuple of (data, rows, columns), where
  • data — scipy.sparse.csr_matrix with values
  • columns — the ids of documents;
  • rows — the names of topics in topic model;