Master Component

This page describes MasterComponent class.

class artm.MasterComponent(library=None, topic_names=None, class_ids=None, transaction_typenames=None, scores=None, regularizers=None, num_processors=None, pwt_name=None, nwt_name=None, num_document_passes=None, reuse_theta=None, cache_theta=False, parent_model_id=None, parent_model_weight=None, config=None, master_id=None)
__init__(library=None, topic_names=None, class_ids=None, transaction_typenames=None, scores=None, regularizers=None, num_processors=None, pwt_name=None, nwt_name=None, num_document_passes=None, reuse_theta=None, cache_theta=False, parent_model_id=None, parent_model_weight=None, config=None, master_id=None)
Parameters:
  • library – an instance of LibArtm
  • topic_names (list of str) – list of topic names to use in model
  • class_ids (dict) – key - class_id, value - class_weight
  • transaction_typenames (dict) – key - transaction_typename, value - transaction_weight, specify class_ids when using custom transaction_typenames
  • scores (dict) – key - score name, value - config
  • regularizers (dict) – key - regularizer name, value - tuple (config, tau) or triple (config, tau, gamma)
  • num_processors (int) – number of worker threads to use for processing the collection
  • pwt_name (str) – name of pwt matrix
  • nwt_name (str) – name of nwt matrix
  • num_document_passes (in) – num passes through each document
  • reuse_theta (bool) – reuse Theta from previous iteration or not
  • cache_theta (bool) – save or not the Theta matrix
  • parent_model_id (int) – master_id of parent model (previous level of hierarchy)
  • parent_model_weight (float) – weight of parent model (plays role in fit_offline; defines how much to respect parent model as compared to batches)
attach_model(model)
Parameters:model (str) – name of matrix in BigARTM
Returns:
  • messages.TopicModel() object with info about Phi matrix
  • numpy.ndarray with Phi data (i.e., p(w|t) values)
clear_score_array_cache()

Clears all entries from score array cache

clear_score_cache()

Clears all entries from score cache

clear_theta_cache()

Clears all entries from theta matrix cache

create_dictionary(dictionary_data, dictionary_name=None)
Parameters:
  • dictionary_data – an instance of DictionaryData with info about dictionary
  • dictionary_name (str) – name of exported dictionary
create_regularizer(name, config, tau, gamma=None)
Parameters:
  • name (str) – the name of the future regularizer
  • config – the config of the future regularizer
  • tau (float) – the coefficient of the regularization
create_score(name, config, model_name=None)
Parameters:
  • name (str) – the name of the future score
  • config – an instance of ***ScoreConfig
  • model_name – pwt or nwt model name
export_dictionary(filename, dictionary_name)
Parameters:
  • filename (str) – full name of dictionary file
  • dictionary_name (str) – name of exported dictionary
export_model(model, filename)
Parameters:
  • model (str) – name of matrix in BigARTM
  • filename (str) – the name of file to save model into binary format
export_score_tracker(filename)
Parameters:filename (str) – the name of file to save score tracker into binary format
filter_dictionary(dictionary_name=None, dictionary_target_name=None, class_id=None, min_df=None, max_df=None, min_df_rate=None, max_df_rate=None, min_tf=None, max_tf=None, max_dictionary_size=None, recalculate_value=None, args=None)
Parameters:
  • dictionary_name (str) – name of the dictionary in the core to filter
  • dictionary_target_name (str) – name for the new filtered dictionary in the core
  • class_id (str) – class_id to filter
  • min_df (float) – min df value to pass the filter
  • max_df (float) – max df value to pass the filter
  • min_df_rate (float) – min df rate to pass the filter
  • max_df_rate (float) – max df rate to pass the filter
  • min_tf (float) – min tf value to pass the filter
  • max_tf (float) – max tf value to pass the filter
  • max_dictionary_size (int) – give an easy option to limit dictionary size; rare tokens will be excluded until dictionary reaches given size
  • recalculate_value (bool) – recalculate or not value field in dictionary after filtration according to new sum of tf values
  • args – an instance of FilterDictionaryArgs
fit_offline(batch_filenames=None, batch_weights=None, num_collection_passes=None, batches_folder=None, reset_nwt=True)
Parameters:
  • batch_filenames (list of str) – name of batches to process
  • batch_weights (list of float) – weights of batches to process
  • num_collection_passes (int) – number of outer iterations
  • batches_folder (str) – folder containing batches to process
  • reset_nwt (bool) – a flag indicating whether to reset n_wt matrix to 0.
fit_online(batch_filenames=None, batch_weights=None, update_after=None, apply_weight=None, decay_weight=None, asynchronous=None)
Parameters:
  • batch_filenames (list of str) – name of batches to process
  • batch_weights (list of float) – weights of batches to process
  • update_after (list of int) – number of batches to be passed for Phi synchronizations
  • apply_weight (list of float) – weight of applying new counters (len == len of update_after)
  • decay_weight (list of float) – weight of applying old counters (len == len of update_after)
  • asynchronous (bool) – whether to use the asynchronous implementation of the EM-algorithm or not
gather_dictionary(dictionary_target_name=None, data_path=None, cooc_file_path=None, vocab_file_path=None, symmetric_cooc_values=None, args=None)
Parameters:
  • dictionary_target_name (str) – name of the dictionary in the core
  • data_path (str) – full path to batches folder
  • cooc_file_path (str) – full path to the file with cooc info
  • vocab_file_path (str) – full path to the file with vocabulary
  • symmetric_cooc_values (bool) – whether the cooc matrix should considered to be symmetric or not
  • args – an instance of GatherDictionaryArgs
get_dictionary(dictionary_name)
Parameters:dictionary_name (str) – name of dictionary to get
get_info()
get_phi_info(model)
Parameters:model (str) – name of matrix in BigARTM
Returns:messages.TopicModel object
get_phi_matrix(model, topic_names=None, class_ids=None, use_sparse_format=None)
Parameters:
  • model (str) – name of matrix in BigARTM
  • topic_names (list of str or None) – list of topics to retrieve (None means all topics)
  • class_ids (list of str or None) – list of class ids to retrieve (None means all class ids)
  • use_sparse_format (bool) – use sparsedense layout
Returns:

numpy.ndarray with Phi data (i.e., p(w|t) values)

get_score(score_name)
Parameters:
  • score_name (str) – the user defined name of score to retrieve
  • score_config – reference to score data object
get_score_array(score_name)
Parameters:
  • score_name (str) – the user defined name of score to retrieve
  • score_config – reference to score data object
get_theta_info()
Returns:messages.ThetaMatrix object
get_theta_matrix(topic_names=None)
Parameters:topic_names (list of str or None) – list of topics to retrieve (None means all topics)
Returns:numpy.ndarray with Theta data (i.e., p(t|d) values)
import_batches(batches=None)
Parameters:batches (list) – list of BigARTM batches loaded into RAM
import_dictionary(filename, dictionary_name)
Parameters:
  • filename (str) – full name of dictionary file
  • dictionary_name (str) – name of imported dictionary
import_model(model, filename)
Parameters:
  • model (str) – name of matrix in BigARTM
  • filename (str) – the name of file to load model from binary format
import_score_tracker(filename)
Parameters:filename (str) – the name of file to load score tracker from binary format
initialize_model(model_name=None, topic_names=None, dictionary_name=None, seed=None, args=None)
Parameters:
  • model_name (str) – name of pwt matrix in BigARTM
  • topic_names (list of str) – the list of names of topics to be used in model
  • dictionary_name (str) – name of imported dictionary
  • seed (unsigned int or -1, default None) – seed for random initialization, None means no seed
  • args – an instance of InitilaizeModelArgs
merge_model(models, nwt, topic_names=None, dictionary_name=None)

Merge multiple nwt-increments together.

Parameters:
  • models (dict) – list of models with nwt-increments and their weights, key - nwt_source_name, value - source_weight.
  • nwt (str) – the name of target matrix to store combined nwt. The matrix will be created by this operation.
  • topic_names (list of str) – names of topics in the resulting model. By default model names are taken from the first model in the list.
  • dictionary_name – name of dictionary that defines which tokens to include in merged model
normalize_model(pwt, nwt, rwt=None)
Parameters:
  • pwt (str) – name of pwt matrix in BigARTM
  • nwt (str) – name of nwt matrix in BigARTM
  • rwt (str) – name of rwt matrix in BigARTM
process_batches(pwt, nwt=None, num_document_passes=None, batches_folder=None, batches=None, regularizer_name=None, regularizer_tau=None, class_ids=None, class_weights=None, find_theta=False, transaction_typenames=None, transaction_weights=None, reuse_theta=False, find_ptdw=False, predict_class_id=None, predict_transaction_type=None)
Parameters:
  • pwt (str) – name of pwt matrix in BigARTM
  • nwt (str) – name of nwt matrix in BigARTM
  • num_document_passes (int) – number of inner iterations during processing
  • batches_folder (str) – full path to data folder (alternative 1)
  • batches (list of str) – full file names of batches to process (alternative 2)
  • regularizer_name (list of str) – list of names of Theta regularizers to use
  • regularizer_tau (list of float) – list of tau coefficients for Theta regularizers
  • class_ids (list of str) – list of class ids to use during processing.
  • class_weights (list of float) – list of corresponding weights of class ids.
  • transaction_typenames (list of str) – list of transaction types to use during processing.
  • transaction_weights (list of float) – list of corresponding weights of transaction types.
  • find_theta (bool) – find theta matrix for ‘batches’ (if alternative 2)
  • reuse_theta (bool) – initialize by theta from previous collection pass
  • find_ptdw (bool) – calculate and return Ptdw matrix or not (works if find_theta == False)
  • predict_class_id (str, default None) – class_id of a target modality to predict
Returns:

  • tuple (messages.ThetaMatrix, numpy.ndarray) — the info about Theta (if find_theta == True)
  • messages.ThetaMatrix — the info about Theta (if find_theta == False)

reconfigure(topic_names=None, class_ids=None, transaction_typenames=None, scores=None, regularizers=None, num_processors=None, pwt_name=None, nwt_name=None, num_document_passes=None, reuse_theta=None, cache_theta=None, parent_model_id=None, parent_model_weight=None)
reconfigure_regularizer(name, config=None, tau=None, gamma=None)
reconfigure_score(name, config, model_name=None)
reconfigure_topic_name(topic_names=None)
regularize_model(pwt, nwt, rwt, regularizer_name, regularizer_tau, regularizer_gamma=None)
Parameters:
  • pwt (str) – name of pwt matrix in BigARTM
  • nwt (str) – name of nwt matrix in BigARTM
  • rwt (str) – name of rwt matrix in BigARTM
  • regularizer_name (list of str) – list of names of Phi regularizers to use
  • regularizer_tau (list of floats) – list of tau coefficients for Phi regularizers
remove_batch(batch_id=None)
Parameters:batch_id (unicode) – id of batch, loaded in RAM
transform(batches=None, batch_filenames=None, theta_matrix_type=None, predict_class_id=None)
Parameters:
  • batches – list of Batch instances
  • batch_weights (list of float) – weights of batches to transform
  • theta_matrix_type (int) – type of matrix to be returned
  • predict_class_id (str, default None) – class_id of a target modality to predict
Returns:

messages.ThetaMatrix object