Master Component¶

This page describes MasterComponent class.

class artm.MasterComponent(library=None, topic_names=None, class_ids=None, transaction_typenames=None, scores=None, regularizers=None, num_processors=None, pwt_name=None, nwt_name=None, num_document_passes=None, reuse_theta=None, cache_theta=False, parent_model_id=None, parent_model_weight=None, config=None, master_id=None)¶

__init__(library=None, topic_names=None, class_ids=None, transaction_typenames=None, scores=None, regularizers=None, num_processors=None, pwt_name=None, nwt_name=None, num_document_passes=None, reuse_theta=None, cache_theta=False, parent_model_id=None, parent_model_weight=None, config=None, master_id=None)¶

Parameters:

library – an instance of LibArtm
topic_names (list of str) – list of topic names to use in model
class_ids (dict) – key - class_id, value - class_weight
transaction_typenames (dict) – key - transaction_typename, value - transaction_weight, specify class_ids when using custom transaction_typenames
scores (dict) – key - score name, value - config
regularizers (dict) – key - regularizer name, value - tuple (config, tau) or triple (config, tau, gamma)
num_processors (int) – number of worker threads to use for processing the collection
pwt_name (str) – name of pwt matrix
nwt_name (str) – name of nwt matrix
num_document_passes (in) – num passes through each document
reuse_theta (bool) – reuse Theta from previous iteration or not
cache_theta (bool) – save or not the Theta matrix
parent_model_id (int) – master_id of parent model (previous level of hierarchy)
parent_model_weight (float) – weight of parent model (plays role in fit_offline; defines how much to respect parent model as compared to batches)

attach_model(model)¶

Parameters:	model (str) – name of matrix in BigARTM
Returns:	messages.TopicModel() object with info about Phi matrix numpy.ndarray with Phi data (i.e., p(w\|t) values)

clear_score_array_cache()¶: Clears all entries from score array cache

clear_score_cache()¶: Clears all entries from score cache

clear_theta_cache()¶: Clears all entries from theta matrix cache

create_dictionary(dictionary_data, dictionary_name=None)¶

Parameters:	dictionary_data – an instance of DictionaryData with info about dictionary dictionary_name (str) – name of exported dictionary

create_regularizer(name, config, tau, gamma=None)¶

Parameters:	name (str) – the name of the future regularizer config – the config of the future regularizer tau (float) – the coefficient of the regularization

create_score(name, config, model_name=None)¶

Parameters:	name (str) – the name of the future score config – an instance of *ScoreConfig model_name** – pwt or nwt model name

export_dictionary(filename, dictionary_name)¶

Parameters:	filename (str) – full name of dictionary file dictionary_name (str) – name of exported dictionary

export_model(model, filename)¶

Parameters:	model (str) – name of matrix in BigARTM filename (str) – the name of file to save model into binary format

export_score_tracker(filename)¶

Parameters:	filename (str) – the name of file to save score tracker into binary format

filter_dictionary(dictionary_name=None, dictionary_target_name=None, class_id=None, min_df=None, max_df=None, min_df_rate=None, max_df_rate=None, min_tf=None, max_tf=None, max_dictionary_size=None, recalculate_value=None, args=None)¶

Parameters:

dictionary_name (str) – name of the dictionary in the core to filter
dictionary_target_name (str) – name for the new filtered dictionary in the core
class_id (str) – class_id to filter
min_df (float) – min df value to pass the filter
max_df (float) – max df value to pass the filter
min_df_rate (float) – min df rate to pass the filter
max_df_rate (float) – max df rate to pass the filter
min_tf (float) – min tf value to pass the filter
max_tf (float) – max tf value to pass the filter
max_dictionary_size (int) – give an easy option to limit dictionary size; rare tokens will be excluded until dictionary reaches given size
recalculate_value (bool) – recalculate or not value field in dictionary after filtration according to new sum of tf values
args – an instance of FilterDictionaryArgs

fit_offline(batch_filenames=None, batch_weights=None, num_collection_passes=None, batches_folder=None, reset_nwt=True)¶

Parameters:	batch_filenames (list of str) – name of batches to process batch_weights (list of float) – weights of batches to process num_collection_passes (int) – number of outer iterations batches_folder (str) – folder containing batches to process reset_nwt (bool) – a flag indicating whether to reset n_wt matrix to 0.

fit_online(batch_filenames=None, batch_weights=None, update_after=None, apply_weight=None, decay_weight=None, asynchronous=None)¶

Parameters:

batch_filenames (list of str) – name of batches to process
batch_weights (list of float) – weights of batches to process
update_after (list of int) – number of batches to be passed for Phi synchronizations
apply_weight (list of float) – weight of applying new counters (len == len of update_after)
decay_weight (list of float) – weight of applying old counters (len == len of update_after)
asynchronous (bool) – whether to use the asynchronous implementation of the EM-algorithm or not

gather_dictionary(dictionary_target_name=None, data_path=None, cooc_file_path=None, vocab_file_path=None, symmetric_cooc_values=None, args=None)¶

Parameters:

dictionary_target_name (str) – name of the dictionary in the core
data_path (str) – full path to batches folder
cooc_file_path (str) – full path to the file with cooc info
vocab_file_path (str) – full path to the file with vocabulary
symmetric_cooc_values (bool) – whether the cooc matrix should considered to be symmetric or not
args – an instance of GatherDictionaryArgs

get_dictionary(dictionary_name)¶

Parameters:	dictionary_name (str) – name of dictionary to get

get_info()¶

get_phi_info(model)¶

Parameters:	model (str) – name of matrix in BigARTM
Returns:	messages.TopicModel object

get_phi_matrix(model, topic_names=None, class_ids=None, use_sparse_format=None)¶

Parameters:	model (str) – name of matrix in BigARTM topic_names (list of str or None) – list of topics to retrieve (None means all topics) class_ids (list of str or None) – list of class ids to retrieve (None means all class ids) use_sparse_format (bool) – use sparsedense layout
Returns:	numpy.ndarray with Phi data (i.e., p(w\|t) values)

get_score(score_name)¶

Parameters:	score_name (str) – the user defined name of score to retrieve score_config – reference to score data object

get_score_array(score_name)¶

Parameters:	score_name (str) – the user defined name of score to retrieve score_config – reference to score data object

get_theta_info()¶

Returns:	messages.ThetaMatrix object

get_theta_matrix(topic_names=None)¶

Parameters:	topic_names (list of str or None) – list of topics to retrieve (None means all topics)
Returns:	numpy.ndarray with Theta data (i.e., p(t\|d) values)

import_batches(batches=None)¶

Parameters:	batches (list) – list of BigARTM batches loaded into RAM

import_dictionary(filename, dictionary_name)¶

Parameters:	filename (str) – full name of dictionary file dictionary_name (str) – name of imported dictionary

import_model(model, filename)¶

Parameters:	model (str) – name of matrix in BigARTM filename (str) – the name of file to load model from binary format

import_score_tracker(filename)¶

Parameters:	filename (str) – the name of file to load score tracker from binary format

initialize_model(model_name=None, topic_names=None, dictionary_name=None, seed=None, args=None)¶

Parameters:	model_name (str) – name of pwt matrix in BigARTM topic_names (list of str) – the list of names of topics to be used in model dictionary_name (str) – name of imported dictionary seed (unsigned int or -1, default None) – seed for random initialization, None means no seed args – an instance of InitilaizeModelArgs

merge_model(models, nwt, topic_names=None, dictionary_name=None)¶

Merge multiple nwt-increments together.

Parameters:

models (dict) – list of models with nwt-increments and their weights, key - nwt_source_name, value - source_weight.
nwt (str) – the name of target matrix to store combined nwt. The matrix will be created by this operation.
topic_names (list of str) – names of topics in the resulting model. By default model names are taken from the first model in the list.
dictionary_name – name of dictionary that defines which tokens to include in merged model

normalize_model(pwt, nwt, rwt=None)¶

Parameters:	pwt (str) – name of pwt matrix in BigARTM nwt (str) – name of nwt matrix in BigARTM rwt (str) – name of rwt matrix in BigARTM

process_batches(pwt, nwt=None, num_document_passes=None, batches_folder=None, batches=None, regularizer_name=None, regularizer_tau=None, class_ids=None, class_weights=None, find_theta=False, transaction_typenames=None, transaction_weights=None, reuse_theta=False, find_ptdw=False, predict_class_id=None, predict_transaction_type=None)¶

Parameters:

pwt (str) – name of pwt matrix in BigARTM
nwt (str) – name of nwt matrix in BigARTM
num_document_passes (int) – number of inner iterations during processing
batches_folder (str) – full path to data folder (alternative 1)
batches (list of str) – full file names of batches to process (alternative 2)
regularizer_name (list of str) – list of names of Theta regularizers to use
regularizer_tau (list of float) – list of tau coefficients for Theta regularizers
class_ids (list of str) – list of class ids to use during processing.
class_weights (list of float) – list of corresponding weights of class ids.
transaction_typenames (list of str) – list of transaction types to use during processing.
transaction_weights (list of float) – list of corresponding weights of transaction types.
find_theta (bool) – find theta matrix for ‘batches’ (if alternative 2)
reuse_theta (bool) – initialize by theta from previous collection pass
find_ptdw (bool) – calculate and return Ptdw matrix or not (works if find_theta == False)
predict_class_id (str, default None) – class_id of a target modality to predict

Returns:

tuple (messages.ThetaMatrix, numpy.ndarray) — the info about Theta (if find_theta == True)
messages.ThetaMatrix — the info about Theta (if find_theta == False)

reconfigure(topic_names=None, class_ids=None, transaction_typenames=None, scores=None, regularizers=None, num_processors=None, pwt_name=None, nwt_name=None, num_document_passes=None, reuse_theta=None, cache_theta=None, parent_model_id=None, parent_model_weight=None)¶

reconfigure_regularizer(name, config=None, tau=None, gamma=None)¶

reconfigure_score(name, config, model_name=None)¶

reconfigure_topic_name(topic_names=None)¶

regularize_model(pwt, nwt, rwt, regularizer_name, regularizer_tau, regularizer_gamma=None)¶

Parameters:	pwt (str) – name of pwt matrix in BigARTM nwt (str) – name of nwt matrix in BigARTM rwt (str) – name of rwt matrix in BigARTM regularizer_name (list of str) – list of names of Phi regularizers to use regularizer_tau (list of floats) – list of tau coefficients for Phi regularizers

remove_batch(batch_id=None)¶

Parameters:	batch_id (unicode) – id of batch, loaded in RAM

transform(batches=None, batch_filenames=None, theta_matrix_type=None, predict_class_id=None)¶

Parameters:	batches – list of Batch instances batch_weights (list of float) – weights of batches to transform theta_matrix_type (int) – type of matrix to be returned predict_class_id (str, default None) – class_id of a target modality to predict
Returns:	messages.ThetaMatrix object