Batches Utils

This page describes BatchVectorizer class.

class artm.BatchVectorizer(batches=None, collection_name=None, data_path='', data_format='batches', target_folder='', batch_size=1000, batch_name_type='code', data_weight=1.0)
__init__(batches=None, collection_name=None, data_path='', data_format='batches', target_folder='', batch_size=1000, batch_name_type='code', data_weight=1.0)
Parameters:
  • collection_name (str) – the name of text collection (required if data_format == ‘bow_uci’)
  • data_path (str) –
    1. if data_format == ‘bow_uci’ => folder containing ‘docword.collection_name.txt’ and vocab.collection_name.txt files; 2) if data_format == ‘vowpal_wabbit’ => file in Vowpal Wabbit format; 3) if data_format == ‘plain_text’ => file with text; 4) if data_format == ‘batches’ => folder containing batches
  • data_format (str) – the type of input data: 1) ‘bow_uci’ — Bag-Of-Words in UCI format; 2) ‘vowpal_wabbit’ — Vowpal Wabbit format; 3) ‘batches’ — the BigARTM data format
  • batch_size (int) – number of documents to be stored in each batch
  • target_folder (str) – full path to folder for future batches storing
  • batches (list of str) – list with non-full file names of batches (necessary parameters are batches + data_path + data_fromat==’batches’ in this case)
  • batch_name_type (str) – name batches in natural order (‘code’) or using random guids (guid)
  • data_weight (float) – weight for a group of batches from data_path; it can be a list of floats, then data_path (and target_folder if not data_format == ‘batches’) should also be lists; one weight corresponds to one path from the data_path list;
batch_size
Returns:the user-defined size of the batches
batches_list
Returns:list of batches names
data_path
Returns:the disk path of batches
num_batches
Returns:the number of batches
weights
Returns:list of batches weights