Batches Utils¶
This page describes BatchVectorizer class.
-
class
artm.
BatchVectorizer
(batches=None, collection_name=None, data_path='', data_format='batches', target_folder='', batch_size=1000, batch_name_type='code', data_weight=1.0)¶ -
__init__
(batches=None, collection_name=None, data_path='', data_format='batches', target_folder='', batch_size=1000, batch_name_type='code', data_weight=1.0)¶ Parameters: - collection_name (str) – the name of text collection (required if data_format == ‘bow_uci’)
- data_path (str) –
- if data_format == ‘bow_uci’ => folder containing ‘docword.collection_name.txt’ and vocab.collection_name.txt files; 2) if data_format == ‘vowpal_wabbit’ => file in Vowpal Wabbit format; 3) if data_format == ‘plain_text’ => file with text; 4) if data_format == ‘batches’ => folder containing batches
- data_format (str) – the type of input data: 1) ‘bow_uci’ — Bag-Of-Words in UCI format; 2) ‘vowpal_wabbit’ — Vowpal Wabbit format; 3) ‘batches’ — the BigARTM data format
- batch_size (int) – number of documents to be stored in each batch
- target_folder (str) – full path to folder for future batches storing
- batches (list of str) – list with non-full file names of batches (necessary parameters are batches + data_path + data_fromat==’batches’ in this case)
- batch_name_type (str) – name batches in natural order (‘code’) or using random guids (guid)
- data_weight (float) – weight for a group of batches from data_path; it can be a list of floats, then data_path (and target_folder if not data_format == ‘batches’) should also be lists; one weight corresponds to one path from the data_path list;
-
batch_size
¶ Returns: the user-defined size of the batches
-
batches_list
¶ Returns: list of batches names
-
data_path
¶ Returns: the disk path of batches
-
num_batches
¶ Returns: the number of batches
-
weights
¶ Returns: list of batches weights
-