BigARTM Command Line Utility¶
This document provides an overview of bigartm
command-line utility shipped with BigARTM.
For a detailed description of bigartm
command line interface refer to
bigartm.exe notebook (in Russian).
In brief, you need to download some input data (a textual collection represented in bag-of-words format).
We recommend to download sample colections in vowpal wabbit format by links provided in Downloads section of the tutorial.
Then you can use bigartm
as described by bigartm --help
.
You may also get more information about builtin regularizers by typing bigartm --help --regularizer
.
- Gathering of Co-occurrence Statistics File
In order to gather co-ooccurrence statistics files you need to have 2 files: collection in Vowpal Wabbit format and file of tokens (so called “vocab”) in UCI format. Vocab is needed to filter tokens of collection, which means that co-occurrence won’t be calculated if some pair isn’t presented in vocab. There are 2 types of co-occurrences available now: tf and df (see description below). Also you may want to calculate Positive PMI of gathered co-occurrence values. This information can be usefull if you want to use co-occurrence dictionary in coherence computation. The utility produces files with pointwise information in as pseudocollection in Vowpal Wabbit format.
Note
If you want to compute co-occurrences of tokens of non-default modality, you should specify that modalities in vocab file. For more information about UCI files format, please visit Input Data Formats and Datasets.
Here is a combination of comand line keys that allows you to build co-occurrence dictionaries:
bigartm -c vw -v vocab --cooc-window 10 --cooc-min-tf 200 --write-cooc-tf cooc_tf_ --cooc-min-df 200 --write-cooc-df cooc_df_ --write-ppmi-tf ppmi_tf_ --write-ppmi-df ppmi_df_
Numbers and names of files can be changed, it’s just an example. To know the description of each key, see below in table of all available keys.
Also you can look at launchable examples and common mistakes in bigartm-book (in russian).
- BigARTM CLI keys
BigARTM v0.9.0 - library for advanced topic modeling (http://bigartm.org):
Input data:
-c [ --read-vw-corpus ] arg Raw corpus in Vowpal Wabbit format
-d [ --read-uci-docword ] arg docword file in UCI format
-v [ --read-uci-vocab ] arg vocab file in UCI format
--read-cooc arg read co-occurrences format
--batch-size arg (=500) number of items per batch
--use-batches arg folder with batches to use
Dictionary:
--cooc-min-tf arg (=0) minimal value of cooccurrences of a
pair of tokens that are saved in
dictionary of cooccurrences
--cooc-min-df arg (=0) minimal value of documents in which a
specific pair of tokens occurred
together closely
--cooc-window arg (=5) number of tokens around specific token,
which are used in calculation of
cooccurrences
--dictionary-min-df arg filter out tokens present in less than
N documents / less than P% of documents
--dictionary-max-df arg filter out tokens present in less than
N documents / less than P% of documents
--dictionary-size arg (=0) limit dictionary size by filtering out
tokens with high document frequency
--use-dictionary arg filename of binary dictionary file to
use
Model:
--load-model arg load model from file before processing
-t [ --topics ] arg (=16) number of topics
--use-modality arg modalities (class_ids) and their
weights
--predict-class arg target modality to predict by theta
matrix
Learning:
-p [ --num-collection-passes ] arg (=0)
number of outer iterations (passes
through the collection)
--num-document-passes arg (=10) number of inner iterations (passes
through the document)
--update-every arg (=0) [online algorithm] requests an update
of the model after update_every
document
--tau0 arg (=1024) [online algorithm] weight option from
online update formula
--kappa arg (=0.699999988) [online algorithm] exponent option from
online update formula
--reuse-theta reuse theta between iterations
--regularizer arg regularizers (SmoothPhi,SparsePhi,Smoot
hTheta,SparseTheta,Decorrelation)
--threads arg (=-1) number of concurrent processors
(default: auto-detect)
--async invoke asynchronous version of the
online algorithm
Output:
--write-cooc-tf arg save dictionary of co-occurrences with
frequencies of co-occurrences of every
specific pair of tokens in whole
collection
--write-cooc-df arg save dictionary of co-occurrences with
number of documents in which every
specific pair occured together
--write-ppmi-tf arg save values of positive pmi of pairs of
tokens from cooc_tf dictionary
--write-ppmi-df arg save values of positive pmi of pairs of
tokens from cooc_df dictionary
--save-model arg save the model to binary file after
processing
--save-batches arg batch folder
--save-dictionary arg filename of dictionary file
--write-model-readable arg output the model in a human-readable
format
--write-dictionary-readable arg output the dictionary in a
human-readable format
--write-predictions arg write prediction in a human-readable
format
--write-class-predictions arg write class prediction in a
human-readable format
--write-scores arg write scores in a human-readable format
--write-vw-corpus arg convert batches into plain text file in
Vowpal Wabbit format
--force force overwrite existing output files
--csv-separator arg (=;) columns separator for
--write-model-readable and
--write-predictions. Use \t or TAB to
indicate tab.
--score-level arg (=2) score level (0, 1, 2, or 3
--score arg scores (Perplexity, SparsityTheta,
SparsityPhi, TopTokens, ThetaSnippet,
or TopicKernel)
--final-score arg final scores (same as scores)
Other options:
-h [ --help ] display this help message
--rand-seed arg specify seed for random number
generator, use system timer when not
specified
--guid-batch-name applies to save-batches and indicate
that batch names should be guids (not
sequential codes)
--response-file arg response file
--paused start paused and waits for a keystroke
(allows to attach a debugger)
--disk-cache-folder arg disk cache folder
--disable-avx-opt disable AVX optimization (gives similar
behavior of the Processor component to
BigARTM v0.5.4)
--time-limit arg (=0) limit execution time in milliseconds
--log-dir arg target directory for logging
(GLOG_log_dir)
--log-level arg min logging level (GLOG_minloglevel;
INFO=0, WARNING=1, ERROR=2, and
FATAL=3)
Examples:
* Download input data:
wget https://s3-eu-west-1.amazonaws.com/artm/docword.kos.txt
wget https://s3-eu-west-1.amazonaws.com/artm/vocab.kos.txt
wget https://s3-eu-west-1.amazonaws.com/artm/vw.mmro.txt
wget https://s3-eu-west-1.amazonaws.com/artm/vw.wiki-enru.txt.zip
* Parse docword and vocab files from UCI bag-of-word format; then fit topic model with 20 topics:
bigartm -d docword.kos.txt -v vocab.kos.txt -t 20 --num_collection_passes 10
* Parse VW format; then save the resulting batches and dictionary:
bigartm --read-vw-corpus vw.mmro.txt --save-batches mmro_batches --save-dictionary mmro.dict
* Parse VW format from standard input; note usage of single dash '-' after --read-vw-corpus:
cat vw.mmro.txt | bigartm --read-vw-corpus - --save-batches mmro2_batches --save-dictionary mmro2.dict
* Re-save batches back into VW format:
bigartm --use-batches mmro_batches --write-vw-corpus vw.mmro.txt
* Parse only specific modalities from VW file, and save them as a new VW file:
bigartm --read-vw-corpus vw.wiki-enru.txt --use-modality @russian --write-vw-corpus vw.wiki-ru.txt
* Load and filter the dictionary on document frequency; save the result into a new file:
bigartm --use-dictionary mmro.dict --dictionary-min-df 5 dictionary-max-df 40% --save-dictionary mmro-filter.dict
* Load the dictionary and export it in a human-readable format:
bigartm --use-dictionary mmro.dict --write-dictionary-readable mmro.dict.txt
* Use batches to fit a model with 20 topics; then save the model in a binary format:
bigartm --use-batches mmro_batches --num_collection_passes 10 -t 20 --save-model mmro.model
* Load the model and export it in a human-readable format:
bigartm --load-model mmro.model --write-model-readable mmro.model.txt
* Load the model and use it to generate predictions:
bigartm --read-vw-corpus vw.mmro.txt --load-model mmro.model --write-predictions mmro.predict.txt
* Fit model with two modalities (@default_class and @target), and use it to predict @target label:
bigartm --use-batches <batches> --use-modality @default_class,@target --topics 50 --num_collection_passes 10 --save-model model.bin
bigartm --use-batches <batches> --use-modality @default_class,@target --topics 50 --load-model model.bin
--write-predictions pred.txt --csv-separator=tab
--predict-class @target --write-class-predictions pred_class.txt --score ClassPrecision
* Fit simple regularized model (increase sparsity up to 60-70%):
bigartm -d docword.kos.txt -v vocab.kos.txt --dictionary-max-df 50% --dictionary-min-df 2
--num_collection_passes 10 --batch-size 50 --topics 20 --write-model-readable model.txt
--regularizer "0.05 SparsePhi" "0.05 SparseTheta"
* Fit more advanced regularize model, with 10 sparse objective topics, and 2 smooth background topics:
bigartm -d docword.kos.txt -v vocab.kos.txt --dictionary-max-df 50% --dictionary-min-df 2
--num_collection_passes 10 --batch-size 50 --topics obj:10;background:2 --write-model-readable model.txt
--regularizer "0.05 SparsePhi #obj"
--regularizer "0.05 SparseTheta #obj"
--regularizer "0.25 SmoothPhi #background"
--regularizer "0.25 SmoothTheta #background"
* Upgrade batches in the old format (from folder 'old_folder' into 'new_folder'):
bigartm --use-batches old_folder --save-batches new_folder
* Configure logger to output into stderr:
tset GLOG_logtostderr=1 & bigartm -d docword.kos.txt -v vocab.kos.txt -t 20 --num_collection_passes 10
Additional information about regularizers:
>bigartm.exe --regularizer --help
List of regularizers available in BigARTM CLI:
--regularizer "tau SmoothTheta #topics"
--regularizer "tau SparseTheta #topics"
--regularizer "tau SmoothPhi #topics @class_ids !dictionary"
--regularizer "tau SparsePhi #topics @class_ids !dictionary"
--regularizer "tau Decorrelation #topics @class_ids"
--regularizer "tau TopicSelection #topics"
--regularizer "tau LabelRegularization #topics @class_ids !dictionary"
--regularizer "tau ImproveCoherence #topics @class_ids !dictionary"
--regularizer "tau Biterms #topics @class_ids !dictionary"
List of regularizers available in BigARTM, but not exposed in CLI:
--regularizer "tau SpecifiedSparsePhi"
--regularizer "tau SmoothPtdw"
--regularizer "tau HierarchySparsingTheta"
If you are interested to see any of these regularizers in BigARTM CLI please send a message to
bigartm-users@googlegroups.com.
By default all regularizers act on the full set of topics and modalities.
To limit action onto specific set of topics use hash sign (#), followed by
list of topics (for example, #topic1;topic2) or topic groups (#obj).
Similarly, to limit action onto specific set of class ids use at sign (@),
by the list of class ids (for example, @default_class).
Some regularizers accept a dictionary. To specify the dictionary use exclamation mark (!),
followed by the path to the dictionary(.dict file in your file system).
Depending on regularizer the dictinoary can be either optional or required.
Some regularizers expect an dictinoary with tokens and their frequencies;
Other regularizers expect an dictinoary with tokens co-occurencies;
For more information about regularizers refer to wiki-page:
https://github.com/bigartm/bigartm/wiki/Implemented-regularizers
To get full help run `bigartm --help` without --regularizer switch.