BigARTM command line utilityΒΆ

This document provides an overview of cpp_client, a simple command-line utility shipped with BigARTM.

To run cpp_client you need to download input data (a textual collection represented in bag-of-words format). We recommend to download vocab and docword files by links provided in Downloads section of the tutorial. Then you can use cpp_client as follows:

cpp_client -d docword.kos.txt -v vocab.kos.txt

You may append the following options to customize the resulting topic model:

  • -t or --num_topic sets the number of topics in the resulting topic model.
  • -i or --num_iters sets the number of iterative scans over the collection.
  • --num_inner_iters sets the number of updates of theta matrix performed on each iteration.
  • --reuse_theta enables caching of Theta matrix and re-uses last Theta matrix from the previous iteration as initial approximation on the next iteration. The default alternative without --reuse_theta switch is to generate random approximation of Theta matrix on each iteration.
  • --tau_phi, --tau_theta and --tau_decor allows you to specify weights of different regularizers. Currently cpp_client does not allow you to customize regularizer weights for different topics and for different iterations. This limitation is only related to cpp_client, and you can simply achieve this by using BigARTM interface (either in Python or in C++).
  • --update_every is a parameter of the online algorithm. When specified, the model will be updated every update_every documents.

You may also apply the following optimizations that should not change the resulting model

  • -p allows you to specify number of concurrent processors. The recommended value is to use the number of logical cores on your machine.
  • --no_scores disables calculation and visualization of all scores. This is a clean way of measuring pure performance of BigARTM, because at the moment some scores takes unnecessary long time to calculate.
  • --disk_cache_folder applies only together with --reuse_theta. This parameter allows you to specify a writable disk location where BigARTM can cache Theta matrix between iterations to avoid storing it in main memory.
  • --merger_queue_size limits the size of the merger queue. Decrease the size of the queue might reduce memory usage, but decrease CPU utilization of the processors.
>cpp_client --help
BigARTM - library for advanced topic modeling (http://bigartm.org):

Basic options:
  -h [ --help ]                        display this help message
  -d [ --docword ] arg                 docword file in UCI format
  -v [ --vocab ] arg                   vocab file in UCI format
  -b [ --batch_folder ] arg            If docword or vocab arguments are not
                                       provided, cpp_client will try to read
                                       pre-parsed batches from batch_folder
                                       location. Otherwise, if both docword and
                                       vocab arguments are provided, cpp_client
                                       will parse data and store batches in
                                       batch_folder location.
  -t [ --num_topic ] arg (=16)         number of topics
  -p [ --num_processors ] arg (=2)     number of concurrent processors
  -i [ --num_iters ] arg (=10)         number of outer iterations
  --num_inner_iters arg (=10)          number of inner iterations
  --reuse_theta                        reuse theta between iterations
  --dictionary_file arg (=dictionary)  filename of dictionary file
  --items_per_batch arg (=500)         number of items per batch
  --tau_phi arg (=0)                   regularization coefficient for PHI
                                       matrix
  --tau_theta arg (=0)                 regularization coefficient for THETA
                                       matrix
  --tau_decor arg (=0)                 regularization coefficient for topics
                                       decorrelation (use with care, since this
                                       value heavily depends on the size of the
                                       dataset)
  --paused                             wait for keystroke (allows to attach a
                                       debugger)
  --no_scores                          disable calculation of all scores
  --update_every arg (=0)              [online algorithm] requests an update of
                                       the model after update_every document
  --parsing_format arg (=0)            parsing format (0 - UCI, 1 - matrix
                                       market)
  --disk_cache_folder arg              disk cache folder
  --merger_queue_size arg              size of the merger queue

Networking options:
  --nodes arg                          endpoints of the remote nodes (enables
                                       network modus operandi)
  --localhost arg (=localhost)         DNS name or the IP address of the
                                       localhost
  --port arg (=5550)                   port to use for master node
  --proxy arg                          proxy endpoint
  --timeout arg (=1000)                network communication timeout in
                                       milliseconds

Examples:
        cpp_client -d docword.kos.txt -v vocab.kos.txt
        set GLOG_logtostderr=1 & cpp_client -d docword.kos.txt -v vocab.kos.txt

For further details please refer to the source code of cpp_client.