BigARTM v0.7.3 Release notes

BigARTM v0.7.3 releases the following changes:

  • New command line tool for BigARTM
  • Support for classification in bigartm CLI
  • Support for asynchronous processing of batches
  • Improvements in coherence regularizer and coherence score
  • New TopicMass score for phi matrix
  • Support for documents markup
  • New API for importing batches through memory

New command line tool for BigARTM

New CLI is named bigartm (or bigrtm.exe on Windows), and it supersedes previous CLI named cpp_client. New CLI has the following features:

  • Parse collection in one of the Formats
  • Load dictionary
  • Initialize a new model, or import previously created model
  • Perform EM-iterations to fit the model
  • Export predicted probabilities for all documents into CSV file
  • Export model into a file

All command-line options are listed here, and you may see several exampels on BigARTM page at github. At the moment full documentation is only available in Russian.

Support for classification in BigARTM CLI

BigARTM CLI is now able to perform classification. The following example assumes that your batches have target_class modality in addition to the default modality (@default_class).

# Fit model
bigartm.exe --use-batches <your batches>
            --use-modality @default_class,target_class
            --topics 50
            --dictionary-min-df 10
            --dictionary-max-df 25%
            --save-model model.bin

# Apply model and output to text files
bigartm.exe --use-batches <your batches>
            --use-modality @default_class,target_class
            --topics 50
            --passes 0
            --load-model model.bin
            --predict-class target_class
            --write-predictions pred.txt
            --write-class-predictions pred_class.txt
            --csv-separator=tab
            --score ClassPrecision

Support for asynchronous processing of batches

Asynchronous processing of batches enables applications to overlap EM-iterations better utilize CPU resources. The following chart shows CPU utilization of bigartm.exe with (left-hand side) and without async flag (right-hand side).

BigARTM performance in asynchronous mode

TopicMass score for phi matrix

Topic mass score calculates cumulated topic mass for each topic. This is a useful metric to monitor balance between topics.

Support for documents markup

Document markup provides topic distribution for each word in a document. Since BigARTM v0.7.3 it is posible to extract this information to use it. A potential application includes color-highlighted maps of the document, where every work is colored according to the most probable topic of the document.

In the code this feature is refered to as ptdw matrix. It is possible to extract and regularizer ptdw matrices. In future versions it will be also possible to calculate scores based on ptdw matrix.

New API for importing batches through memory

New low-level APIs ArtmImportBatches and ArtmDisposeBatches allow to import batches from memory into BigARTM. Those batches are saved in BigARTM, and can be used for batches processing.