Messages¶

This document explains all protobuf messages that can be transfered between the user code and BigARTM library.

Warning

Remember that all fields is marked as optional to enhance backwards compatibility of the binary protobuf format. Some fields will result in run-time exception when not specified. Please refer to the documentation of each field for more details.

DoubleArray¶

class messages_pb2.DoubleArray¶

Represents an array of double-precision floating point values.

message DoubleArray {
  repeated double value = 1 [packed = true];
}

BoolArray¶

class messages_pb2.BoolArray¶

Represents an array of boolean values.

message BoolArray {
  repeated bool value = 1 [packed = true];
}

Item¶

class messages_pb2.Item¶

Represents a unit of textual information. A typical example of an item is a document that belongs to some text collection.

message Item {
  optional int32 id = 1;
  repeated Field field = 2;
  optional string title = 3;
}

Item.id¶: An integer identifier of the item.

Item.field¶: A set of all fields withing the item.

Item.title¶: An optional title of the item.

Field¶

class messages_pb2.Field¶

Represents a field withing an item. The idea behind fields is that each item might have its title, author, body, abstract, actual text, links, year of publication, etc. Each of this entities should be represented as a Field. The topic model defines how those fields should be taken into account when BigARTM infers a topic model. Currently each field is represented as “bag-of-words” — each token is listed together with the number of its occurrences. Note that each Field is always part of an Item, Item is part of a Batch, and a batch always contains a list of tokens. Therefore, each Field just lists the indexes of tokens in the Batch.

message Field {
  optional string name = 1 [default = "@body"];
  repeated int32 token_id = 2;
  repeated int32 token_count = 3;
  repeated int32 token_offset = 4;

  optional string string_value = 5;
  optional int64 int_value = 6;
  optional double double_value = 7;
  optional string date_value = 8;

  repeated string string_array = 16;
  repeated int64 int_array = 17;
  repeated double double_array = 18;
  repeated string date_array = 19;
}

Batch¶

class messages_pb2.Batch¶

Represents a set of items. In BigARTM a batch is never split into smaller parts. When it comes to concurrency this means that each batch goes to a single processor. Two batches can be processed concurrently, but items in one batch are always processed sequentially.

message Batch {
  repeated string token = 1;
  repeated Item item = 2;
  repeated string class_id = 3;
  optional string description = 4;
  optional string id = 5;
}

Batch.token¶: A set value that defines all tokens than may appear in the batch.

Batch.item¶: A set of items of the batch.

Batch.class_id¶: A set of values that define for classes (modalities) of tokens. This repeated field must have the same length as token. This value is optional, use an empty list indicate that all tokens belong to the default class.

Batch.description¶: An optional text description of the batch. You may describe for example the source of the batch, preprocessing technique and the structure of its fields.

Batch.id¶: Unique identifier of the batch in a form of a GUID (example: 4fb38197-3f09-4871-9710-392b14f00d2e). This field is required.

Stream¶

class messages_pb2.Stream¶

Represents a configuration of a stream. Streams provide a mechanism to split the entire collection into virtual subsets (for example, the ‘train’ and ‘test’ streams).

message Stream {
  enum Type {
    Global = 0;
    ItemIdModulus = 1;
  }

  optional Type type = 1 [default = Global];
  optional string name = 2 [default = "@global"];
  optional int32 modulus = 3;
  repeated int32 residuals = 4;
}

Stream.type¶

A value that defines the type of the stream.

`Global`	Defines a stream containing all items in the collection.
`ItemIdModulus`	Defines a stream containing all items with ID that matches modulus and residuals. An item belongs to the stream iff the modulo reminder of item ID is contained in the residuals field.

Stream.name¶: A value that defines the name of the stream. The name must be unique across all streams defined in the master component.

MasterComponentConfig¶

class messages_pb2.MasterComponentConfig¶

Represents a configuration of a master component.

message MasterComponentConfig {
  enum ModusOperandi {
    Local = 0;
    Network = 1;
  }

  optional ModusOperandi modus_operandi = 1 [default = Local];
  optional string disk_path = 2;
  repeated Stream stream = 3;
  optional bool compact_batches = 4 [default = true];
  optional bool cache_theta = 5 [default = false];
  optional int32 processors_count = 6 [default = 1];
  optional int32 processor_queue_max_size = 7 [default = 10];
  optional int32 merger_queue_max_size = 8 [default = 10];
  repeated ScoreConfig score_config = 9;
  optional string create_endpoint = 10;
  optional string connect_endpoint = 11;
  repeated string node_connect_endpoint = 12;
  optional bool online_batch_processing = 13 [default = false];  // obsolete in BigARTM v0.5.8
  optional int32 communication_timeout = 14 [default = 1000];
  optional string disk_cache_path = 15;
}

MasterComponentConfig.modus_operandi¶

A value that defines the modus operandi of the master component.

`Local`	Defines a master component that operates in the local mode. In this mode master component is self-contained, and does not require any external nodes to tune topic model.
`Network`	Defines a master component that operates in the network mode. In this mode master component delegates all heavy processing to externals nodes. The master component is then responsible only for merging the results from external nodes into a single topic model.

MasterComponentConfig.disk_path¶: A value that defines the disk location to store or load the collection. In network modus operandi this field is required, and it must point to a network file share, accessible by all nodes connected to the master component.

MasterComponentConfig.stream¶: A set of all data streams to configure in master component. Streams can overlap if needed.

MasterComponentConfig.compact_batches¶: A flag indicating whether to compact batches in AddBatch() operation. Compaction is a process that shrinks the dictionary of each batch by removing all unused tokens.

MasterComponentConfig.cache_theta¶: A flag indicating whether to cache theta matrix. Theta matrix defines the discrete probability distribution of each document across the topics in topic model. By default BigARTM infers this distribution every time it processes the document. Option ‘cache_theta’ allows to cache this theta matrix and re-use theha values when the same document is processed on the next iteration. This option must be set to ‘true’ before calling method ‘ArtmRequestThetaMatrix’. This feature is currently not supported in network modus operandi.

MasterComponentConfig.processors_count¶: A value that defines the number of concurrent processor components. In network modus operandi this value defines the processors count at every remote node controller, connected to the master component. The number of processors should normally not exceed the number of CPU cores.

MasterComponentConfig.processor_queue_max_size¶

A value that defines the maximal size of the processor queue. Processor queue contains batches, prefetch from disk into memory. In network modus operandi this value defines the maximal queue size at every remote node controller, connected to the master component. Recommendations regarding the maximal queue size are as follows:

the queue size should be at least as large as the number of concurrent processors;
the total size of the queues across all node controllers should not exceed the number of batches in the collection.

MasterComponentConfig.merger_queue_max_size¶: A value that defines the maximal size of the merger queue. Merger queue size contains an incremental updates of topic model, produced by processor components. Try reducing this parameter if BigARTM consumes too much memory.

MasterComponentConfig.score_config¶: A set of all scores, available for calculation.

MasterComponentConfig.create_endpoint¶: A value that defines ZeroMQ endpoint to expose the master component service (example: tcp://*:5555). For further details about the format of endpoint string refer to ZeroMQ documentation (http://api.zeromq.org). The value is used only in the network modus operandi.

MasterComponentConfig.connect_endpoint¶: A value that defines ZeroMQ endpoint to expose the master component service (example: tcp://192.168.0.1:5555). // For further details about the format of endpoint string refer to ZeroMQ documentation (http://api.zeromq.org). The value is used only in the network modus operandi.

MasterComponentConfig.node_connect_endpoint¶: A set containing all ZeroMQ endpoints of the external node controllers (example: tcp://192.168.0.2:5556). For further details about the format of endpoint string refer to ZeroMQ documentation (http://api.zeromq.org). The value is used only in the network modus operandi. A node controller component at the remote machine must be created prior to configuring master component with its endpoint.

MasterComponentConfig.online_batch_processing¶: Obsolete in BigARTM v0.5.8.

MasterComponentConfig.communication_timeout¶: An communication timeout in milliseconds. Any remote network call that exceeds communication timeout will result in ARTM_NETWORK_ERROR error.

MasterComponentConfig.disk_cache_path¶: A value that defines a writtable disk location where this master component can store some temporary files. This can reduce memory usage, particularly when cache_theta option is enabled. Note that on clean shutdown master component will will be cleaned this folder automatically, but otherwise it is your responsibility to clean this folder to avoid running out of disk.

NodeControllerConfig¶

class messages_pb2.NodeControllerConfig¶

Represents a configuration of a NodeController

message NodeControllerConfig {
  optional string create_endpoint = 1;
}

NodeControllerConfig.create_endpoint¶: A value that defines ZeroMQ endpoint to expose the node component service (example: tcp://*:5556). For further details about the format of endpoint string refer to ZeroMQ documentation (http://api.zeromq.org).

MasterProxyConfig¶

class messages_pb2.MasterProxyConfig¶

Represents a configuration of a proxy to MasterComponent. The purpose of the proxy is to operate MasterComponent from a remote machine. There is no requirement to run MasterComponent and its proxy in the same operating system (MasterComponent can run on linux while the proxy can be on Windows). Any type of MasterComponent (either in local or in network modus operandi) can be operated by a proxy.

message MasterProxyConfig {
  optional string node_connect_endpoint = 1;
  optional MasterComponentConfig config = 2;
  optional int32 communication_timeout = 3 [default = 1000];
  optional int32 polling_frequency = 4 [default = 50];
}

MasterProxyConfig.node_connect_endpoint¶: A value that defines ZeroMQ endpoint of an external node controller. (example: tcp://192.168.0.2:5556). For further details about the format of endpoint string refer to ZeroMQ documentation (http://api.zeromq.org). A node controller component at the remote machine must be created prior to configuring master component with its endpoint.

MasterProxyConfig.config¶: A message that defines MasterComponent configuration of a remote node.

MasterProxyConfig.communication_timeout¶: An communication timeout in milliseconds. Any remote network call that exceeds communication timeout will result in ARTM_NETWORK_ERROR error.

MasterProxyConfig.polling_frequency¶

Defines the frequency that the proxy object uses to repeatedly pool remote master component for a status.

When ArtmWaitIdle() on the remote master component reports ARTM_STILL_WORKING, the proxy object will retry the request within specified pooling frequency.

ModelConfig¶

class messages_pb2.ModelConfig¶

Represents a configuration of a topic model.

message ModelConfig {
  optional string name = 1 [default = "@model"];
  optional int32 topics_count = 2 [default = 32];
  repeated string topic_name = 3;
  optional bool enabled = 4 [default = true];
  optional int32 inner_iterations_count = 5 [default = 10];
  optional string field_name = 6 [default = "@body"];  // obsolete in BigARTM v0.5.8
  optional string stream_name = 7 [default = "@global"];
  repeated string score_name = 8;
  optional bool reuse_theta = 9 [default = false];
  repeated string regularizer_name = 10;
  repeated double regularizer_tau = 11;
  repeated string class_id = 12;
  repeated float class_weight = 13;
  optional bool use_sparse_bow = 14 [default = true];
  optional bool use_random_theta = 15 [default = false];
}

ModelConfig.name¶: A value that defines the name of the topic model. The name must be unique across all models defined in the master component.

ModelConfig.topics_count¶: A value that defines the number of topics in the topic model.

ModelConfig.topic_name¶: A repeated field that defines the names of the topics. All topic names must be unique within each topic model. This field is optional, but either topics_count or topic_name must be specified. If both specified, then topics_count will be ignored, and the number of topics in the model will be based on the length of topic_name field. When topic_name is not specified the names for all topics will be autogenerated.

ModelConfig.enabled¶: A flag indicating whether to update the model during iterations.

ModelConfig.inner_iterations_count¶: A value that defines the fixed number of iterations, performed to infer the theta distribution for each document.

ModelConfig.field_name¶: Obsolete in BigARTM v0.5.8

ModelConfig.stream_name¶: A value that defines which stream the model should use.

ModelConfig.score_name¶: A set of names that defines which scores should be calculated for the model.

ModelConfig.reuse_theta¶: A flag indicating whether the model should reuse theta values cached on the previous iterations. This option require cache_theta flag to be set to ‘true’ in MasterComponentConfig.

ModelConfig.regularizer_name¶: A set of names that define which regularizers should be enabled for the model. This repeated field must have the same length as regularizer_tau.

ModelConfig.regularizer_tau¶: A set of values that define the regularization coefficients of the corresponding regularizer. This repeated field must have the same length as regularizer_name.

ModelConfig.class_id¶: A set of values that define for which classes (modalities) to build topic model. This repeated field must have the same length as class_weight.

ModelConfig.class_weight¶: A set of values that define the weights of the corresponding classes (modalities). This repeated field must have the same length as class_id. This value is optional, use an empty list to set equal weights for all classes.

ModelConfig.use_sparse_bow¶

A flag indicating whether to use sparse representation of the Bag-of-words data. The default setting (use_sparse_bow = true) is best suited for processing textual collections where every token is represented in a small fraction of all documents. Dense representation (use_sparse_bow = false) better fits for non-textual collections (for example for matrix factorization).

Note that class_weight and class_id must not be used together with use_sparse_bow=false.

ModelConfig.use_random_theta¶: A flag indicating whether to initialize p(t|d) distribution with random uniform distribution. The default setting (use_random_theta = false) sets p(t|d) = 1/T, where T stands for topics_count. Note that reuse_theta flag takes priority over use_random_theta flag, so that if reuse_theta = true and there is a cache entry from previous iteration the cache entry will be used regardless of use_random_theta flag.

RegularizerConfig¶

class messages_pb2.RegularizerConfig¶

Represents a configuration of a general regularizer.

message RegularizerConfig {
  enum Type {
    SmoothSparseTheta = 0;
    SmoothSparsePhi = 1;
    DecorrelatorPhi = 2;
  }

  optional string name = 1;
  optional Type type = 2;
  optional bytes config = 3;
}

RegularizerConfig.name¶: A value that defines the name of the regularizer. The name must be unique across all names defined in the master component.

RegularizerConfig.type¶

A value that defines the type of the regularizer.

`SmoothSparseTheta`	Smooth-sparse regularizer for theta matrix
`SmoothSparsePhi`	Smooth-sparse regularizer for phi matrix
`DecorrelatorPhi`	Decorrelator regularizer for phi matrix

RegularizerConfig.config¶: A serialized protobuf message that describes regularizer config for the specific regularizer type.

SmoothSparseThetaConfig¶

class messages_pb2.SmoothSparseThetaConfig¶

Represents a configuration of a SmoothSparse Theta regularizer.

message SmoothSparseThetaConfig {
  repeated string topic_name = 1;
  repeated float alpha_iter = 2;
}

SmoothSparseThetaConfig.topic_name¶: A set of topic names that defines which topics in the model should be regularized. This value is optional, use an empty list to regularize all topics.

SmoothSparseThetaConfig.alpha_iter¶

A field of the same length as ModelConfig.inner_iterations_count that defines relative regularization weight for every iteration inner iterations. The actual regularization value is calculated as product of alpha_iter[i] and ModelConfig.regularizer_tau.

To specify different regularization weight for different topics create multiple regularizers with different topic_name set, and use different values of ModelConfig.regularizer_tau.

SmoothSparsePhiConfig¶

class messages_pb2.SmoothSparsePhiConfig¶

Represents a configuration of a SmoothSparse Phi regularizer.

message SmoothSparsePhiConfig {
  repeated string topic_name = 1;
  repeated string class_id = 2;
  optional string dictionary_name = 3;
}

SmoothSparsePhiConfig.topic_name¶: A set of topic names that defines which topics in the model should be regularized. This value is optional, use an empty list to regularize all topics.

SmoothSparsePhiConfig.class_id¶: This set defines which classes in the model should be regularized. This value is optional, use an empty list to regularize all classes.

SmoothSparsePhiConfig.dictionary_name¶

An optional value defining the name of the dictionary to use. The entries of the dictionary are expected to have DictionaryEntry.key_token, DictionaryEntry.class_id and DictionaryEntry.value fields. The actual regularization value will be calculated as a product of DictionaryEntry.value and ModelConfig.regularizer_tau.

This value is optional, if no dictionary is specified than all tokens will be regularized with the same weight.

DecorrelatorPhiConfig¶

class messages_pb2.DecorrelatorPhiConfig¶

Represents a configuration of a Decorrelator Phi regularizer.

message DecorrelatorPhiConfig {
  repeated string topic_name = 1;
  repeated string class_id = 2;
}

DecorrelatorPhiConfig.topic_name¶: A set of topic names that defines which topics in the model should be regularized. This value is optional, use an empty list to regularize all topics.

DecorrelatorPhiConfig.class_id¶: This set defines which classes in the model should be regularized. This value is optional, use an empty list to regularize all classes.

RegularizerInternalState¶

class messages_pb2.RegularizerInternalState¶

Represents an internal state of a general regularizer.

message RegularizerInternalState {
  enum Type {
    MultiLanguagePhi = 5;
  }

  optional string name = 1;
  optional Type type = 2;
  optional bytes data = 3;
}

DictionaryConfig¶

class messages_pb2.DictionaryConfig¶

Represents a static dictionary.

message DictionaryConfig {
  optional string name = 1;
  repeated DictionaryEntry entry = 2;
  optional int32 total_token_count = 3;
  optional int32 total_items_count = 4;
}

DictionaryConfig.name¶: A value that defines the name of the dictionary. The name must be unique across all dictionaries defined in the master component.

DictionaryConfig.entry¶: A list of all entries of the dictionary.

DictionaryConfig.total_token_count¶: A sum of DictionaryEntry.token_count across all entries in this dictionary. The value is optional and might be missing when all entries in the dictionary does not carry the DictionaryEntry.token_count attribute.

DictionaryConfig.total_items_count¶: A sum of DictionaryEntry.items_count across all entries in this dictionary. The value is optional and might be missing when all entries in the dictionary does not carry the DictionaryEntry.items_count attribute.

DictionaryEntry¶

class messages_pb2.DictionaryEntry¶

Represents one entry in a static dictionary.

message DictionaryEntry {
  optional string key_token = 1;
  optional string class_id = 2;
  optional float value = 3;
  repeated string value_tokens = 4;
  optional FloatArray values = 5;
  optional int32 token_count = 6;
  optional int32 items_count = 7;
}

DictionaryEntry.key_token¶: A token that defines the key of the entry.

DictionaryEntry.class_id¶: The class of the DictionaryEntry.key_token.

DictionaryEntry.value¶: An optional generic value, associated with the entry. The meaning of this value depends on the usage of the dictionary.

DictionaryEntry.token_count¶: An optional value, indicating the overall number of token occurrences in some collection.

DictionaryEntry.items_count¶: An optional value, indicating the overall number of documents containing the token.

ScoreConfig¶

class messages_pb2.ScoreConfig¶

Represents a configuration of a general score.

message ScoreConfig {
  enum Type {
    Perplexity = 0;
    SparsityTheta = 1;
    SparsityPhi = 2;
    ItemsProcessed = 3;
    TopTokens = 4;
    ThetaSnippet = 5;
    TopicKernel = 6;
  }

  optional string name = 1;
  optional Type type = 2;
  optional bytes config = 3;
}

ScoreConfig.name¶: A value that defines the name of the score. The name must be unique across all names defined in the master component.

ScoreConfig.type¶

A value that defines the type of the score.

`Perplexity`	Defines a config of the Perplexity score
`SparsityTheta`	Defines a config of the SparsityTheta score
`SparsityPhi`	Defines a config of the SparsityPhi score
`ItemsProcessed`	Defines a config of the ItemsProcessed score
`TopTokens`	Defines a config of the TopTokens score
`ThetaSnippet`	Defines a config of the ThetaSnippet score
`TopicKernel`	Defines a config of the TopicKernel score

ScoreConfig.config¶: A serialized protobuf message that describes score config for the specific score type.

ScoreData¶

class messages_pb2.ScoreData¶

Represents a general result of score calculation.

message ScoreData {
  enum Type {
    Perplexity = 0;
    SparsityTheta = 1;
    SparsityPhi = 2;
    ItemsProcessed = 3;
    TopTokens = 4;
    ThetaSnippet = 5;
    TopicKernel = 6;
  }

  optional string name = 1;
  optional Type type = 2;
  optional bytes data = 3;
}

ScoreData.name¶: A value that describes the name of the score. This name will match the name of the corresponding score config.

ScoreData.type¶

A value that defines the type of the score.

`Perplexity`	Defines a Perplexity score data
`SparsityTheta`	Defines a SparsityTheta score data
`SparsityPhi`	Defines a SparsityPhi score data
`ItemsProcessed`	Defines a ItemsProcessed score data
`TopTokens`	Defines a TopTokens score data
`ThetaSnippet`	Defines a ThetaSnippet score data
`TopicKernel`	Defines a TopicKernel score data

ScoreData.data¶: A serialized protobuf message that provides the specific score result.

PerplexityScoreConfig¶

class messages_pb2.PerplexityScoreConfig¶

Represents a configuration of a perplexity score.

message PerplexityScoreConfig {
  enum Type {
    UnigramDocumentModel = 0;
    UnigramCollectionModel = 1;
  }

  optional string field_name = 1 [default = "@body"];  // obsolete in BigARTM v0.5.8
  optional string stream_name = 2 [default = "@global"];
  optional Type model_type = 3 [default = UnigramDocumentModel];
  optional string dictionary_name = 4;
  optional float theta_sparsity_eps = 5 [default = 1e-37];
  repeated string theta_sparsity_topic_name = 6;
}

PerplexityScoreConfig.field_name¶: Obsolete in BigARTM v0.5.8

PerplexityScoreConfig.stream_name¶: A value that defines which stream should be used in perplexity calculation.

PerplexityScore¶

class messages_pb2.PerplexityScore¶

Represents a result of calculation of a perplexity score.

message PerplexityScore {
  optional double value = 1;
  optional double raw = 2;
  optional double normalizer = 3;
  optional int32 zero_words = 4;
  optional double theta_sparsity_value = 5;
  optional int32 theta_sparsity_zero_topics = 6;
  optional int32 theta_sparsity_total_topics = 7;
}

PerplexityScore.value¶: A perplexity value which is calculated as exp(-raw/normalizer).

PerplexityScore.raw¶: A numerator of perplexity calculation. This value is equal to the likelihood of the topic model.

PerplexityScore.normalizer¶: A denominator of perplexity calculation. This value is equal to the total number of tokens in all processed items.

PerplexityScore.zero_words¶: A number of tokens that have zero probability p(w|t,d) in a document. Such tokens are evaluated based on to unigram document model or unigram colection model.

PerplexityScore.theta_sparsity_value¶: A fraction of zero entries in the theta matrix.

SparsityThetaScoreConfig¶

class messages_pb2.SparsityThetaScoreConfig¶

Represents a configuration of a theta sparsity score.

message SparsityThetaScoreConfig {
  optional string field_name = 1 [default = "@body"];  // obsolete in BigARTM v0.5.8
  optional string stream_name = 2 [default = "@global"];
  optional float eps = 3 [default = 1e-37];
  repeated string topic_name = 4;
}

SparsityThetaScoreConfig.field_name¶: Obsolete in BigARTM v0.5.8

SparsityThetaScoreConfig.stream_name¶: A value that defines which stream should be used in theta sparsity calculation.

SparsityThetaScoreConfig.eps¶: A small value that defines zero threshold for theta probabilities. Theta values below the threshold will be counted as zeros when calculating theta sparsity score.

SparsityThetaScoreConfig.topic_name¶: A set of topic names that defines which topics should be used for score calculation. The names correspond to ModelConfig.topic_name. This value is optional, use an empty list to calculate the score for all topics.

SparsityThetaScore¶

class messages_pb2.SparsityThetaScoreConfig

Represents a result of calculation of a theta sparsity score.

message SparsityThetaScore {
  optional double value = 1;
  optional int32 zero_topics = 2;
  optional int32 total_topics = 3;
}

SparsityThetaScore.value¶: A value of theta sparsity that is calculated as zero_topics / total_topics.

SparsityThetaScore.zero_topics¶: A numerator of theta sparsity score. A number of topics that have zero probability in a topic-item distribution.

SparsityThetaScore.total_topics¶: A denominator of theta sparsity score. A total number of topics in a topic-item distributions that are used in theta sparsity calculation.

SparsityPhiScoreConfig¶

class messages_pb2.SparsityPhiScoreConfig¶

Represents a configuration of a sparsity phi score.

message SparsityPhiScoreConfig {
  optional float eps = 1 [default = 1e-37];
  optional string class_id = 2;
  repeated string topic_name = 3;
}

SparsityPhiScoreConfig.eps¶: A small value that defines zero threshold for phi probabilities. Phi values below the threshold will be counted as zeros when calculating phi sparsity score.

SparsityPhiScoreConfig.class_id¶: A value that defines the class of tokens to use for score calculation. This value corresponds to ModelConfig.class_id field. This value is optional. By default the score will be calculated for the default class ('@default_class‘).

SparsityPhiScoreConfig.topic_name¶: A set of topic names that defines which topics should be used for score calculation. This value is optional, use an empty list to calculate the score for all topics.

SparsityPhiScore¶

class messages_pb2.SparsityPhiScore¶

Represents a result of calculation of a phi sparsity score.

message SparsityPhiScore {
  optional double value = 1;
  optional int32 zero_tokens = 2;
  optional int32 total_tokens = 3;
}

SparsityPhiScore.value¶: A value of phi sparsity that is calculated as zero_tokens / total_tokens.

SparsityPhiScore.zero_tokens¶: A numerator of phi sparsity score. A number of tokens that have zero probability in a token-topic distribution.

SparsityPhiScore.total_tokens¶: A denominator of phi sparsity score. A total number of tokens in a token-topic distributions that are used in phi sparsity calculation.

ItemsProcessedScoreConfig¶

class messages_pb2.ItemsProcessedScoreConfig¶

Represents a configuration of an items processed score.

message ItemsProcessedScoreConfig {
  optional string field_name = 1 [default = "@body"];  // obsolete in BigARTM v0.5.8
  optional string stream_name = 2 [default = "@global"];
}

ItemsProcessedScoreConfig.field_name¶: Obsolete in BigARTM v0.5.8

ItemsProcessedScoreConfig.stream_name¶: A value that defines which stream should be used in calculation of processed items.

ItemsProcessedScore¶

class messages_pb2.ItemsProcessedScore¶

Represents a result of calculation of an items processed score.

message ItemsProcessedScore {
  optional int32 value = 1;
}

ItemsProcessedScore.value¶: A number of items that belong to the stream ItemsProcessedScoreConfig.stream_name and have been processed during iterations. Currently this number is aggregated throughout all iterations.

TopTokensScoreConfig¶

class messages_pb2.TopTokensScoreConfig¶

Represents a configuration of a top tokens score.

message TopTokensScoreConfig {
  optional int32 num_tokens = 1 [default = 10];
  optional string class_id = 2;
  repeated string topic_name = 3;
}

TopTokensScoreConfig.num_tokens¶: A value that defines how many top tokens should be retrieved for each topic.

TopTokensScoreConfig.class_id¶

A value that defines for which class of the model to collect top tokens. This value corresponds to ModelConfig.class_id field.

This parameter is optional. By default tokens will be retrieved for the default class ('@default_class‘).

TopTokensScoreConfig.topic_name¶

A set of values that represent the names of the topics to include in the result. The names correspond to ModelConfig.topic_name.

This parameter is optional. By default top tokens will be calculated for all topics in the model.

TopTokensScore¶

class messages_pb2.TopTokensScore¶

Represents a result of calculation of a top tokens score.

message TopTokensScore {
  optional int32 num_entries = 1;
  repeated string topic_name = 2;
  repeated int32 topic_index = 3;
  repeated string token = 4;
  repeated float weight = 5;
}

The data in this score is represented in a table-like format. sorted on topic_index. The following code block gives a typical usage example. The loop below is guarantied to process all top-N tokens for the first topic, then for the second topic, etc.

for (int i = 0; i < top_tokens_score.num_entries(); i++) {
  // Gives a index from 0 to (model_config.topics_size() - 1)
  int topic_index = top_tokens_score.topic_index(i);

  // Gives one of the topN tokens for topic 'topic_index'
  std::string token = top_tokens_score.token(i);

  // Gives the weight of the token
  float weight = top_tokens_score.weight(i);
}

TopTokensScore.num_entries¶: A value indicating the overall number of entries in the score. All the remaining repeated fiels in this score will have this length.

TopTokensScore.token¶: A repeated field of num_entries elements, containing tokens with high probability.

TopTokensScore.weight¶: A repeated field of num_entries elements, containing the p(t|w) probabilities.

TopTokensScore.topic_index¶: A repeated field of num_entries elements, containing integers between 0 and (ModelConfig.topics_count - 1).

TopTokensScore.topic_name¶: A repeated field of num_entries elements, corresponding to the values of ModelConfig.topic_name field.

ThetaSnippetScoreConfig¶

class messages_pb2.ThetaSnippetScoreConfig¶

Represents a configuration of a theta snippet score.

message ThetaSnippetScoreConfig {
  optional string field_name = 1 [default = "@body"];  // obsolete in BigARTM v0.5.8
  optional string stream_name = 2 [default = "@global"];
  repeated int32 item_id = 3 [packed = true];  // obsolete in BigARTM v0.5.8
  optional int32 item_count = 4 [default = 10];
}

ThetaSnippetScoreConfig.field_name¶: Obsolete in BigARTM v0.5.8

ThetaSnippetScoreConfig.stream_name¶: A value that defines which stream should be used in calculation of a theta snippet.

ThetaSnippetScoreConfig.item_id¶: Obsolete in BigARTM v0.5.8.

ThetaSnippetScoreConfig.item_count¶: The number of items to retrieve. ThetaSnippetScore will select last item_count processed items and return their theta vectors.

ThetaSnippetScore¶

class messages_pb2.ThetaSnippetScore¶

Represents a result of calculation of a theta snippet score.

message ThetaSnippetScore {
  repeated int32 item_id = 1;
  repeated FloatArray values = 2;
}

ThetaSnippetScore.item_id¶: A set of item ids for which theta snippet have been calculated. Items are identified by the item id.

ThetaSnippetScore.values¶: A set of values that define topic probabilities for each item. The length of these repeated values will match the number of item ids specified in ThetaSnippetScore.item_id. Each repeated field contains float array of topic probabilities in the natural order of topic ids.

TopicKernelScoreConfig¶

class messages_pb2.TopicKernelScoreConfig¶

Represents a configuration of a topic kernel score.

message TopicKernelScoreConfig {
  optional float eps = 1 [default = 1e-37];
  optional string class_id = 2;
  repeated string topic_name = 3;
  optional double probability_mass_threshold = 4 [default = 0.1];
}

Kernel of a topic model is defined as the list of all tokens such that the probability p(t | w) exceeds probability mass threshold.
Kernel size of a topic t is defined as the number of tokens in its kernel.
Topic purity of a topic t is defined as the sum of p(w | t) across all tokens w in the kernel.
Topic contrast of a topic t is defined as the sum of p(t | w) across all tokens w in the kernel defided by the size of the kernel.

TopicKernelScoreConfig.eps¶: Defines the minimum threshold on kernel size. In most cases this parameter should be kept at the default value.

TopicKernelScoreConfig.class_id¶: A value that defines the class of tokens to use for score calculation. This value corresponds to ModelConfig.class_id field. This value is optional. By default the score will be calculated for the default class ('@default_class‘).

TopicKernelScoreConfig.topic_name¶: A set of topic names that defines which topics should be used for score calculation. This value is optional, use an empty list to calculate the score for all topics.

TopicKernelScoreConfig.probability_mass_threshold¶: Defines the probability mass threshold (see the definition of kernel above).

TopicKernelScore¶

class messages_pb2.TopicKernelScore¶

Represents a result of calculation of a topic kernel score.

message TopicKernelScore {
  optional DoubleArray kernel_size = 1;
  optional DoubleArray kernel_purity = 2;
  optional DoubleArray kernel_contrast = 3;
  optional double average_kernel_size = 4;
  optional double average_kernel_purity = 5;
  optional double average_kernel_contrast = 6;
}

TopicKernelScore.kernel_size¶: Provides the kernel size for all requested topics. The length of this DoubleArray is always equal to the overall number of topics. The values of -1 correspond to non-calculated topics. The remaining values carry the kernel size of the requested topics.

TopicKernelScore.kernel_purity¶: Provides the kernel purity for all requested topics. The length of this DoubleArray is always equal to the overall number of topics. The values of -1 correspond to non-calculated topics. The remaining values carry the kernel size of the requested topics.

TopicKernelScore.kernel_contrast¶: Provides the kernel contrast for all requested topics. The length of this DoubleArray is always equal to the overall number of topics. The values of -1 correspond to non-calculated topics. The remaining values carry the kernel contrast of the requested topics.

TopicKernelScore.average_kernel_size¶: Provides the average kernel size across all the requested topics.

TopicKernelScore.average_kernel_purity¶: Provides the average kernel purity across all the requested topics.

TopicKernelScore.average_kernel_contrast¶: Provides the average kernel contrast across all the requested topics.

TopicModel¶

class messages_pb2.TopicModel¶

Represents a topic model.

message TopicModel {
  optional string name = 1 [default = "@model"];
  optional int32 topics_count = 2;
  repeated string topic_name = 3;
  repeated string token = 4;
  repeated FloatArray token_weights = 5;
  repeated string class_id = 6;

  message TopicModelInternals {
    repeated FloatArray n_wt = 1;
    repeated FloatArray r_wt = 2;
  }

  optional bytes internals = 7;
}

TopicModel.name¶: A value that describes the name of the topic model. This name will match the name of the corresponding model config.

TopicModel.topics_count¶: A value that describes the number of topics in the topic model. This value will match ModelConfig.topics_count value, defined in the model config.

TopicModel.token¶: The set of all tokens, included in the topic model.

TopicModel.token_weights¶: A set of token weights. The length of this repeated field will match the length of the repeated field ‘token’. The length of each FloatArray will match the topics_count field.

TopicModel.class_id¶: A set values that specify the class (modality) of the tokens. The length of this repeated field will match the length of the repeated field ‘token’.

TopicModel.internals¶: A serialized instance of TopicModelInternals message.

ThetaMatrix¶

class messages_pb2.ThetaMatrix¶

Represents a theta matrix.

message ThetaMatrix {
  optional string model_name = 1 [default = "@model"];
  repeated int32 item_id = 2;
  repeated FloatArray item_weights = 3;
  repeated string topic_name = 4;
  optional int32 topics_count = 5;
  repeated string item_title = 6;
}

ThetaMatrix.model_name¶: A value that describes the name of the topic model. This name will match the name of the corresponding model config.

ThetaMatrix.item_id¶: A set of item IDs corresponding to Item.id values.

ThetaMatrix.item_weights¶: A set of item ID weights. The length of this repeated field will match the length of the repeated field ‘item_id’. The length of each FloatArray will match the number of topics in the model.

ThetaMatrix.topic_name¶: A set of values that represent the names of the topics, included in this theta matrix. The names correspond to ModelConfig.topic_name.

TopicModel.topics_count: A value that describes the number of topics in the topic model. This value will match ModelConfig.topics_count value, defined in the model config.

ThetaMatrix.item_id: A set of item titles, corresponding to Item.title values.

CollectionParserConfig¶

class messages_pb2.CollectionParserConfig¶

Represents a configuration of a collection parser.

message CollectionParserConfig {
  enum Format {
    BagOfWordsUci = 0;
    MatrixMarket = 1;
  }

  optional Format format = 1 [default = BagOfWordsUci];
  optional string docword_file_path = 2;
  optional string vocab_file_path = 3;
  optional string target_folder = 4;
  optional string dictionary_file_name = 5;
  optional int32 num_items_per_batch = 6 [default = 1000];
  optional string cooccurrence_file_name = 7;
  repeated string cooccurrence_token = 8;
}

CollectionParserConfig.format¶

A value that defines the format of a collection to be parsed.

BagOfWordsUci

A bag-of-words collection, stored in UCI format.
UCI format must have two files - vocab.*.txt
and docword.*.txt, defined by
docword_file_path
and vocab_file_path.
The format of the docword.*.txt file is 3 header
lines, followed by NNZ triples:

D
W
NNZ
docID wordID count
docID wordID count
...
docID wordID count

The file must be sorted on docID.
Values of wordID must be unity-based (not zero-based).
The format of the vocab.*.txt file is line containing wordID=n.
Note that words must not have spaces or tabs.
In vocab.*.txt file it is also possible to specify
Batch.class_id for tokens, as it is shown in this example:

token1 @default_class
token2 custom_class
token3 @default_class
token4

Use space or tab to separate token from its class.
Token that are not followed by class label automatically
get ''@default_class‘’ as a lable (see ‘’token4’’ in the example).

MatrixMarket

See the description at http://math.nist.gov/MatrixMarket/formats.html
In this mode parameter docword_file_path must refer to a file
in Matrix Market format. Parameter vocab_file_path
is also required and must refer to a dictionary file exported in
gensim format (dictionary.save_as_text()).

CollectionParserConfig.docword_file_path¶: A value that defines the disk location of a docword.*.txt file (the bag of words file in sparse format).

CollectionParserConfig.vocab_file_path¶: A value that defines the disk location of a vocab.*.txt file (the file with the vocabulary of the collection).

CollectionParserConfig.target_folder¶: A value that defines the disk location where to stores all the results after parsing the colleciton. Usually the resulting location will contain a set of batches, and a DictionaryConfig that contains all unique tokens occured in the collection. Such location can be further passed MasterComponent via MasterComponentConfig.disk_path.

CollectionParserConfig.dictionary_file_name¶

A file name where to save the DictionaryConfig message that contains all unique tokens occured in the collection. The file will be created in target_folder.

This parameter is optional. The dictionary will be still collected even when this parameter is not provided, but the resulting dictionary will be only returned as the result of ArtmRequestParseCollection, but it will not be stored to disk.

In the resulting dictionary each entry will have the following fields:

DictionaryEntry.key_token - the textual representation of the token,
DictionaryEntry.class_id - the label of the default class (“@DefaultClass”),
DictionaryEntry.token_count - the overall number of occurrences of the token in the collection,
DictionaryEntry.items_count - the number of documents in the collection, containing the token.
DictionaryEntry.value - the ratio between token_count and total_token_count.

Use ArtmRequestLoadDictionary method to load the resulting dictionary.

CollectionParserConfig.num_items_per_batch¶: A value indicating the desired number of items per batch.

CollectionParserConfig.cooccurrence_file_name¶

A file name where to save the DictionaryConfig message that contains information about co-occurrence of all pairs of tokens in the collection. The file will be created in target_folder.

This parameter is optional. No cooccurrence information will be collected if the filename is not provided.

In the resulting dictionary each entry will correspond to two tokens (‘<first>’ and ‘<second>’), and carry the information about co-occurrence of this tokens in the collection.

DictionaryEntry.key_token - a string of the form ‘<first>~<second>’, produced by concatenation of two tokens together via the tilde symbol (‘~’). <first> tokens is guarantied lexicographic less than the <second> token.
DictionaryEntry.class_id - the label of the default class (“@DefaultClass”).
DictionaryEntry.items_count - the number of documents in the collection, containing both tokens (‘<first>’ and ‘<second>’)

Use ArtmRequestLoadDictionary method to load the resulting dictionary.

CollectionParserConfig.cooccurrence_token¶: A list of tokens to collect cooccurrence information. A cooccurrence of the pair <first>~<second> will be collected only when both tokens are present in CollectionParserConfig.cooccurrence_token.

SynchronizeModelArgs¶

class messages_pb2.SynchronizeModelArgs¶

Represents an argument of synchronize model operation.

message SynchronizeModelArgs {
  optional string model_name = 1;
  optional float decay_weight = 2 [default = 1.0];
  optional bool invoke_regularizers = 3 [default = true];
  optional float apply_weight = 4 [default = 1.0];
}

SynchronizeModelArgs.model_name¶: The name of the model to be synchronized. This value is optional. When not set, all models will be synchronized with the same decay weight.

SynchronizeModelArgs.decay_weight¶

The decay weight and apply_weight define how to combine existing topic model with all increments, calculated since the last ArtmSynchronizeModel(). This is best described by the following formula:

n_wt_new = n_wt_old * decay_weight + n_wt_inc * apply_weight,

where n_wt_old describe current topic model, n_wt_inc describe increment calculated since last ArtmSynchronizeModel(), n_wt_new define the resulting topic model.

Expected values of both parameters are between 0.0 and 1.0. Here are some examples:

Combination of decay_weight=0.0 and apply_weight=1.0 states that the previous Phi matrix of the topic model will be disregarded completely, and the new Phi matrix will be formed based on new increments gathered since last model synchronize.
Combination of decay_weight=1.0 and apply_weight=1.0 states that new increments will be appended to the current Phi matrix without any decay.
Combination of decay_weight=1.0 and apply_weight=0.0 states that new increments will be disregarded, and current Phi matrix will stay unchanged.
To reproduce Online variational Bayes for LDA algorighm by Matthew D. Hoffman set decay_weight = 1 - rho and apply_weight = rho, where parameter rho is defined as rho = exp(tau + t, -kappa). See Online Learning for Latent Dirichlet Allocation for further details.

SynchronizeModelArgs.apply_weight¶: See decay_weight for the description.

SynchronizeModelArgs.invoke_regularizers¶: A flag indicating whether to invoke all phi-regularizers.

InitializeModelArgs¶

class messages_pb2.InitializeModelArgs¶

Represents an argument of initialize model operation.

message InitializeModelArgs {
  optional string model_name = 1;
  optional string dictionary_name = 2;
}

InitializeModelArgs.model_name¶: The name of the model to be initialized.

InitializeModelArgs.dictionary_name¶: The name of the dictionary containing all tokens that should be initialized.

GetTopicModelArgs¶

Represents an argument of get topic model operation.

message GetTopicModelArgs {
  optional string model_name = 1;
  repeated string topic_name = 2;
  repeated string token = 3;
  repeated string class_id = 4;
}

GetTopicModelArgs.model_name¶: The name of the model to be retrieved.

GetTopicModelArgs.topic_name¶: The list of topic names to be retrieved. This value is optional. When not provided, all topics will be retrieved.

GetTopicModelArgs.token¶: The list of tokens to be retrieved. The length of this field must match the length of class_id field. This field is optional. When not provided, all tokens will be retrieved.

GetTopicModelArgs.class_id¶: The list of classes corresponding to all tokens. The length of this field must match the length of token field. This field is only required together with token, otherwise it is ignored.

GetThetaMatrixArgs¶

Represents an argument of get theta matrix operation.

message GetThetaMatrixArgs {
  optional string model_name = 1;
  optional Batch batch = 2;
  repeated string topic_name = 3;
  repeated int32 topic_index = 4;
  optional bool clean_cache = 5 [default = false];
}

GetThetaMatrixArgs.model_name¶: The name of the model to retrieved theta matrix for.

GetThetaMatrixArgs.batch¶: The Batch to classify with the model.

GetThetaMatrixArgs.topic_name¶: The list of topic names, describing which topics to include in the Theta matrix. The values of this field should correspond to values in ModelConfig.topic_name. This field is optional, by default all topics will be included.

GetThetaMatrixArgs.topic_index¶

The list of topic indices, describing which topics to include in the Theta matrix. The values of this field should be an integers between 0 and (ModelConfig.topics_count - 1). This field is optional, by default all topics will be included.

Note that this field acts similar to GetThetaMatrixArgs.topic_name. It is not allowed to specify both topic_index and topic_name at the same time. The recommendation is to use topic_name.

GetThetaMatrixArgs.clean_cache¶: An optional flag that defines whether to clear the theta matrix cache after this operation. Setting this value to True will clear the cache for a topic model, defined by GetThetaMatrixArgs.model_name. This value is only applicable when MasterComponentConfig.cache_theta is set to True.

GetScoreValueArgs¶

Represents an argument of get score operation.

message GetScoreValueArgs {
  optional string model_name = 1;
  optional string score_name = 2;
  optional Batch batch = 3;
}

GetScoreValueArgs.model_name¶: The name of the model to retrieved score for.

GetScoreValueArgs.score_name¶: The name of the score to retrieved.

GetScoreValueArgs.batch¶: The Batch to calculate the score. This option is only applicable to cumulative scores. When not provided the score will be reported for all batches processed since last ArtmInvokeIteration().

AddBatchArgs¶

Represents an argument of ArtmAddBatch() operation.

message AddBatchArgs {
  optional Batch batch = 1;
  optional int32 timeout_milliseconds = 2;
  optional bool reset_scores = 3 [default = false];
}

AddBatchArgs.batch¶: The Batch to add.

AddBatchArgs.timeout_milliseconds¶: Timeout in milliseconds for this operation.

AddBatchArgs.reset_scores¶: An optional flag that defines whether to reset all scores before this operation.

InvokeIterationArgs¶

Represents an argument of ArtmInvokeIteration() operation.

message InvokeIterationArgs {
  optional int32 iterations_count = 1 [default = 1];
  optional bool reset_scores = 2 [default = true];
}

InvokeIterationArgs.iterations_count¶: An integer value describing how many iterations to invoke.

InvokeIterationArgs.reset_scores¶: An optional flag that defines whether to reset all scores before this operation.

WaitIdleArgs¶

Represents an argument of ArtmWaitIdle() operation.

message WaitIdleArgs {
  optional int32 timeout_milliseconds = 1 [default = -1];
}

WaitIdleArgs.timeout_milliseconds¶: Timeout in milliseconds for this operation.