Low-level API in C

This document explains all public methods of the low level BigARTM interface, written in plain C language.

Introduction

The goal of low level API is to expose all functionality of the library in a set of simple functions written in plain C language. This makes it easier to consume BigARTM from various programming environments. For example, the Python Interface of BigARTM uses ctypes module to call the low level BigARTM interface. Most programming environments also have similar functionality: PInvoke in C#, loadlibrary in Matlab, etc.

Typical methods of low-level API may look as follows:

int ArtmCreateMasterModel(int length, const char* master_model_config);
int ArtmFitOfflineMasterModel(int master_id, int length, const char* fit_offline_master_model_args);
int ArtmRequestTopicModel(int master_id, int length, const char* get_model_args);

This methods, similarly to most other methods in low level API, accept a serialized binary representation of some Google Protocol Buffer message. From BigARTM v0.8.2 it is also possible to pass JSON-serialized protobuf message. This might be useful if you are planing to use low-level C interface from environment where configuring protobuf libraries would be challenging. Please, refer to Messages for more details about each particular message, and Protobuf documentation regarding JSON mapping.

Note that this documentation is incomplete. For the actual list the methods of low-level C API please refer to c_interface.h. Same is true about messages documentation. It is always recommended to review the messages.proto definition.

If you plan to implement a high-wrapper around low-level API we recoomend to review the source code of existing wrappers cpp_interface.h, cpp_interface.cc (for C++ wrapper) and spec.py, api.py (for python wrapper).

List of all methods with corresponding protobuf types

                            ArtmConfigureLogging           (           artm.ConfigureLoggingArgs);
const char*               = ArtmGetVersion();
const char*               = ArtmGetLastErrorMessage();

artm.CollectionParserInfo = ArtmParseCollection            (           artm.CollectionParserConfig);

master_id                 = ArtmCreateMasterModel          (           artm.MasterModelConfig);
                            ArtmReconfigureMasterModel     (master_id, artm.MasterModelConfig);
                            ArtmReconfigureTopicName       (master_id, artm.MasterModelConfig);
                            ArtmDisposeMasterComponent     (master_id);

                            ArtmImportBatches              (master_id, artm.ImportBatchesArgs);

                            ArtmGatherDictionary           (master_id, artm.GatherDictionaryArgs);
                            ArtmFilterDictionary           (master_id, artm.FilterDictionaryArgs);
                            ArtmCreateDictionary           (master_id, artm.DictionaryData);
                            ArtmImportDictionary           (master_id, artm.ImportDictionaryArgs);
                            ArtmExportDictionary           (master_id, artm.ExportDictionaryArgs);
artm.DictionaryData       = ArtmRequestDictionary          (master_id, artm.GetDictionaryArgs);

                            ArtmInitializeModel            (master_id, artm.InitializeModelArgs);
                            ArtmExportModel                (master_id, artm.ExportModel);
                            ArtmImportModel                (master_id, artm.ImportModel);
                            ArtmOverwriteTopicModel        (master_id, artm.TopicModel);

                            ArtmFitOfflineMasterModel      (master_id, artm.FitOfflineMasterModelArgs);
                            ArtmFitOnlineMasterModel       (master_id, attm.FitOnlineMasterModelArgs);
artm.ThetaMatrix          = ArtmRequestTransformMasterModel(master_id, artm.TransformMasterModelArgs);
artm.ThetaMatrix          = ArtmRequestTransformMasterModelExternal(master_id, artm.TransformMasterModelArgs);

artm.MasterModelConfig    = ArtmRequestMasterModelConfig   (master_id);
artm.ThetaMatrix          = ArtmRequestThetaMatrix         (master_id, artm.GetThetaMatrix);
artm.ThetaMatrix          = ArtmRequestThetaMatrixExternal (master_id, artm.GetThetaMatrix);
artm.TopicModel           = ArtmRequestTopicModel          (master_id, artm.GetTopicModel);
artm.TopicModel           = ArtmRequestTopicModelExternal  (master_id, artm.GetTopicModel);
artm.ScoreData            = ArtmRequestScore               (master_id, artm.GetScoreValueArgs);
artm.ScoreArray           = ArtmRequestScoreArray          (master_id, artm.GetScoreArrayArgs);
artm.MasterComponentInfo  = ArtmRequestMasterComponentInfo (master_id, artm.GetMasterComponentInfoArgs);

                            ArtmDisposeModel               (master_id, const char* model_name);
                            ArtmDisposeDictionary          (master_id, const char* dictionary_name);
                            ArtmDisposeBatch               (master_id, const char* batch_name);
                            ArtmClearThetaCache            (master_id, artm.ClearThetaCacheArgs);
                            ArtmClearScoreCache            (master_id, artm.ClearScoreCacheArgs);
                            ArtmClearScoreArrayCache       (master_id, artm.ClearScoreArrayCacheArgs);

                            ArtmCopyRequestedMessage        (int length, char* address);
                            ArtmCopyRequestedObject         (int length, char* address);

                            ArtmSetProtobufMessageFormatToJson();
                            ArtmSetProtobufMessageFormatToBinary();
int                       = ArtmProtobufMessageFormatIsJson();

Below we give a short description of these methods.

  • ArtmConfigureLogging allows to configure logging parameters; this method is optional, you may not use it
  • ArtmGetVersion returns the version of BigARTM library
  • ArtmParseCollection parse collection in VW or UCI-BOW formats, creates batches and stores them to disk
  • ArtmCreateMasterModel / ArtmReconfigureMasterModel / ArtmDisposeMasterComponent create master model / updates its parameters / dispose given instance of master model.
  • ArtmImportBatches loads batches from disk into memory for quicker processing. This is optional, most methods that require batches can work directly with files on disk.
  • ArtmGatherDictionary / ArtmFilterDictionary / ArtmImportDictionary / ArtmExportDictionary Main methods to work with dictionaries. Gather initialized the dictionary based on a folder with batches, Filter eliminates tokens based on their frequency, Import/Export save and re-load dictionary to/from disk.
  • You may also created the dictionary from artm.DictionaryData message, that contains the list of all tokens to be included in the dictionary. To do this use method ArtmCreateDictionary (to create a dictionary) and ArtmRequestDictionary (to retrieve artm.DictionaryData for an existing dictionary).
  • ArtmInitializeModel / ArtmExportModel / ArtmImportModel handle models (e.g. matrices of size |T|*|W| such as pwt, nwt or rwt). Initialize* fills the matrix with random 0..1 values. Export and Import saves the matrix to disk and re-loads it back.
  • ArtmOverwriteTopicModel allows to overwite values in topic model (for example to manually specify initial approximation).
  • ArtmFitOfflineMasterModel — fit the model with offline algorithm
  • ArtmFitOnlineMasterModel — fit the model with online algorithm
  • ArtmRequestTransformMasterModel — apply the model to new data
  • ArtmRequestMasterModelConfig — retrieve configuration of master model
  • ArtmRequestThetaMatrix — retrieve cached theta matrix
  • ArtmRequestTopicModel — retrieve a model (e.g. pwt, nwt or rwt matrix)
  • ArtmRequestScore – retrieve score (such as perplexity, sparsity, etc)
  • ArtmRequestScoreArray – retrieve historical information for a given score
  • ArtmRequestMasterComponentInfo – retrieve diagnostics information and internal state of the master model
  • ArtmDisposeModel / ArtmDisposeDictionary / ArtmDisposeBatch – dispose specific objects
  • ArtmClearThetaCache / ArtmClearScoreCache / ArtmClearScoreArrayCache – clear specific caches
  • ArtmSetProtobufMessageFormatToJson / ArtmSetProtobufMessageFormatToBinary / ArtmProtobufMessageFormatIsJson – configure the low-level API to work with JSON-serialized protobuf messages instead of binary-serialized protobuf messages

The following operations are less important part of low-level BigARTM CLI. In most cases you won’e need them, unless you have a very specific needs.

master_id                 = ArtmDuplicateMasterComponent     (master_id, artm.DuplicateMasterComponentArgs);
                            ArtmCreateRegularizer            (master_id, artm.RegularizerConfig);
                            ArtmReconfigureRegularizer       (master_id, artm.RegularizerConfig);
                            ArtmDisposeRegularizer           (master_id, const char* regularizer_name);
                            ArtmOverwriteTopicModelNamed     (master_id, artm.TopicModel, const char* name);
                            ArtmCreateDictionaryNamed        (master_id, artm.DictionaryData, const char* name);
                            ArtmAttachModel                  (master_id, artm.AttachModelArgs, int address_length, char* address);
artm.ProcessBatchesResult = ArtmRequestProcessBatches        (master_id, artm.ProcessBatchesArgs);
artm.ProcessBatchesResult = ArtmRequestProcessBatchesExternal(master_id, artm.ProcessBatchesArgs);
                            ArtmAsyncProcessBatches          (master_id, artm.ProcessBatchesArgs);
                            ArtmMergeModel                   (master_id, artm.MergeModelArgs);
                            ArtmRegularizeModel              (master_id, artm.RegularizeModelArgs);
                            ArtmNormalizeModel               (master_id, artm.NormalizeModelArgs);
artm.Batch                = ArtmRequestLoadBatch             (           const char* filename);
                            ArtmAwaitOperation               (           int operation_id, artm.AwaitOperationArgs);
                            ArtmSaveBatch                    (           const char* disk_path, artm.Batch);

Protocol for retrieving results

The methods in low-level API can be split into two groups — those that execute certain action, and those that request certain data. For example ArtmCreateMasterModel and ArtmFitOfflineMasterModel just execute an action, while ArtmRequestTopicModel is a request for data. Naming convention is that such requests always start with ArtmRequest prefix.

1. To call execute-action method is fairly straighforward — first you create a protobuf message that describe the arguments of the operation. For example, ArtmCreateMasterModel expects artm.MasterModelConfig message, as defined in the documentation of ArtmCreateMasterModel. Then you serialize protobuf message, and pass it to the method along with the length of the serialized message. In some cases you also pass the master_id, returned by ArtmCreateMasterModel, as described in details futher below on this page. The execute-action method will typically return an error code, with zero value (or ARTM_SUCCESS) indicating successfull execution.

2. To call request-data method is more tricky. First you follow the same procedure as when calling an execute-action method, e.g. create and serialize protobuf message and pass it to your ArtmRequestXxx operation. For example, ArtmRequestTopicModel expects artm.GetTopicModelArgs message. Then the method like ArtmRequestTopicModel will return the size (in bytes) of the memory buffer that needs to be allocated by caller. To fill this buffer with actual data you need to call method

int ArtmCopyRequestedMessage(int length, char* address)

where address give a pointer to the memory buffer, and length must give the length of the buffer (e.g. must match the value returned by ArtmRequestXxx call). After ArtmCopyRequestedMessage the buffer will contain protobuf-serialized message. To deserialize this message you need to know its protobuf type, which will be defined by the documentation of the ArtmRequestXxx method that you are calling. For ArtmRequestTopicModel it will be a artm.TopicModel message.

3. Note that few ArtmRequestXxx methods has a more complex protocol that require two subsequent calls — first, to ArtmCopyRequestedMessage, and then to ArtmCopyRequestedObject. If that’s the case the name of the method will be ArtmRequestXxxExternal (for example ArtmRequestThetaMatrixExternal or ArtmRequestTopicModelExternal). Typically this is used to copy out large objects, such as theta or phi matrices, and store them directly as dense matrices, bypassing protobuf serialization. For more information see cpp_interface.cc.

A side-note on thread safety: in between calls to ArtmRequestXxx and ArtmCopyRequestedMessage the result is stored in a thread local storage. This allows you to call multiple ArtmRequestXxx methods from different threads.

Error handling

All methods in this API return an integer value. Negative return values represent an error code. See error codes for the list of all error codes. To get corresponding error message as string use ArtmGetLastErrorMessage(). Non-negative return values represent a success, and for some API methods might also incorporate some useful information. For example, ArtmCreateMasterModel() returns the ID of newly created master component, and ArtmRequestTopicModel() returns the length of the buffer that should be allocated before calling ArtmCopyRequestedMessage().

MasterId and MasterModel

The concept of Master Model is central in low-level API. Almost any interaction with the low-level API starts by calling method ArtmCreateMasterModel, which creates an instance of so-called Master Model (or Master Component), and returns its master_id – an integer identifier that refers to that instance. You need master_id in the remaining methods of the low-level API, such as ArtmFitOfflineMasterModel. master_id creates a context, or scope, that isolate different models from each other. An operation applied to a specific master_id will not affect other master components. Each master model occupy some memory — potentially a very large amount, depending on the number of topics and tokens in the model. Once you are done with a specific instance of master component you need to dispose its resources by calling ArtmDisposeMasterComponent(master_id). After that master_id is no longer valid, and it must not be used as argument to other methods.

You may use method ArtmRequestMasterComponentInfo to retrieve internal diagnostics information about master component. It will reveal its internal state and tell the config of the master component, the list of scores and regularizers, the list of phi matrices, the list of dictionaries, cache entries, and other informatino that will help to understand how master component is functioning.

Note there might be confusion between terms MasterComponent and MasterModel, throughout this page as well as in the actual naming of the methods. This is due to historical reasons, and for all practical purposes you may think that this terms refer to the same thing.

ArtmConfigureLogging

You may use ArtmConfigureLogging call to set logging parameters, such as verbosity level or directory to output logs. You are not require to call ArtmConfigureLogging, in which case logging is automaticaly initialized to INFO level, and logs are placed in the active working folder.

Note that you can set log directory just one time. Once it is set you can not change it afterwards. Method ArtmConfigureLogging will return error code INVALID_OPERATION if it detects an attempt to change logging folder after logging had been initialized. In order to set log directory the call to ArtmConfigureLogging must happen prior to calling any other methods in low-level C API. (with exception to ArtmSetProtobufMessageFormatToJson, ArtmSetProtobufMessageFormatToBinary and ArtmProtobufMessageFormatIsJson). This is because methods in c_interface may automatically initialize logging into current working directory, which later can not be changed.

Setting log directory require that target folder already exist on disk.

The following parameters can be customized with ArtmConfigureLogging.

message ConfigureLoggingArgs {
  // If specified, logfiles are written into this directory
  // instead of the default logging directory.
  optional string log_dir = 1;

  // Messages logged at a lower level than this
  // do not actually get logged anywhere
  optional int32 minloglevel = 2;

  // Log messages at a level >= this flag are automatically
  // sent to stderr in addition to log files.
  optional int32 stderrthreshold = 3;

  // Log messages go to stderr instead of logfiles
  optional bool logtostderr = 4;

  // color messages logged to stderr (if supported by terminal)
  optional bool colorlogtostderr = 5;

  // log messages go to stderr in addition to logfiles
  optional bool alsologtostderr = 6;

  // Buffer log messages for at most this many seconds
  optional int32 logbufsecs = 7;

  // Buffer log messages logged at this level or lower
  // (-1 means do not buffer; 0 means buffer INFO only; ...)
  optional int32 logbuflevel = 8;

  // approx. maximum log file size (in MB). A value of 0 will be silently overridden to 1.
  optional int32 max_log_size = 9;

  // Stop attempting to log to disk if the disk is full.
  optional bool stop_logging_if_full_disk = 10;
}

We recommend to set logbuflevel = -1 to not buffer log messages. However by default BigARTM does not set this parameter, using the same default as provided by glog.

ArtmReconfigureTopicName

To explain ArtmReconfigureTopicName we need to first start with ArtmReconfigureMasterModel. ArtmReconfigureMasterModel allow user to rename topics, but the number of topics must stay the same, and the order and the content of all existing phi matrices remains unchanged. On contrary, ArtmReconfigureTopicName may change the number of topics by adding or removing topics, as well as reorder columns of existing phi matrices. In ArtmReconfigureMasterModel the list of topic names is treated as new identifiers that should be set for existing columns. In ArtmReconfigureTopicName the list of topic names is matched against previous topic names. New topic names are added to phi matrices, topic names removed from the list are excluded from phi matrices, and topic names present in both old and new lists are re-ordered accordingly to match new topic name list.

Examples for ArtmReconfigureMasterModel:

  • t1, t2, t3 -> t4, t5, t6 sets new topic names for existing columns in phi matrices.

Examples for ArtmReconfigureTopicName:

  • t1, t2, t3 -> t1, t2, t3, t4 adds a new column to phi matrices, initialized with zeros
  • t1, t2, t3 -> t1, t2 removes last column from phi matrices
  • t1, t2, t3 -> t2, t3 removes the first column from phi matrices
  • t1, t2, t3 -> t3, t2 removes the first column from phi matrices and swaps the remaining two columns
  • t1, t2, t3 -> t4, t5, t6 removes all columns from phi matrices and creates three new columns, initialized with zeros

Note that both ArtmReconfigureTopicName and ArtmReconfigureMasterModel only affect phi matrices where set of topic names match the configuration of the master model. User-created matrices with custom set of topic names, for example created via ArtmMergeModel, will stay unchanged.

If you change topic names you should also consider changing your configuration of scores and regularizers. Also take into account that ArtmReconfigureTopicName and ArtmReconfigureMasterModel do not update theta cache. It is a good idea to call ArtmClearThetaCache after changing topic names.

ArtmGetLastErrorMessage

const char* ArtmGetLastErrorMessage()

Retrieves the textual error message, occured during the last failing request.

Error codes

#define ARTM_SUCCESS 0
#define ARTM_STILL_WORKING -1
#define ARTM_INTERNAL_ERROR -2
#define ARTM_ARGUMENT_OUT_OF_RANGE -3
#define ARTM_INVALID_MASTER_ID -4
#define ARTM_CORRUPTED_MESSAGE -5
#define ARTM_INVALID_OPERATION -6
#define ARTM_DISK_READ_ERROR -7
#define ARTM_DISK_WRITE_ERROR -8
ARTM_SUCCESS

The API call succeeded.

ARTM_STILL_WORKING

This error code is applicable only to ArtmAwaitOperation(). It indicates that library is still processing the collection. Try to retrieve results later.

ARTM_INTERNAL_ERROR

The API call failed due to internal error in BigARTM library. Please, collect steps to reproduce this issue and report it with BigARTM issue tracker.

ARTM_ARGUMENT_OUT_OF_RANGE

The API call failed because one or more values of an argument are outside the allowable range of values as defined by the invoked method.

ARTM_INVALID_MASTER_ID

An API call that require master_id parameter failed because MasterComponent with given ID does not exist.

ARTM_CORRUPTED_MESSAGE

Unable to deserialize protocol buffer message.

ARTM_INVALID_OPERATION

The API call is invalid in current state or due to provided parameters.

ARTM_DISK_READ_ERROR

The required files coult not be read from disk.

ARTM_DISK_WRITE_ERROR

The required files could not be writtent to disk.