C Interface¶
This document explains all public methods of the low level BigARTM interface, written in plain C language.
Introduction¶
The goal of low level API is to expose all functionality of the library in a set of simple functions written in plain C language. This makes it easier to consume BigARTM from various programming environments. For example, the Python Interface of BigARTM uses ctypes module to call the low level BigARTM interface. Most programming environments also have similar functionality: PInvoke in C#, loadlibrary in Matlab, etc.
Typical methods of low-level API may look as follows:
int ArtmCreateMasterModel(int length, const char* master_model_config);
int ArtmFitOfflineMasterModel(int master_id, int length, const char* fit_offline_master_model_args);
int ArtmRequestTopicModel(int master_id, int length, const char* get_model_args);
This methods, similarly to most other methods in low level API, accept a serialized binary representation of some Google Protocol Buffer message. From BigARTM v0.8.2 it is also possible to pass JSON-serialized protobuf message. This might be useful if you are planing to use low-level C interface from environment where configuring protobuf libraries would be challenging. Please, refer to Messages for more details about each particular message, and Protobuf documentation regarding JSON mapping.
Note that this documentation is incomplete. For the actual list the methods of low-level C API please refer to c_interface.h. Same is true about messages documentation. It is always recommended to review the messages.proto definition.
If you plan to implement a high-wrapper around low-level API we recoomend to review the source code of existing wrappers cpp_interface.h, cpp_interface.cc (for C++ wrapper) and spec.py, api.py (for python wrapper).
List of all methods with corresponding protobuf types¶
ArtmConfigureLogging ( artm.ConfigureLoggingArgs);
const char* = ArtmGetVersion();
const char* = ArtmGetLastErrorMessage();
artm.CollectionParserInfo = ArtmParseCollection ( artm.CollectionParserConfig);
master_id = ArtmCreateMasterModel ( artm.MasterModelConfig);
ArtmReconfigureMasterModel (master_id, artm.MasterModelConfig);
ArtmReconfigureTopicName (master_id, artm.MasterModelConfig);
ArtmDisposeMasterComponent (master_id);
ArtmImportBatches (master_id, artm.ImportBatchesArgs);
ArtmGatherDictionary (master_id, artm.GatherDictionaryArgs);
ArtmFilterDictionary (master_id, artm.FilterDictionaryArgs);
ArtmCreateDictionary (master_id, artm.DictionaryData);
ArtmImportDictionary (master_id, artm.ImportDictionaryArgs);
ArtmExportDictionary (master_id, artm.ExportDictionaryArgs);
artm.DictionaryData = ArtmRequestDictionary (master_id, artm.GetDictionaryArgs);
ArtmInitializeModel (master_id, artm.InitializeModelArgs);
ArtmExportModel (master_id, artm.ExportModel);
ArtmImportModel (master_id, artm.ImportModel);
ArtmOverwriteTopicModel (master_id, artm.TopicModel);
ArtmFitOfflineMasterModel (master_id, artm.FitOfflineMasterModelArgs);
ArtmFitOnlineMasterModel (master_id, attm.FitOnlineMasterModelArgs);
artm.ThetaMatrix = ArtmRequestTransformMasterModel(master_id, artm.TransformMasterModelArgs);
artm.ThetaMatrix = ArtmRequestTransformMasterModelExternal(master_id, artm.TransformMasterModelArgs);
artm.MasterModelConfig = ArtmRequestMasterModelConfig (master_id);
artm.ThetaMatrix = ArtmRequestThetaMatrix (master_id, artm.GetThetaMatrix);
artm.ThetaMatrix = ArtmRequestThetaMatrixExternal (master_id, artm.GetThetaMatrix);
artm.TopicModel = ArtmRequestTopicModel (master_id, artm.GetTopicModel);
artm.TopicModel = ArtmRequestTopicModelExternal (master_id, artm.GetTopicModel);
artm.ScoreData = ArtmRequestScore (master_id, artm.GetScoreValueArgs);
artm.ScoreArray = ArtmRequestScoreArray (master_id, artm.GetScoreArrayArgs);
artm.MasterComponentInfo = ArtmRequestMasterComponentInfo (master_id, artm.GetMasterComponentInfoArgs);
ArtmDisposeModel (master_id, const char* model_name);
ArtmDisposeDictionary (master_id, const char* dictionary_name);
ArtmDisposeBatch (master_id, const char* batch_name);
ArtmClearThetaCache (master_id, artm.ClearThetaCacheArgs);
ArtmClearScoreCache (master_id, artm.ClearScoreCacheArgs);
ArtmClearScoreArrayCache (master_id, artm.ClearScoreArrayCacheArgs);
ArtmCopyRequestedMessage (int length, char* address);
ArtmCopyRequestedObject (int length, char* address);
ArtmSetProtobufMessageFormatToJson();
ArtmSetProtobufMessageFormatToBinary();
int = ArtmProtobufMessageFormatIsJson();
Below we give a short description of these methods.
ArtmConfigureLogging
allows to configure logging parameters; this method is optional, you may not use itArtmGetVersion
returns the version of BigARTM libraryArtmParseCollection
parse collection in VW or UCI-BOW formats, creates batches and stores them to diskArtmCreateMasterModel
/ArtmReconfigureMasterModel
/ArtmDisposeMasterComponent
create master model / updates its parameters / dispose given instance of master model.ArtmImportBatches
loads batches from disk into memory for quicker processing. This is optional, most methods that require batches can work directly with files on disk.ArtmGatherDictionary
/ArtmFilterDictionary
/ArtmImportDictionary
/ArtmExportDictionary
Main methods to work with dictionaries. Gather initialized the dictionary based on a folder with batches, Filter eliminates tokens based on their frequency, Import/Export save and re-load dictionary to/from disk.- You may also created the dictionary from
artm.DictionaryData
message, that contains the list of all tokens to be included in the dictionary. To do this use methodArtmCreateDictionary
(to create a dictionary) andArtmRequestDictionary
(to retrieveartm.DictionaryData
for an existing dictionary).ArtmInitializeModel
/ArtmExportModel
/ArtmImportModel
handle models (e.g. matrices of size|T|*|W|
such as pwt, nwt or rwt). Initialize* fills the matrix with random0..1
values. Export and Import saves the matrix to disk and re-loads it back.ArtmOverwriteTopicModel
allows to overwite values in topic model (for example to manually specify initial approximation).ArtmFitOfflineMasterModel
— fit the model with offline algorithmArtmFitOnlineMasterModel
— fit the model with online algorithmArtmRequestTransformMasterModel
— apply the model to new dataArtmRequestMasterModelConfig
— retrieve configuration of master modelArtmRequestThetaMatrix
— retrieve cached theta matrixArtmRequestTopicModel
— retrieve a model (e.g. pwt, nwt or rwt matrix)ArtmRequestScore
– retrieve score (such as perplexity, sparsity, etc)ArtmRequestScoreArray
– retrieve historical information for a given scoreArtmRequestMasterComponentInfo
– retrieve diagnostics information and internal state of the master modelArtmDisposeModel
/ArtmDisposeDictionary
/ArtmDisposeBatch
– dispose specific objectsArtmClearThetaCache
/ArtmClearScoreCache
/ArtmClearScoreArrayCache
– clear specific cachesArtmSetProtobufMessageFormatToJson
/ArtmSetProtobufMessageFormatToBinary
/ArtmProtobufMessageFormatIsJson
– configure the low-level API to work with JSON-serialized protobuf messages instead of binary-serialized protobuf messages
The following operations are less important part of low-level BigARTM CLI. In most cases you won’e need them, unless you have a very specific needs.
master_id = ArtmDuplicateMasterComponent (master_id, artm.DuplicateMasterComponentArgs);
ArtmCreateRegularizer (master_id, artm.RegularizerConfig);
ArtmReconfigureRegularizer (master_id, artm.RegularizerConfig);
ArtmDisposeRegularizer (master_id, const char* regularizer_name);
ArtmOverwriteTopicModelNamed (master_id, artm.TopicModel, const char* name);
ArtmCreateDictionaryNamed (master_id, artm.DictionaryData, const char* name);
ArtmAttachModel (master_id, artm.AttachModelArgs, int address_length, char* address);
artm.ProcessBatchesResult = ArtmRequestProcessBatches (master_id, artm.ProcessBatchesArgs);
artm.ProcessBatchesResult = ArtmRequestProcessBatchesExternal(master_id, artm.ProcessBatchesArgs);
ArtmAsyncProcessBatches (master_id, artm.ProcessBatchesArgs);
ArtmMergeModel (master_id, artm.MergeModelArgs);
ArtmRegularizeModel (master_id, artm.RegularizeModelArgs);
ArtmNormalizeModel (master_id, artm.NormalizeModelArgs);
artm.Batch = ArtmRequestLoadBatch ( const char* filename);
ArtmAwaitOperation ( int operation_id, artm.AwaitOperationArgs);
ArtmSaveBatch ( const char* disk_path, artm.Batch);
Protocol for retrieving results¶
The methods in low-level API can be split into two groups —
those that execute certain action, and those that request certain data.
For example ArtmCreateMasterModel
and ArtmFitOfflineMasterModel
just execute an action, while ArtmRequestTopicModel
is a request for data.
Naming convention is that such requests always start with ArtmRequest
prefix.
1. To call execute-action method is fairly straighforward —
first you create a protobuf message that describe the arguments of the operation.
For example, ArtmCreateMasterModel
expects artm.MasterModelConfig
message,
as defined in the documentation of ArtmCreateMasterModel
.
Then you serialize protobuf message,
and pass it to the method along with the length of the serialized message.
In some cases you also pass the master_id, returned by ArtmCreateMasterModel
,
as described in details futher below on this page.
The execute-action method will typically return an error code,
with zero value (or ARTM_SUCCESS
) indicating successfull execution.
2. To call request-data method is more tricky.
First you follow the same procedure as when calling an execute-action method,
e.g. create and serialize protobuf message and pass it to your ArtmRequestXxx
operation.
For example, ArtmRequestTopicModel
expects artm.GetTopicModelArgs
message.
Then the method like ArtmRequestTopicModel
will return the size (in bytes)
of the memory buffer that needs to be allocated by caller.
To fill this buffer with actual data you need to call method
int ArtmCopyRequestedMessage(int length, char* address)
where address
give a pointer to the memory buffer,
and length
must give the length of the buffer
(e.g. must match the value returned by ArtmRequestXxx
call).
After ArtmCopyRequestedMessage
the buffer will contain protobuf-serialized message.
To deserialize this message you need to know its protobuf type,
which will be defined by the documentation of the ArtmRequestXxx
method that you are calling.
For ArtmRequestTopicModel
it will be a artm.TopicModel
message.
3. Note that few ArtmRequestXxx
methods has a more complex protocol that require two subsequent calls —
first, to ArtmCopyRequestedMessage
, and then to ArtmCopyRequestedObject
.
If that’s the case the name of the method will be ArtmRequestXxxExternal
(for example ArtmRequestThetaMatrixExternal
or ArtmRequestTopicModelExternal
).
Typically this is used to copy out large objects, such as theta or phi matrices,
and store them directly as dense matrices, bypassing protobuf serialization.
For more information see
cpp_interface.cc.
A side-note on thread safety:
in between calls to ArtmRequestXxx
and ArtmCopyRequestedMessage
the result is stored in a thread local storage.
This allows you to call multiple ArtmRequestXxx
methods from different threads.
Error handling¶
All methods in this API return an integer value.
Negative return values represent an error code.
See error codes for the list of all error codes.
To get corresponding error message as string use ArtmGetLastErrorMessage()
.
Non-negative return values represent a success, and for some API methods
might also incorporate some useful information.
For example, ArtmCreateMasterModel()
returns the ID of newly created master component,
and ArtmRequestTopicModel()
returns the length of the buffer that should be allocated before
calling ArtmCopyRequestedMessage()
.
MasterId and MasterModel¶
The concept of Master Model is central in low-level API.
Almost any interaction with the low-level API starts by calling
method ArtmCreateMasterModel
, which creates an instance of so-called Master Model (or Master Component),
and returns its master_id
– an integer identifier that refers to that instance.
You need master_id
in the remaining methods of the low-level API,
such as ArtmFitOfflineMasterModel
.
master_id
creates a context, or scope, that isolate different models from each other.
An operation applied to a specific master_id
will not affect other master components.
Each master model occupy some memory — potentially a very large amount,
depending on the number of topics and tokens in the model.
Once you are done with a specific instance of master component you need to dispose its resources
by calling ArtmDisposeMasterComponent(master_id)
.
After that master_id
is no longer valid, and it must not be used as argument to other methods.
You may use method ArtmRequestMasterComponentInfo
to retrieve
internal diagnostics information about master component.
It will reveal its internal state and tell the config of the master component,
the list of scores and regularizers, the list of phi matrices,
the list of dictionaries, cache entries, and other informatino
that will help to understand how master component is functioning.
Note there might be confusion between terms MasterComponent and MasterModel, throughout this page as well as in the actual naming of the methods. This is due to historical reasons, and for all practical purposes you may think that this terms refer to the same thing.
ArtmConfigureLogging¶
You may use ArtmConfigureLogging
call to set logging parameters,
such as verbosity level or directory to output logs.
You are not require to call ArtmConfigureLogging
,
in which case logging is automaticaly initialized to INFO level,
and logs are placed in the active working folder.
Note that you can set log directory just one time.
Once it is set you can not change it afterwards.
Method ArtmConfigureLogging
will return error code INVALID_OPERATION
if it detects an attempt to change logging folder after logging had been initialized.
In order to set log directory the call to ArtmConfigureLogging
must happen prior to calling any other methods in low-level C API.
(with exception to ArtmSetProtobufMessageFormatToJson
,
ArtmSetProtobufMessageFormatToBinary
and ArtmProtobufMessageFormatIsJson
).
This is because methods in c_interface
may automatically initialize logging
into current working directory, which later can not be changed.
Setting log directory require that target folder already exist on disk.
The following parameters can be customized with ArtmConfigureLogging
.
message ConfigureLoggingArgs {
// If specified, logfiles are written into this directory
// instead of the default logging directory.
optional string log_dir = 1;
// Messages logged at a lower level than this
// do not actually get logged anywhere
optional int32 minloglevel = 2;
// Log messages at a level >= this flag are automatically
// sent to stderr in addition to log files.
optional int32 stderrthreshold = 3;
// Log messages go to stderr instead of logfiles
optional bool logtostderr = 4;
// color messages logged to stderr (if supported by terminal)
optional bool colorlogtostderr = 5;
// log messages go to stderr in addition to logfiles
optional bool alsologtostderr = 6;
// Buffer log messages for at most this many seconds
optional int32 logbufsecs = 7;
// Buffer log messages logged at this level or lower
// (-1 means do not buffer; 0 means buffer INFO only; ...)
optional int32 logbuflevel = 8;
// approx. maximum log file size (in MB). A value of 0 will be silently overridden to 1.
optional int32 max_log_size = 9;
// Stop attempting to log to disk if the disk is full.
optional bool stop_logging_if_full_disk = 10;
}
We recommend to set logbuflevel = -1
to not buffer log messages.
However by default BigARTM does not set this parameter, using the same default as provided by glog.
ArtmReconfigureTopicName¶
To explain ArtmReconfigureTopicName
we need to first start with ArtmReconfigureMasterModel
.
ArtmReconfigureMasterModel
allow user to rename topics, but the number of topics must stay the same,
and the order and the content of all existing phi matrices remains unchanged.
On contrary, ArtmReconfigureTopicName
may change the number of topics by adding or removing topics,
as well as reorder columns of existing phi matrices.
In ArtmReconfigureMasterModel
the list of topic names is treated
as new identifiers that should be set for existing columns.
In ArtmReconfigureTopicName
the list of topic names is matched against previous topic names.
New topic names are added to phi matrices, topic names removed from the list are excluded from phi matrices,
and topic names present in both old and new lists are re-ordered accordingly to match new topic name list.
Examples for ArtmReconfigureMasterModel
:
t1, t2, t3
->t4, t5, t6
sets new topic names for existing columns in phi matrices.
Examples for ArtmReconfigureTopicName
:
t1, t2, t3
->t1, t2, t3, t4
adds a new column to phi matrices, initialized with zerost1, t2, t3
->t1, t2
removes last column from phi matricest1, t2, t3
->t2, t3
removes the first column from phi matricest1, t2, t3
->t3, t2
removes the first column from phi matrices and swaps the remaining two columnst1, t2, t3
->t4, t5, t6
removes all columns from phi matrices and creates three new columns, initialized with zeros
Note that both ArtmReconfigureTopicName
and ArtmReconfigureMasterModel
only affect phi matrices
where set of topic names match the configuration of the master model.
User-created matrices with custom set of topic names,
for example created via ArtmMergeModel
, will stay unchanged.
If you change topic names you should also consider changing your configuration of scores and regularizers.
Also take into account that ArtmReconfigureTopicName
and ArtmReconfigureMasterModel
do not update theta cache.
It is a good idea to call ArtmClearThetaCache
after changing topic names.
ArtmGetLastErrorMessage¶
-
const char*
ArtmGetLastErrorMessage
()¶ Retrieves the textual error message, occured during the last failing request.
Error codes¶
#define ARTM_SUCCESS 0
#define ARTM_STILL_WORKING -1
#define ARTM_INTERNAL_ERROR -2
#define ARTM_ARGUMENT_OUT_OF_RANGE -3
#define ARTM_INVALID_MASTER_ID -4
#define ARTM_CORRUPTED_MESSAGE -5
#define ARTM_INVALID_OPERATION -6
#define ARTM_DISK_READ_ERROR -7
#define ARTM_DISK_WRITE_ERROR -8
-
ARTM_SUCCESS
¶ The API call succeeded.
-
ARTM_STILL_WORKING
¶ This error code is applicable only to
ArtmAwaitOperation()
. It indicates that library is still processing the collection. Try to retrieve results later.
-
ARTM_INTERNAL_ERROR
¶ The API call failed due to internal error in BigARTM library. Please, collect steps to reproduce this issue and report it with BigARTM issue tracker.
-
ARTM_ARGUMENT_OUT_OF_RANGE
¶ The API call failed because one or more values of an argument are outside the allowable range of values as defined by the invoked method.
-
ARTM_INVALID_MASTER_ID
¶ An API call that require master_id parameter failed because MasterComponent with given ID does not exist.
-
ARTM_CORRUPTED_MESSAGE
¶ Unable to deserialize protocol buffer message.
-
ARTM_INVALID_OPERATION
¶ The API call is invalid in current state or due to provided parameters.
-
ARTM_DISK_READ_ERROR
¶ The required files coult not be read from disk.
-
ARTM_DISK_WRITE_ERROR
¶ The required files could not be writtent to disk.