Input Data Formats and Datasets¶
- Formats
This page describes input data formats compatible with BigARTM. Currently all formats support Bag-of-words representation, meaning that all linguistic processing (lemmatization, tokenization, detection of n-grams, etc) needs to be done outside BigARTM.
Vowpal Wabbit is a single-format file, based on the following principles:
- each document is represented in a single line
- all tokens are represented as strings (no need to convert them into an integer identifier)
- token frequency defaults to
1.0
, and can be optionally specified after a colon (:) - namespaces (
Batch.class_id
) can be identified by a pipe (|)
Example 1
doc1 Alpha Bravo:10 Charlie:5 |author Ola_Nordmann doc2 Bravo:5 Delta Echo:3 |author Ivan_Ivanov
Example 2
user123 |track-like track2 track5 track7 |track-play track1:10 track2:25 track3:2 track7:8 |track-skip track2:3 track8:1 |artist-like artist4:2 artist5:6 |artist-play artist4:100 artist5:20 user345 |track-like track2 track5 track7 |track-play track1:10 track2:25 track3:2 track7:8 |track-skip track2:3 track8:1 |artist-like artist4:2 artist5:6 |artist-play artist4:100 artist5:20
- putting tokens in each document in their natural order without specifying token frequencies will lead to model with sequential texts (not Bag-of-words)
Example 3
doc1 this text will be processed not as bag of words | Some_Author
UCI Bag-of-words format consists of two files -
vocab.*.txt
anddocword.*.txt
. The format of thedocword.*.txt
file is 3 header lines, followed by NNZ triples:D W NNZ docID wordID count docID wordID count ... docID wordID count
The file must be sorted on docID. Values of wordID must be unity-based (not zero-based). The format of the
vocab.*.txt
file is line containing wordID=n. Note that words must not have spaces or tabs. Invocab.*.txt
file it is also possible to specify the namespace (Batch.class_id
) for tokens, as it is shown in this example:token1 @default_class token2 custom_class token3 @default_class token4
Use space or tab to separate token from its class. Token that are not followed by class label automatically get ‘’@default_class’’ as a label (see ‘’token4’’ in the example).
Unicode support. For non-ASCII characters save
vocab.*.txt
file in UTF-8 format.Batches (binary BigARTM-specific format).
This is compact and efficient format, based on several protobuf messages in public BigARTM interface (Batch and Item).
- A batch is a collection of several items
- An item is a collection of pairs
(token_id, token_weight)
.
Note that the batch has its local dictionary,
batch.token
. This dictionary which mapstoken_id
into the actual token. In order to create a batch from textual files involve one needs to find all distinct words, and map them into sequential indices.batch.id
must be set to a unique GUID in a format of00000000-0000-0000-0000-000000000000
.
Datasets
Download one of the following datasets to start experimenting with BigARTM. Note that
docword.*
andvocab.*
files indicateUCI BOW
format, whilevw.*
file indicateVowpal Wabbit
format.