Input Data Formats and Datasets

  • Formats

This page describes input data formats compatible with BigARTM. Currently all formats support Bag-of-words representation, meaning that all linguistic processing (lemmatization, tokenization, detection of n-grams, etc) needs to be done outside BigARTM.

  1. Vowpal Wabbit is a single-format file, based on the following principles:

    • each document is depresented in a single line
    • all tokens are represented as strings (no need to convert them into an integer identifier)
    • token frequency defaults to 1.0, and can be optionally specified after a colon (:)
    • namespaces (Batch.class_id) can be identified by a pipe (|)

    Example 1

    doc1 Alpha Bravo:10 Charlie:5 |author Ola_Nordmann
    doc2 Bravo:5 Delta Echo:3 |author Ivan_Ivanov

    Example 2

    user123 |track-like track2 track5 track7 |track-play track1:10 track2:25 track3:2 track7:8 |track-skip track2:3 track8:1 |artist-like artist4:2 artist5:6 |artist-play artist4:100 artist5:20
    user345 |track-like track2 track5 track7 |track-play track1:10 track2:25 track3:2 track7:8 |track-skip track2:3 track8:1 |artist-like artist4:2 artist5:6 |artist-play artist4:100 artist5:20
    • putting tokens in each document in their natural order without specifying token frequencies will lead to model with sequential texts (not Bag-of-words)

    Example 3

    doc1 this text will be processed not as bag of words | Some_Author
  2. UCI Bag-of-words format consists of two files - vocab.*.txt and docword.*.txt. The format of the docword.*.txt file is 3 header lines, followed by NNZ triples:

    docID wordID count
    docID wordID count
    docID wordID count

    The file must be sorted on docID. Values of wordID must be unity-based (not zero-based). The format of the vocab.*.txt file is line containing wordID=n. Note that words must not have spaces or tabs. In vocab.*.txt file it is also possible to specify the namespace (Batch.class_id) for tokens, as it is shown in this example:

    token1 @default_class
    token2 custom_class
    token3 @default_class

    Use space or tab to separate token from its class. Token that are not followed by class label automatically get ''@default_class‘’ as a label (see ‘’token4’’ in the example).

    Unicode support. For non-ASCII characters save vocab.*.txt file in UTF-8 format.

  3. Batches (binary BigARTM-specific format).

    This is compact and efficient format, based on several protobuf messages in public BigARTM interface (Batch, Item and Field).

    • A batch is a collection of several items
    • An item is a collection of several fields
    • A field is a collection of pairs (token_id, token_weight).

    The following example shows a Python code that generates a synthetic batch.

    import artm.messages, random, uuid
    num_tokens = 60
    num_items = 100
    batch = artm.messages.Batch() = str(uuid.uuid4())
    for token_id in range(0, num_tokens):
        batch.token.append('token' + str(token_id))
    for item_id in range(0, num_items):
        item = batch.item.add() = item_id
        field = item.field.add()
        for token_id in range(0, num_tokens):
            background_count = random.randint(1, 5) if (token_id >= 40) else 0
            topical_count = 10 if (token_id < 40) and ((token_id % 10) == (item_id % 10)) else 0
            field.token_weight.append(background_count + topical_count)

    Note that the batch has its local dictionary, batch.token. This dictionary which maps token_id into the actual token. In order to create a batch from textual files involve one needs to find all distinct words, and map them into sequential indices. must be set to a unique GUID in a format of 00000000-0000-0000-0000-000000000000.