Typical python example

This page is obsolete, please use the high-level API described in ARTM notebook (in Russian or in English).

Examples of low-level API

Folder C:\BigARTM\python\examples contains several toy examples:

All examples does not have any parameters, and you may run them without arguments:

C:\BigARTM\python\examples>python example02_parse_collection.py

 No batches found, parsing them from textual collection...  OK.
 Iter#0 : Perplexity = 6885.223 , Phi sparsity = 0.050 , Theta sparsity = 0.012
 Iter#1 : Perplexity = 2409.510 , Phi sparsity = 0.113 , Theta sparsity = 0.063
 Iter#2 : Perplexity = 2075.445 , Phi sparsity = 0.203 , Theta sparsity = 0.174
 Iter#3 : Perplexity = 1855.196 , Phi sparsity = 0.293 , Theta sparsity = 0.261
 Iter#4 : Perplexity = 1728.749 , Phi sparsity = 0.370 , Theta sparsity = 0.302
 Iter#5 : Perplexity = 1661.044 , Phi sparsity = 0.429 , Theta sparsity = 0.317
 Iter#6 : Perplexity = 1621.851 , Phi sparsity = 0.475 , Theta sparsity = 0.327
 Iter#7 : Perplexity = 1596.965 , Phi sparsity = 0.511 , Theta sparsity = 0.331

 Top tokens per topic:
 Topic#1: poll(0.05) iraq(0.04) people(0.02) news(0.02) john(0.01) media(0.01)
 Topic#2: republican(0.02) party(0.02) state(0.02) general(0.01) democrats(0.01)
 Topic#3: dean(0.04) edwards(0.02) percent(0.02) primary(0.02) clark(0.02)
 Topic#4: forces(0.01) baghdad(0.01) iraqis(0.01) coburn(0.01) carson(0.01)
 Topic#5: military(0.01) officials(0.01) intelligence(0.01) american(0.01)
 Topic#6: electoral(0.04) labor(0.02) culture(0.02) exit(0.02) scoop(0.01)
 Topic#7: law(0.01) court(0.01) marriage(0.01) gay(0.01) amendment(0.01)
 Topic#8: president(0.03) administration(0.02) campaign(0.01) million(0.01)
 Topic#9: years(0.01) ballot(0.01) rights(0.01) nader(0.01) life(0.01)
 Topic#10: house(0.08) war(0.03) republicans(0.02) voting(0.02) vote(0.02)

 Snippet of theta matrix:
 Item#3000: 0.432 0.507 0.059 0.000 0.000 0.000 0.000 0.000 0.002 0.000
 Item#2991: 0.249 0.382 0.269 0.000 0.000 0.025 0.016 0.034 0.000 0.026
 Item#2992: 0.000 0.001 0.000 0.000 0.000 0.000 0.000 0.851 0.000 0.147
 Item#2993: 0.358 0.058 0.030 0.141 0.152 0.000 0.002 0.248 0.000 0.010
 Item#2994: 0.051 0.142 0.056 0.000 0.000 0.146 0.000 0.000 0.000 0.604
 Item#2995: 0.004 0.593 0.000 0.000 0.128 0.005 0.168 0.040 0.030 0.033
 Item#2996: 0.069 0.063 0.054 0.000 0.000 0.107 0.008 0.004 0.000 0.696
 Item#2997: 0.000 0.194 0.000 0.000 0.043 0.000 0.471 0.228 0.062 0.002
 Item#2998: 0.026 0.085 0.042 0.001 0.180 0.000 0.146 0.485 0.022 0.012
 Item#2999: 0.312 0.547 0.099 0.000 0.000 0.004 0.008 0.017 0.013 0.000

This simple example loads a text collection from disk and uses iterative scans over the collection to infer a topic model. Then it outputs top words in each topic and topic distributions of last processed documents. For further information about this example refer to Typical python example.

Parse collection step

The following python script parses docword.kos.txt and vocab.kos.txt files and converts them into a set of binary-serialized batches, stored on disk. In addition the script creates a dictionary with all unique tokens in the collection and stored it on disk. The script also detects if it had been already executed, and in this case it just loads the dictionary and save it in unique_tokens variable.

The same logic is implemented in a helper-method ParseCollectionOrLoadDictionary method.

data_folder = sys.argv[1] if (len(sys.argv) >= 2) else ''
target_folder = 'kos'
collection_name = 'kos'

batches_found = len(glob.glob(target_folder + "/*.batch"))
if batches_found == 0:
  print "No batches found, parsing them from textual collection...",
  parser_config = artm.messages_pb2.CollectionParserConfig();
  parser_config.format = artm.library.CollectionParserConfig_Format_BagOfWordsUci

  parser_config.docword_file_path = data_folder + 'docword.'+ collection_name + '.txt'
  parser_config.vocab_file_path = data_folder + 'vocab.'+ collection_name + '.txt'
  parser_config.target_folder = target_folder
  parser_config.dictionary_file_name = 'dictionary'
  unique_tokens = artm.library.Library().ParseCollection(parser_config);
  print " OK."
else:
  print "Found " + str(batches_found) + " batches, using them."
  unique_tokens  = artm.library.Library().LoadDictionary(target_folder + '/dictionary');

You may also download larger collections from the following links. You can get the original collection (docword file and vocab file) or an already precompiled batches and dictionary.

MasterComponent

Master component is you main entry-point to all BigARTM functionality. The following script creates master component and configures it with several regularizers and score calculators.

with artm.library.MasterComponent(disk_path = target_folder) as master:
  perplexity_score     = master.CreatePerplexityScore()
  sparsity_theta_score = master.CreateSparsityThetaScore()
  sparsity_phi_score   = master.CreateSparsityPhiScore()
  top_tokens_score     = master.CreateTopTokensScore()
  theta_snippet_score  = master.CreateThetaSnippetScore()

  dirichlet_theta_reg  = master.CreateDirichletThetaRegularizer()
  dirichlet_phi_reg    = master.CreateDirichletPhiRegularizer()
  decorrelator_reg     = master.CreateDecorrelatorPhiRegularizer()

Master component must be configured with a disk path, which should contain a set of batches produced in the previous step of this tutorial.

Score calculators allows you to retrieve important quality measures for your topic model. Perplexity, sparsity of theta and phi matrices, lists of tokens with highest probability within each topic are all examples of such scores. By default BigARTM does not calculate any scores, so you have to create in master component. The same is true for regularizers, that allow you to customize your topic model.

For further details about master component refer to MasterComponentConfig.

Configure Topic Model

Topic model configuration defins the number of topics in the model, the list of scores to be calculated, and the list of regularizers to apply to the model. For further details about model configuration refer to ModelConfig.

model = master.CreateModel(topics_count = 10, inner_iterations_count = 10)
model.EnableScore(perplexity_score)
model.EnableScore(sparsity_phi_score)
model.EnableScore(sparsity_theta_score)
model.EnableScore(top_tokens_score)
model.EnableScore(theta_snippet_score)
model.EnableRegularizer(dirichlet_theta_reg, -0.1)
model.EnableRegularizer(dirichlet_phi_reg, -0.2)
model.EnableRegularizer(decorrelator_reg, 1000000)
model.Initialize(unique_tokens)    # Setup initial approximation for Phi matrix.

Note that on the last step we configured the initial approximation of Phi matrix. This step is optional — BigARTM is able to collect all tokens dynamically during first scan of the collection. However, a deterministic initial approximation helps to reproduce the same results from run to run.

Invoke Iterations

The following script performs several scans over the set of batches. Depending on the size of the collection this step might be quite time-consuming. It is good idea to output some information after every step.

for iter in range(0, 8):
  master.InvokeIteration(1)        # Invoke one scan of the entire collection...
  master.WaitIdle();               # and wait until it completes.
  model.Synchronize();             # Synchronize topic model.
  print "Iter#" + str(iter),
  print ": Perplexity = %.3f" % perplexity_score.GetValue(model).value,
  print ", Phi sparsity = %.3f" % sparsity_phi_score.GetValue(model).value,
  print ", Theta sparsity = %.3f" % sparsity_theta_score.GetValue(model).value

If your collection is very large you may want to utilize online algorithm that updates topic model several times during each iteration, as it is demonstrated by the following script:

master.InvokeIteration(1)        # Invoke one scan of the entire collection...
while True:
  done = master.WaitIdle(100)    # wait 100 ms
  model.Synchronize(0.9)         # decay weights in current topic model by 0.9,
  if (done):                     # append all increments and invoke all regularizers.
        break;

Retrieve and visualize scores

Finally, you are interested in retrieving and visualizing all collected scores.

artm.library.Visualizers.PrintTopTokensScore(top_tokens_score.GetValue(model))
artm.library.Visualizers.PrintThetaSnippetScore(theta_snippet_score.GetValue(model))