Uploaded image for project: 'Apache MADlib'
  1. Apache MADlib
  2. MADLIB-899

LDA (parsed) model table and output table disagree

    XMLWordPrintableJSON

Details

    Description

      select * from tester;                                                                                                                                       docid |            documents             |               words
      -------+----------------------------------+------------------------------------
           2 | Sam ate ham for lunch            | {sam,ate,ham,for,lunch}
           1 | Monday morning. I ate breakfast! | {monday,morning.,i,ate,breakfast!}
      
      
      SELECT madlib.term_frequency('tester','docid','words','my_training',TRUE);
                                           term_frequency
      ----------------------------------------------------------------------------------------
       Term frequency output in table my_training, vocabulary in table my_training_vocabulary
      (1 row)
      
      
      select madlib.lda_train('my_training','my_model','my_outdata',9,5,10,1,0.1);
                  lda_train
      ----------------------------------
       (my_model,"model table")
       (my_outdata,"output data table")
      (2 rows)
      
      madlib-pg93=# select (madlib.lda_parse_model(model, voc_size, topic_num)).* from my_model;
                      model_matrix_part1                 |                      model_matrix_part2                       | total_topic_counts
      ---------------------------------------------------+---------------------------------------------------------------+--------------------
       {{2,0,0,0,0},{0,0,0,0,1},{0,0,1,0,0},{0,0,0,0,1}} | {{0,1,0,0,0},{0,0,1,0,0},{0,1,0,0,0},{0,0,0,0,1},{0,0,0,1,0}} | {2,2,2,1,3}
      (1 row)
      
      madlib-pg93=# select * from my_outdata;
       docid | wordcount |    words    |   counts    | topic_count | topic_assignment
      -------+-----------+-------------+-------------+-------------+------------------
           1 |         5 | {0,1,4,6,7} | {1,1,1,1,1} | {2,1,0,0,2} | {0,4,1,4,0}
           2 |         5 | {8,0,5,2,3} | {1,1,1,1,1} | {0,2,1,1,1} | {1,1,4,3,2}
      (2 rows)
      
      madlib-pg93=# select * from my_model
      madlib-pg93-# ;
       voc_size | topic_num | alpha | beta |                                       model
      ----------+-----------+-------+------+------------------------------------------------------------------------------------
              9 |         5 |     1 |  0.1 | {2,0,0,0,0,1,0,1,0,0,0,1,4294967296,0,0,0,1,0,4294967296,0,0,0,0,1,0,4294967296,0}
      (1 row)
      
      madlib-pg93=# select * from my_training_vocabulary
      madlib-pg93-# ;
       wordid |    word
      --------+------------
            0 | ate
            1 | breakfast!
            2 | for
            3 | ham
            4 | i
            5 | lunch
            6 | monday
            7 | morning.
            8 | sam
      (9 rows)
      
      
      
      
      

      total_topic_counts array from model does not match the sum of the topic_counts arrays from the output_table.

      Attachments

        Issue Links

          Activity

            People

              riyer Rahul Iyer
              sziegler Steve Ziegler
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated: