Uploaded image for project: 'Apache MADlib'
  1. Apache MADlib
  2. MADLIB-1201

Inconsistent lda output tables

    XMLWordPrintableJSON

Details

    Description

      We found an inconsistency in the LDA module between the outputs of lda_train and lda_get_word_topic_count.

      Repro Steps

      DROP TABLE IF EXISTS documents;
      CREATE TABLE documents(docid INT4, contents TEXT);
      INSERT INTO documents VALUES
      (0, ' b a a c'),
      (1, ' d e f f f ');
      
      ALTER TABLE documents ADD COLUMN words TEXT[];
      UPDATE documents SET words = regexp_split_to_array(lower(contents), E'[\\s+\\.\\,]');
      
      DROP TABLE IF EXISTS my_training, my_training_vocabulary;
      SELECT madlib.term_frequency('documents', 'docid', 'words', 'my_training', TRUE);
      
      
      DROP TABLE IF EXISTS my_model, my_outdata;
      SELECT madlib.lda_train( 'my_training',
                               'my_model',
                               'my_outdata',
                               7,
                               2,
                               1,
                               5,
                               0.01
                             );
      
      select * from my_outdata order by docid;
      ```
       docid | wordcount |   words   |  counts   | topic_count | topic_assignment
      -------+-----------+-----------+-----------+-------------+------------------
           0 |         5 | {2,1,0,3} | {1,2,1,1} | {2,3}       | {0,1,1,1,0}
           1 |         7 | {4,5,0,6} | {1,1,2,3} | {1,6}       | {1,0,1,1,1,1,1}
      ```
      
      
      DROP TABLE IF EXISTS my_word_topic_count;
      SELECT madlib.lda_get_word_topic_count( 'my_model', 'my_word_topic_count');
      SELECT * FROM my_word_topic_count ORDER BY wordid;
      ```
       wordid | topic_count
      --------+-------------
            0 | {1,2}
            1 | {0,2}
            2 | {1,0}
            3 | {0,1}
            4 | {1,0}
            5 | {0,1}
            6 | {0,3}
      (7 rows)
      ```
      

      The output of 'my_outdata' indicates that wordid 3 gets assigned only to topic 0 but the output of my_word_topic_count indicates that wordid 3 gets assigned only to topic 1. This output seems to be inconsistent with each other.

      Attachments

        Issue Links

          Activity

            People

              jingyimei Jingyi Mei
              jingyimei Jingyi Mei
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: