Details
-
Bug
-
Status: Closed
-
Major
-
Resolution: Fixed
-
None
-
None
Description
We found an inconsistency in the LDA module between the outputs of lda_train and lda_get_word_topic_count.
Repro Steps
DROP TABLE IF EXISTS documents; CREATE TABLE documents(docid INT4, contents TEXT); INSERT INTO documents VALUES (0, ' b a a c'), (1, ' d e f f f '); ALTER TABLE documents ADD COLUMN words TEXT[]; UPDATE documents SET words = regexp_split_to_array(lower(contents), E'[\\s+\\.\\,]'); DROP TABLE IF EXISTS my_training, my_training_vocabulary; SELECT madlib.term_frequency('documents', 'docid', 'words', 'my_training', TRUE); DROP TABLE IF EXISTS my_model, my_outdata; SELECT madlib.lda_train( 'my_training', 'my_model', 'my_outdata', 7, 2, 1, 5, 0.01 ); select * from my_outdata order by docid; ``` docid | wordcount | words | counts | topic_count | topic_assignment -------+-----------+-----------+-----------+-------------+------------------ 0 | 5 | {2,1,0,3} | {1,2,1,1} | {2,3} | {0,1,1,1,0} 1 | 7 | {4,5,0,6} | {1,1,2,3} | {1,6} | {1,0,1,1,1,1,1} ``` DROP TABLE IF EXISTS my_word_topic_count; SELECT madlib.lda_get_word_topic_count( 'my_model', 'my_word_topic_count'); SELECT * FROM my_word_topic_count ORDER BY wordid; ``` wordid | topic_count --------+------------- 0 | {1,2} 1 | {0,2} 2 | {1,0} 3 | {0,1} 4 | {1,0} 5 | {0,1} 6 | {0,3} (7 rows) ```
The output of 'my_outdata' indicates that wordid 3 gets assigned only to topic 0 but the output of my_word_topic_count indicates that wordid 3 gets assigned only to topic 1. This output seems to be inconsistent with each other.
Attachments
Issue Links
- links to