Uploaded image for project: 'Apache MADlib'
  1. Apache MADlib
  2. MADLIB-1160

Usability changes for LDA

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Minor
    • Resolution: Fixed
    • None
    • v1.14
    • Module: Utilities
    • None

    Description

      Context

      Please see this thread from the user mailing list
      http://mail-archives.apache.org/mod_mbox/incubator-madlib-user/201709.mbox/%3CCA%2B9JwyW78-aoe-NCQZc_iMuqW6SpKXs0H4JeTMfo3b-G4cxm0w%40mail.gmail.com%3E

      Tasks

      1) Term frequency
      http://madlib.apache.org/docs/latest/group__grp__text__utilities.html
      and LDA
      http://madlib.apache.org/docs/latest/group__grp__lda.html
      should both creates indexes that start at 1, to make them consistent with other MADlib modules. One or both of these currently create indexes starting at 0.

      2) In the output_data_table topic_assignment is a dense vector but words is a sparse vector (svec).
      We should change topic_assignment to be a sparse vector to be consistent.

      Note: the reason sparse vectors were used in the first place (I think) is to keep the model state as small as possible, so it is preferred to dense format in this case., although svecs are a bit harder to work with. We have hit the Postgres 1GB field limit size in some use cases.

      3) The user docs could also use some cleanup at the same time. E.g., helper functions are used in the examples but not described above.

      4) The helper function `madlib.lda_get_topic_desc` should return top k words (and ties).  It seems to returning the top k-1 words (and ties) now.

      Attachments

        Issue Links

          Activity

            People

              jingyimei Jingyi Mei
              fmcquillan Frank McQuillan
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: