Uploaded image for project: 'Apache MADlib'
  1. Apache MADlib
  2. MADLIB-933

MADlib LDA term_frequency function bugs

    XMLWordPrintableJSON

Details

    Description

      1. madlib.term_frequency() function (http://doc.madlib.net/latest/group__grp__text__utilities.html) takes the docid column and words columns as inputs, but this just fools us into thinking that we could name our columns as whatever we want, coz it complains if the columns are not actually named "docid" and "words"!
      2. Secondly, it takes an output table as well as input (ex: documents_tf), but it creates a temp table for the vocabulary (therefore i can't specify a schema name like vatsan.documents_tf). This is annoying for two reasons
      a. The user can't immediately senses what's with the vocabulary table and why is it a temp table while the documents_tf table itself is not.
      b. If i have a real world dataset for LDA, my models are going to run for quite sometime. I may even terminate one session and run the LDA model in another session, this would mean the vocabulary temp table won't be available in the other session (or would have gotten dropped)

      Attachments

        Activity

          People

            riyer Rahul Iyer
            vatsan Srivatsan Ramanujam
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: