Uploaded image for project: 'Hivemall'
  1. Hivemall
  2. HIVEMALL-146

Implement yet another UDF to generate n-grams from a list of words

    XMLWordPrintableJSON

    Details

    • Type: New Feature
    • Status: Closed
    • Priority: Major
    • Resolution: Done
    • Affects Version/s: None
    • Fix Version/s: 0.5.0
    • Labels:
      None

      Description

      Hive has ngrams() function to obtain n-grams of a list of words: https://cwiki.apache.org/confluence/display/Hive/StatisticsAndDataMining#StatisticsAndDataMining-ngrams()andcontext_ngrams():N-gramfrequencyestimation

      While the existing function returns "estimated" top-k list of frequent n-grams, NLP applications sometimes need to get "exact" list of n-grams which include all of 1-, 2-, ..., n-grams. To give an example, for an input ["machine", "learning"], we might expect to get the following result: ["machine", "learning", "machine learning"].

      Hence, this ticket requests to implement yet another UDF something like ngrams(). Implementation could be similar to getNgrams() in the Stanford CoreNLP library: https://github.com/stanfordnlp/CoreNLP/blob/d6318a0cb06dba635550477bc843952cc5a5f868/src/edu/stanford/nlp/util/StringUtils.java#L2132-L2142

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                takuti Takuya Kitazawa
                Reporter:
                takuti Takuya Kitazawa
              • Votes:
                0 Vote for this issue
                Watchers:
                1 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: