XMLWordPrintableJSON

Details

    • Sub-task
    • Status: Resolved
    • Major
    • Resolution: Incomplete
    • 2.2.0
    • None
    • SQL

    Description

      This is the implementation of `ngrams` aggregate expression which is also implemented by Hive. It takes use of n-gram concept in natural language processing to understand texts.

      Currently, Spark doesn't support using Hive UDAF GenericUDAFnGrams, which is actually a feature missing.

      An n-gram is a contiguous subsequence of n item(s) drawn from a given sequence. This expression finds the k most frequent n-grams from one or more sequences.

      This expression has the pattern of : ngrams(children: Array[Array[String]](or Array[String]), n: Int, k: Int, accuracy: Int), it can be used in conjuction with `sentences` to split the column of String to Array. Among the parameters:
      Children indicates the 'given sequence' we collect n-grams from;
      N indicates n-gram's element number, size 1 is referred to as a "unigram", size 2 is a "bigram", size 3 is a "trigram"...
      K indicates top k;
      Accuracy is related to the memory used for frequency estimation, more memory will give more accurate frequency counts.

      A simple example:
      `SELECT ngrams(array("abc", "abc", "bcd", "abc", "bcd"), 2, 4);` will get
      `[

      {["abc","bcd"]:2.0}

      ,

      {["abc","abc"]:1.0}

      ,

      {["bcd","abc"]:1.0}

      ]`. Because there are four 2-grams for the input which are `["abc", "abc"], ["abc", "bcd"], ["bcd", "abc"], ["abc", "bcd"]`, and `["abc", "bcd"]` occurs 2 times, the other two 2-grams occurs 1 time each, while `["abc","abc"]` is alphabetically before `["bcd","abc"]`, so the answer is like that.

      Attachments

        Activity

          People

            Unassigned Unassigned
            chenzhaoguo-intel Chenzhao Guo
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: