Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-1518

context_ngrams() UDAF for estimating top-k contextual n-grams

    XMLWordPrintableJSON

Details

    • New Feature
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 0.7.0
    • 0.7.0
    • UDF
    • None
    • Reviewed

    Description

      Create a new context_ngrams() function that generalizes the ngrams() UDAF to allow the user to specify context around n-grams. The analogy is "fill-in-the-blanks", and is best illustrated with an example:

      SELECT context_ngrams(sentences(tweets), array("i", "love", null), 300) FROM twitter;

      will estimate the top-300 words that follow the phrase "i love" in a database of tweets. The position of the null(s) specifies where to generate the n-gram from, and can be placed anywhere. For example:

      SELECT context_ngrams(sentences(tweets), array("i", "love", null, "but", "hate", null), 300) FROM twitter;

      will estimate the top-300 word-pairs that fill in the blanks specified by null.

      POSSIBLE USES:
      1. Pre-computing search lookaheads
      2. Sentiment analysis for products or entities – e.g., querying with context = array("twitter", "is", null)
      3. Navigation path analysis in URL databases

      Attachments

        1. HIVE-1518.5.patch
          102 kB
          Mayank Lahiri
        2. HIVE-1518.4.patch
          103 kB
          Mayank Lahiri
        3. HIVE-1518.3.patch
          99 kB
          Mayank Lahiri
        4. HIVE-1518.2.patch
          98 kB
          Mayank Lahiri
        5. HIVE-1518.1.patch
          98 kB
          Mayank Lahiri

        Activity

          People

            mayanklahiri Mayank Lahiri
            mayanklahiri Mayank Lahiri
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: