[HIVE-1518] context_ngrams() UDAF for estimating top-k contextual n-grams - ASF JIRA

XML

Word

Printable

JSON

Details

Type: New Feature
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 0.7.0
Fix Version/s: 0.7.0
Component/s: UDF
Labels:
None

Hadoop Flags:

Reviewed

Description

Create a new context_ngrams() function that generalizes the ngrams() UDAF to allow the user to specify context around n-grams. The analogy is "fill-in-the-blanks", and is best illustrated with an example:

SELECT context_ngrams(sentences(tweets), array("i", "love", null), 300) FROM twitter;

will estimate the top-300 words that follow the phrase "i love" in a database of tweets. The position of the null(s) specifies where to generate the n-gram from, and can be placed anywhere. For example:

SELECT context_ngrams(sentences(tweets), array("i", "love", null, "but", "hate", null), 300) FROM twitter;

will estimate the top-300 word-pairs that fill in the blanks specified by null.

POSSIBLE USES:
1. Pre-computing search lookaheads
2. Sentiment analysis for products or entities – e.g., querying with context = array("twitter", "is", null)
3. Navigation path analysis in URL databases

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

HIVE-1518.5.patch
18/Aug/10 17:34
102 kB
Mayank Lahiri
HIVE-1518.4.patch
17/Aug/10 23:55
103 kB
Mayank Lahiri
HIVE-1518.3.patch
17/Aug/10 21:35
99 kB
Mayank Lahiri
HIVE-1518.2.patch
12/Aug/10 22:41
98 kB
Mayank Lahiri
HIVE-1518.1.patch
11/Aug/10 20:27
98 kB
Mayank Lahiri

Activity

People

Assignee:: Mayank Lahiri

Reporter:: Mayank Lahiri

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 06/Aug/10 23:26

Updated:: 16/Dec/11 23:59

Resolved:: 18/Aug/10 23:00