[HIVE-1438] sentences() UDF for natural language tokenization - ASF JIRA

XML

Word

Printable

JSON

Details

Type: New Feature
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 0.7.0
Fix Version/s: 0.7.0
Component/s: UDF
Labels:
None

Hadoop Flags:

Reviewed

Description

Create a generic UDF that tokenizes free-form natural language text into sentences and words for more advanced processing, while stripping unnecessary punctuation and being fully international-aware. Fortunately, most of this functionality is already built into Java in the form of the i8n BreakIterator class, so this UDF will just connect it to Hive. For example:

> SELECT sentences("Hello there! This is a UDF.") FROM somedata LIMIT 1;
[ ["Hello", "there"], ["This", "is", "a", "UDF"] ]

> SELECT sentences("Je m'apelle hive!!!", "fr") FROM somedata LIMIT 1;
[["Je","m'apelle","hive"]]

Notice how punctuation is maintained only where appropriate. Breaking at sentences (and thus the nested array return type) is important for tasks like counting the frequency of n-grams in text, which should not cross sentence boundaries.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

HIVE-1438.2.patch
12/Jul/10 22:20
21 kB
Mayank Lahiri
HIVE-1438.1.patch
06/Jul/10 20:45
13 kB
Mayank Lahiri

Activity

People

Assignee:: Mayank Lahiri

Reporter:: Mayank Lahiri

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 25/Jun/10 19:54

Updated:: 16/Dec/11 23:59

Resolved:: 13/Jul/10 00:53