Description
Create a generic UDF that tokenizes free-form natural language text into sentences and words for more advanced processing, while stripping unnecessary punctuation and being fully international-aware. Fortunately, most of this functionality is already built into Java in the form of the i8n BreakIterator class, so this UDF will just connect it to Hive. For example:
> SELECT sentences("Hello there! This is a UDF.") FROM somedata LIMIT 1;
[ ["Hello", "there"], ["This", "is", "a", "UDF"] ]
or
> SELECT sentences("Je m'apelle hive!!!", "fr") FROM somedata LIMIT 1;
[["Je","m'apelle","hive"]]
Notice how punctuation is maintained only where appropriate. Breaking at sentences (and thus the nested array return type) is important for tasks like counting the frequency of n-grams in text, which should not cross sentence boundaries.