Hive
  1. Hive
  2. HIVE-1438

sentences() UDF for natural language tokenization

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.7.0
    • Fix Version/s: 0.7.0
    • Component/s: UDF
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      Create a generic UDF that tokenizes free-form natural language text into sentences and words for more advanced processing, while stripping unnecessary punctuation and being fully international-aware. Fortunately, most of this functionality is already built into Java in the form of the i8n BreakIterator class, so this UDF will just connect it to Hive. For example:

      > SELECT sentences("Hello there! This is a UDF.") FROM somedata LIMIT 1;
      [ ["Hello", "there"], ["This", "is", "a", "UDF"] ]

      or

      > SELECT sentences("Je m'apelle hive!!!", "fr") FROM somedata LIMIT 1;
      [["Je","m'apelle","hive"]]

      Notice how punctuation is maintained only where appropriate. Breaking at sentences (and thus the nested array return type) is important for tasks like counting the frequency of n-grams in text, which should not cross sentence boundaries.

      1. HIVE-1438.1.patch
        13 kB
        Mayank Lahiri
      2. HIVE-1438.2.patch
        21 kB
        Mayank Lahiri

        Activity

          People

          • Assignee:
            Mayank Lahiri
            Reporter:
            Mayank Lahiri
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development