Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-1438

sentences() UDF for natural language tokenization

    Details

    • Type: New Feature
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 0.7.0
    • Fix Version/s: 0.7.0
    • Component/s: UDF
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      Create a generic UDF that tokenizes free-form natural language text into sentences and words for more advanced processing, while stripping unnecessary punctuation and being fully international-aware. Fortunately, most of this functionality is already built into Java in the form of the i8n BreakIterator class, so this UDF will just connect it to Hive. For example:

      > SELECT sentences("Hello there! This is a UDF.") FROM somedata LIMIT 1;
      [ ["Hello", "there"], ["This", "is", "a", "UDF"] ]

      or

      > SELECT sentences("Je m'apelle hive!!!", "fr") FROM somedata LIMIT 1;
      [["Je","m'apelle","hive"]]

      Notice how punctuation is maintained only where appropriate. Breaking at sentences (and thus the nested array return type) is important for tasks like counting the frequency of n-grams in text, which should not cross sentence boundaries.

        Attachments

        1. HIVE-1438.1.patch
          13 kB
          Mayank Lahiri
        2. HIVE-1438.2.patch
          21 kB
          Mayank Lahiri

          Activity

            People

            • Assignee:
              mayanklahiri Mayank Lahiri
              Reporter:
              mayanklahiri Mayank Lahiri
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: