Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-1438

sentences() UDF for natural language tokenization

    XMLWordPrintableJSON

Details

    • New Feature
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 0.7.0
    • 0.7.0
    • UDF
    • None
    • Reviewed

    Description

      Create a generic UDF that tokenizes free-form natural language text into sentences and words for more advanced processing, while stripping unnecessary punctuation and being fully international-aware. Fortunately, most of this functionality is already built into Java in the form of the i8n BreakIterator class, so this UDF will just connect it to Hive. For example:

      > SELECT sentences("Hello there! This is a UDF.") FROM somedata LIMIT 1;
      [ ["Hello", "there"], ["This", "is", "a", "UDF"] ]

      or

      > SELECT sentences("Je m'apelle hive!!!", "fr") FROM somedata LIMIT 1;
      [["Je","m'apelle","hive"]]

      Notice how punctuation is maintained only where appropriate. Breaking at sentences (and thus the nested array return type) is important for tasks like counting the frequency of n-grams in text, which should not cross sentence boundaries.

      Attachments

        1. HIVE-1438.2.patch
          21 kB
          Mayank Lahiri
        2. HIVE-1438.1.patch
          13 kB
          Mayank Lahiri

        Activity

          People

            mayanklahiri Mayank Lahiri
            mayanklahiri Mayank Lahiri
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: