Hive
  1. Hive
  2. HIVE-1438

sentences() UDF for natural language tokenization

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.7.0
    • Fix Version/s: 0.7.0
    • Component/s: UDF
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      Create a generic UDF that tokenizes free-form natural language text into sentences and words for more advanced processing, while stripping unnecessary punctuation and being fully international-aware. Fortunately, most of this functionality is already built into Java in the form of the i8n BreakIterator class, so this UDF will just connect it to Hive. For example:

      > SELECT sentences("Hello there! This is a UDF.") FROM somedata LIMIT 1;
      [ ["Hello", "there"], ["This", "is", "a", "UDF"] ]

      or

      > SELECT sentences("Je m'apelle hive!!!", "fr") FROM somedata LIMIT 1;
      [["Je","m'apelle","hive"]]

      Notice how punctuation is maintained only where appropriate. Breaking at sentences (and thus the nested array return type) is important for tasks like counting the frequency of n-grams in text, which should not cross sentence boundaries.

      1. HIVE-1438.2.patch
        21 kB
        Mayank Lahiri
      2. HIVE-1438.1.patch
        13 kB
        Mayank Lahiri

        Activity

        Hide
        Mayank Lahiri added a comment -

        Patch available for code review. Implements the UDF as described.

        Show
        Mayank Lahiri added a comment - Patch available for code review. Implements the UDF as described.
        Hide
        John Sichi added a comment -

        For the test case, it's good that you have non-English text. However, I'm worried that checking in non-ASCII files to Subversion may cause encoding problems on some platforms (I've seen problems from this in the past). Let's think of a way to avoid that () while preserving the test coverage.

        Looking at the grammar file (Hive.g), there may be a way to encode Unicode characters as hex in a character string literal.

        Show
        John Sichi added a comment - For the test case, it's good that you have non-English text. However, I'm worried that checking in non-ASCII files to Subversion may cause encoding problems on some platforms (I've seen problems from this in the past). Let's think of a way to avoid that () while preserving the test coverage. Looking at the grammar file (Hive.g), there may be a way to encode Unicode characters as hex in a character string literal.
        Hide
        Mayank Lahiri added a comment -

        Removed the UTF-8 text from the test cases using some hex()/unhex() calls. Did this by encoding the inputof the sentences() UDF as UTF-8 hex strings, and only displaying the hex form in the output.

        For future reference: the Hive CLI on a Mac with the hex() UDF will allow you to paste a UTF-8 string into the terminal and convert it to a hex bytestream. This can be reversed with the unhex() UDF.

        Show
        Mayank Lahiri added a comment - Removed the UTF-8 text from the test cases using some hex()/unhex() calls. Did this by encoding the inputof the sentences() UDF as UTF-8 hex strings, and only displaying the hex form in the output. For future reference: the Hive CLI on a Mac with the hex() UDF will allow you to paste a UTF-8 string into the terminal and convert it to a hex bytestream. This can be reversed with the unhex() UDF.
        Hide
        John Sichi added a comment -

        +1. Will commit if tests pass.

        (Some optimization for the case where the locale is constant would be nice, but we can leave that for a followup.)

        Show
        John Sichi added a comment - +1. Will commit if tests pass. (Some optimization for the case where the locale is constant would be nice, but we can leave that for a followup.)
        Hide
        John Sichi added a comment -

        Committed. Thanks Mayank!

        Show
        John Sichi added a comment - Committed. Thanks Mayank!

          People

          • Assignee:
            Mayank Lahiri
            Reporter:
            Mayank Lahiri
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development