Hive
  1. Hive
  2. HIVE-1438

sentences() UDF for natural language tokenization

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.7.0
    • Fix Version/s: 0.7.0
    • Component/s: UDF
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      Create a generic UDF that tokenizes free-form natural language text into sentences and words for more advanced processing, while stripping unnecessary punctuation and being fully international-aware. Fortunately, most of this functionality is already built into Java in the form of the i8n BreakIterator class, so this UDF will just connect it to Hive. For example:

      > SELECT sentences("Hello there! This is a UDF.") FROM somedata LIMIT 1;
      [ ["Hello", "there"], ["This", "is", "a", "UDF"] ]

      or

      > SELECT sentences("Je m'apelle hive!!!", "fr") FROM somedata LIMIT 1;
      [["Je","m'apelle","hive"]]

      Notice how punctuation is maintained only where appropriate. Breaking at sentences (and thus the nested array return type) is important for tasks like counting the frequency of n-grams in text, which should not cross sentence boundaries.

      1. HIVE-1438.2.patch
        21 kB
        Mayank Lahiri
      2. HIVE-1438.1.patch
        13 kB
        Mayank Lahiri

        Activity

        Mayank Lahiri created issue -
        Mayank Lahiri made changes -
        Field Original Value New Value
        Attachment HIVE-1438.1.patch [ 12448817 ]
        Hide
        Mayank Lahiri added a comment -

        Patch available for code review. Implements the UDF as described.

        Show
        Mayank Lahiri added a comment - Patch available for code review. Implements the UDF as described.
        Mayank Lahiri made changes -
        Status Open [ 1 ] Patch Available [ 10002 ]
        Hide
        John Sichi added a comment -

        For the test case, it's good that you have non-English text. However, I'm worried that checking in non-ASCII files to Subversion may cause encoding problems on some platforms (I've seen problems from this in the past). Let's think of a way to avoid that () while preserving the test coverage.

        Looking at the grammar file (Hive.g), there may be a way to encode Unicode characters as hex in a character string literal.

        Show
        John Sichi added a comment - For the test case, it's good that you have non-English text. However, I'm worried that checking in non-ASCII files to Subversion may cause encoding problems on some platforms (I've seen problems from this in the past). Let's think of a way to avoid that () while preserving the test coverage. Looking at the grammar file (Hive.g), there may be a way to encode Unicode characters as hex in a character string literal.
        Hide
        Mayank Lahiri added a comment -

        Removed the UTF-8 text from the test cases using some hex()/unhex() calls. Did this by encoding the inputof the sentences() UDF as UTF-8 hex strings, and only displaying the hex form in the output.

        For future reference: the Hive CLI on a Mac with the hex() UDF will allow you to paste a UTF-8 string into the terminal and convert it to a hex bytestream. This can be reversed with the unhex() UDF.

        Show
        Mayank Lahiri added a comment - Removed the UTF-8 text from the test cases using some hex()/unhex() calls. Did this by encoding the inputof the sentences() UDF as UTF-8 hex strings, and only displaying the hex form in the output. For future reference: the Hive CLI on a Mac with the hex() UDF will allow you to paste a UTF-8 string into the terminal and convert it to a hex bytestream. This can be reversed with the unhex() UDF.
        Mayank Lahiri made changes -
        Attachment HIVE-1438.2.patch [ 12449299 ]
        Hide
        John Sichi added a comment -

        +1. Will commit if tests pass.

        (Some optimization for the case where the locale is constant would be nice, but we can leave that for a followup.)

        Show
        John Sichi added a comment - +1. Will commit if tests pass. (Some optimization for the case where the locale is constant would be nice, but we can leave that for a followup.)
        Hide
        John Sichi added a comment -

        Committed. Thanks Mayank!

        Show
        John Sichi added a comment - Committed. Thanks Mayank!
        John Sichi made changes -
        Status Patch Available [ 10002 ] Resolved [ 5 ]
        Hadoop Flags [Reviewed]
        Resolution Fixed [ 1 ]
        Carl Steinbach made changes -
        Component/s UDF [ 12313585 ]
        Component/s Query Processor [ 12312586 ]
        Carl Steinbach made changes -
        Status Resolved [ 5 ] Closed [ 6 ]
        Transition Time In Source Status Execution Times Last Executer Last Execution Date
        Open Open Patch Available Patch Available
        11d 51m 1 Mayank Lahiri 06/Jul/10 21:46
        Patch Available Patch Available Resolved Resolved
        6d 4h 6m 1 John Sichi 13/Jul/10 01:53
        Resolved Resolved Closed Closed
        521d 23h 6m 1 Carl Steinbach 16/Dec/11 23:59

          People

          • Assignee:
            Mayank Lahiri
            Reporter:
            Mayank Lahiri
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development