Pig
  1. Pig
  2. PIG-3190

Add LuceneTokenizer and SnowballTokenizer to Pig - useful text tokenization

    Details

    • Type: Bug Bug
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: 0.11
    • Fix Version/s: 0.16.0
    • Component/s: internal-udfs
    • Labels:
      None

      Description

      TOKENIZE is literally useless. The Lucene Standard/Snowball tokenizers in lucene, as used by, varaha is much more useful for actual tasks: https://github.com/Ganglion/varaha/blob/master/src/main/java/varaha/text/TokenizeText.java

      1. PIG-3190.patch
        12 kB
        Russell Jurney
      2. PIG-3190-2.patch
        339 kB
        Russell Jurney
      3. PIG-3190-3.patch
        336 kB
        Russell Jurney

        Activity

        Russell Jurney created issue -
        Hide
        Russell Jurney added a comment -

        LuceneTokenize()

        Show
        Russell Jurney added a comment - LuceneTokenize()
        Hide
        Russell Jurney added a comment -

        Pulled from varaha project by Jacob Perkins

        Show
        Russell Jurney added a comment - Pulled from varaha project by Jacob Perkins
        Russell Jurney made changes -
        Field Original Value New Value
        Attachment PIG-3190.patch [ 12569757 ]
        Russell Jurney made changes -
        Summary Add LuceneTokenizer to Pig - useful text tokenization Add LuceneTokenizer and SnowballTokenizer to Pig - useful text tokenization
        Description TOKENIZE is literally useless. The Lucene tokenizer in varaha is much more useful for actual tasks: https://github.com/Ganglion/varaha/blob/master/src/main/java/varaha/text/TokenizeText.java TOKENIZE is literally useless. The Lucene Standard/Snowball tokenizers in lucene, as used by, varaha is much more useful for actual tasks: https://github.com/Ganglion/varaha/blob/master/src/main/java/varaha/text/TokenizeText.java
        Hide
        Russell Jurney added a comment -

        Initial working patch, sans tests.

        Show
        Russell Jurney added a comment - Initial working patch, sans tests.
        Russell Jurney made changes -
        Attachment PIG-3190.patch [ 12569838 ]
        Russell Jurney made changes -
        Attachment PIG-3190.patch [ 12569757 ]
        Hide
        Russell Jurney added a comment -

        Working patch with unit test.

        Show
        Russell Jurney added a comment - Working patch with unit test.
        Russell Jurney made changes -
        Attachment PIG-3190-2.patch [ 12569849 ]
        Hide
        Jonathan Coveney added a comment -

        Can you throw this in RB? Either way, some initial comments...

        1. ExecException is a subclass of IOException...why do you just catch and rethrow it?
        2. On the TokenStream stream declaration you have too many parens.
        3. I personally don't like functions whose aliases depends on the alias of an input type unless it really makes sense. IMHO this is not one of those cases. I'd just nix it and use the @OutputSchema annotation.

        Many of these apply to the SnowballTokenizer as well.

        Show
        Jonathan Coveney added a comment - Can you throw this in RB? Either way, some initial comments... 1. ExecException is a subclass of IOException...why do you just catch and rethrow it? 2. On the TokenStream stream declaration you have too many parens. 3. I personally don't like functions whose aliases depends on the alias of an input type unless it really makes sense. IMHO this is not one of those cases. I'd just nix it and use the @OutputSchema annotation. Many of these apply to the SnowballTokenizer as well.
        Hide
        Russell Jurney added a comment -

        Changed name from Lucene -> Standard, and general cleanup based on feedback.

        Show
        Russell Jurney added a comment - Changed name from Lucene -> Standard, and general cleanup based on feedback.
        Russell Jurney made changes -
        Attachment PIG-3190-3.patch [ 12573003 ]
        Hide
        Russell Jurney added a comment -
        Show
        Russell Jurney added a comment - https://reviews.apache.org/r/9843/ REview Board
        Russell Jurney made changes -
        Status Open [ 1 ] Patch Available [ 10002 ]
        Hide
        Daniel Dai added a comment -

        Some comments:

        • TestStandardTokenize fail due to LuceneTokenize not found. Seems you change LuceneTokenize to StandardTokenize but not changing the test
        • How about Jonathan's comment#1?
        • TestStandardTokenize.java includes two copies of Apache License
        • In general, shall we put it into piggybank instead of builtin?
        Show
        Daniel Dai added a comment - Some comments: TestStandardTokenize fail due to LuceneTokenize not found. Seems you change LuceneTokenize to StandardTokenize but not changing the test How about Jonathan's comment#1? TestStandardTokenize.java includes two copies of Apache License In general, shall we put it into piggybank instead of builtin?
        Hide
        Russell Jurney added a comment -

        Thanks for the notes. I'll get it fixed.

        As to piggybank: if this were in Piggybank, it would never get used. The reason being that a user would have to locate and LOAD lucene.jar(s), which is very difficult to do in practice. Since TOKENIZE is builtin, these make sense being builtin too.

        Show
        Russell Jurney added a comment - Thanks for the notes. I'll get it fixed. As to piggybank: if this were in Piggybank, it would never get used. The reason being that a user would have to locate and LOAD lucene.jar(s), which is very difficult to do in practice. Since TOKENIZE is builtin, these make sense being builtin too.
        Hide
        Daniel Dai added a comment -

        I don't against expanding our builtin collection, but traditionally we only take basic UDFs into builtin. How's others think?

        Show
        Daniel Dai added a comment - I don't against expanding our builtin collection, but traditionally we only take basic UDFs into builtin. How's others think?
        Hide
        Alan Gates added a comment -

        Canceling patch until issues around location and build failures are resolved.

        Show
        Alan Gates added a comment - Canceling patch until issues around location and build failures are resolved.
        Alan Gates made changes -
        Status Patch Available [ 10002 ] Open [ 1 ]
        Daniel Dai made changes -
        Fix Version/s 0.13.0 [ 12324971 ]
        Fix Version/s 0.12.0 [ 12323380 ]
        Daniel Dai made changes -
        Fix Version/s 0.14.0 [ 12326954 ]
        Fix Version/s 0.13.0 [ 12324971 ]
        Daniel Dai made changes -
        Fix Version/s 0.15.0 [ 12328760 ]
        Fix Version/s 0.14.0 [ 12326954 ]
        Daniel Dai made changes -
        Fix Version/s 0.16.0 [ 12332168 ]
        Fix Version/s 0.15.0 [ 12328760 ]
        Transition Time In Source Status Execution Times Last Executer Last Execution Date
        Open Open Patch Available Patch Available
        22d 3m 1 Russell Jurney 11/Mar/13 00:47
        Patch Available Patch Available Open Open
        36d 16h 28m 1 Alan Gates 16/Apr/13 18:15

          People

          • Assignee:
            Russell Jurney
            Reporter:
            Russell Jurney
          • Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:

              Development