Pig
  1. Pig
  2. PIG-3190

Add LuceneTokenizer and SnowballTokenizer to Pig - useful text tokenization

    Details

    • Type: Bug Bug
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: 0.11
    • Fix Version/s: 0.16.0
    • Component/s: internal-udfs
    • Labels:
      None

      Description

      TOKENIZE is literally useless. The Lucene Standard/Snowball tokenizers in lucene, as used by, varaha is much more useful for actual tasks: https://github.com/Ganglion/varaha/blob/master/src/main/java/varaha/text/TokenizeText.java

      1. PIG-3190.patch
        12 kB
        Russell Jurney
      2. PIG-3190-2.patch
        339 kB
        Russell Jurney
      3. PIG-3190-3.patch
        336 kB
        Russell Jurney

        Activity

        Hide
        Russell Jurney added a comment -

        LuceneTokenize()

        Show
        Russell Jurney added a comment - LuceneTokenize()
        Hide
        Russell Jurney added a comment -

        Pulled from varaha project by Jacob Perkins

        Show
        Russell Jurney added a comment - Pulled from varaha project by Jacob Perkins
        Hide
        Russell Jurney added a comment -

        Initial working patch, sans tests.

        Show
        Russell Jurney added a comment - Initial working patch, sans tests.
        Hide
        Russell Jurney added a comment -

        Working patch with unit test.

        Show
        Russell Jurney added a comment - Working patch with unit test.
        Hide
        Jonathan Coveney added a comment -

        Can you throw this in RB? Either way, some initial comments...

        1. ExecException is a subclass of IOException...why do you just catch and rethrow it?
        2. On the TokenStream stream declaration you have too many parens.
        3. I personally don't like functions whose aliases depends on the alias of an input type unless it really makes sense. IMHO this is not one of those cases. I'd just nix it and use the @OutputSchema annotation.

        Many of these apply to the SnowballTokenizer as well.

        Show
        Jonathan Coveney added a comment - Can you throw this in RB? Either way, some initial comments... 1. ExecException is a subclass of IOException...why do you just catch and rethrow it? 2. On the TokenStream stream declaration you have too many parens. 3. I personally don't like functions whose aliases depends on the alias of an input type unless it really makes sense. IMHO this is not one of those cases. I'd just nix it and use the @OutputSchema annotation. Many of these apply to the SnowballTokenizer as well.
        Hide
        Russell Jurney added a comment -

        Changed name from Lucene -> Standard, and general cleanup based on feedback.

        Show
        Russell Jurney added a comment - Changed name from Lucene -> Standard, and general cleanup based on feedback.
        Hide
        Russell Jurney added a comment -
        Show
        Russell Jurney added a comment - https://reviews.apache.org/r/9843/ REview Board
        Hide
        Daniel Dai added a comment -

        Some comments:

        • TestStandardTokenize fail due to LuceneTokenize not found. Seems you change LuceneTokenize to StandardTokenize but not changing the test
        • How about Jonathan's comment#1?
        • TestStandardTokenize.java includes two copies of Apache License
        • In general, shall we put it into piggybank instead of builtin?
        Show
        Daniel Dai added a comment - Some comments: TestStandardTokenize fail due to LuceneTokenize not found. Seems you change LuceneTokenize to StandardTokenize but not changing the test How about Jonathan's comment#1? TestStandardTokenize.java includes two copies of Apache License In general, shall we put it into piggybank instead of builtin?
        Hide
        Russell Jurney added a comment -

        Thanks for the notes. I'll get it fixed.

        As to piggybank: if this were in Piggybank, it would never get used. The reason being that a user would have to locate and LOAD lucene.jar(s), which is very difficult to do in practice. Since TOKENIZE is builtin, these make sense being builtin too.

        Show
        Russell Jurney added a comment - Thanks for the notes. I'll get it fixed. As to piggybank: if this were in Piggybank, it would never get used. The reason being that a user would have to locate and LOAD lucene.jar(s), which is very difficult to do in practice. Since TOKENIZE is builtin, these make sense being builtin too.
        Hide
        Daniel Dai added a comment -

        I don't against expanding our builtin collection, but traditionally we only take basic UDFs into builtin. How's others think?

        Show
        Daniel Dai added a comment - I don't against expanding our builtin collection, but traditionally we only take basic UDFs into builtin. How's others think?
        Hide
        Alan Gates added a comment -

        Canceling patch until issues around location and build failures are resolved.

        Show
        Alan Gates added a comment - Canceling patch until issues around location and build failures are resolved.

          People

          • Assignee:
            Russell Jurney
            Reporter:
            Russell Jurney
          • Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:

              Development