Pig
  1. Pig
  2. PIG-681

TextDataParser does not handle non-ASCII UTF-8 characters

    Details

    • Type: Bug Bug
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: 0.2.0
    • Fix Version/s: None
    • Component/s: impl
    • Labels:
      None

      Description

      The TextDataParser handles ASCII data but it does not handle non-ASCII UTF-8 data. Since Pig supports UTF-8 data, the parser should be modified to handle non-ASCII UTF-8 data.

        Activity

        Hide
        Santhosh Srinivasan added a comment -

        The query and the exception stack trace from the user:

        phrases = load 'phrases' as (data: chararray, f: int);
        a = group phrases by f;
        b = foreach a generate group as f, phrases.data as data;
        store b into 'grouped';
        
        b = load 'grouped' as (f: int, data: bag{t: tuple(data: chararray)});
        c = foreach b generate f, data;       -- just store in this sample
        store c into 'final';
        
        [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2043: Unexpected
        error during execution.
        
        org.apache.pig.data.parser.TokenMgrError: Error: Bailing out of
        infinite loop caused by repeated empty string matches at line 1, column 3.
        	at org.apache.pig.data.parser.TextDataParserTokenManager.TokenLexicalActions(TextDataParserTokenManager.java:619)
        	at org.apache.pig.data.parser.TextDataParserTokenManager.getNextToken(TextDataParserTokenManager.java:568)
        	at org.apache.pig.data.parser.TextDataParser.jj_ntk(TextDataParser.java:623)
        	at org.apache.pig.data.parser.TextDataParser.Tuple(TextDataParser.java:153)
        	at org.apache.pig.data.parser.TextDataParser.Bag(TextDataParser.java:85)
        	at org.apache.pig.data.parser.TextDataParser.Datum(TextDataParser.java:345)
        	at org.apache.pig.data.parser.TextDataParser.Parse(TextDataParser.java:42)
        	at org.apache.pig.builtin.Utf8StorageConverter.parseFromBytes(Utf8StorageConverter.java:71)
        	at org.apache.pig.builtin.Utf8StorageConverter.bytesToBag(Utf8StorageConverter.java:79)
        	at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POCast.getNext(POCast.java:908)
        	at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:244)
        	at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:198)
        	at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:226)
        	at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:187)
        	at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:203)
        	at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:194)
        	at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapOnly$Map.map(PigMapOnly.java:65)
        	at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
        	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:342)
        	at org.apache.hadoop.mapred.Child.main(Child.java:158)
        
        Show
        Santhosh Srinivasan added a comment - The query and the exception stack trace from the user: phrases = load 'phrases' as (data: chararray, f: int ); a = group phrases by f; b = foreach a generate group as f, phrases.data as data; store b into 'grouped'; b = load 'grouped' as (f: int , data: bag{t: tuple(data: chararray)}); c = foreach b generate f, data; -- just store in this sample store c into ' final '; [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2043: Unexpected error during execution. org.apache.pig.data.parser.TokenMgrError: Error: Bailing out of infinite loop caused by repeated empty string matches at line 1, column 3. at org.apache.pig.data.parser.TextDataParserTokenManager.TokenLexicalActions(TextDataParserTokenManager.java:619) at org.apache.pig.data.parser.TextDataParserTokenManager.getNextToken(TextDataParserTokenManager.java:568) at org.apache.pig.data.parser.TextDataParser.jj_ntk(TextDataParser.java:623) at org.apache.pig.data.parser.TextDataParser.Tuple(TextDataParser.java:153) at org.apache.pig.data.parser.TextDataParser.Bag(TextDataParser.java:85) at org.apache.pig.data.parser.TextDataParser.Datum(TextDataParser.java:345) at org.apache.pig.data.parser.TextDataParser.Parse(TextDataParser.java:42) at org.apache.pig.builtin.Utf8StorageConverter.parseFromBytes(Utf8StorageConverter.java:71) at org.apache.pig.builtin.Utf8StorageConverter.bytesToBag(Utf8StorageConverter.java:79) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POCast.getNext(POCast.java:908) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:244) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:198) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:226) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:187) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:203) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:194) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapOnly$Map.map(PigMapOnly.java:65) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:342) at org.apache.hadoop.mapred.Child.main(Child.java:158)

          People

          • Assignee:
            Unassigned
            Reporter:
            Santhosh Srinivasan
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:

              Development