Uploaded image for project: 'Pig'
  1. Pig
  2. PIG-947

Parsing Bags by PigStorage is not handled correctly if whitespace before start of tuple.

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • None
    • data
    • None
    • Pig on Hadoop 18

    Description

      PigStorage parser for bags is not working correctly when a tuple in a bag is proceeded by a space. For example, the following is parsed correctly:

      {(-5.243084,3.142401,0.000138,2.071200,0),(-6.021349,0.992683,0.000044,0.992683,0),(-10.426160,20.251774,0.000892,5.691086,0)}

      while this is not: (Note the space before the second tuple)

      {(-5.243084,3.142401,0.000138,2.071200,0), (-6.021349,0.992683,0.000044,0.992683,0),(-10.426160,20.251774,0.000892,5.691086,0)}

      It seems that the parser when it encounters the space, treats the rest of the line as a String. With a schema, this results in a typecast of string to databag which results in exception.

      WARN builtin.PigStorage: Unable to interpret value [B@2c9b42e6 in field being converted to type bag, caught ParseException <Encountered " <STRING> " "" at line 1, column 43.
      Was expecting:
      "(" ...
      > field discarded

      Below is the parser debug output for the parsing of the above error sequence: "2.071200,0), (" from above...

                • FOUND A <DOUBLENUMBER> MATCH (2.071200) ******

      Call: AtomDatum
      Consumed token: <<DOUBLENUMBER>: "2.071200" at line 1 column 31>
      Return: AtomDatum
      Return: Datum
      Matched the empty string as <STRING> token.
      Current character : , (44) at line 1 column 39
      No more string literal token matches are possible.
      Currently matched the first 1 characters as a "," token.

                • FOUND A "," MATCH (,) ******

      Consumed token: <"," at line 1 column 39>
      Call: Datum
      Matched the empty string as <STRING> token.
      Current character : 0 (48) at line 1 column 40
      No string literal matches possible.
      Starting NFA to match one of :

      { <STRING>, <SIGNEDINTEGER>, <DOUBLENUMBER> }
      Current character : 0 (48) at line 1 column 40
      Currently matched the first 1 characters as a <SIGNEDINTEGER> token.
      Possible kinds of longer matches : { <STRING>, <SIGNEDINTEGER>, <DOUBLENUMBER>, <LONGINTEGER>, <FLOATNUMBER> }
      Current character : ) (41) at line 1 column 41
      Currently matched the first 1 characters as a <SIGNEDINTEGER> token.
      Putting back 1 characters into the input stream.
      ****** FOUND A <SIGNEDINTEGER> MATCH (0) ******

      Call: AtomDatum
      Consumed token: <<SIGNEDINTEGER>: "0" at line 1 column 40>
      Return: AtomDatum
      Return: Datum
      Matched the empty string as <STRING> token.
      Current character : ) (41) at line 1 column 41
      No more string literal token matches are possible.
      Currently matched the first 1 characters as a ")" token.
      ****** FOUND A ")" MATCH ()) ******

      Return: Tuple
      Consumed token: <")" at line 1 column 41>
      Matched the empty string as <STRING> token.
      Current character : , (44) at line 1 column 42
      No more string literal token matches are possible.
      Currently matched the first 1 characters as a "," token.
      ****** FOUND A "," MATCH (,) ******

      Consumed token: <"," at line 1 column 42>
      Matched the empty string as <STRING> token.
      Current character : (32) at line 1 column 43
      No string literal matches possible.
      Starting NFA to match one of : { <STRING>, <SIGNEDINTEGER>, <DOUBLENUMBER> }

      Current character : (32) at line 1 column 43
      Currently matched the first 1 characters as a <STRING> token.
      Possible kinds of longer matches :

      { <STRING>, <SIGNEDINTEGER>, <DOUBLENUMBER> }

      Current character : ( (40) at line 1 column 44
      Currently matched the first 1 characters as a <STRING> token.
      Putting back 1 characters into the input stream.

                • FOUND A <STRING> MATCH ( ) ******

      Return: Bag
      Return: Datum
      Return: Parse

      Attachments

        Activity

          People

            Unassigned Unassigned
            gandul Gandul Azul
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated: