[PIG-947] Parsing Bags by PigStorage is not handled correctly if whitespace before start of tuple. - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: None
Component/s: data
Labels:
None
Environment:

Pig on Hadoop 18

Description

PigStorage parser for bags is not working correctly when a tuple in a bag is proceeded by a space. For example, the following is parsed correctly:

{(-5.243084,3.142401,0.000138,2.071200,0),(-6.021349,0.992683,0.000044,0.992683,0),(-10.426160,20.251774,0.000892,5.691086,0)}

while this is not: (Note the space before the second tuple)

{(-5.243084,3.142401,0.000138,2.071200,0), (-6.021349,0.992683,0.000044,0.992683,0),(-10.426160,20.251774,0.000892,5.691086,0)}

It seems that the parser when it encounters the space, treats the rest of the line as a String. With a schema, this results in a typecast of string to databag which results in exception.

WARN builtin.PigStorage: Unable to interpret value [B@2c9b42e6 in field being converted to type bag, caught ParseException <Encountered " <STRING> " "" at	line 1, column 43.
Was expecting:
"(" ...
> field discarded

Below is the parser debug output for the parsing of the above error sequence: "2.071200,0), (" from above...

- - - - FOUND A <DOUBLENUMBER> MATCH (2.071200) ******

Call: AtomDatum
Consumed token: <<DOUBLENUMBER>: "2.071200" at line 1 column 31>
Return: AtomDatum
Return: Datum
Matched the empty string as <STRING> token.
Current character : , (44) at line 1 column 39
No more string literal token matches are possible.
Currently matched the first 1 characters as a "," token.

- - - - FOUND A "," MATCH (,) ******

Consumed token: <"," at line 1 column 39>
Call: Datum
Matched the empty string as <STRING> token.
Current character : 0 (48) at line 1 column 40
No string literal matches possible.
Starting NFA to match one of :

{ <STRING>, <SIGNEDINTEGER>, <DOUBLENUMBER> }
Current character : 0 (48) at line 1 column 40
Currently matched the first 1 characters as a <SIGNEDINTEGER> token.
Possible kinds of longer matches : { <STRING>, <SIGNEDINTEGER>, <DOUBLENUMBER>, <LONGINTEGER>, <FLOATNUMBER> }
Current character : ) (41) at line 1 column 41
Currently matched the first 1 characters as a <SIGNEDINTEGER> token.
Putting back 1 characters into the input stream.
****** FOUND A <SIGNEDINTEGER> MATCH (0) ******

Call: AtomDatum
Consumed token: <<SIGNEDINTEGER>: "0" at line 1 column 40>
Return: AtomDatum
Return: Datum
Matched the empty string as <STRING> token.
Current character : ) (41) at line 1 column 41
No more string literal token matches are possible.
Currently matched the first 1 characters as a ")" token.
****** FOUND A ")" MATCH ()) ******

Return: Tuple
Consumed token: <")" at line 1 column 41>
Matched the empty string as <STRING> token.
Current character : , (44) at line 1 column 42
No more string literal token matches are possible.
Currently matched the first 1 characters as a "," token.
****** FOUND A "," MATCH (,) ******

Consumed token: <"," at line 1 column 42>
Matched the empty string as <STRING> token.
Current character : (32) at line 1 column 43
No string literal matches possible.
Starting NFA to match one of : { <STRING>, <SIGNEDINTEGER>, <DOUBLENUMBER> }

Current character : (32) at line 1 column 43
Currently matched the first 1 characters as a <STRING> token.
Possible kinds of longer matches :

{ <STRING>, <SIGNEDINTEGER>, <DOUBLENUMBER> }

Current character : ( (40) at line 1 column 44
Currently matched the first 1 characters as a <STRING> token.
Putting back 1 characters into the input stream.

- - - - FOUND A <STRING> MATCH ( ) ******

Return: Bag
Return: Datum
Return: Parse

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Gandul Azul

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 07/Sep/09 21:29

Updated:: 23/Jul/10 23:02