Details
-
Bug
-
Status: In Progress
-
Major
-
Resolution: Unresolved
-
None
-
None
-
None
Description
After committing PIG-3655 a couple of Spark mode tests (e.g. org.apache.pig.test.TestEvalPipeline.testCogroupAfterDistinct) started failing on:
java.lang.Error: java.io.IOException: Corrupt data file, expected tuple type byte, but seen 27 at org.apache.pig.backend.hadoop.executionengine.HJob$1.hasNext(HJob.java:122) at org.apache.pig.test.TestEvalPipeline.testCogroupAfterDistinct(TestEvalPipeline.java:1052) Caused by: java.io.IOException: Corrupt data file, expected tuple type byte, but seen 27 at org.apache.pig.impl.io.InterRecordReader.readDataOrEOF(InterRecordReader.java:158) at org.apache.pig.impl.io.InterRecordReader.nextKeyValue(InterRecordReader.java:194) at org.apache.pig.impl.io.InterStorage.getNext(InterStorage.java:79) at org.apache.pig.impl.io.ReadToEndLoader.getNextHelper(ReadToEndLoader.java:238) at org.apache.pig.impl.io.ReadToEndLoader.getNext(ReadToEndLoader.java:218) at org.apache.pig.backend.hadoop.executionengine.HJob$1.hasNext(HJob.java:115)
This is because InterRecordReader became much stricter after PIG-3655. Before it just simply skipped these bytes thinking that they are just garbage on the split beginning. Now when we expect a proper tuple with a tuple type byte we see these nulls and throw an Exception.
As I can see it this is happening because JoinGroupSparkConverter has to return something even when it shouldn't.
When the POPackage operator returns a POStatus.STATUS_NULL, the converter shouldn't return a thing, but it can't do better than returning a null. This then gets written out by Spark..
Attachments
Issue Links
- relates to
-
PIG-3655 BinStorage and InterStorage approach to record markers is broken
-
- Resolved
-