Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Duplicate
-
2.2.0
-
None
-
None
-
None
Description
When using a custom delimiter for TextInputFormat, the resulting blocks are not correct under some circumstances. It happens that the total number of records is wrong and some entries are duplicated.
I have created a reproducible test case:
Generate a File
for i in $(seq 1 10000000); do echo -n $i >> long_delimiter-1to10000000-with_newline.txt; echo "--------------------------------------------" >> long_delimiter-1to10000000-with_newline.txt; done
Java-Test to reproduce the error
public static void longDelimiterBug(JavaSparkContext sc) { Configuration hadoopConf = new Configuration(); String delimitedFile = "long_delimiter-1to10000000-with_newline.txt"; hadoopConf.set("textinputformat.record.delimiter", "--------------------------------------------\n"); JavaPairRDD<LongWritable, Text> input = sc.newAPIHadoopFile(delimitedFile, TextInputFormat.class, LongWritable.class, Text.class, hadoopConf); List<String> values = input.map(t -> t._2.toString()).collect(); Assert.assertEquals(10000000, values.size()); for (int i = 0; i < 10000000; i++) { boolean correct = values.get(i).equals(Integer.toString(i + 1)); if (!correct) { logger.error("Wrong value for index {}: expected {} -> got {}", i, i + 1, values.get(i)); } else { logger.info("Correct value for index {}: expected {} -> got {}", i, i + 1, values.get(i)); } Assert.assertTrue(correct); } }
This example fails with the error
java.lang.AssertionError: expected:<10000000> but was:<10042616>
when commenting out the Assert about the size of the collection, my log output ends like this:
[main] INFO edu.udo.cs.schaefer.testspark.Main - Correct value for index 663244: expected 663245 -> got 663245
[main] ERROR edu.udo.cs.schaefer.testspark.Main - Wrong value for index 663245: expected 663246 -> got 660111
After the the wrong value for index 663245 the values are sorted again an a continuing with 660112, 660113, ....
The error is not reproducible with \n as delimiter, i.e. when not using a custom delimiter.
Attachments
Issue Links
- duplicates
-
MAPREDUCE-6549 multibyte delimiters with LineRecordReader cause duplicate records
- Closed