[MAPREDUCE-6891] TextInputFormat: duplicate records with custom delimiter - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Duplicate
Affects Version/s: 2.2.0
Fix Version/s: None
Component/s: None
Labels:
None

Description

When using a custom delimiter for TextInputFormat, the resulting blocks are not correct under some circumstances. It happens that the total number of records is wrong and some entries are duplicated.

I have created a reproducible test case:

Generate a File

for i in $(seq 1 10000000); do 
  echo -n $i >> long_delimiter-1to10000000-with_newline.txt;
  echo "--------------------------------------------" >> long_delimiter-1to10000000-with_newline.txt; 
done

Java-Test to reproduce the error

public static void longDelimiterBug(JavaSparkContext sc) {
	Configuration hadoopConf = new Configuration();
	String delimitedFile = "long_delimiter-1to10000000-with_newline.txt";
	hadoopConf.set("textinputformat.record.delimiter", "--------------------------------------------\n");
	JavaPairRDD<LongWritable, Text> input = sc.newAPIHadoopFile(delimitedFile, TextInputFormat.class,
			LongWritable.class, Text.class, hadoopConf);

	List<String> values = input.map(t -> t._2.toString()).collect();

	Assert.assertEquals(10000000, values.size());
	for (int i = 0; i < 10000000; i++) {
		boolean correct = values.get(i).equals(Integer.toString(i + 1));
		if (!correct) {
			logger.error("Wrong value for index {}: expected {} -> got {}", i, i + 1, values.get(i));
		} else {
			logger.info("Correct value for index {}: expected {} -> got {}", i, i + 1, values.get(i));
		}
		Assert.assertTrue(correct);
	}
}

This example fails with the error

java.lang.AssertionError: expected:<10000000> but was:<10042616>

when commenting out the Assert about the size of the collection, my log output ends like this:

[main] INFO edu.udo.cs.schaefer.testspark.Main - Correct value for index 663244: expected 663245 -> got 663245
[main] ERROR edu.udo.cs.schaefer.testspark.Main - Wrong value for index 663245: expected 663246 -> got 660111

After the the wrong value for index 663245 the values are sorted again an a continuing with 660112, 660113, ....

The error is not reproducible with \n as delimiter, i.e. when not using a custom delimiter.

Attachments

Issue Links

duplicates

MAPREDUCE-6549 multibyte delimiters with LineRecordReader cause duplicate records

Closed

Activity

People

Assignee:: Unassigned

Reporter:: Till Schäfer

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 22/May/17 16:43

Updated:: 23/May/17 11:54

Resolved:: 23/May/17 11:54