Uploaded image for project: 'Hadoop Map/Reduce'
  1. Hadoop Map/Reduce
  2. MAPREDUCE-6891

TextInputFormat: duplicate records with custom delimiter

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Duplicate
    • 2.2.0
    • None
    • None
    • None

    Description

      When using a custom delimiter for TextInputFormat, the resulting blocks are not correct under some circumstances. It happens that the total number of records is wrong and some entries are duplicated.

      I have created a reproducible test case:

      Generate a File

      for i in $(seq 1 10000000); do 
        echo -n $i >> long_delimiter-1to10000000-with_newline.txt;
        echo "--------------------------------------------" >> long_delimiter-1to10000000-with_newline.txt; 
      done
      

      Java-Test to reproduce the error

      public static void longDelimiterBug(JavaSparkContext sc) {
      	Configuration hadoopConf = new Configuration();
      	String delimitedFile = "long_delimiter-1to10000000-with_newline.txt";
      	hadoopConf.set("textinputformat.record.delimiter", "--------------------------------------------\n");
      	JavaPairRDD<LongWritable, Text> input = sc.newAPIHadoopFile(delimitedFile, TextInputFormat.class,
      			LongWritable.class, Text.class, hadoopConf);
      
      	List<String> values = input.map(t -> t._2.toString()).collect();
      
      	Assert.assertEquals(10000000, values.size());
      	for (int i = 0; i < 10000000; i++) {
      		boolean correct = values.get(i).equals(Integer.toString(i + 1));
      		if (!correct) {
      			logger.error("Wrong value for index {}: expected {} -> got {}", i, i + 1, values.get(i));
      		} else {
      			logger.info("Correct value for index {}: expected {} -> got {}", i, i + 1, values.get(i));
      		}
      		Assert.assertTrue(correct);
      	}
      }
      

      This example fails with the error

      java.lang.AssertionError: expected:<10000000> but was:<10042616>

      when commenting out the Assert about the size of the collection, my log output ends like this:

      [main] INFO edu.udo.cs.schaefer.testspark.Main - Correct value for index 663244: expected 663245 -> got 663245
      [main] ERROR edu.udo.cs.schaefer.testspark.Main - Wrong value for index 663245: expected 663246 -> got 660111

      After the the wrong value for index 663245 the values are sorted again an a continuing with 660112, 660113, ....

      The error is not reproducible with \n as delimiter, i.e. when not using a custom delimiter.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              till.schaefer Till Schäfer
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: