Hadoop Map/Reduce
  1. Hadoop Map/Reduce
  2. MAPREDUCE-577

Duplicate Mapper input when using StreamXmlRecordReader

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.22.0
    • Component/s: contrib/streaming
    • Labels:
      None
    • Environment:

      HADOOP 0.17.0, Java 6.0

    • Hadoop Flags:
      Reviewed

      Description

      I have an XML file with 93626 rows. A row is marked by <row>...</row>.

      I've confirmed this with grep and the Grep example program included with HADOOP.

      Here is the grep example output. 93626 <row>

      I've setup my job configuration as follows:

      conf.set("stream.recordreader.class", "org.apache.hadoop.streaming.StreamXmlRecordReader");
      conf.set("stream.recordreader.begin", "<row>");
      conf.set("stream.recordreader.end", "</row>");

      conf.setInputFormat(StreamInputFormat.class);

      I have a fairly simple test Mapper.

      Here's the map method.

      public void map(Text key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
      try {

      output.collect(totalWord, one);

      if (key != null && key.toString().indexOf("01852") != -1)

      { output.collect(new Text("01852"), one); }

      } catch (Exception ex)

      { Logger.getLogger(TestMapper.class.getName()).log(Level.SEVERE, null, ex); System.out.println(value); }

      }

      For totalWord ("TOTAL"), I get:

      TOTAL 140850

      and for 01852 I get.

      01852 86

      There are 43 instances of 01852 in the file.

      I have the following setting in my config.

      conf.setNumMapTasks(1);

      I have a total of six machines in my cluster.

      If I run without this, the result is 12x the actual value, not 2x.

      Here's some info from the cluster web page.

      Maps Reduces Total Submissions Nodes Map Task Capacity Reduce Task Capacity Avg. Tasks/Node
      0 0 1 6 12 12 4.00

      I've also noticed something really strange in the job's output. It looks like it's starting over or redoing things.
      This was run using all six nodes and no limitations on map or reduce tasks. I haven't seen this behavior in any other case.

      08/06/03 10:50:35 INFO mapred.FileInputFormat: Total input paths to process : 1
      08/06/03 10:50:36 INFO mapred.JobClient: Running job: job_200806030916_0018
      08/06/03 10:50:37 INFO mapred.JobClient: map 0% reduce 0%
      08/06/03 10:50:42 INFO mapred.JobClient: map 2% reduce 0%
      08/06/03 10:50:45 INFO mapred.JobClient: map 12% reduce 0%
      08/06/03 10:50:47 INFO mapred.JobClient: map 31% reduce 0%
      08/06/03 10:50:48 INFO mapred.JobClient: map 49% reduce 0%
      08/06/03 10:50:49 INFO mapred.JobClient: map 68% reduce 0%
      08/06/03 10:50:50 INFO mapred.JobClient: map 100% reduce 0%
      08/06/03 10:50:54 INFO mapred.JobClient: map 87% reduce 0%
      08/06/03 10:50:55 INFO mapred.JobClient: map 100% reduce 0%
      08/06/03 10:50:56 INFO mapred.JobClient: map 0% reduce 0%
      08/06/03 10:51:00 INFO mapred.JobClient: map 0% reduce 1%
      08/06/03 10:51:05 INFO mapred.JobClient: map 28% reduce 2%
      08/06/03 10:51:07 INFO mapred.JobClient: map 80% reduce 4%
      08/06/03 10:51:08 INFO mapred.JobClient: map 100% reduce 4%
      08/06/03 10:51:09 INFO mapred.JobClient: map 100% reduce 7%
      08/06/03 10:51:10 INFO mapred.JobClient: map 90% reduce 9%
      08/06/03 10:51:11 INFO mapred.JobClient: map 100% reduce 9%
      08/06/03 10:51:12 INFO mapred.JobClient: map 100% reduce 11%
      08/06/03 10:51:13 INFO mapred.JobClient: map 90% reduce 11%
      08/06/03 10:51:14 INFO mapred.JobClient: map 97% reduce 11%
      08/06/03 10:51:15 INFO mapred.JobClient: map 63% reduce 11%
      08/06/03 10:51:16 INFO mapred.JobClient: map 48% reduce 11%
      08/06/03 10:51:17 INFO mapred.JobClient: map 21% reduce 11%
      08/06/03 10:51:19 INFO mapred.JobClient: map 0% reduce 11%
      08/06/03 10:51:20 INFO mapred.JobClient: map 15% reduce 12%
      08/06/03 10:51:21 INFO mapred.JobClient: map 27% reduce 13%
      08/06/03 10:51:22 INFO mapred.JobClient: map 67% reduce 13%
      08/06/03 10:51:24 INFO mapred.JobClient: map 22% reduce 16%
      08/06/03 10:51:25 INFO mapred.JobClient: map 46% reduce 16%
      08/06/03 10:51:26 INFO mapred.JobClient: map 70% reduce 16%
      08/06/03 10:51:27 INFO mapred.JobClient: map 73% reduce 18%
      08/06/03 10:51:28 INFO mapred.JobClient: map 85% reduce 19%
      08/06/03 10:51:29 INFO mapred.JobClient: map 7% reduce 19%
      08/06/03 10:51:32 INFO mapred.JobClient: map 100% reduce 20%
      08/06/03 10:51:35 INFO mapred.JobClient: map 100% reduce 22%
      08/06/03 10:51:37 INFO mapred.JobClient: map 100% reduce 23%
      08/06/03 10:51:38 INFO mapred.JobClient: map 100% reduce 46%
      08/06/03 10:51:39 INFO mapred.JobClient: map 100% reduce 58%
      08/06/03 10:51:40 INFO mapred.JobClient: map 100% reduce 80%
      08/06/03 10:51:42 INFO mapred.JobClient: map 100% reduce 90%
      08/06/03 10:51:43 INFO mapred.JobClient: map 100% reduce 100%
      08/06/03 10:51:44 INFO mapred.JobClient: Job complete: job_200806030916_0018
      08/06/03 10:51:44 INFO mapred.JobClient: Counters: 17
      08/06/03 10:51:44 INFO mapred.JobClient: File Systems
      08/06/03 10:51:44 INFO mapred.JobClient: Local bytes read=1705
      08/06/03 10:51:44 INFO mapred.JobClient: Local bytes written=29782
      08/06/03 10:51:44 INFO mapred.JobClient: HDFS bytes read=1366064660
      08/06/03 10:51:44 INFO mapred.JobClient: HDFS bytes written=23
      08/06/03 10:51:44 INFO mapred.JobClient: Job Counters
      08/06/03 10:51:44 INFO mapred.JobClient: Launched map tasks=37
      08/06/03 10:51:44 INFO mapred.JobClient: Launched reduce tasks=10
      08/06/03 10:51:44 INFO mapred.JobClient: Data-local map tasks=22
      08/06/03 10:51:44 INFO mapred.JobClient: Rack-local map tasks=15
      08/06/03 10:51:44 INFO mapred.JobClient: Map-Reduce Framework
      08/06/03 10:51:44 INFO mapred.JobClient: Map input records=942105
      08/06/03 10:51:44 INFO mapred.JobClient: Map output records=942621
      08/06/03 10:51:44 INFO mapred.JobClient: Map input bytes=1365761556
      08/06/03 10:51:44 INFO mapred.JobClient: Map output bytes=9426210
      08/06/03 10:51:44 INFO mapred.JobClient: Combine input records=942621
      08/06/03 10:51:44 INFO mapred.JobClient: Combine output records=49
      08/06/03 10:51:44 INFO mapred.JobClient: Reduce input groups=2
      08/06/03 10:51:44 INFO mapred.JobClient: Reduce input records=49
      08/06/03 10:51:44 INFO mapred.JobClient: Reduce output records=2

      1. 0001-test-to-demonstrate-HADOOP-3484.patch
        4 kB
        Bo Adler
      2. 0002-patch-for-HADOOP-3484.patch
        1 kB
        Bo Adler
      3. HADOOP-3484.combined.patch
        5 kB
        Bo Adler
      4. HADOOP-3484.try3.patch
        30 kB
        Bo Adler
      5. 577.patch
        8 kB
        Ravi Gummadi
      6. 577.20S.patch
        9 kB
        Ravi Gummadi
      7. 577.v1.patch
        9 kB
        Ravi Gummadi
      8. 577.v2.patch
        7 kB
        Ravi Gummadi
      9. 577.v3.patch
        12 kB
        Ravi Gummadi
      10. 577.v4.patch
        9 kB
        Ravi Gummadi

        Activity

        Hide
        Bo Adler added a comment -

        I ran into a similar problem while trying to debug/implement support for compressed input files. I believe that the problem is in StreamXmlRecordReader.java in the fast-match code. When the bytestream doesn't completely match the begin/end pattern, the code does "pos_ += m" to increment the position in the stream by the number of characters matching the pattern... but it forgets the "c" character which didn't match. Ultimately, the code ends up reading significantly past the end of the split, resulting in the same record (at least the first one right after the split) being detected by multiple StreamXmlRecordReader instances.

        The fix is simple. The line mentioned above should be: pos_ += m + 1;

        I'm planning to submit this and my gzip-support as soon as I can figure out how to submit a patch.

        Show
        Bo Adler added a comment - I ran into a similar problem while trying to debug/implement support for compressed input files. I believe that the problem is in StreamXmlRecordReader.java in the fast-match code. When the bytestream doesn't completely match the begin/end pattern, the code does "pos_ += m" to increment the position in the stream by the number of characters matching the pattern... but it forgets the "c" character which didn't match. Ultimately, the code ends up reading significantly past the end of the split, resulting in the same record (at least the first one right after the split) being detected by multiple StreamXmlRecordReader instances. The fix is simple. The line mentioned above should be: pos_ += m + 1; I'm planning to submit this and my gzip-support as soon as I can figure out how to submit a patch.
        Hide
        David Campbell added a comment -

        Great job!

        I'm sure a lot of people (definitely including me) will be pleased to see that get fixed!

        Thanks!

        Dave

        Show
        David Campbell added a comment - Great job! I'm sure a lot of people (definitely including me) will be pleased to see that get fixed! Thanks! Dave
        Hide
        Bo Adler added a comment -

        Here is a patch to create a test case that demonstrates the problem. In my personal testing, it was very easy to run into the problem, but I had to tweak the block size to trigger the error in the junit tests.

        (sorry about the duplicate comment)

        Show
        Bo Adler added a comment - Here is a patch to create a test case that demonstrates the problem. In my personal testing, it was very easy to run into the problem, but I had to tweak the block size to trigger the error in the junit tests. (sorry about the duplicate comment)
        Hide
        Bo Adler added a comment -

        Here is the patch that fixes the problem.

        Show
        Bo Adler added a comment - Here is the patch that fixes the problem.
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12385056/0002-patch-for-HADOOP-3484.patch
        against trunk revision 673215.

        +1 @author. The patch does not contain any @author tags.

        -1 tests included. The patch doesn't appear to include any new or modified tests.
        Please justify why no tests are needed for this patch.

        -1 patch. The patch command could not apply the patch.

        Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2777/console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12385056/0002-patch-for-HADOOP-3484.patch against trunk revision 673215. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no tests are needed for this patch. -1 patch. The patch command could not apply the patch. Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2777/console This message is automatically generated.
        Hide
        Bo Adler added a comment -

        Combined patches and used "--no-prefix" to strip the leading directory names.

        This patch includes both the test and the fix in a single file.

        Show
        Bo Adler added a comment - Combined patches and used "--no-prefix" to strip the leading directory names. This patch includes both the test and the fix in a single file.
        Hide
        Lohit Vijayarenu added a comment -

        Bo, This looks good.
        Few minor changes.

        • Test depends on perl. In testcase, would it be possible to check perl and terminate if there isnt on system running this, else it might be difficult to find the test failure.
        • Could we move the testcase to already existing test file TestStreamXmlRecordReader.java instead of creating a new one?
        Show
        Lohit Vijayarenu added a comment - Bo, This looks good. Few minor changes. Test depends on perl. In testcase, would it be possible to check perl and terminate if there isnt on system running this, else it might be difficult to find the test failure. Could we move the testcase to already existing test file TestStreamXmlRecordReader.java instead of creating a new one?
        Hide
        Wei-Ming Chen added a comment -

        Hi,

        I also found the missing "+1" bug a few days ago. However, there are still duplicates even if the proposed patch (pos_ += m + 1) is applied.

        The duplicates happen when the assigned split boundary (end_) is between an end tag and another beginning tag (for example, there are several newlines between two records). In this case, after reading the last record in this split, the pos_ is still smaller than end_. Therefore, the reader will continue trying to find another beginning tag which may actually resides in another split. In this case, both splits will return the record, causing duplicates.

        In addition, the pos_ variable does not present the real position in the situation that there are additional bytes between records, even if the proposed patch is applied. This is because pos_ is only updated when trying to find the end tag. If there are several bytes skipped during the process of finding the beginning tag and the pos_ is not updated, the position will be wrong.

        My solution should be able to solve the two problems. If you agree with me and think my solution works, feel free to use my code. I willl be glad if the problem gets fixed in the next release.

        StreamXmlRecordReader.java
        
          boolean fastReadUntilMatch(String textPat, boolean includePat, DataOutputBuffer outBufOrNull) throws IOException {
            byte[] cpat = textPat.getBytes("UTF-8");
            int m = 0;
            boolean match = false;
            int msup = cpat.length;
            int LL = 120000 * 10;
        
            bin_.mark(LL); // large number to invalidate mark
            while (true) {
              int b = bin_.read();
              if (b == -1) break;
        
              byte c = (byte) b; // this assumes eight-bit matching. OK with UTF-8
              if (c == cpat[m]) {
                m++;
                if (m == msup) {
                  match = true;
                  break;
                }
              } else {
                bin_.mark(LL); // rest mark so we could jump back if we found a match
                if (outBufOrNull != null) {
                  outBufOrNull.write(cpat, 0, m);
                  outBufOrNull.write(c);
        -         pos_ += m;
        +         pos_ += m + 1;
        +       } else {
        +         pos_ += m + 1;
        +         if (pos_ >= end_)
        +           break;
                }
                m = 0;
              }
            }
            if (!includePat && match) {
              bin_.reset();
            } else if (outBufOrNull != null) {
              outBufOrNull.write(cpat);
              pos_ += msup;
            }
            return match;
          }
        
        Show
        Wei-Ming Chen added a comment - Hi, I also found the missing "+1" bug a few days ago. However, there are still duplicates even if the proposed patch (pos_ += m + 1) is applied. The duplicates happen when the assigned split boundary (end_) is between an end tag and another beginning tag (for example, there are several newlines between two records). In this case, after reading the last record in this split, the pos_ is still smaller than end_. Therefore, the reader will continue trying to find another beginning tag which may actually resides in another split. In this case, both splits will return the record, causing duplicates. In addition, the pos_ variable does not present the real position in the situation that there are additional bytes between records, even if the proposed patch is applied. This is because pos_ is only updated when trying to find the end tag. If there are several bytes skipped during the process of finding the beginning tag and the pos_ is not updated, the position will be wrong. My solution should be able to solve the two problems. If you agree with me and think my solution works, feel free to use my code. I willl be glad if the problem gets fixed in the next release. StreamXmlRecordReader.java boolean fastReadUntilMatch( String textPat, boolean includePat, DataOutputBuffer outBufOrNull) throws IOException { byte [] cpat = textPat.getBytes( "UTF-8" ); int m = 0; boolean match = false ; int msup = cpat.length; int LL = 120000 * 10; bin_.mark(LL); // large number to invalidate mark while ( true ) { int b = bin_.read(); if (b == -1) break ; byte c = ( byte ) b; // this assumes eight-bit matching. OK with UTF-8 if (c == cpat[m]) { m++; if (m == msup) { match = true ; break ; } } else { bin_.mark(LL); // rest mark so we could jump back if we found a match if (outBufOrNull != null ) { outBufOrNull.write(cpat, 0, m); outBufOrNull.write(c); - pos_ += m; + pos_ += m + 1; + } else { + pos_ += m + 1; + if (pos_ >= end_) + break ; } m = 0; } } if (!includePat && match) { bin_.reset(); } else if (outBufOrNull != null ) { outBufOrNull.write(cpat); pos_ += msup; } return match; }
        Hide
        Bo Adler added a comment -

        Hmm, yeah, that seems plausible.

        Unfortunately, I'm working on a paper deadline, so I won't get to this for another week. But I think there's something still wrong here. This new change assumes that we want to stop reading at the split boundary, but that's not true. If the split boundary falls in the middle of a record, then the preceding split needs to go past the boundary to "see" the whole record. Additionally, this only fixes the fast match, but I'm thinking (haven't checked) that a similar issue exists with the slow match... so this check needs to happen in next(), not inside the match routines.

        I think that means we should have a few more test cases, to check all these variations. Lohit: I was hoping someone else would take up the charge of doing "the right thing" with my patches. I can certainly take a stab at creating new tests for Wei-Ming's case and merging all these new tests into TestStreamXmlReacord.java, if no one beats me to it.

        Show
        Bo Adler added a comment - Hmm, yeah, that seems plausible. Unfortunately, I'm working on a paper deadline, so I won't get to this for another week. But I think there's something still wrong here. This new change assumes that we want to stop reading at the split boundary, but that's not true. If the split boundary falls in the middle of a record, then the preceding split needs to go past the boundary to "see" the whole record. Additionally, this only fixes the fast match, but I'm thinking (haven't checked) that a similar issue exists with the slow match... so this check needs to happen in next(), not inside the match routines. I think that means we should have a few more test cases, to check all these variations. Lohit: I was hoping someone else would take up the charge of doing "the right thing" with my patches. I can certainly take a stab at creating new tests for Wei-Ming's case and merging all these new tests into TestStreamXmlReacord.java, if no one beats me to it.
        Hide
        Wei-Ming Chen added a comment -

        My solution should stop reading only when trying to find a beginning tag. When it tries to find an end tag, the outBufOrNull will not be null and the boundary-checking code will not be reached. Therefore, the whole record can be returned. However, all these assume outBufOrNull can be used to discern whether it's trying to find a beginning tag or an end tag.

        I have not checked the slow match either. However, if the boundary-checking can be done inside the match function, it should help some performance improvement.

        Show
        Wei-Ming Chen added a comment - My solution should stop reading only when trying to find a beginning tag. When it tries to find an end tag, the outBufOrNull will not be null and the boundary-checking code will not be reached. Therefore, the whole record can be returned. However, all these assume outBufOrNull can be used to discern whether it's trying to find a beginning tag or an end tag. I have not checked the slow match either. However, if the boundary-checking can be done inside the match function, it should help some performance improvement.
        Hide
        Bo Adler added a comment -

        Ah, okay, I see what you're doing. I agree that the performance would be slightly faster inside the match function. But I also feel like the code will be harder to maintain, because this "check for split boundary" has to happen in both slow and fast match cases.

        On the issue of test cases: I managed to create a test case that demonstrates the bug that Wei-Ming addresses, but I'm having a problem combining the test cases into a single file. When I do that, the tests pass – as if a single jobconf was being used for all the tests. Any ideas on what I might be doing wrong?

        Show
        Bo Adler added a comment - Ah, okay, I see what you're doing. I agree that the performance would be slightly faster inside the match function. But I also feel like the code will be harder to maintain, because this "check for split boundary" has to happen in both slow and fast match cases. On the issue of test cases: I managed to create a test case that demonstrates the bug that Wei-Ming addresses, but I'm having a problem combining the test cases into a single file. When I do that, the tests pass – as if a single jobconf was being used for all the tests. Any ideas on what I might be doing wrong?
        Hide
        Bo Adler added a comment -

        Turns out that the problem is with org.apache.hadoop.fs.FileSystem. There is a cache object inside of it, which remembers the LocalFileSystem that is created; even though I set "-jobconf fs.local.block.size=xx" in subsequent jobs, the original object is kept so the override doesn't happen.

        There is a FileSystem.closeAll() method, which seems to erase the cache. I tried calling this at the end of each job, so that the LocalFileSystem is recreated with the new blocksize (I have three blocksizes that I test). This works the first time, but not the second, so that the third blocksize is the same as the second. At this point, I'd say that the easiest way forward is to have a separate file for each test, since I need to change the blocksize to elicit the two errors.

        Show
        Bo Adler added a comment - Turns out that the problem is with org.apache.hadoop.fs.FileSystem. There is a cache object inside of it, which remembers the LocalFileSystem that is created; even though I set "-jobconf fs.local.block.size=xx" in subsequent jobs, the original object is kept so the override doesn't happen. There is a FileSystem.closeAll() method, which seems to erase the cache. I tried calling this at the end of each job, so that the LocalFileSystem is recreated with the new blocksize (I have three blocksizes that I test). This works the first time, but not the second, so that the third blocksize is the same as the second. At this point, I'd say that the easiest way forward is to have a separate file for each test, since I need to change the blocksize to elicit the two errors.
        Hide
        Bo Adler added a comment -

        Here is my latest patch that implements tests demonstrating the problem, and how I think it should be fixed (including the info from Wei-Ming). I did not implement the early-quit discussed above, but it's easy to do for the person doing the actual commit.

        The tests are done as separate files, to avoid the FileSystem caching that I mentioned earlier.

        Show
        Bo Adler added a comment - Here is my latest patch that implements tests demonstrating the problem, and how I think it should be fixed (including the info from Wei-Ming). I did not implement the early-quit discussed above, but it's easy to do for the person doing the actual commit. The tests are done as separate files, to avoid the FileSystem caching that I mentioned earlier.
        Hide
        Jean-Daniel Cryans added a comment -

        I've just tried the fix described by Wei-Ching in it's July 11 post to process the wikipedia articles dump and it works perfectly (instead of having maps not ending even tho they are at 1600%). This really should be committed.

        Show
        Jean-Daniel Cryans added a comment - I've just tried the fix described by Wei-Ching in it's July 11 post to process the wikipedia articles dump and it works perfectly (instead of having maps not ending even tho they are at 1600%). This really should be committed.
        Hide
        Paul Dlug added a comment -

        This problem still exists in 0.20.0, I was also successful in applying Wei-Ching's fix.

        Show
        Paul Dlug added a comment - This problem still exists in 0.20.0, I was also successful in applying Wei-Ching's fix.
        Hide
        Tom White added a comment -

        Is the latest patch the one with the desired fix? If so then please mark "Patch Available" to run it through Hudson, otherwise please post up a new patch first. Thanks.

        Show
        Tom White added a comment - Is the latest patch the one with the desired fix? If so then please mark "Patch Available" to run it through Hudson, otherwise please post up a new patch first. Thanks.
        Hide
        Ravi Gummadi added a comment -

        Attaching patch for trunk. This patch is same as the earlier patch HADOOP-3484.try3.patch except fixing TestStreamXmlMultiInner.java to have fs.local.block.size=59 instead of 80 sothat the testcase fails without the fix of this patch.

        Show
        Ravi Gummadi added a comment - Attaching patch for trunk. This patch is same as the earlier patch HADOOP-3484 .try3.patch except fixing TestStreamXmlMultiInner.java to have fs.local.block.size=59 instead of 80 sothat the testcase fails without the fix of this patch.
        Hide
        Ravi Gummadi added a comment -

        Looks like the testcase TestStreamXmlMultiOuter is failing in trunk but passing in 0.20. Will investigate.

        Show
        Ravi Gummadi added a comment - Looks like the testcase TestStreamXmlMultiOuter is failing in trunk but passing in 0.20. Will investigate.
        Hide
        Ravi Gummadi added a comment -

        It was an issue in input to testcases. The perl scripts seem to be depending on the number of occurences of the word "is". My earlier patch changed the input — causing the testcase failure.

        Here is the patch for Y! 20S distribution with the correct input to testcases. Both testcases passed on my local machine in Y! 20S with patch and fail without the fix of this patch.
        ----------------------

        Same patch has an issue in trunk. Number of splits seem to be 1 in TestStreamXmlMultiInner.java even though fs.local.block.size=59. Will investigate.

        Show
        Ravi Gummadi added a comment - It was an issue in input to testcases. The perl scripts seem to be depending on the number of occurences of the word "is". My earlier patch changed the input — causing the testcase failure. Here is the patch for Y! 20S distribution with the correct input to testcases. Both testcases passed on my local machine in Y! 20S with patch and fail without the fix of this patch. ---------------------- Same patch has an issue in trunk. Number of splits seem to be 1 in TestStreamXmlMultiInner.java even though fs.local.block.size=59. Will investigate.
        Hide
        Ravi Gummadi added a comment -

        In trunk, testcases were not picking the block size because in TestStreaming(the base class of the 2 tests of this patch) is creating input file by creating FileSystem object. As we were setting the config fs.local.block.size later, it is not effective for the FileSystem — causing single split in both tests.

        Show
        Ravi Gummadi added a comment - In trunk, testcases were not picking the block size because in TestStreaming(the base class of the 2 tests of this patch) is creating input file by creating FileSystem object. As we were setting the config fs.local.block.size later, it is not effective for the FileSystem — causing single split in both tests.
        Hide
        Ravi Gummadi added a comment -

        Attaching patch for trunk by fixing the testcases so that the configuration used when FileSystem object is created will have fs.local.block.size set to the proper value needed.

        Both testcases fail without the fix of the patch and pass with the fix.

        Show
        Ravi Gummadi added a comment - Attaching patch for trunk by fixing the testcases so that the configuration used when FileSystem object is created will have fs.local.block.size set to the proper value needed. Both testcases fail without the fix of the patch and pass with the fix.
        Hide
        Amareshwari Sriramadasu added a comment -

        Some comments on testcases:
        1. Can you put both the tests into single file?
        2. Both the tests look similar except changes in input and some configuration parameters. Can you re-use the code in tests by passing them as parameters?
        3. Minor: Can you put perl script for mapper and reducer in String and use it?
        4. Minor: In UtilsForTest, change LOG(..., StringUtils.StringifyException(e)) to LOG(..., e).

        Show
        Amareshwari Sriramadasu added a comment - Some comments on testcases: 1. Can you put both the tests into single file? 2. Both the tests look similar except changes in input and some configuration parameters. Can you re-use the code in tests by passing them as parameters? 3. Minor: Can you put perl script for mapper and reducer in String and use it? 4. Minor: In UtilsForTest, change LOG(..., StringUtils.StringifyException(e)) to LOG(..., e).
        Hide
        Ravi Gummadi added a comment -

        Attaching new patch incorporating review comments.

        Show
        Ravi Gummadi added a comment - Attaching new patch incorporating review comments.
        Hide
        Ravi Gummadi added a comment -

        This patch is on top of patch of MAPREDUCE-1888 because test cases are refactored in MAPREDUCE-1888.

        Show
        Ravi Gummadi added a comment - This patch is on top of patch of MAPREDUCE-1888 because test cases are refactored in MAPREDUCE-1888 .
        Hide
        Ravi Gummadi added a comment -

        After merging, the file system block size is not updated properly. So adding FileSystem.closeAll() call at the begimning of each test case. Will upload a patch soon.

        Show
        Ravi Gummadi added a comment - After merging, the file system block size is not updated properly. So adding FileSystem.closeAll() call at the begimning of each test case. Will upload a patch soon.
        Hide
        Ravi Gummadi added a comment -

        Attaching new patch fixing the issue with FileSystem block size setting using FileSystem.closeAll() in the test cases. With this, the block size is properly set to 60, 80, 60 and 80 in the 4 test cases respectively.

        Show
        Ravi Gummadi added a comment - Attaching new patch fixing the issue with FileSystem block size setting using FileSystem.closeAll() in the test cases. With this, the block size is properly set to 60, 80, 60 and 80 in the 4 test cases respectively.
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12448380/577.v3.patch
        against trunk revision 959193.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 14 new or modified tests.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed core unit tests.

        -1 contrib tests. The patch failed contrib unit tests.

        Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/275/testReport/
        Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/275/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/275/artifact/trunk/build/test/checkstyle-errors.html
        Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/275/console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12448380/577.v3.patch against trunk revision 959193. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 14 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/275/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/275/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/275/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/275/console This message is automatically generated.
        Hide
        Ravi Gummadi added a comment -

        Some code changes to test cases done in this patch have gone in MAPREDUCE-1888. So regenerating this patch.

        Show
        Ravi Gummadi added a comment - Some code changes to test cases done in this patch have gone in MAPREDUCE-1888 . So regenerating this patch.
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12448668/577.v4.patch
        against trunk revision 960446.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 8 new or modified tests.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed core unit tests.

        -1 contrib tests. The patch failed contrib unit tests.

        Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/285/testReport/
        Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/285/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/285/artifact/trunk/build/test/checkstyle-errors.html
        Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/285/console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12448668/577.v4.patch against trunk revision 960446. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 8 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/285/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/285/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/285/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/285/console This message is automatically generated.
        Hide
        Ravi Gummadi added a comment -

        Contrib test failed is because of MAPREDUCE-1834.
        All other tests passed.

        Show
        Ravi Gummadi added a comment - Contrib test failed is because of MAPREDUCE-1834 . All other tests passed.
        Hide
        Amareshwari Sriramadasu added a comment -

        +1

        I just committed this. Thanks Ravi!

        Show
        Amareshwari Sriramadasu added a comment - +1 I just committed this. Thanks Ravi!
        Hide
        Amareshwari Sriramadasu added a comment -

        Thanks Bo Alder for the earlier patches.

        Show
        Amareshwari Sriramadasu added a comment - Thanks Bo Alder for the earlier patches.
        Hide
        Hudson added a comment -

        Integrated in Hadoop-Mapreduce-trunk-Commit #523 (See https://hudson.apache.org/hudson/job/Hadoop-Mapreduce-trunk-Commit/523/)

        Show
        Hudson added a comment - Integrated in Hadoop-Mapreduce-trunk-Commit #523 (See https://hudson.apache.org/hudson/job/Hadoop-Mapreduce-trunk-Commit/523/ )
        Hide
        Ming Jin added a comment -

        Hi everyone,

        I found the exact same issue in Hadoop v1.0.3(http://fossies.org/dox/hadoop-1.0.3/StreamXmlRecordReader_8java_source.html).

        Is there any plan to fix it in v1.0.3?

        Show
        Ming Jin added a comment - Hi everyone, I found the exact same issue in Hadoop v1.0.3( http://fossies.org/dox/hadoop-1.0.3/StreamXmlRecordReader_8java_source.html ). Is there any plan to fix it in v1.0.3?
        Hide
        Clark Mobarry added a comment -

        I found the exact same issue in Hadoop v2.0.0 (via Cloudera CDH 4.1.2.

        Show
        Clark Mobarry added a comment - I found the exact same issue in Hadoop v2.0.0 (via Cloudera CDH 4.1.2.
        Hide
        Clark Mobarry added a comment -

        I found the exact same issue in Hadoop 0.20.2 via Cloudera CDH 4.1.2 MRv1. I did not attempt this with MRv2.

        Show
        Clark Mobarry added a comment - I found the exact same issue in Hadoop 0.20.2 via Cloudera CDH 4.1.2 MRv1. I did not attempt this with MRv2.
        Hide
        Pierre-Francois Laquerre added a comment -

        This is still broken in 1.1.2.

        Show
        Pierre-Francois Laquerre added a comment - This is still broken in 1.1.2.

          People

          • Assignee:
            Ravi Gummadi
            Reporter:
            David Campbell
          • Votes:
            3 Vote for this issue
            Watchers:
            10 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development