Hadoop Common
  1. Hadoop Common
  2. HADOOP-3630

CompositeRecordReader: key and values can be in uninitialized state if files being joined have no records

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.17.0
    • Fix Version/s: 0.18.0
    • Component/s: None
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      I am using org.apache.hadoop.mapred.join.CompositeInputFormat to do an outer-join across a number of SequenceFiles. This works fine in most circumstances, but I get NullPointerExceptions/uninitialized data (where Writable#readFields() has not been called) when some of the files being joined have no records in them.

      1. 3630-0.patch
        6 kB
        Chris Douglas
      2. 3630-1.patch
        7 kB
        Chris Douglas

        Activity

        Hide
        Hudson added a comment -
        Show
        Hudson added a comment - Integrated in Hadoop-trunk #581 (See http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/581/ )
        Hide
        Chris Douglas added a comment -

        I just committed this.

        Show
        Chris Douglas added a comment - I just committed this.
        Hide
        Runping Qi added a comment -

        +1.

        Looks good.

        Runping

        Show
        Runping Qi added a comment - +1. Looks good. Runping
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12385249/3630-1.patch
        against trunk revision 673869.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 9 new or modified tests.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        -1 javac. The applied patch generated 520 javac compiler warnings (more than the trunk's current 519 warnings).

        +1 findbugs. The patch does not introduce any new Findbugs warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        -1 core tests. The patch failed core unit tests.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2796/testReport/
        Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2796/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2796/artifact/trunk/build/test/checkstyle-errors.html
        Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2796/console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12385249/3630-1.patch against trunk revision 673869. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 9 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. -1 javac. The applied patch generated 520 javac compiler warnings (more than the trunk's current 519 warnings). +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2796/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2796/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2796/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2796/console This message is automatically generated.
        Hide
        Chris Douglas added a comment -

        Fixed another possible error in skip(K)

        Show
        Chris Douglas added a comment - Fixed another possible error in skip(K)
        Hide
        Chris Douglas added a comment -

        Is it too late for this to make it into 0.18?

        Show
        Chris Douglas added a comment - Is it too late for this to make it into 0.18?
        Hide
        Chris Douglas added a comment -

        The warning is from InputFormat::validateInput, which has been deprecated.

        Show
        Chris Douglas added a comment - The warning is from InputFormat::validateInput, which has been deprecated.
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12384649/3630-0.patch
        against trunk revision 671385.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 9 new or modified tests.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        -1 javac. The applied patch generated 443 javac compiler warnings (more than the trunk's current 442 warnings).

        +1 findbugs. The patch does not introduce any new Findbugs warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed core unit tests.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2731/testReport/
        Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2731/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2731/artifact/trunk/build/test/checkstyle-errors.html
        Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2731/console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12384649/3630-0.patch against trunk revision 671385. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 9 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. -1 javac. The applied patch generated 443 javac compiler warnings (more than the trunk's current 442 warnings). +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2731/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2731/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2731/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2731/console This message is automatically generated.
        Hide
        Chris Douglas added a comment -

        I think this occurs because CompositeRecordReader#add() [Hadoop 0.17.0, line 138] doesn't call rr.hasNext() to check if the RecordReader has any records before it adds it to the PriorityQueue. Is this a bug or expected behaviour?

        This is definitely a bug, and your diagnosis is exactly right.

        Attaching a fix and a testcase.

        Show
        Chris Douglas added a comment - I think this occurs because CompositeRecordReader#add() [Hadoop 0.17.0, line 138] doesn't call rr.hasNext() to check if the RecordReader has any records before it adds it to the PriorityQueue. Is this a bug or expected behaviour? This is definitely a bug, and your diagnosis is exactly right. Attaching a fix and a testcase.
        Hide
        Jingkei Ly added a comment - - edited

        I've pasted an example program below that demonstrates this:

        	public static final void main(String[] args) throws Exception {
        		String input = args[0], output = args[1];
        
        		JobConf conf = new JobConf(TestJoin.class);
        		
        		FileSystem fs = FileSystem.get(conf);
        		Path inputPath = new Path(input);
        		
        		Path[] src = createInputFiles(inputPath,fs,conf);
        		
        		String expr = CompositeInputFormat.compose("outer", SequenceFileInputFormat.class, src);
        		conf.set("mapred.join.expr", expr);
        		
        		conf.addInputPath(inputPath);
        		conf.setOutputPath(new Path(output));
        		
        		conf.setInputFormat(CompositeInputFormat.class);
        		conf.setOutputFormat(SequenceFileOutputFormat.class);
        		
        		conf.setOutputKeyClass(ExampleWritable.class);
        		conf.setOutputValueClass(TupleWritable.class);
        		
        		conf.setNumReduceTasks(0);
        		
        		conf.setMapperClass(IdentityMapper.class);
        		conf.setReducerClass(IdentityReducer.class);
        		
        		JobClient.runJob(conf);
        	}
        
        	public static class ExampleWritable implements WritableComparable	{
        		String str = null;
        
        		public void readFields(DataInput in) throws IOException	{
        			str = in.readUTF();
        		}
        
        		public void write(DataOutput out) throws IOException 	{
        			out.writeUTF(str);
        		}
        
        		public int compareTo(Object o){
        			ExampleWritable other = (ExampleWritable) o;
        			return str.compareTo(other.str);
        		}
        		
        	}
        	
        	private static Path[] createInputFiles(Path inputPath, FileSystem fs, JobConf conf)
        		throws IOException 	{
        		Path[] src = new Path[10];
        		
        		for (int i = 0; i < src.length; ++i)	{
        			src[i] = new Path(inputPath, Integer.toString(i + 10, 36));
        		}
        	    
        		SequenceFile.Writer out[] = new SequenceFile.Writer[10];
        		for (int i = 0; i < src.length; ++i)	{
        			out[i] = new SequenceFile.Writer(fs, conf, src[i],
        					ExampleWritable.class, ExampleWritable.class);
        		}
        		
        		ExampleWritable key = new ExampleWritable();
        		ExampleWritable val = new ExampleWritable();
        
        		// write arbitrary values to only one of the sequence files
        		for (int i=0; i< 10; i++)	{
        			key.str = String.valueOf(i);
        			val.str = String.valueOf(i);
        			out[1].append(key, val);
        		}
        		
        		for (SequenceFile.Writer o:out)	{
        			o.close();
        		}
        		
        		return src;
        	}
        
        

        (I hope that still compiles and makes sense :S)

        which gives me the following exception when I run it:

        java.lang.NullPointerException
                at com.detica.netreveal.hadoop.TestJoin$ExampleWritable.compare
        To(TestJoin.java:81)
                at org.apache.hadoop.io.WritableComparator.compare(WritableComparator.ja
        va:109)
                at org.apache.hadoop.mapred.join.CompositeRecordReader$2.compare(Composi
        teRecordReader.java:134)
                at org.apache.hadoop.mapred.join.CompositeRecordReader$2.compare(Composi
        teRecordReader.java:132)
                at java.util.PriorityQueue.fixUp(PriorityQueue.java:544)
                at java.util.PriorityQueue.offer(PriorityQueue.java:304)
                at java.util.PriorityQueue.add(PriorityQueue.java:327)
                at org.apache.hadoop.mapred.join.CompositeRecordReader.add(CompositeReco
        rdReader.java:138)
                at org.apache.hadoop.mapred.join.Parser$CNode.getRecordReader(Parser.jav
        a:419)
                at org.apache.hadoop.mapred.join.CompositeInputFormat.getRecordReader(Co
        mpositeInputFormat.java:142)
                at org.apache.hadoop.mapred.join.CompositeInputFormat.getRecordReader(Co
        mpositeInputFormat.java:49)
                at org.apache.hadoop.mapred.MapTask.run(MapTask.java:211)
                at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2124
        )
        

        I think this occurs because CompositeRecordReader#add() [Hadoop 0.17.0, line 138] doesn't call rr.hasNext() to check if the RecordReader has any records before it adds it to the PriorityQueue. Is this a bug or expected behaviour?

        Show
        Jingkei Ly added a comment - - edited I've pasted an example program below that demonstrates this: public static final void main( String [] args) throws Exception { String input = args[0], output = args[1]; JobConf conf = new JobConf(TestJoin.class); FileSystem fs = FileSystem.get(conf); Path inputPath = new Path(input); Path[] src = createInputFiles(inputPath,fs,conf); String expr = CompositeInputFormat.compose( " outer " , SequenceFileInputFormat.class, src); conf.set( "mapred.join.expr" , expr); conf.addInputPath(inputPath); conf.setOutputPath( new Path(output)); conf.setInputFormat(CompositeInputFormat.class); conf.setOutputFormat(SequenceFileOutputFormat.class); conf.setOutputKeyClass(ExampleWritable.class); conf.setOutputValueClass(TupleWritable.class); conf.setNumReduceTasks(0); conf.setMapperClass(IdentityMapper.class); conf.setReducerClass(IdentityReducer.class); JobClient.runJob(conf); } public static class ExampleWritable implements WritableComparable { String str = null ; public void readFields(DataInput in) throws IOException { str = in.readUTF(); } public void write(DataOutput out) throws IOException { out.writeUTF(str); } public int compareTo( Object o){ ExampleWritable other = (ExampleWritable) o; return str.compareTo(other.str); } } private static Path[] createInputFiles(Path inputPath, FileSystem fs, JobConf conf) throws IOException { Path[] src = new Path[10]; for ( int i = 0; i < src.length; ++i) { src[i] = new Path(inputPath, Integer .toString(i + 10, 36)); } SequenceFile.Writer out[] = new SequenceFile.Writer[10]; for ( int i = 0; i < src.length; ++i) { out[i] = new SequenceFile.Writer(fs, conf, src[i], ExampleWritable.class, ExampleWritable.class); } ExampleWritable key = new ExampleWritable(); ExampleWritable val = new ExampleWritable(); // write arbitrary values to only one of the sequence files for ( int i=0; i< 10; i++) { key.str = String .valueOf(i); val.str = String .valueOf(i); out[1].append(key, val); } for (SequenceFile.Writer o:out) { o.close(); } return src; } (I hope that still compiles and makes sense :S) which gives me the following exception when I run it: java.lang.NullPointerException at com.detica.netreveal.hadoop.TestJoin$ExampleWritable.compare To(TestJoin.java:81) at org.apache.hadoop.io.WritableComparator.compare(WritableComparator.ja va:109) at org.apache.hadoop.mapred.join.CompositeRecordReader$2.compare(Composi teRecordReader.java:134) at org.apache.hadoop.mapred.join.CompositeRecordReader$2.compare(Composi teRecordReader.java:132) at java.util.PriorityQueue.fixUp(PriorityQueue.java:544) at java.util.PriorityQueue.offer(PriorityQueue.java:304) at java.util.PriorityQueue.add(PriorityQueue.java:327) at org.apache.hadoop.mapred.join.CompositeRecordReader.add(CompositeReco rdReader.java:138) at org.apache.hadoop.mapred.join.Parser$CNode.getRecordReader(Parser.jav a:419) at org.apache.hadoop.mapred.join.CompositeInputFormat.getRecordReader(Co mpositeInputFormat.java:142) at org.apache.hadoop.mapred.join.CompositeInputFormat.getRecordReader(Co mpositeInputFormat.java:49) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:211) at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2124 ) I think this occurs because CompositeRecordReader#add() [Hadoop 0.17.0, line 138] doesn't call rr.hasNext() to check if the RecordReader has any records before it adds it to the PriorityQueue. Is this a bug or expected behaviour?

          People

          • Assignee:
            Chris Douglas
            Reporter:
            Jingkei Ly
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development