Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
0.17.0
-
None
-
None
-
Reviewed
Description
Problem
The ValuesIterator.next() doesn't return a new object representing the next element.
It is expected that this is the case. Example from java.lang.String source code:
public boolean equals(Object anObject) { if (this == anObject) { return true; } if (anObject instanceof String) { // custom code here } }
Changing the private fields of the object and returning it from next() will make this object fail the equals() test, because this==anObject will always be true.
What is affected
The reduce() method presents (a subclass of) ValuesIterator to the user. So every user-defined class can be affected.
Manifestations of this bug in 0.17.0:
- The Text class checks for equality similarly to String.equals(), which is shown above.
- The contrib/data_join breaks because it stores tags in a Map. The behavior of the next() method makes Object.equals() be true for all tags.
JUnit test
Patch against hadoop-0.17.0:
Index: src/test/org/apache/hadoop/mapred/TestReduceTask.java =================================================================== --- src/test/org/apache/hadoop/mapred/TestReduceTask.java (revision 665935) +++ src/test/org/apache/hadoop/mapred/TestReduceTask.java (working copy) @@ -122,4 +122,52 @@ runValueIterator(tmpDir, testCase, conf); } } + + public void runValueIterator2(Path tmpDir, Pair[] vals, + Configuration conf) throws IOException { + FileSystem fs = tmpDir.getFileSystem(conf); + Path path = new Path(tmpDir, "data.in"); + SequenceFile.Writer writer = new SequenceFile.Writer(fs, conf, path, + Text.class, + Text.class); + for(Pair p: vals) { + writer.append(new Text(p.key), new Text(p.value)); + } + writer.close(); + SequenceFile.Sorter sorter = new SequenceFile.Sorter(fs, Text.class, + Text.class, conf); + SequenceFile.Sorter.RawKeyValueIterator rawItr = + sorter.merge(new Path[]{path}, false, tmpDir); + @SuppressWarnings("unchecked") // WritableComparators are not generic + ReduceTask.ValuesIterator<Text,Text> valItr = + new ReduceTask.ValuesIterator<Text,Text>(rawItr, + WritableComparator.get(Text.class), Text.class, Text.class, + conf, new NullProgress()); + while (valItr.more()) { + Object key = valItr.getKey(); + String keyString = key.toString(); + // make sure it matches! + assertEquals(vals[0].key, keyString); + // must have at least 1 value! + assertTrue(valItr.hasNext()); + Text value1 = valItr.next(); + // must have at least 2 values! + assertTrue(valItr.hasNext()); + Text value2 = valItr.next(); + // do test + assertNotSame(value1, value2); + assertTrue(!value1.equals(value2)); + assertTrue(!value2.equals(value1)); + // make sure the key hasn't changed under the hood + assertEquals(keyString, valItr.getKey().toString()); + valItr.nextKey(); + } + } + + public void testValueIterator2() throws Exception { + Path tmpDir = new Path("build/test/test.reduce.task"); + Configuration conf = new Configuration(); + Pair[] test = new Pair[]{ new Pair("k1", "v1"), new Pair("k1", "v2") }; + runValueIterator2(tmpDir, test, conf); + } }
A program with a bug
I would imagine that a programmer would be really confused when everything is equal in the example below, for any text input:
package test; import java.io.*; import java.util.*; import org.apache.hadoop.fs.Path; import org.apache.hadoop.conf.*; import org.apache.hadoop.io.*; import org.apache.hadoop.util.*; import org.apache.hadoop.mapred.*; import org.apache.hadoop.mapred.lib.*; public class TestNewReducer extends Configured implements Tool { public int run(String[] args) throws Exception { if (args.length < 2) { System.out.println("TestNewReducer <inDir> <outDir>"); return -1; } JobConf conf = new JobConf(getConf(), TestNewReducer.class); conf.setJobName("test-newreducer"); conf.setInputFormat(TextInputFormat.class); conf.setMapperClass(Map.class); conf.setReducerClass(Reduce.class); conf.setOutputFormat(TextOutputFormat.class); conf.setOutputKeyClass(LongWritable.class); conf.setOutputValueClass(Text.class); FileInputFormat.setInputPaths(conf, new Path(args[0])); FileOutputFormat.setOutputPath(conf, new Path(args[1])); JobClient.runJob(conf); return 0; } public static void main(String[] args) throws Exception { int res = ToolRunner.run(new Configuration(), new TestNewReducer(), args); System.exit(res); } public static class Map extends MapReduceBase implements Mapper <LongWritable, Text, LongWritable, Text> { private final static LongWritable nill = new LongWritable(1); public void map (LongWritable key, Text value, OutputCollector<LongWritable, Text> output, Reporter reporter) throws IOException { output.collect(nill, value); } } public static class Reduce extends MapReduceBase implements Reducer<LongWritable, Text, LongWritable, Text> { private final static LongWritable nill = new LongWritable(1); public void reduce(LongWritable key, Iterator<Text> values, OutputCollector<LongWritable, Text> output, Reporter reporter) throws IOException { Text t1 = null; if (values.hasNext()) t1 = values.next(); // t1 is the first element in values while(values.hasNext()) { // is this element equal to t1? Text t = values.next(); output.collect(nill, t.equals(t1) ? new Text("Equal") : new Text("Not equal")); } } } }
Possible fixes
- Return a new object each time for next(). This might have significant overhead.
- Return a new object for unknown object types and reuse the same object for known types (like Text). Remove "if (this==anObject) return true;" check from all equals() methods for known objects.
- Document clearly that all user-defined classes must implement an equals() method, which doesn't do the "if (this==anObject) return true;" check (ie. push the problem to the user).