Hadoop Map/Reduce
  1. Hadoop Map/Reduce
  2. MAPREDUCE-252

Create an InputFormat for reading lines of text as Java Strings

    Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Incomplete
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      Such a StringInputFormat would be like TextInputFormat but with input types of Long and String, rather than LongWritable and Text. This would allow users to write MapReduce programs that used only Java native types (i.e. no Writables).

      This is currently not possible to write without changes to Hadoop due to a limitation in the RecordReader interface explained here: https://issues.apache.org/jira/browse/HADOOP-3413?focusedCommentId=12597935#action_12597935

      1. hadoop-3566-v4.patch
        12 kB
        Tom White
      2. hadoop-3566-v3.patch
        14 kB
        Tom White
      3. hadoop-3566-v2.patch
        14 kB
        Tom White
      4. hadoop-3566.patch
        19 kB
        Tom White

        Issue Links

          Activity

          Hide
          Allen Wittenauer added a comment -

          Closing this as stale.

          Show
          Allen Wittenauer added a comment - Closing this as stale.
          Hide
          Pete Wyckoff added a comment -

          If we had a LineReader Serialization implementation (HADOOP-4203) that returns Strings that could be plugged into a flat file input format (HADOOP-4065), you would have this inputformat, albeit with the important problem of the signature being <LongWritable, String>.

          Could we not modify 4065 to accommodate this use case or am i missing something?

          Show
          Pete Wyckoff added a comment - If we had a LineReader Serialization implementation ( HADOOP-4203 ) that returns Strings that could be plugged into a flat file input format ( HADOOP-4065 ), you would have this inputformat, albeit with the important problem of the signature being <LongWritable, String>. Could we not modify 4065 to accommodate this use case or am i missing something?
          Hide
          Tom White added a comment -

          Tentative patch that uses the new API being developed in HADOOP-1230. It's not ready to be used until HADOOP-1230 is done.

          Show
          Tom White added a comment - Tentative patch that uses the new API being developed in HADOOP-1230 . It's not ready to be used until HADOOP-1230 is done.
          Hide
          Tom White added a comment -

          The patch should be updated to use the new API in org.apache.hadoop.mapreduce which has a RecordReader that is compatible with this approach, so there is no need to introduce NewInstanceRecordReader.

          Show
          Tom White added a comment - The patch should be updated to use the new API in org.apache.hadoop.mapreduce which has a RecordReader that is compatible with this approach, so there is no need to introduce NewInstanceRecordReader.
          Hide
          Tom White added a comment -

          On the StringInputFormat is now a FileInputFormat<String, Void>, shouldn't be more appropriate FileInputFormat<Void, String> ?

          By making it <String, Void> you get the sort that Owen mentioned.

          Show
          Tom White added a comment - On the StringInputFormat is now a FileInputFormat<String, Void>, shouldn't be more appropriate FileInputFormat<Void, String> ? By making it <String, Void> you get the sort that Owen mentioned.
          Hide
          Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12386380/hadoop-3566-v3.patch
          against trunk revision 677872.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 6 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2902/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2902/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2902/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2902/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12386380/hadoop-3566-v3.patch against trunk revision 677872. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 6 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2902/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2902/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2902/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2902/console This message is automatically generated.
          Hide
          Alejandro Abdelnur added a comment -

          On the StringInputFormat is now a FileInputFormat<String, Void>, shouldn't be more appropriate FileInputFormat<Void, String> ?

          Show
          Alejandro Abdelnur added a comment - On the StringInputFormat is now a FileInputFormat<String, Void> , shouldn't be more appropriate FileInputFormat<Void, String> ?
          Hide
          Tom White added a comment -

          Merged with trunk.

          Show
          Tom White added a comment - Merged with trunk.
          Hide
          Tom White added a comment -

          A new patch with Owen's suggestions. StringInputFormat is now a FileInputFormat<String, Void>, so keys are the lines of text and values are nulls.

          Show
          Tom White added a comment - A new patch with Owen's suggestions. StringInputFormat is now a FileInputFormat<String, Void>, so keys are the lines of text and values are nulls.
          Hide
          Owen O'Malley added a comment -

          Could we use

             K getKey() throws IOException;
             V getValue() throws IOException;
          

          We should probably change the context objects for InputFormats this way too.

          The whole Long,String signature for TextInputFormat has always kind of annoyed me. No one ever uses those offsets and they mean that TextInputFormat, IdentityMapper, IdentityReducer, TextOutputFormat doesn't do the "sort" that people expect.

          Show
          Owen O'Malley added a comment - Could we use K getKey() throws IOException; V getValue() throws IOException; We should probably change the context objects for InputFormats this way too. The whole Long,String signature for TextInputFormat has always kind of annoyed me. No one ever uses those offsets and they mean that TextInputFormat, IdentityMapper, IdentityReducer, TextOutputFormat doesn't do the "sort" that people expect.
          Hide
          Tom White added a comment -

          Patch implementing the proposal. The unit test TestJavaSerialization is now Writable-free. So you can write mappers and reducers like this:

              static class WordCountMapper extends MapReduceBase implements
                Mapper<Long, String, String, Long> {
          
              public void map(Long key, String value,
                  OutputCollector<String, Long> output, Reporter reporter)
                  throws IOException {
                StringTokenizer st = new StringTokenizer(value);
                while (st.hasMoreTokens()) {
                  output.collect(st.nextToken(), 1L);
                }
              }
          
            }
            
            static class SumReducer<K> extends MapReduceBase implements
                Reducer<K, Long, K, Long> {
              
              public void reduce(K key, Iterator<Long> values,
                  OutputCollector<K, Long> output, Reporter reporter)
                throws IOException {
          
                long sum = 0;
                while (values.hasNext()) {
                  sum += values.next();
                }
                output.collect(key, sum);
              }
              
            }
          
          Show
          Tom White added a comment - Patch implementing the proposal. The unit test TestJavaSerialization is now Writable-free. So you can write mappers and reducers like this: static class WordCountMapper extends MapReduceBase implements Mapper< Long , String , String , Long > { public void map( Long key, String value, OutputCollector< String , Long > output, Reporter reporter) throws IOException { StringTokenizer st = new StringTokenizer(value); while (st.hasMoreTokens()) { output.collect(st.nextToken(), 1L); } } } static class SumReducer<K> extends MapReduceBase implements Reducer<K, Long , K, Long > { public void reduce(K key, Iterator< Long > values, OutputCollector<K, Long > output, Reporter reporter) throws IOException { long sum = 0; while (values.hasNext()) { sum += values.next(); } output.collect(key, sum); } }
          Hide
          Tom White added a comment -

          I propose a subclass of RecordReader that supports retrieval of the current key and value:

          public interface NewInstanceRecordReader<K, V> extends RecordReader<K, V> {
            
            K getCurrentKey();
            
            V getCurrentValue();
          }
          

          With this change, the framework would need only changes to MapTask to have two types of TrackedRecordReader, and MapRunner to check the type of RecordRunner and retrieve the current key and value, if necessary.

          Longer term, in HADOOP-1230, we could simply add these new methods to the new RecordReader interface (which will be in a new package, org.apache.hadoop.mapreduce).

          Thoughts?

          Show
          Tom White added a comment - I propose a subclass of RecordReader that supports retrieval of the current key and value: public interface NewInstanceRecordReader<K, V> extends RecordReader<K, V> { K getCurrentKey(); V getCurrentValue(); } With this change, the framework would need only changes to MapTask to have two types of TrackedRecordReader, and MapRunner to check the type of RecordRunner and retrieve the current key and value, if necessary. Longer term, in HADOOP-1230 , we could simply add these new methods to the new RecordReader interface (which will be in a new package, org.apache.hadoop.mapreduce). Thoughts?

            People

            • Assignee:
              Tom White
              Reporter:
              Tom White
            • Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development