Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.16.2
    • Fix Version/s: 0.18.0
    • Component/s: None
    • Labels:
      None
    • Environment:

      All

    • Hadoop Flags:
      Reviewed
    • Release Note:
      Added org.apache.hadoop.mapred.lib.NLineInputFormat ,which splits N lines of input as one split. N can be specified by configuration property "mapred.line.input.format.linespermap", which defaults to 1.

      Description

      In many "pleasantly" parallel applications, each process/mapper processes the same input file (s), but with computations are controlled by different parameters.
      (Referred to as "parameter sweeps").

      One way to achieve this, is to specify a set of parameters (one set per line) as input in a control file (which is the input path to the map-reduce application, where as the input dataset is specified via a config variable in JobConf.).

      It would be great to have an InputFormat, that splits the input file such that by default, one line is fed as a value to one map task, and key could be line number. i.e. (k,v) is (LongWritable, Text).

      If user specifies the number of maps explicitly, each mapper should get a contiguous chunk of lines (so as to load balance between the mappers.)

      The location hints for the splits should not be derived from the input file, but rather, should span the whole mapred cluster.

      (Is there a way to do this without having to return an array of nSplits*nTaskTrackers ?)

      Increasing the replication of the "real" input dataset (since it will be fetched by all the nodes) is orthogonal, and one can use DistributedCache for that.

      (P.S. Please chose a better name for this InputFormat. I am not in love with "LineBasedText" name.)

      1. patch-3221-2.txt
        10 kB
        Amareshwari Sriramadasu
      2. patch-3221-1.txt
        13 kB
        Amareshwari Sriramadasu
      3. patch-3221.txt
        13 kB
        Amareshwari Sriramadasu

        Activity

        Hide
        Arkady Borkovsky added a comment -

        I think we already have it in 0.16.2 as OneLineInputFormat

        Show
        Arkady Borkovsky added a comment - I think we already have it in 0.16.2 as OneLineInputFormat
        Hide
        Milind Bhandarkar added a comment -

        Arkady,

        OneLineInputFormat is not checked into Hadoop yet. I created this jira, so that we can make the necessary modifications (e.g. LocationHints, adjusting to numMappers etc), and contribute it.

        Show
        Milind Bhandarkar added a comment - Arkady, OneLineInputFormat is not checked into Hadoop yet. I created this jira, so that we can make the necessary modifications (e.g. LocationHints, adjusting to numMappers etc), and contribute it.
        Hide
        Amareshwari Sriramadasu added a comment -

        Here is design for the proposed LineBasedTextInputFormat:

        We can have an NLineInputFormat that splits the input file such that N lines are made as one split, where N defaults to 1. N can be derived from number of maps as total_number_of_lines/number_of_maps.

        The location hints for the splits should not be derived from the input file, but rather, should span the whole mapred cluster.

        I think this can be done by returning an empty array for InputSplit.getLocations().

        And for making the split contain the actual lines themselves instead of <filename, start-offset, length>, InputSplit.write and read methods can be overridden to do this. And a RecordReader should be implemented to read the contents of the split.

        Thoughts?

        Show
        Amareshwari Sriramadasu added a comment - Here is design for the proposed LineBasedTextInputFormat: We can have an NLineInputFormat that splits the input file such that N lines are made as one split, where N defaults to 1. N can be derived from number of maps as total_number_of_lines/number_of_maps. The location hints for the splits should not be derived from the input file, but rather, should span the whole mapred cluster. I think this can be done by returning an empty array for InputSplit.getLocations(). And for making the split contain the actual lines themselves instead of <filename, start-offset, length>, InputSplit.write and read methods can be overridden to do this. And a RecordReader should be implemented to read the contents of the split. Thoughts?
        Hide
        Amareshwari Sriramadasu added a comment -

        Here is a patch doing the implementation in the following way:
        1. Adds Classes org.apache.hadoop.mapred.LineSplit, org.apache.hadoop.mapred.NLineInputFormat and
        org.apache.hadoop.mapred.NLineInputFormat.NLineRecordReader. Adds a test org.apache.hadoop.mapred.TestLineInputFormat

        2. LineSplit implements InputSplit. LineSplit has number of lines and the lines themselves, sothat all the mappers do not have to fetch the same file simultaneously.

        3. NLineInputFormat extends FileInputFormat. It splits the input into N lines per split, where N is derived from the number of map tasks specified (through mapred.map.tasks in JobConf). The value of N defaults to 1.

        4. NLineRecordReader reads one line at time from the LineSplit. The (key, value) is (LongWritable, Text), where key is the line number and the value is the line.

        Thoughts?

        Show
        Amareshwari Sriramadasu added a comment - Here is a patch doing the implementation in the following way: 1. Adds Classes org.apache.hadoop.mapred.LineSplit, org.apache.hadoop.mapred.NLineInputFormat and org.apache.hadoop.mapred.NLineInputFormat.NLineRecordReader. Adds a test org.apache.hadoop.mapred.TestLineInputFormat 2. LineSplit implements InputSplit. LineSplit has number of lines and the lines themselves, sothat all the mappers do not have to fetch the same file simultaneously. 3. NLineInputFormat extends FileInputFormat. It splits the input into N lines per split, where N is derived from the number of map tasks specified (through mapred.map.tasks in JobConf). The value of N defaults to 1. 4. NLineRecordReader reads one line at time from the LineSplit. The (key, value) is (LongWritable, Text), where key is the line number and the value is the line. Thoughts?
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12381742/patch-3221.txt
        against trunk revision 654315.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 3 new or modified tests.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        -1 findbugs. The patch appears to introduce 3 new Findbugs warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed core unit tests.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2437/testReport/
        Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2437/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2437/artifact/trunk/build/test/checkstyle-errors.html
        Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2437/console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12381742/patch-3221.txt against trunk revision 654315. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to introduce 3 new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2437/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2437/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2437/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2437/console This message is automatically generated.
        Hide
        Amareshwari Sriramadasu added a comment -

        Fixed findbugs warnings

        Show
        Amareshwari Sriramadasu added a comment - Fixed findbugs warnings
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12381750/patch-3221-1.txt
        against trunk revision 654315.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 3 new or modified tests.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        -1 core tests. The patch failed core unit tests.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2439/testReport/
        Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2439/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2439/artifact/trunk/build/test/checkstyle-errors.html
        Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2439/console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12381750/patch-3221-1.txt against trunk revision 654315. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2439/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2439/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2439/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2439/console This message is automatically generated.
        Hide
        Chris Douglas added a comment -

        This implements something slightly different than the requirements as stated, i.e. it takes input file(s) and encodes each line (or a subset of lines) as a split, rather than specifying a partition of a resource with one split per line. This has some clear advantages for the issue at hand, i.e. one map per line of text, where a vanilla FileSplit is likely as large (path + offsets + locations) as the relevant line of text, and placement avoids being misled.

        That said, slurping all the input files and writing their contents into the splits may not be the best approach. The result is likely to be close to guessing even offsets into each input (without reading each file), and while there's a possible space savings if both the line length and N are small, it's close enough that the value added may not distinguish it from an InputFormat returning closely cropped FileSplits, stripped of locations. The use and purpose of this new InputFormat might be clearer (though not what this patch implements) if one set a property that governs how many lines are in each split (defaulting to 1).* Since the JobTracker has to read in all the splits (and hold them in memory for the duration of the job, limiting the size of the file the user points this at would be a good idea (via a property that- if said user felt daring or malicious- he could cast off). If you felt daring, you could even mix stripped-down FileSplits with LineSplits based on the length of each section, since the classname of each split is encoded into job.splits.

        A few nits:

        • This should be in o.a.h.mapred.lib, not o.a.h.mapred
        • Since the map expects Text, LineSplit might as well keep Text[] rather than String[]
        • It might be worthwhile to use LineRecordReader instead of InputStreamReader
        • I'm fairly certain that "line number" should not be local to the split, but either the line number in the original input file or an offset into that file.

        * Semantically, it's not clear how to regard files with a number of lines not evenly divided by N; the current patch would group lines from different files into the same split, which might not be what users would expect, but the particular choice is not critical as long as it's documented.

        Show
        Chris Douglas added a comment - This implements something slightly different than the requirements as stated, i.e. it takes input file(s) and encodes each line (or a subset of lines) as a split, rather than specifying a partition of a resource with one split per line. This has some clear advantages for the issue at hand, i.e. one map per line of text, where a vanilla FileSplit is likely as large (path + offsets + locations) as the relevant line of text, and placement avoids being misled. That said, slurping all the input files and writing their contents into the splits may not be the best approach. The result is likely to be close to guessing even offsets into each input (without reading each file), and while there's a possible space savings if both the line length and N are small, it's close enough that the value added may not distinguish it from an InputFormat returning closely cropped FileSplits, stripped of locations. The use and purpose of this new InputFormat might be clearer (though not what this patch implements) if one set a property that governs how many lines are in each split (defaulting to 1).* Since the JobTracker has to read in all the splits (and hold them in memory for the duration of the job, limiting the size of the file the user points this at would be a good idea (via a property that- if said user felt daring or malicious- he could cast off). If you felt daring, you could even mix stripped-down FileSplits with LineSplits based on the length of each section, since the classname of each split is encoded into job.splits. A few nits: This should be in o.a.h.mapred.lib, not o.a.h.mapred Since the map expects Text, LineSplit might as well keep Text[] rather than String[] It might be worthwhile to use LineRecordReader instead of InputStreamReader I'm fairly certain that "line number" should not be local to the split, but either the line number in the original input file or an offset into that file. * Semantically, it's not clear how to regard files with a number of lines not evenly divided by N; the current patch would group lines from different files into the same split, which might not be what users would expect, but the particular choice is not critical as long as it's documented.
        Hide
        Devaraj Das added a comment -

        I agree with Chris that the JobTracker shouldn't load the lines into memory. I think we should make this work with FileSplit (minus the locations info). A pass over the input files containing the lines will tell us how many lines there are. The number of maps that the user desires will give us the number of lines per map (goalsize). The offsets in the input files can then be derived in a second pass over the input files (with the pass breaking at file boundaries just like the FileSplit case). Would this satisfy the requirements?

        Show
        Devaraj Das added a comment - I agree with Chris that the JobTracker shouldn't load the lines into memory. I think we should make this work with FileSplit (minus the locations info). A pass over the input files containing the lines will tell us how many lines there are. The number of maps that the user desires will give us the number of lines per map (goalsize). The offsets in the input files can then be derived in a second pass over the input files (with the pass breaking at file boundaries just like the FileSplit case). Would this satisfy the requirements?
        Hide
        Chris Douglas added a comment -

        A pass over the input files containing the lines will tell us how many lines there are. The number of maps that the user desires will give us the number of lines per map (goalsize). The offsets in the input files can then be derived in a second pass over the input files (with the pass breaking at file boundaries just like the FileSplit case).

        For applications with one map per line of text (depressingly many, particularly for prototypes and research projects), the approach this patch takes makes some sense. For a line length of 40 to 100 characters, a FileSplit- even sans location information- is likely no smaller than the data it describes. Given this potential advantage, there are at least two cases that this implementation includes that work against that model. The first, obviously, is large files; a property defining the maximum aggregate file size is pretty much required to prevent accidents. The second is specifying the number of maps and getting splits with an even number of lines. That adds little value over the default, since in practice most inputs will have fairly uniform line lengths; the estimates should be very close, so the second pass has limited value. If one wants to use multiple lines per map for load balancing only, then generating splits in the usual way is sufficient, unless the "line number as key" is a requirement and the offset isn't enough.

        The purpose of this class would be much clearer if the user were required to provide N. I think it's OK to read the lines into the splits, as long as the total size is kept low. Ideally, this would mix stripped-down FileSpits with LineSplits (line literals) based on size, but that's probably overdoing it. It's probably sufficient to add a (starting) line number to LineSplit, add safety checks for maximum input size, and change its behavior to be N lines per split, rather than the current behavior. Thoughts? I think this should satisfy the requirements and- at least to me- clarifies and narrows where this new InputFormat may be used.

        Show
        Chris Douglas added a comment - A pass over the input files containing the lines will tell us how many lines there are. The number of maps that the user desires will give us the number of lines per map (goalsize). The offsets in the input files can then be derived in a second pass over the input files (with the pass breaking at file boundaries just like the FileSplit case). For applications with one map per line of text (depressingly many, particularly for prototypes and research projects), the approach this patch takes makes some sense. For a line length of 40 to 100 characters, a FileSplit- even sans location information- is likely no smaller than the data it describes. Given this potential advantage, there are at least two cases that this implementation includes that work against that model. The first, obviously, is large files; a property defining the maximum aggregate file size is pretty much required to prevent accidents. The second is specifying the number of maps and getting splits with an even number of lines. That adds little value over the default, since in practice most inputs will have fairly uniform line lengths; the estimates should be very close, so the second pass has limited value. If one wants to use multiple lines per map for load balancing only, then generating splits in the usual way is sufficient, unless the "line number as key" is a requirement and the offset isn't enough. The purpose of this class would be much clearer if the user were required to provide N. I think it's OK to read the lines into the splits, as long as the total size is kept low. Ideally, this would mix stripped-down FileSpits with LineSplits (line literals) based on size, but that's probably overdoing it. It's probably sufficient to add a (starting) line number to LineSplit, add safety checks for maximum input size, and change its behavior to be N lines per split, rather than the current behavior. Thoughts? I think this should satisfy the requirements and- at least to me- clarifies and narrows where this new InputFormat may be used.
        Hide
        Devaraj Das added a comment -

        I am tending to think that FileSplit based approach is the better one. The reasons:
        1) We don't invent brand new input formats. We reuse what exists and the amount of new code is minimal (at a high level, it seems like only FileInputFormat.getSplits and FileSplit.getLocations needs to be overridden)
        2) We are better at handling the cases of large files. Granted that with 1 line per map, we might have the same problem with FileSplit. But we could work around that by having a larger N.
        3) We don't make assumptions about the line lengths, etc. Just make one pass over the files and arrive at the splits.

        The only issue is that we might end up in a situation where a couple of datanodes in the cluster becomes a bottleneck for the split serving. But that could be handled by having a higher replication factor of such files (just like we handle job.jar, etc.).

        Thoughts?

        Show
        Devaraj Das added a comment - I am tending to think that FileSplit based approach is the better one. The reasons: 1) We don't invent brand new input formats. We reuse what exists and the amount of new code is minimal (at a high level, it seems like only FileInputFormat.getSplits and FileSplit.getLocations needs to be overridden) 2) We are better at handling the cases of large files. Granted that with 1 line per map, we might have the same problem with FileSplit. But we could work around that by having a larger N. 3) We don't make assumptions about the line lengths, etc. Just make one pass over the files and arrive at the splits. The only issue is that we might end up in a situation where a couple of datanodes in the cluster becomes a bottleneck for the split serving. But that could be handled by having a higher replication factor of such files (just like we handle job.jar, etc.). Thoughts?
        Hide
        Chris Douglas added a comment -

        We don't invent brand new input formats. We reuse what exists and the amount of new code is minimal

        Which is why this would reuse LineRecordReader to handle compression for the split generation, etc.

        We are better at handling the cases of large files. Granted that with 1 line per map, we might have the same problem with FileSplit. But we could work around that by having a larger N.

        That's why this was requested. Our model handles large files, but users want to create maps initialized with a handful of parameters defined in a text file and executed at arbitrary points on the cluster. I'm skeptical of this model, but it's an idiom used often enough to justify a new InputFormat. It only makes sense when N is small (in practice, N=1 most of the time) and specified by the user and when the file is small. The existing code covers the other cases.

        The only issue is that we might end up in a situation where a couple of datanodes in the cluster becomes a bottleneck for the split serving

        That's not likely to be a bottleneck for these jobs. The optimization isn't just for split serving, but also potentially to the size of the split. Doing this with FileSplits sans locations will probably end up with an average 70-120 bytes per split, right? If the lines are shorter, then embedding them in the split is a win. If it's within 10-20% of that size, it's probably still worth doing. It becomes less attractive as it converges to the cases we already cover.

        We don't make assumptions about the line lengths, etc. Just make one pass over the files and arrive at the splits.

        Both require a pass for the line numbers, if that's a requirement.

        A lot seems to hinge on this. If it is a requirement that the path be included, then there's no longer any real advantage to embedding the line with the split. If users don't need that context, then there are some potential advantages to the core approach in the current patch.

        Show
        Chris Douglas added a comment - We don't invent brand new input formats. We reuse what exists and the amount of new code is minimal Which is why this would reuse LineRecordReader to handle compression for the split generation, etc. We are better at handling the cases of large files. Granted that with 1 line per map, we might have the same problem with FileSplit. But we could work around that by having a larger N. That's why this was requested. Our model handles large files, but users want to create maps initialized with a handful of parameters defined in a text file and executed at arbitrary points on the cluster. I'm skeptical of this model, but it's an idiom used often enough to justify a new InputFormat. It only makes sense when N is small (in practice, N=1 most of the time) and specified by the user and when the file is small. The existing code covers the other cases. The only issue is that we might end up in a situation where a couple of datanodes in the cluster becomes a bottleneck for the split serving That's not likely to be a bottleneck for these jobs. The optimization isn't just for split serving, but also potentially to the size of the split. Doing this with FileSplits sans locations will probably end up with an average 70-120 bytes per split, right? If the lines are shorter, then embedding them in the split is a win. If it's within 10-20% of that size, it's probably still worth doing. It becomes less attractive as it converges to the cases we already cover. We don't make assumptions about the line lengths, etc. Just make one pass over the files and arrive at the splits. Both require a pass for the line numbers, if that's a requirement. A lot seems to hinge on this. If it is a requirement that the path be included, then there's no longer any real advantage to embedding the line with the split. If users don't need that context, then there are some potential advantages to the core approach in the current patch.
        Hide
        Milind Bhandarkar added a comment -

        FileSplit approach will work (although the replication factor for the parameter list has to be increased to 10 - similar to job.jar), as Devaraj describes it. Each map should get exactly one line, no more , no less. So, file offsets in split have to be exact for that case (not file-length / 80 or something.) Having exact offsets, pointing to each \n will make LineRecordReader reusable, in this case, right ? The Unit test needs to test this. Current OneLineInputFormat that Lohit built uses this approach, and users have been happy with it.

        In case of N lines per mapper also, same approach should work, but will require two passes over the input file. First to calculate the number of lines, and then computing the splits. If number of lines is not divisible by number of mappers, its ok to have the last mapper consume less lines (although, dividing the slack among more than one mapper will be better).

        Show
        Milind Bhandarkar added a comment - FileSplit approach will work (although the replication factor for the parameter list has to be increased to 10 - similar to job.jar), as Devaraj describes it. Each map should get exactly one line, no more , no less. So, file offsets in split have to be exact for that case (not file-length / 80 or something.) Having exact offsets, pointing to each \n will make LineRecordReader reusable, in this case, right ? The Unit test needs to test this. Current OneLineInputFormat that Lohit built uses this approach, and users have been happy with it. In case of N lines per mapper also, same approach should work, but will require two passes over the input file. First to calculate the number of lines, and then computing the splits. If number of lines is not divisible by number of mappers, its ok to have the last mapper consume less lines (although, dividing the slack among more than one mapper will be better).
        Hide
        Milind Bhandarkar added a comment -

        Just talked with Amareshwari. She suggested specifying number of lines per mapper as a configuration variable which defaults to 1. The name of the config variable could be: mapred.line.input.format.linespermap.

        With this, the splits could be computed in a single pass over the parameter file (input file).

        This is a better approach, IMHO. Since the parameter file is small, user could easily do:

        hadoop dfs -cat /path/to/param/list | wc -l

        And do the necessary calculations before specifying the config variable. Or just let it default to one.

        Show
        Milind Bhandarkar added a comment - Just talked with Amareshwari. She suggested specifying number of lines per mapper as a configuration variable which defaults to 1. The name of the config variable could be: mapred.line.input.format.linespermap. With this, the splits could be computed in a single pass over the parameter file (input file). This is a better approach, IMHO. Since the parameter file is small, user could easily do: hadoop dfs -cat /path/to/param/list | wc -l And do the necessary calculations before specifying the config variable. Or just let it default to one.
        Hide
        Amareshwari Sriramadasu added a comment - - edited

        Here is patch adding org.apache.hadoop.mapred.lib.NLineInputFormat, which splits N lines of input file as one split.
        N is specifed using config variable "mapred.line.input.format.linespermap", defaults to 1. In files with a number of lines not evenly divided by N, the last split constructed from that file would have number lines lines less than N.

        NLineInputFormat constructs FileSplits containing N lines in each split. And Uses LineRecordReader to read the lines from the split.

        Show
        Amareshwari Sriramadasu added a comment - - edited Here is patch adding org.apache.hadoop.mapred.lib.NLineInputFormat, which splits N lines of input file as one split. N is specifed using config variable "mapred.line.input.format.linespermap", defaults to 1. In files with a number of lines not evenly divided by N, the last split constructed from that file would have number lines lines less than N. NLineInputFormat constructs FileSplits containing N lines in each split. And Uses LineRecordReader to read the lines from the split.
        Hide
        Lohit Vijayarenu added a comment -

        N is specifed using config variable "mapred.line.input.format.linespermap", defaults to 1. In files with a number of lines not evenly divided by N, the last split constructed from that file would have number lines lines less than N.

        +1 on this approach

        Show
        Lohit Vijayarenu added a comment - N is specifed using config variable "mapred.line.input.format.linespermap", defaults to 1. In files with a number of lines not evenly divided by N, the last split constructed from that file would have number lines lines less than N. +1 on this approach
        Hide
        Chris Douglas added a comment -

        +1 Patch looks good.

        Show
        Chris Douglas added a comment - +1 Patch looks good.
        Hide
        Hadoop QA added a comment -

        +1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12382090/patch-3221-2.txt
        against trunk revision 656491.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 3 new or modified tests.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed core unit tests.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2477/testReport/
        Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2477/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2477/artifact/trunk/build/test/checkstyle-errors.html
        Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2477/console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12382090/patch-3221-2.txt against trunk revision 656491. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2477/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2477/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2477/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2477/console This message is automatically generated.
        Hide
        Devaraj Das added a comment -

        I just committed this. Thanks, Amareshwari!

        Show
        Devaraj Das added a comment - I just committed this. Thanks, Amareshwari!
        Hide
        Hudson added a comment -
        Show
        Hudson added a comment - Integrated in Hadoop-trunk #493 (See http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/493/ )

          People

          • Assignee:
            Amareshwari Sriramadasu
            Reporter:
            Milind Bhandarkar
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development