Uploaded image for project: 'Hadoop Common'
  1. Hadoop Common
  2. HADOOP-3221

Need a "LineBasedTextInputFormat"

    XMLWordPrintableJSON

Details

    • New Feature
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 0.16.2
    • 0.18.0
    • None
    • None
    • All

    • Reviewed
    • Added org.apache.hadoop.mapred.lib.NLineInputFormat ,which splits N lines of input as one split. N can be specified by configuration property "mapred.line.input.format.linespermap", which defaults to 1.

    Description

      In many "pleasantly" parallel applications, each process/mapper processes the same input file (s), but with computations are controlled by different parameters.
      (Referred to as "parameter sweeps").

      One way to achieve this, is to specify a set of parameters (one set per line) as input in a control file (which is the input path to the map-reduce application, where as the input dataset is specified via a config variable in JobConf.).

      It would be great to have an InputFormat, that splits the input file such that by default, one line is fed as a value to one map task, and key could be line number. i.e. (k,v) is (LongWritable, Text).

      If user specifies the number of maps explicitly, each mapper should get a contiguous chunk of lines (so as to load balance between the mappers.)

      The location hints for the splits should not be derived from the input file, but rather, should span the whole mapred cluster.

      (Is there a way to do this without having to return an array of nSplits*nTaskTrackers ?)

      Increasing the replication of the "real" input dataset (since it will be fetched by all the nodes) is orthogonal, and one can use DistributedCache for that.

      (P.S. Please chose a better name for this InputFormat. I am not in love with "LineBasedText" name.)

      Attachments

        1. patch-3221-2.txt
          10 kB
          Amareshwari Sriramadasu
        2. patch-3221-1.txt
          13 kB
          Amareshwari Sriramadasu
        3. patch-3221.txt
          13 kB
          Amareshwari Sriramadasu

        Activity

          People

            amareshwari Amareshwari Sriramadasu
            milindb Milind Barve
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: