Uploaded image for project: 'Hadoop Map/Reduce'
  1. Hadoop Map/Reduce
  2. MAPREDUCE-6208

There should be an input format for MapFiles which can be configured so that only a fraction of the input data is used for the MR process

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Patch Available
    • Major
    • Resolution: Unresolved
    • None
    • None
    • None

    Description

      In some cases there are large amounts of data organized in MapFiles, e.g., from previous MapReduce tasks, and only a fraction of the data is to be processed in a MR task. The current approach, as I understand, is to re-organize the data in a suitable partition using folders on HDFS, and only use relevant folders as input paths, and maybe doing some additional filtering in the Map task. However, sometimes the input data cannot be easily partitioned that way. For example, when processing large amounts of measured data where additional data on a time period already in HDFS arrives later.

      There should be an input format that accepts folders with MapFiles, and there should be an option to specify the input key range so that only fitting InputSplits are generated.

      Attachments

        1. MAPREDUCE-6208.001.patch
          27 kB
          Jens Rabe
        2. MAPREDUCE-6208.002.patch
          28 kB
          Jens Rabe

        Activity

          People

            rabe-jens Jens Rabe
            rabe-jens Jens Rabe
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:

              Time Tracking

                Estimated:
                Original Estimate - 24h
                24h
                Remaining:
                Remaining Estimate - 24h
                24h
                Logged:
                Time Spent - Not Specified
                Not Specified