Uploaded image for project: 'Apache Tez'
  1. Apache Tez
  2. TEZ-3391

Optimize single split MR split reader

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 0.10.0, 0.9.3
    • None
    • None

    Description

      During initialization, each task creates an array of objects TaskSplitMetaInfo[]. This represents unnecessary space and time overhead as each task needs only its corresponding split object. Beside the current implementation is n^2 space complexity, it leaks the inputstream.

      We need to optimize that implementation by returning only a single object instead of an entire array. 

      rohini suggested the following:

      In the vertex construct TaskSplitMetaInfo only for the split of that task instead of constructing for all splits. ie change
      public static TaskSplitMetaInfo[] readSplitMetaInfo(Configuration conf, FileSystem fs) to public static TaskSplitMetaInfo getSplitMetaInfo(Configuration conf, FileSystem fs, int index) and skip reading splits below the index. If there are 1000 splits, the first task will read 1 split, second task will read 2 splits and so on instead of each task reading all the 1000 splits as is happening now. 

      Attachments

        1. TEZ-3391.002.patch
          8 kB
          Ahmed Hussein
        2. TEZ-3391.001.patch
          9 kB
          Ahmed Hussein

        Activity

          People

            ahussein Ahmed Hussein
            rohini Rohini Palaniswamy
            Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: