Uploaded image for project: 'Hadoop Map/Reduce'
  1. Hadoop Map/Reduce
  2. MAPREDUCE-1946

enhance FileInputFormat.setInputPaths()

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 0.20.2
    • None
    • job submission
    • None

    Description

      FileInputFormat.setInputPaths(Job job, Path... inputPaths) can be enhanced in the following 3 ways:
      1) when the input paths are known only at runtime, we need another form which accepts Collection<> as second parameter. E.g. Set<Path> inputPaths
      2) Use StringBuilder instead of StringBuffer because StringBuilder doesn't incur synchronization cost
      3) The biggest performance boost comes from calling the following constructor of StringBuilder:
      public StringBuilder(int capacity)
      capacity can be a 3rd parameter to setInputPaths() This would avoid excessive calls to Arrays.copyOf().

      The following stack trace was observed when our code used FileInputFormat.addInputPath() many times when a lot of files are eligible for processing:
      java.lang.Thread.State: RUNNABLE
      at java.util.Arrays.copyOf(Arrays.java:2882)
      at java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:100)
      at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:390)
      at java.lang.StringBuilder.append(StringBuilder.java:119)
      at org.apache.hadoop.mapred.FileInputFormat.addInputPath(FileInputFormat.java:330)
      at com.carrieriq.m2m.platform.mmp2.input.PackageInput.configureJobConf(PackageInput.java:336)

      After incorporating all three optimizations, total time taken in customized setInputPaths(JobConf conf, Set<Path> inputPaths) was 2 seconds. The combined time calling FileInputFormat.addInputPath() was over 80 minutes.

      Attachments

        Activity

          People

            Unassigned Unassigned
            ted_yu Ted Yu
            Votes:
            1 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated: