Uploaded image for project: 'Hadoop Common'
  1. Hadoop Common
  2. HADOOP-6502

DistributedFileSystem#listStatus is very slow when listing a directory with a size of 1300

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Critical
    • Resolution: Fixed
    • Affects Version/s: 0.20.0
    • Fix Version/s: 0.23.2
    • Component/s: util
    • Labels:
      None
    • Target Version/s:
    • Hadoop Flags:
      Reviewed

      Description

      When listing a directory of around 1300 children, it takes hundreds of milliseconds. It turns out the slowdowness is caused by the change made by HADOOP-4187. The return value of listStatus is an array of FileStatus. When deserializing each element of the array, ReflectionUtils#newInstance(Class<T>, Configuration) is called and then calls setConf, which calls setJobConf. SetJobConf checks if JobConf is on the class path by calling Configuration#getClassByName. Even though Configuration#getClassByName tries to optimize the lookup using a cached map, but since JobConf is not in the class path, so it is not in the cache. Every checkup ends up calling Class.ForName which is very expensive. Deserializing an array of 1300 entries requires calling of Class#ForName 1300 times!

        Attachments

        1. 6502.patch
          1 kB
          Sharad Agarwal
        2. 6502_v2.patch
          1 kB
          Sharad Agarwal
        3. hadoop-6502-trunk.txt
          1 kB
          Todd Lipcon
        4. hadoop-6502-trunk.txt
          4 kB
          Todd Lipcon

          Issue Links

            Activity

              People

              • Assignee:
                sharadag Sharad Agarwal
                Reporter:
                hairong Hairong Kuang
              • Votes:
                0 Vote for this issue
                Watchers:
                17 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: