Uploaded image for project: 'Hadoop Common'
  1. Hadoop Common
  2. HADOOP-15471

Hdfs recursive listing operation is very slow

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Patch Available
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 2.7.1
    • Fix Version/s: 2.7.1
    • Component/s: fs
    • Labels:
      None
    • Environment:

      HCFS file system where HDP 2.6.1 is connected to ECS (Object Store).

    • Target Version/s:

      Description

      The hdfs dfs -ls -R command is sequential in nature and is very slow for a HCFS system. We have seen around 6 mins for 40K directory/files structure.

      The proposal is to use multithreading approach to speed up recursive list, du and count operations.

      We have tried a ForkJoinPool implementation to improve performance for recursive listing operation.

      https://github.com/jasoncwik/hadoop-release/tree/parallel-fs-cli

      commit id : 

      82387c8cd76c2e2761bd7f651122f83d45ae8876

      Another implementation is to use Java Executor Service to improve performance to run listing operation in multiple threads in parallel. This has significantly reduced the time to 40 secs from 6 mins.

       

       

        Attachments

        1. parallelfsPatch
          7 kB
          Ajay Sachdev
        2. HDFS-13398.001.patch
          9 kB
          Ajay Sachdev
        3. HDFS-13398.002.patch
          4 kB
          Ajay Sachdev
        4. HDFS-13398.003.patch
          25 kB
          Ajay Sachdev

          Activity

            People

            • Assignee:
              ajaysachdev Ajay Sachdev
              Reporter:
              ajaysachdev Ajay Sachdev
            • Votes:
              1 Vote for this issue
              Watchers:
              17 Start watching this issue

              Dates

              • Created:
                Updated: