Details

    • Type: Sub-task
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: fs/s3
    • Labels:
      None

      Description

      I noticed that I get 150MB/s when I use the AWS CLI

      aws s3 cp s3://<bucket>/<key> - > /dev/null

      vs 50MB/s when I use the S3AFileSystem

      hadoop fs -cat s3://<bucket>/<key> > /dev/null

      Looking into the AWS CLI code, it looks like the download logic is quite clever. It downloads the next couple parts in parallel using range requests, and then buffers them in memory in order to reorder them and expose a single contiguous stream. I translated the logic to Java and modified the S3AFileSystem to do similar things, and am able to achieve 150MB/s download speeds as well. It is mostly done but I have some things to clean up first. The PR is here: https://github.com/palantir/hadoop/pull/47/files

      It would be great to get some other eyes on it to see what we need to do to get it merged.

        Attachments

        1. seek-logs-parquet.txt
          6 kB
          Justin Uang
        2. HADOOP-16132.005.patch
          54 kB
          Justin Uang
        3. HADOOP-16132.004.patch
          54 kB
          Justin Uang
        4. HADOOP-16132.003.patch
          52 kB
          Justin Uang
        5. HADOOP-16132.002.patch
          44 kB
          Justin Uang
        6. HADOOP-16132.001.patch
          44 kB
          Justin Uang

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                justin.uang Justin Uang
              • Votes:
                1 Vote for this issue
                Watchers:
                10 Start watching this issue

                Dates

                • Created:
                  Updated: