Details

    • Type: Sub-task
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 2.8.0
    • Fix Version/s: None
    • Component/s: fs/s3
    • Labels:
      None

      Description

      Currently, localhost is passed as locality for each block, causing all blocks involved in job to initially target the same node (RM), before being moved by the scheduler (to a rack-local node). This reduces parallelism for jobs (with short-lived mappers).

      We should mimic Azures implementation: a config setting fs.s3a.block.location.impersonatedhost where the user can enter the list of hostnames in the cluster to return to getFileBlockLocations.

      Possible optimization: for larger systems, it might be better to return N (5?) random hostnames to prevent passing a huge array (the downstream code assumes size = O(3)).

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                Thomas Demoor Thomas Demoor
                Reporter:
                Thomas Demoor Thomas Demoor
              • Votes:
                1 Vote for this issue
                Watchers:
                10 Start watching this issue

                Dates

                • Created:
                  Updated: