Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-3173 Reduce catalog's memory footprint
  3. IMPALA-3653

Consider using listLocatedStatus() API to get filestatus and blocklocations in one RPC call

    Details

    • Type: Sub-task
    • Status: Resolved
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: Impala 2.5.0
    • Fix Version/s: Impala 2.8.0
    • Component/s: Catalog
    • Labels:

      Description

      Currently, Impala needs to make 2-3 RPC calls to load HDFS block metadata for each file.
      DistributedFileSystem.listLocatedStatus() is available long time ago, with https://issues.apache.org/jira/browse/HDFS-8887 (already backported to CDH5.5) this API can return all files' status and block locations under a directory in one RPC calls. That can greatly reduce HDFS metadata loading time. for example, for a directory with 200K files, the metadata loading time reduced from 40s to under 15s.

      One concern is memory usage. StorageID is a UUID string, diskID is int32, this info is needed for each replica. If we just simply store storageID in catalog metadata, it will increase catalog metadata size and thrift object size(Impact catalog topic update, plan fragments). We should try to do mapping at fe to make sure memory usage not increase.

      Also, Does it make sense to have a global map for host/TNetworkAddresses mapping? Currently Impala keeps one map per HdfsTable. with more tables and more nodes, this can still take quite some memory.
      If we could use a global map, we could link the storageID for each host here as well. and all impalads and catalog will have the same global mapkey for host index, and no need to send the full value in thrift update objects or plan fragments.

        Issue Links

          Activity

          Hide
          jyu@cloudera.com Juan Yu added a comment -

          FYI, volumeID is removed in hdfs 3.0 and related APIs, a global storage ID is used instead. Better to switch to new APIs sooner than later.

          Show
          jyu@cloudera.com Juan Yu added a comment - FYI, volumeID is removed in hdfs 3.0 and related APIs, a global storage ID is used instead. Better to switch to new APIs sooner than later.
          Hide
          alex.behm Alexander Behm added a comment -

          bharath v, did you check out this API call? Is it really just one RPC?

          Show
          alex.behm Alexander Behm added a comment - bharath v , did you check out this API call? Is it really just one RPC?
          Hide
          laszlog Laszlo Gaal added a comment -

          Additional details on storageID: it is UUID-based but longer, each spindle has one, declared in the following file:

          [root@vb0220 systest]# cat /data/4/dfs/dn/current/VERSION 
          #Sun Oct 30 09:10:46 PDT 2016
          storageID=DS-37caad0c-0431-4453-8ea5-b45619841e47
          clusterID=cluster2
          cTime=0
          datanodeUuid=c5773580-e3c6-49b8-a4e4-a5b9c1ff6d44
          storageType=DATA_NODE
          layoutVersion=-56
          

          spindles are mapped to directories like this:

          [root@vb0220 systest]# mount
          /dev/sda2 on / type ext4 (rw,noatime,nodiratime)
          proc on /proc type proc (rw)
          sysfs on /sys type sysfs (rw)
          devpts on /dev/pts type devpts (rw,gid=5,mode=620)
          tmpfs on /dev/shm type tmpfs (rw)
          /dev/sdb1 on /data/1 type ext4 (rw,noatime,nodiratime)
          /dev/sdk1 on /data/10 type ext4 (rw,noatime,nodiratime)
          /dev/sdl1 on /data/11 type ext4 (rw,noatime,nodiratime)
          /dev/sdc1 on /data/2 type ext4 (rw,noatime,nodiratime)
          /dev/sdd1 on /data/3 type ext4 (rw,noatime,nodiratime)
          /dev/sde1 on /data/4 type ext4 (rw,noatime,nodiratime)
          /dev/sdf1 on /data/5 type ext4 (rw,noatime,nodiratime)
          /dev/sdg1 on /data/6 type ext4 (rw,noatime,nodiratime)
          /dev/sdh1 on /data/7 type ext4 (rw,noatime,nodiratime)
          /dev/sdi1 on /data/8 type ext4 (rw,noatime,nodiratime)
          /dev/sdj1 on /data/9 type ext4 (rw,noatime,nodiratime)
          /dev/sda1 on /home type ext4 (rw,noatime,nodiratime)
          /dev/sda5 on /var type ext4 (rw,noatime,nodiratime)
          none on /proc/sys/fs/binfmt_misc type binfmt_misc (rw)
          cm_processes on /var/run/cloudera-scm-agent/process type tmpfs (rw,mode=0751)
          

          The listings were taken from the perf cluster, host vb0220.halxg.cloudera.com

          Show
          laszlog Laszlo Gaal added a comment - Additional details on storageID: it is UUID-based but longer, each spindle has one, declared in the following file: [root@vb0220 systest]# cat /data/4/dfs/dn/current/VERSION #Sun Oct 30 09:10:46 PDT 2016 storageID=DS-37caad0c-0431-4453-8ea5-b45619841e47 clusterID=cluster2 cTime=0 datanodeUuid=c5773580-e3c6-49b8-a4e4-a5b9c1ff6d44 storageType=DATA_NODE layoutVersion=-56 spindles are mapped to directories like this: [root@vb0220 systest]# mount /dev/sda2 on / type ext4 (rw,noatime,nodiratime) proc on /proc type proc (rw) sysfs on /sys type sysfs (rw) devpts on /dev/pts type devpts (rw,gid=5,mode=620) tmpfs on /dev/shm type tmpfs (rw) /dev/sdb1 on /data/1 type ext4 (rw,noatime,nodiratime) /dev/sdk1 on /data/10 type ext4 (rw,noatime,nodiratime) /dev/sdl1 on /data/11 type ext4 (rw,noatime,nodiratime) /dev/sdc1 on /data/2 type ext4 (rw,noatime,nodiratime) /dev/sdd1 on /data/3 type ext4 (rw,noatime,nodiratime) /dev/sde1 on /data/4 type ext4 (rw,noatime,nodiratime) /dev/sdf1 on /data/5 type ext4 (rw,noatime,nodiratime) /dev/sdg1 on /data/6 type ext4 (rw,noatime,nodiratime) /dev/sdh1 on /data/7 type ext4 (rw,noatime,nodiratime) /dev/sdi1 on /data/8 type ext4 (rw,noatime,nodiratime) /dev/sdj1 on /data/9 type ext4 (rw,noatime,nodiratime) /dev/sda1 on /home type ext4 (rw,noatime,nodiratime) /dev/sda5 on / var type ext4 (rw,noatime,nodiratime) none on /proc/sys/fs/binfmt_misc type binfmt_misc (rw) cm_processes on / var /run/cloudera-scm-agent/process type tmpfs (rw,mode=0751) The listings were taken from the perf cluster, host vb0220.halxg.cloudera.com
          Hide
          bharathv bharath v added a comment -

          Per my understanding, the ask here is fixed in https://github.com/apache/incubator-impala/commit/6662c55364b1c429340edc1ffd14323167f7b561

          Juan, please reopen if you think something is missing.

          Show
          bharathv bharath v added a comment - Per my understanding, the ask here is fixed in https://github.com/apache/incubator-impala/commit/6662c55364b1c429340edc1ffd14323167f7b561 Juan, please reopen if you think something is missing.
          Hide
          alex.behm Alexander Behm added a comment -

          commit 6662c55364b1c429340edc1ffd14323167f7b561
          Author: Bharath Vissapragada <bharathv@cloudera.com>
          Date: Sun Nov 13 22:15:41 2016 -0800

          IMPALA-4172/IMPALA-3653: Improvements to block metadata loading

          This patch improves the block metadata loading (locations and disk
          storage IDs) for partitioned and un-partitioned tables in the Catalog
          server.

          Without this patch:
          ------------------
          We loop through each and every file in the table/partition directories
          and call getFileBlockLocations() on it to obtain the block metadata.
          This results in large number of RPC calls to the Namenode, especially
          for tables with large no. of files/partitions.

          With this patch:
          ---------------
          We move the block metadata querying to use listStatus() call which
          accepts a directory as input and fetches the 'BlockLocation' objects
          for every file recursively in that directory. This improves the
          metadata loading in the following ways.

          • For non-partitioned tables, we query all the BlockLocations in a
            single RPC call in the base table directory and load the corresponding
            disk IDs.
          • For partitioned tables, we query the BlockLocations for all the
            partitions residing under the base table directories in a single RPC
            and then load every partition with non-default partition directory
            separately.
          • REFRESH on a table reloads the block metadata from scratch for
            every data file every time. So it can be used as a replacement for
            invalidate in situations like HDFS block rebalancing which needs
            block metadata update.

          Also, this patch does away with VolumeIds returned by the HDFS NN
          and uses the new StorageIDs returned by the BlockLocation class.
          These StorageIDs are UUID strings and hence are mapped to a
          per-node 0-based index as expected by the backend. In the upcoming
          versions of Hadoop APIs, getFileBlockStorageLocations() is deprecated
          and instead the listStatus() returns BlockLocations with storage IDs
          embedded. This patch makes use of this improvement to reduce an
          additional RPC to the NN to fetch the storage locations.

          Change-Id: Ie127658172e6e70dae441374530674a4ac9d5d26
          Reviewed-on: http://gerrit.cloudera.org:8080/5148
          Reviewed-by: Bharath Vissapragada <bharathv@cloudera.com>
          Tested-by: Internal Jenkins

          Show
          alex.behm Alexander Behm added a comment - commit 6662c55364b1c429340edc1ffd14323167f7b561 Author: Bharath Vissapragada <bharathv@cloudera.com> Date: Sun Nov 13 22:15:41 2016 -0800 IMPALA-4172 / IMPALA-3653 : Improvements to block metadata loading This patch improves the block metadata loading (locations and disk storage IDs) for partitioned and un-partitioned tables in the Catalog server. Without this patch: ------------------ We loop through each and every file in the table/partition directories and call getFileBlockLocations() on it to obtain the block metadata. This results in large number of RPC calls to the Namenode, especially for tables with large no. of files/partitions. With this patch: --------------- We move the block metadata querying to use listStatus() call which accepts a directory as input and fetches the 'BlockLocation' objects for every file recursively in that directory. This improves the metadata loading in the following ways. For non-partitioned tables, we query all the BlockLocations in a single RPC call in the base table directory and load the corresponding disk IDs. For partitioned tables, we query the BlockLocations for all the partitions residing under the base table directories in a single RPC and then load every partition with non-default partition directory separately. REFRESH on a table reloads the block metadata from scratch for every data file every time. So it can be used as a replacement for invalidate in situations like HDFS block rebalancing which needs block metadata update. Also, this patch does away with VolumeIds returned by the HDFS NN and uses the new StorageIDs returned by the BlockLocation class. These StorageIDs are UUID strings and hence are mapped to a per-node 0-based index as expected by the backend. In the upcoming versions of Hadoop APIs, getFileBlockStorageLocations() is deprecated and instead the listStatus() returns BlockLocations with storage IDs embedded. This patch makes use of this improvement to reduce an additional RPC to the NN to fetch the storage locations. Change-Id: Ie127658172e6e70dae441374530674a4ac9d5d26 Reviewed-on: http://gerrit.cloudera.org:8080/5148 Reviewed-by: Bharath Vissapragada <bharathv@cloudera.com> Tested-by: Internal Jenkins

            People

            • Assignee:
              bharathv bharath v
              Reporter:
              jyu@cloudera.com Juan Yu
            • Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development