Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-5042

Loading metadata for partitioned tables is slow due to usage of an ArrayList, potential 4x speedup

    Details

      Description

      Loading metadata for partitions with custom paths is 4x slower compared to partitions without custom paths, the slow down is due to an N2 lookups to check if a partition already exists.

      The List should ideally be replaced with a Set.
      From https://github.com/apache/incubator-impala/blob/master/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java

        List<Path> dirsToLoad = Lists.newArrayList(tblLocation);
       if (!dirsToLoad.contains(partDir) &&
                  !FileSystemUtil.isDescendantPath(partDir, tblLocation)) {
                // This partition has a custom filesystem location. Load its file/block
                // metadata separately by adding it to the list of dirs to load.
                dirsToLoad.add(partDir);
              }
      

      From Java mission control

      Stack Trace	Sample Count	Percentage(%)
      java.lang.Thread.run()	73,611	97.157
         java.util.concurrent.ThreadPoolExecutor$Worker.run()	73,611	97.157
            java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor$Worker)	73,611	97.157
               java.util.concurrent.FutureTask.run()	73,595	97.136
                  org.apache.impala.catalog.TableLoadingMgr$2.call()	73,555	97.083
                     org.apache.impala.catalog.TableLoadingMgr$2.call()	73,555	97.083
                        org.apache.impala.catalog.TableLoader.load(Db, String)	73,555	97.083
                           org.apache.impala.catalog.HdfsTable.load(boolean, IMetaStoreClient, Table)	73,555	97.083
                              org.apache.impala.catalog.HdfsTable.load(boolean, IMetaStoreClient, Table, boolean, boolean, Set)	73,555	97.083
                                 org.apache.impala.catalog.HdfsTable.loadAllPartitions(List, Table)	73,508	97.021
                                    java.util.ArrayList.contains(Object)	70,094	92.515
                                       java.util.ArrayList.indexOf(Object)	70,094	92.515
                                          org.apache.hadoop.fs.Path.equals(Object)	69,462	91.681
                                             java.net.URI.equals(Object)	69,462	91.681
      

        Issue Links

          Activity

          Hide
          bharathv bharath v added a comment -

          IMPALA-5042: Use a HashSet instead of ArrayList for O(1) look ups

          Testing: Ran the metadata perf benchmark. No regressions and
          found good gains in the following cases.

          100K-PARTITIONS-1M-FILES-CUSTOM-05-QUERY-AFTER-INVALIDATE ~81.3%
          100K-PARTITIONS-1M-FILES-CUSTOM-07-REFRESH ~81.3%
          100K-PARTITIONS-1M-FILES-CUSTOM-10-REFRESH-AFTER-ADD-PARTITION ~81.7

          Change-Id: Ia9eccfe853583a0b78a5280f1b9525ce97f88cb5
          Reviewed-on: http://gerrit.cloudera.org:8080/6319
          Reviewed-by: Alex Behm <alex.behm@cloudera.com>
          Tested-by: Impala Public Jenkins

          Show
          bharathv bharath v added a comment - IMPALA-5042 : Use a HashSet instead of ArrayList for O(1) look ups Testing: Ran the metadata perf benchmark. No regressions and found good gains in the following cases. 100K-PARTITIONS-1M-FILES-CUSTOM-05-QUERY-AFTER-INVALIDATE ~81.3% 100K-PARTITIONS-1M-FILES-CUSTOM-07-REFRESH ~81.3% 100K-PARTITIONS-1M-FILES-CUSTOM-10-REFRESH-AFTER-ADD-PARTITION ~81.7 Change-Id: Ia9eccfe853583a0b78a5280f1b9525ce97f88cb5 Reviewed-on: http://gerrit.cloudera.org:8080/6319 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Impala Public Jenkins

            People

            • Assignee:
              bharathv bharath v
              Reporter:
              mmokhtar Mostafa Mokhtar
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development