Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-4788

Partition recovery is very slow as it uses an ArrayList to check if a partition already exists

    Details

      Description

      When running "alter table recover partitions foo" against a table with a large number of partitions performance is really bad as an ArrayList is used to check if the partition already exists.

      https://github.com/apache/incubator-impala/blob/master/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java#L1677

      java.util.ArrayList.contains(Object) ends up consuming the majority of the CPU

        private void getAllPartitionsNotInHms(Path path, List<String> partitionKeys,
            int depth, FileSystem fs, List<String> partitionValues,
            List<LiteralExpr> partitionExprs, List<List<LiteralExpr>> existingPartitions,
            List<List<String>> partitionsNotInHms) throws IOException {
          if (depth == partitionKeys.size()) {
            if (existingPartitions.contains(partitionExprs)) {
              if (LOG.isTraceEnabled()) {
                LOG.trace(String.format("Skip recovery of path '%s' because it already "
                    + "exists in metastore", path.toString()));
              }
            } else {
              partitionsNotInHms.add(partitionValues);
              existingPartitions.add(partitionExprs);
            }
            return;
          }
      
      Stack Trace	Sample Count	Percentage(%)
      org.apache.impala.service.JniCatalog.execDdl(byte[])	25,561	99.98
         org.apache.impala.service.CatalogOpExecutor.execDdlRequest(TDdlExecRequest)	25,561	99.98
            org.apache.impala.service.CatalogOpExecutor.alterTable(TAlterTableParams, TDdlExecResponse)	25,561	99.98
               org.apache.impala.service.CatalogOpExecutor.alterTableRecoverPartitions(Table)	25,561	99.98
                  org.apache.impala.catalog.HdfsTable.getPathsWithoutPartitions()	25,561	99.98
                     org.apache.impala.catalog.HdfsTable.getAllPartitionsNotInHms(Path, List, List, List)	25,561	99.98
                        org.apache.impala.catalog.HdfsTable.getAllPartitionsNotInHms(Path, List, int, FileSystem, List, List, List, List)	25,561	99.98
                           org.apache.impala.catalog.HdfsTable.getAllPartitionsNotInHms(Path, List, int, FileSystem, List, List, List, List)	25,561	99.98
                              org.apache.impala.catalog.HdfsTable.getAllPartitionsNotInHms(Path, List, int, FileSystem, List, List, List, List)	25,427	99.456
                                 java.util.ArrayList.contains(Object)	25,334	99.093
                                    java.util.ArrayList.indexOf(Object)	25,334	99.093
                                       java.util.AbstractList.equals(Object)	24,755	96.828
                                          java.util.ArrayList.listIterator()	8,190	32.035
                                             java.util.ArrayList$ListItr.<init>(ArrayList, int)	8,184	32.011
                                                java.util.ArrayList$Itr.<init>(ArrayList, ArrayList$1)	8,184	32.011
                                                   java.util.ArrayList$Itr.<init>(ArrayList)	8,184	32.011
      
      

        Activity

        Hide
        jbapple Jim Apple added a comment -
        Show
        jbapple Jim Apple added a comment - For review: http://gerrit.cloudera.org:8080/5745
        Hide
        jbapple Jim Apple added a comment -

        Now uses HashSet instead: http://gerrit.cloudera.org:8080/5745

        Show
        jbapple Jim Apple added a comment - Now uses HashSet instead: http://gerrit.cloudera.org:8080/5745

          People

          • Assignee:
            jbapple Jim Apple
            Reporter:
            mmokhtar Mostafa Mokhtar
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development