Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-8167

Refresh´s on NON-partitioned tables ALWAYS reads all the files block locations taking too long on BIG TABLES.

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • Impala 2.12.0
    • None
    • Catalog
    • None
    • ghx-label-7

    Description

      REFRESH's on NON-PARTITIONED tables always fetches their block locations using the "getFileBlockLocations" method on all files, no matter if there are new files or not.

      We think the problem is located in the method "updateUnpartitionedTableFileMd".

      This method always resets partitions and adds a "new one" with NO FILEDESCRIPTORS. So the method refreshPartitionFileMetadata(part), always needs to read all the files of the new partition to rebuild the information. This behaviour causes that getBlockLocation is always call for all the files, despite they are new or old.

      This is confirmed by looking at the code:

       

        private void updateUnpartitionedTableFileMd() throws Exception {
          if (LOG.isTraceEnabled()) {
            LOG.trace("update unpartitioned table: " + getFullName());
          }
          resetPartitions();  ---> DROP PARTITION WITH PREVIOUS FILEDESCRIPTOR INFO. 
          org.apache.hadoop.hive.metastore.api.Table msTbl = getMetaStoreTable();
          Preconditions.checkNotNull(msTbl);
          addDefaultPartition(msTbl.getSd());
          HdfsPartition part = createPartition(msTbl.getSd(), null); ---> CREATES NEW PARTITION.
          addPartition(part);
          if (isMarkedCached_) part.markCached();

          LOG.info("Refreshing-updateUnpartitionedTableFileMd(): " + getFullName() +
                    " Location: " + part.getLocation() +
                    " FileDescriptors: " + part.getFileDescriptors().size());

          refreshPartitionFileMetadata(part);

          LOG.info("Refreshed-updateUnpartitionedTableFileMd(): " + getFullName() +
                   " Location: " + part.getLocation() +
                   " FileDescriptors: " + part.getFileDescriptors().size());
        }

       

      Running examples:

      1) The first run after no files added or changed .

      [vera05.claro.amx:21000] > refresh prod_ar.aux_tas_call_details_rt02; 

      LOG:


      I0206 11:18:16.581826 34494 HdfsTable.java:1333] Refreshing-updateUnpartitionedTableFileMd(): prod_ar.aux_tas_call_details_rt02 Location: hdfs://nn-hdfs.scs.claro.amx:8020/data/prod_ar/staging/cdrs/voz/tas/rt02/aux FileDescriptors: 0

      I0206 11:25:35.748185 34494 HdfsTable.java:1340] Refreshed-updateUnpartitionedTableFileMd(): prod_ar.aux_tas_call_details_rt02 Location: hdfs://nn-hdfs.scs.claro.amx:8020/data/prod_ar/staging/cdrs/voz/tas/rt02/aux FileDescriptors: 148398

       

      2) Second run 2 min after the other with no files added or changed in the middle. In this case we see that no filedescriptors exists because of the resetPartitions(), so it needs to read all the files again.

      [vera05.claro.amx:21000] > refresh prod_ar.aux_tas_call_details_rt02;


      LOG:


       

      I0206 11:27:54.086167 33902 HdfsTable.java:1333] Refreshing-updateUnpartitionedTableFileMd(): prod_ar.aux_tas_call_details_rt02 Location: hdfs://nn-hdfs.scs.claro.amx:8020/data/prod_ar/staging/cdrs/voz/tas/rt02/aux FileDescriptors: 0

      I0206 11:36:35.344233 33902 HdfsTable.java:1340] Refreshed-updateUnpartitionedTableFileMd(): prod_ar.aux_tas_call_details_rt02 Location: hdfs://nn-hdfs.scs.claro.amx:8020/data/prod_ar/staging/cdrs/voz/tas/rt02/aux FileDescriptors: 148398

       

      Attachments

        Activity

          People

            Unassigned Unassigned
            ggatto Gabriel Gatto
            Votes:
            1 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated: