Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-8630

Consistent remote placement should include partition information when calculating placement

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Blocker
    • Resolution: Fixed
    • Impala 3.2.0
    • Impala 3.3.0
    • Backend
    • None

    Description

      For partitioned tables, the actual filenames within partitions may not have large entropy. Impala includes information in its filenames that would not be the same across partitions, but this is common for tables written by the current CDH version of Hive. For example, in our minicluster, the TPC-DS store_sales table has many partitions, but the actual filenames within partitions are very simple:

      hdfs dfs -ls /test-warehouse/tpcds.store_sales/ss_sold_date_sk=2452642
      Found 1 items
      -rwxr-xr-x 3 joe supergroup 379535 2019-06-05 15:16 /test-warehouse/tpcds.store_sales/ss_sold_date_sk=2452642/000000_0
      
      hdfs dfs -ls /test-warehouse/tpcds.store_sales/ss_sold_date_sk=2452640
      Found 1 items
      -rwxr-xr-x 3 joe supergroup 412959 2019-06-05 15:16 /test-warehouse/tpcds.store_sales/ss_sold_date_sk=2452640/000000_0

      Right now, consistent remote placement uses the filename+offset without the partition id.

      uint32_t hash = HashUtil::Hash(hdfs_file_split->relative_path.data(),
      	hdfs_file_split->relative_path.length(), 0);
      

      This would produce a poor balance of files across nodes when there is low entropy in filenames. This should be amended to include the partition id, which is already accessible on the THdfsFileSplit.

      Attachments

        Activity

          People

            joemcdonnell Joe McDonnell
            joemcdonnell Joe McDonnell
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: