Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-22913

Hive Partition Pruning, Fractional and Timestamp types

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Won't Fix
    • 2.3.0
    • None
    • SQL
    • None

    Description

      Spark currently pushes the predicates it has in the SQL query to Hive Metastore. This only applies to predicates that are placed on top of partitioning columns. As more and more hive metastore implementations come around, this is an important optimization to allow data to be prefiltered to only relevant partitions. Consider the following example:

      Table:
      create external table data (key string, quantity long)
      partitioned by (processing-date timestamp)

      Query:
      select * from data where processing-date = '2017-10-23 00:00:00'

      Currently, no filters will be pushed to the hive metastore for the above query. The reason is that the code that tries to compute predicates to be sent to hive metastore, only deals with integral and string column types. It doesn't know how to handle fractional and timestamp columns.

      I have tables in my metastore (AWS Glue) with millions of partitions of type timestamp and double. In my specific case, it takes Spark's master node about 6.5 minutes to download all partitions for the table, and then filter the partitions client-side. The actual processing time of my query is only 6 seconds. In other words, without partition pruning, I'm looking at 6.5 minutes of processing and with partition pruning, I'm looking at 6 seconds only.

      I have a fix for this developed locally that I'll provide shortly as a pull request.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              ameen.tayyebi@gmail.com Ameen Tayyebi
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: