Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Minor
    • Resolution: Duplicate
    • Affects Version/s: 2.0.2, 2.1.1
    • Fix Version/s: 2.3.0
    • Component/s: SQL
    • Labels:
      None

      Description

      I found an issue where filtering partitions using a dataframe results in very bad performance.

      To reproduce:
      Create a hive table with a lot of partitions and write a spark query on that table that filters based on the partition column.

      In my use case I've got a table with about 30k partitions.
      I filter the partitions using some scala via spark-shell:
      table.filter("partition=x or partition=y")
      This results in a Hive thrift API call:{{ #get_partitions('db', 'table', -1)}} which is very slow (minutes) and loads all metastore partitions in memory.

      Doing a more simple filter:
      table.filter("partition=x)
      Results in a Hive Thrift API call:{{ #get_partitions_by_filter('db', 'table', 'partition = "x', -1)}} which is very fast (seconds) and only fetches partition X into memory.

      If possible Spark should translate the filter into the more performant Thrift call or fallback to a more scalable solution where it filters our partitions without having to loading them all into memory first (for instance fetching the partitions in batches).

      I've posted my original question on SO

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                patduin Patrick Duin
              • Votes:
                0 Vote for this issue
                Watchers:
                4 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: