Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-6984

Operations on tables with many partitions _very_slow

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Duplicate
    • 1.2.1
    • 1.5.0
    • SQL
    • None
    • External Hive metastore, table with 30K partitions

    Description

      I have a table with _many_partitions (30K). Users cannot query all of them but they are in the metastore. Querying this table is extremely slow even if we're asking for a single partition.
      "describe sometable" also performs very poorly

      Spark produces the following times:
      Query 1 of 1, Rows read: 50, Elapsed time (seconds) - Total: 73.02, SQL query: 72.831, Reading results: 0.189

      Whereas Hive over the same metastore shows:
      Query 1 of 1, Rows read: 47, Elapsed time (seconds) - Total: 0.44, SQL query: 0.204, Reading results: 0.236

      I attempted to debug this and noticed that HiveMetastoreCatalog constructs an object for each partition, which is puzzling to me (attaching screenshot). Should this value be lazy – describe table should be purely a metastore op IMO (i.e. query postgres, return types).

      The issue is a blocker to me but leaving with default priority until someone can confirm it is a bug. "describe table" is not so interesting but I think this affects all query paths – I sent an inquiry earlier here: https://www.mail-archive.com/user@spark.apache.org/msg26242.html

      Attachments

        1. 7282_partitions_stack.png
          234 kB
          Yana Kadiyska

        Issue Links

          Activity

            People

              Unassigned Unassigned
              yanakad Yana Kadiyska
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: