Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-9926

Parallelize file listing for partitioned Hive table

    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 1.4.1, 1.5.0
    • Fix Version/s: 2.0.0
    • Component/s: SQL
    • Labels:
      None
    • Target Version/s:

      Description

      In Spark SQL, short queries like select * from table limit 10 run very slowly against partitioned Hive tables because of file listing. In particular, if a large number of partitions are scanned on storage like S3, the queries run extremely slowly. Here are some example benchmarks in my environment-

      • Parquet-backed Hive table
      • Partitioned by dateint and hour
      • Stored on S3
      # of partitions # of files runtime query
      1 972 30 secs select * from nccp_log where dateint=20150601 and hour=0 limit 10;
      24 13646 6 mins select * from nccp_log where dateint=20150601 limit 10;
      240 136222 1 hour select * from nccp_log where dateint>=20150601 and dateint<=20150610 limit 10;

      The problem is that TableReader constructs a separate HadoopRDD per Hive partition path and group them into a UnionRDD. Then, all the input files are listed sequentially. In other tools such as Hive and Pig, this can be solved by setting mapreduce.input.fileinputformat.list-status.num-threads high. But in Spark, since each HadoopRDD lists only one partition path, setting this property doesn't help.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                rdblue Ryan Blue
                Reporter:
                cheolsoo Cheolsoo Park
              • Votes:
                0 Vote for this issue
                Watchers:
                10 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: