Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-9926

Parallelize file listing for partitioned Hive table

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 1.4.1, 1.5.0
    • 2.0.0
    • SQL
    • None

    Description

      In Spark SQL, short queries like select * from table limit 10 run very slowly against partitioned Hive tables because of file listing. In particular, if a large number of partitions are scanned on storage like S3, the queries run extremely slowly. Here are some example benchmarks in my environment-

      • Parquet-backed Hive table
      • Partitioned by dateint and hour
      • Stored on S3
      # of partitions # of files runtime query
      1 972 30 secs select * from nccp_log where dateint=20150601 and hour=0 limit 10;
      24 13646 6 mins select * from nccp_log where dateint=20150601 limit 10;
      240 136222 1 hour select * from nccp_log where dateint>=20150601 and dateint<=20150610 limit 10;

      The problem is that TableReader constructs a separate HadoopRDD per Hive partition path and group them into a UnionRDD. Then, all the input files are listed sequentially. In other tools such as Hive and Pig, this can be solved by setting mapreduce.input.fileinputformat.list-status.num-threads high. But in Spark, since each HadoopRDD lists only one partition path, setting this property doesn't help.

      Attachments

        Issue Links

          Activity

            People

              rdblue Ryan Blue
              cheolsoo Cheolsoo Park
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: