Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-19351

Support for obtaining file splits from underlying InputFormat

    XMLWordPrintableJSON

Details

    • New Feature
    • Status: Resolved
    • Major
    • Resolution: Incomplete
    • 2.0.0
    • None
    • SQL

    Description

      This is a request for a feature, that enables SparkSQL to obtain the files for a Hive partition, by calling inputFormat.getSplits(), as opposed to listing files directly, while still using Spark's optimized Parquet readers for actual IO. (Note that the difference between this and falling back entirely to Hive via spark.sql.hive.convertMetastoreParquet=false is that we get to realize benefits such as new parquet reader, schema merging etc in SparkSQL)

      Some background the context, using our use-case at Uber. We have Hive tables, where each partition contains versioned files (whenever records in a file change, we produce a new version, to speed up database ingestion) and such tables are registerd with a custom InputFormat that just filters out old versions and just returns the latest version of each file to the query.

      We have this working for 5 months now across Hive/Spark/Presto as follows

      • Hive : Works out of box, by calling the inputFormat.getSplits, so we are good there
      • Presto: We made the fix in Presto, similar to whats proposed here.
      • Spark : We set convertMetastoreParquet=false. Perf is actually comparable for our use-cases, but we run into schema merging issues now and then.

      we have explored a few approaches here and would like to get more feedback from you all, before we go further..

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              vinothchandar Vinoth Chandar
              Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: