[SPARK-19351] Support for obtaining file splits from underlying InputFormat - ASF JIRA

XML

Word

Printable

JSON

Details

Type: New Feature
Status: Resolved
Priority: Major
Resolution: Incomplete
Affects Version/s: 2.0.0
Fix Version/s: None
Component/s: SQL
Labels:
- bulk-closed

Description

This is a request for a feature, that enables SparkSQL to obtain the files for a Hive partition, by calling inputFormat.getSplits(), as opposed to listing files directly, while still using Spark's optimized Parquet readers for actual IO. (Note that the difference between this and falling back entirely to Hive via spark.sql.hive.convertMetastoreParquet=false is that we get to realize benefits such as new parquet reader, schema merging etc in SparkSQL)

Some background the context, using our use-case at Uber. We have Hive tables, where each partition contains versioned files (whenever records in a file change, we produce a new version, to speed up database ingestion) and such tables are registerd with a custom InputFormat that just filters out old versions and just returns the latest version of each file to the query.

We have this working for 5 months now across Hive/Spark/Presto as follows

Hive : Works out of box, by calling the inputFormat.getSplits, so we are good there
Presto: We made the fix in Presto, similar to whats proposed here.
Spark : We set convertMetastoreParquet=false. Perf is actually comparable for our use-cases, but we run into schema merging issues now and then.

we have explored a few approaches here and would like to get more feedback from you all, before we go further..

Attachments

Issue Links

is related to

SPARK-15689 Data source API v2

Resolved

Activity

People

Assignee:: Unassigned

Reporter:: Vinoth Chandar

Votes:: 0 Vote for this issue

Watchers:: 10 Start watching this issue

Dates

Created:: 24/Jan/17 17:21

Updated:: 21/May/19 04:13

Resolved:: 21/May/19 04:13