[SPARK-31822] Cost too much resources when read orc hive table for infer schema - ASF JIRA

Attach files

Attach Screenshot

Add vote

Voters

Watch issue

Watchers

Create sub-task

Link

Clone

Update Comment Author

Replace String in Comment

Update Comment Visibility

Delete Comments

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: 2.4.3
Fix Version/s: None
Component/s: Input/Output, SQL
Labels:
- HiveMetastoreCatalog
- orc

Description

When read a hive orc partitioned table without spark schema properties , spark read all partitions and all files for infer schema.

Other settings: native orc mode ; convertMetastoreOrc = true.

And I think it can improved by pass partitionFilters to fileIndex.listFiles.

// code placeholder
// org/apache/spark/sql/hive/HiveMetastoreCatalog.scala:238
val inferredSchema = fileFormat
  .inferSchema(
    sparkSession,
    options,
    fileIndex.listFiles(Nil, Nil).flatMap(_.files))
  .map(mergeWithMetastoreSchema(relation.tableMeta.dataSchema, _))