[SPARK-31822] Cost too much resources when read orc hive table for infer schema - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: 2.4.3
Fix Version/s: None
Component/s: Input/Output, SQL
Labels:
- HiveMetastoreCatalog
- orc

Description

When read a hive orc partitioned table without spark schema properties , spark read all partitions and all files for infer schema.

Other settings: native orc mode ; convertMetastoreOrc = true.

And I think it can improved by pass partitionFilters to fileIndex.listFiles.

// code placeholder
// org/apache/spark/sql/hive/HiveMetastoreCatalog.scala:238
val inferredSchema = fileFormat
  .inferSchema(
    sparkSession,
    options,
    fileIndex.listFiles(Nil, Nil).flatMap(_.files))
  .map(mergeWithMetastoreSchema(relation.tableMeta.dataSchema, _))

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: lithiumlee-_-

Votes:: 2 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 26/May/20 09:23

Updated:: 26/May/20 14:20