[SPARK-35191] all columns are read even if column pruning applies when spark3.0 read table written by spark2.2 - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Question
Status: Resolved
Priority: Major
Resolution: Duplicate
Affects Version/s: 3.0.0
Fix Version/s: None
Component/s: Spark Core
Labels:
None
Environment:

spark3.0

spark.sql.hive.convertMetastoreOrc=true(default value in spark3.0)

spark.sql.orc.impl=native(default value in spark3.0)

Description

Before I address this issue, let me talk about the issue background: The current spark version we use is 2.2, and we plan to migrate to spark3.0 in near future. Before migration, we test some query in both spark2.2 and spark3.0 to check potential issue. The data source table of these query is orc format written by spark2.2.

I find that even if column pruning is applied, spark3.0’s native reader will read all columns.

Then I do remote debug. In OrcUtils.scala’s requestedColumnIds Method, it will check whether field name is started with “_col”. In my case, field name is started with “_col”, like “_col1”, “_col2”. So pruneCols is not done. The code is below:

if (orcFieldNames.forall(_.startsWith("_col"))) {

// This is a ORC file written by Hive, no field names in the physical schema, assume the

// physical schema maps to the data scheme by index.

assert(orcFieldNames.length <= dataSchema.length, "The given data schema " +

s"$

{dataSchema.catalogString}

has less fields than the actual ORC physical schema, " +

"no idea which columns were dropped, fail to read.")

// for ORC file written by Hive, no field names

// in the physical schema, there is a need to send the

// entire dataSchema instead of required schema.

// So pruneCols is not done in this case

Some(requiredSchema.fieldNames.map { name =>

val index = dataSchema.fieldIndex(name)

if (index < orcFieldNames.length)

{ index }

else

{ -1 }

}, false)

Although this code comment explains reason, I still do not understand. This issue only happens in this case: spark3.0 uses native reader to read table written by spark2.2.

In other cases, there is no such issue. I do another 2 tests:

Test1: use spark3.0’s hive reader (running with spark.sql.hive.convertMetastoreOrc=false and spark.sql.orc.impl=hive) to read the same table, it only reads pruned columns.

Test2: use spark3.0 to write a table, then use spark3.0’s native reader to read this new table, it only reads pruned columns.

This issue I mentioned is a block we use native reader in spark3.0. Can anyone know further reason or provide solutions?

Attachments

Issue Links

duplicates

SPARK-34897 Support reconcile schemas based on index after nested column pruning

Resolved

Activity

People

Assignee:: Unassigned

Reporter:: xiaoli

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 22/Apr/21 11:13

Updated:: 23/Apr/21 10:52

Resolved:: 23/Apr/21 10:52