Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-35191

all columns are read even if column pruning applies when spark3.0 read table written by spark2.2

    XMLWordPrintableJSON

Details

    • Question
    • Status: Resolved
    • Major
    • Resolution: Duplicate
    • 3.0.0
    • None
    • Spark Core
    • None
    • spark3.0

      spark.sql.hive.convertMetastoreOrc=true(default value in spark3.0)

      spark.sql.orc.impl=native(default value in spark3.0)

    Description

      Before I address this issue, let me talk about the issue background: The current spark version we use is 2.2, and we plan to migrate to spark3.0 in near future. Before migration, we test some query in both spark2.2 and spark3.0 to check potential issue. The data source table of these query is orc format written by spark2.2.

       

      I find that even if column pruning is applied, spark3.0’s native reader will read all columns.

       

      Then I do remote debug. In OrcUtils.scala’s requestedColumnIds Method, it will check whether field name is started with “_col”. In my case, field name is started with “_col”, like “_col1”, “_col2”. So pruneCols is not done.  The code is below:

       

      if (orcFieldNames.forall(_.startsWith("_col"))) {

        // This is a ORC file written by Hive, no field names in the physical schema, assume the

        // physical schema maps to the data scheme by index.

        assert(orcFieldNames.length <= dataSchema.length, "The given data schema " +

          s"$

      {dataSchema.catalogString}

      has less fields than the actual ORC physical schema, " +

          "no idea which columns were dropped, fail to read.")

        // for ORC file written by Hive, no field names

        // in the physical schema, there is a need to send the

        // entire dataSchema instead of required schema.

        // So pruneCols is not done in this case

        Some(requiredSchema.fieldNames.map { name =>

          val index = dataSchema.fieldIndex(name)

          if (index < orcFieldNames.length)

      {       index     }

      else

      {       -1     }

        }, false)

      Although this code comment explains reason, I still do not understand. This issue only happens in this case: spark3.0 uses native reader to read table written by spark2.2. 

       

      In other cases, there is no such issue. I do another 2 tests:

      Test1: use spark3.0’s hive reader (running with spark.sql.hive.convertMetastoreOrc=false and spark.sql.orc.impl=hive) to read the same table, it only reads pruned columns.

      Test2: use spark3.0 to write a table, then use spark3.0’s native reader to read this new table, it only reads pruned columns.

       

      This issue I mentioned is a block we use native reader in spark3.0. Can anyone know further reason or provide solutions?

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              expxiaoli xiaoli
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: