Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-31822

Cost too much resources when read orc hive table for infer schema

    XMLWordPrintableJSON

    Details

      Description

      When read a hive orc partitioned table without spark schema properties , spark read all partitions and all files for infer schema. 

      Other settings: native orc mode ; convertMetastoreOrc = true.

       

      And I think it can improved by pass  partitionFilters to fileIndex.listFiles.

      // code placeholder
      // org/apache/spark/sql/hive/HiveMetastoreCatalog.scala:238
      val inferredSchema = fileFormat
        .inferSchema(
          sparkSession,
          options,
          fileIndex.listFiles(Nil, Nil).flatMap(_.files))
        .map(mergeWithMetastoreSchema(relation.tableMeta.dataSchema, _))
      
      

       

       

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              lithiumlee-_- lithiumlee-_-
            • Votes:
              2 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated: