Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-31822

Cost too much resources when read orc hive table for infer schema

    XMLWordPrintableJSON

Details

    Description

      When read a hive orc partitioned table without spark schema properties , spark read all partitions and all files for infer schema. 

      Other settings: native orc mode ; convertMetastoreOrc = true.

       

      And I think it can improved by pass  partitionFilters to fileIndex.listFiles.

      // code placeholder
      // org/apache/spark/sql/hive/HiveMetastoreCatalog.scala:238
      val inferredSchema = fileFormat
        .inferSchema(
          sparkSession,
          options,
          fileIndex.listFiles(Nil, Nil).flatMap(_.files))
        .map(mergeWithMetastoreSchema(relation.tableMeta.dataSchema, _))
      
      

       

       

      Attachments

        Activity

          People

            Unassigned Unassigned
            lithiumlee-_- lithiumlee-_-
            Votes:
            2 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated: