Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-34075

Hidden directories are being listed for partition inference

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Blocker
    • Resolution: Fixed
    • 3.1.0
    • 3.1.1
    • SQL
    • None

    Description

      Marking this as a blocker since it seems to be a regression. We are running Delta's tests against Spark 3.1 as part of QA here: https://github.com/delta-io/delta/pull/579

       

      We have noticed that one of our tests regressed with:

      java.lang.AssertionError: assertion failed: Conflicting directory structures detected. Suspicious paths:
      [info] 	file:/private/var/folders/_2/xn1c9yr11_93wjdk2vkvmwm00000gp/t/spark-18706bcc-23ea-4853-b8bc-c4cc2a5ed551
      [info] 	file:/private/var/folders/_2/xn1c9yr11_93wjdk2vkvmwm00000gp/t/spark-18706bcc-23ea-4853-b8bc-c4cc2a5ed551/_delta_log
      [info] 
      [info] If provided paths are partition directories, please set "basePath" in the options of the data source to specify the root directory of the table. If there are multiple root directories, please load them separately and then union them.
      [info]   at scala.Predef$.assert(Predef.scala:223)
      [info]   at org.apache.spark.sql.execution.datasources.PartitioningUtils$.parsePartitions(PartitioningUtils.scala:172)
      [info]   at org.apache.spark.sql.execution.datasources.PartitioningUtils$.parsePartitions(PartitioningUtils.scala:104)
      [info]   at org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex.inferPartitioning(PartitioningAwareFileIndex.scala:158)
      [info]   at org.apache.spark.sql.execution.datasources.InMemoryFileIndex.partitionSpec(InMemoryFileIndex.scala:73)
      [info]   at org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex.partitionSchema(PartitioningAwareFileIndex.scala:50)
      [info]   at org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:167)
      [info]   at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:418)
      [info]   at org.apache.spark.sql.execution.datasources.ResolveSQLOnFile$$anonfun$apply$1.applyOrElse(rules.scala:62)
      [info]   at org.apache.spark.sql.execution.datasources.ResolveSQLOnFile$$anonfun$apply$1.applyOrElse(rules.scala:45)
      [info]   at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.$anonfun$resolveOperatorsDown$2(AnalysisHelper.scala:108)
      [info]   at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:73)
      [info]   at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.$anonfun$resolveOperatorsDown$1(AnalysisHelper.scala:108)
      [info]   at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.allowInvokingTransformsInAnalyzer(AnalysisHelper.scala:221)
      [info]   at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperatorsDown(AnalysisHelper.scala:106)
      [info]   at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperatorsDown$(AnalysisHelper.scala:104)
      [info]   at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperatorsDown(LogicalPlan.scala:29)
      [info]   at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperators(AnalysisHelper.scala:73)
      [info]   at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperators$(AnalysisHelper.scala:72)
      [info]   at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:29)
      [info]   at org.apache.spark.sql.execution.datasources.ResolveSQLOnFile.apply(rules.scala:45)
      [info]   at org.apache.spark.sql.execution.datasources.ResolveSQLOnFile.apply(rules.scala:40)
      [info]   at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$2(RuleExecutor.scala:216)
      [info]   at scala.collection.LinearSeqOptimized.foldLeft(LinearSeqOptimized.scala:126)
      [info]   at scala.collection.LinearSeqOptimized.foldLeft$(LinearSeqOptimized.scala:122)
      [info]   at scala.collection.immutable.List.foldLeft(List.scala:89)
      [info]   at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$1(RuleExecutor.scala:213)
      [info]   at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$1$adapted(RuleExecutor.scala:205)
      [info]   at scala.collection.immutable.List.foreach(List.scala:392)
      [info]   at org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:205)
      [info]   at org.apache.spark.sql.catalyst.analysis.Analyzer.org$apache$spark$sql$catalyst$analysis$Analyzer$$executeSameContext(Analyzer.scala:195)
      [info]   at org.apache.spark.sql.catalyst.analysis.Analyzer.execute(Analyzer.scala:189) 

      It seems like a hidden directory is not being filtered out, when it actually should.

      Attachments

        Activity

          People

            Gengliang.Wang Gengliang Wang
            brkyvz Burak Yavuz
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: