Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-28563

Spark 2.4 | Reading all the data inside partition like directory.

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Blocker
    • Resolution: Not A Bug
    • Affects Version/s: 2.4.1
    • Fix Version/s: None
    • Component/s: Input/Output
    • Labels:
      None

      Description

      We have upgraded your cluster from Spark 2.3 to 2.4 and currently, we are observing different behavior while reading data. 

       

      In Spark 2.3 
            spark.read.('basePath','output/model').orc('output/model/abc=4')

      Expected: We will get "abc" column  in schema

      Similarly:

       spark.read.('basePath','output/model/abc=4').orc('output/model/abc=4')

      Expected : It will only read data inside parition abc=4 and abc will not be part of schema even "output/model" has different schema of files inside 

      In Spark2.4

      spark.read.('basePath','output/model/abc=4').orc('output/model/abc=4')

      It is trying to get the schema from "output/model/" instead of  output/model/abc=4  and job is getting failed because of different schema

      For partitioned table directories, data files should only live in leaf directories.
      And directories at the same level should have the same partition column name.
      Please check the following directories for unexpected files or inconsistent partition column names:
      
      at scala.Predef$.assert(Predef.scala:170)
       at org.apache.spark.sql.execution.datasources.PartitioningUtils$.resolvePartitions(PartitioningUtils.scala:364)
       at org.apache.spark.sql.execution.datasources.PartitioningUtils$.parsePartitions(PartitioningUtils.scala:165)
       at org.apache.spark.sql.execution.datasources.PartitioningUtils$.parsePartitions(PartitioningUtils.scala:100)
       at org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex.inferPartitioning(PartitioningAwareFileIndex.scala:131)
       at org.apache.spark.sql.execution.datasources.InMemoryFileIndex.partitionSpec(InMemoryFileIndex.scala:71)
       at org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex.partitionSchema(PartitioningAwareFileIndex.scala:50)
       at org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:144)
       at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:373)
       at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:223)
       at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211)
       at org.apache.spark.sql.DataFrameReader.orc(DataFrameReader.scala:662)
       at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
       at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
       at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
       at java.lang.reflect.Method.invoke(Method.java:498)
       at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
       at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
       at py4j.Gateway.invoke(Gateway.java:282)
       at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
       at py4j.commands.CallCommand.execute(CallCommand.java:79)
       at py4j.GatewayConnection.run(GatewayConnection.java:238)
       at java.lang.Thread.run(Thread.java:745)
      
      

       

       

       

       

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              vishaldonderia@gmail.com Vishal Donderia
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: