Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-17153

[Structured streams] readStream ignores partition columns

    XMLWordPrintableJSON

    Details

      Description

      When parquet files are persisted using partitions, spark's `readStream` returns data with all `null`s for the partitioned columns.

      For example:

      case class A(id: Int, value: Int)
      
      val data = spark.createDataset(Seq(
        A(1, 1), 
        A(2, 2), 
        A(2, 3))
      )
      
      val url = "/mnt/databricks/test"
      data.write.partitionBy("id").parquet(url)
      

      when data is read as stream:

      spark.readStream.schema(spark.read.load(url).schema).parquet(url)
      

      it reads:

      id, value
      null, 1
      null, 2
      null, 3
      

      A possible reason is `readStream` reads parquet files directly but when those are stored the columns they are partitioned by are excluded from the file itself. In the given example the parquet files contain `value` information only since `id` is partition.

        Attachments

          Activity

            People

            • Assignee:
              viirya L. C. Hsieh
              Reporter:
              dcarpov Dmitri Carpov
            • Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: