Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-17153

[Structured streams] readStream ignores partition columns

    XMLWordPrintableJSON

Details

    Description

      When parquet files are persisted using partitions, spark's `readStream` returns data with all `null`s for the partitioned columns.

      For example:

      case class A(id: Int, value: Int)
      
      val data = spark.createDataset(Seq(
        A(1, 1), 
        A(2, 2), 
        A(2, 3))
      )
      
      val url = "/mnt/databricks/test"
      data.write.partitionBy("id").parquet(url)
      

      when data is read as stream:

      spark.readStream.schema(spark.read.load(url).schema).parquet(url)
      

      it reads:

      id, value
      null, 1
      null, 2
      null, 3
      

      A possible reason is `readStream` reads parquet files directly but when those are stored the columns they are partitioned by are excluded from the file itself. In the given example the parquet files contain `value` information only since `id` is partition.

      Attachments

        Activity

          People

            viirya L. C. Hsieh
            dcarpov Dmitri Carpov
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: