Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-10287

After processing a query using JSON data, Spark SQL continuously refreshes metadata of the table

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Critical
    • Resolution: Fixed
    • 1.5.0
    • 1.5.0
    • SQL

    Description

      I have a partitioned json table with 1824 partitions.

      val df = sqlContext.read.format("json").load("aPartitionedJsonData")
      val columnStr = df.schema.map(_.name).mkString(",")
      println(s"columns: $columnStr")
      val hash = df
        .selectExpr(s"hash($columnStr) as hashValue")
        .groupBy()
        .sum("hashValue")
        .head()
        .getLong(0)
      

      Looks like for JSON, we refresh metadata when we call buildScan. For a partitioned table, we call buildScan for every partition. So, looks like we will refresh this table 1824 times.

      Attachments

        Activity

          People

            yhuai Yin Huai
            yhuai Yin Huai
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: