Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-6921

Spark SQL API "saveAsParquetFile" will output tachyon file with different block size

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Critical
    • Resolution: Won't Fix
    • 1.3.0, 1.3.1
    • None
    • SQL
    • None

    Description

      I run below code in Spark Shell to access parquet files in Tachyon.
      1.First,created a DataFrame by loading a bunch of Parquet Files in Tachyon
      val ta3 =sqlContext.parquetFile("tachyon://tachyonserver:19998/apps/tachyon/zhangxf/parquetAdClick-6p-256m");
      2.Second, set the "fs.local.block.size" to 256M to make sure that block size of output files in Tachyon is 256M.
      sc.hadoopConfiguration.setLong("fs.local.block.size",268435456)
      3.Third,saved above DataFrame into Parquet files that is stored in Tachyon
      ta3.saveAsParquetFile("tachyon://tachyonserver:19998/apps/tachyon/zhangxf/parquetAdClick-6p-256m-test");
      After above code run successfully, the output parquet files were stored in Tachyon,but these files have different block size,below is the information of those files in the path "tachyon://tachyonserver:19998/apps/tachyon/zhangxf/parquetAdClick-6p-256m-test":
      File Name Size Block Size In-Memory Pin Creation Time
      _SUCCESS 0.00 B 256.00 MB 100% NO 04-13-2015 17:48:23:519
      _common_metadata 1088.00 B 256.00 MB 100% NO 04-13-2015 17:48:23:741
      _metadata 22.71 KB 256.00 MB 100% NO 04-13-2015 17:48:23:646
      part-r-00001.parquet 177.19 MB 32.00 MB 100% NO 04-13-2015 17:46:44:626
      part-r-00002.parquet 177.21 MB 32.00 MB 100% NO 04-13-2015 17:46:44:636
      part-r-00003.parquet 177.02 MB 32.00 MB 100% NO 04-13-2015 17:46:45:439
      part-r-00004.parquet 177.21 MB 32.00 MB 100% NO 04-13-2015 17:46:44:845
      part-r-00005.parquet 177.40 MB 32.00 MB 100% NO 04-13-2015 17:46:44:638
      part-r-00006.parquet 177.33 MB 32.00 MB 100% NO 04-13-2015 17:46:44:648

      It seems that the API saveAsParquetFile does not distribute/broadcast the hadoopconfiguration to executors like the other API such as saveAsTextFile.The configutation "fs.local.block.size" only take effects on Driver.
      If I set that configuration before loading parquet files,the problem is gone.

      Attachments

        Activity

          People

            Unassigned Unassigned
            zhangxiongfei zhangxiongfei
            Votes:
            1 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: