Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-21661

SparkSQL can't merge load table from Hadoop

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 2.2.0
    • 2.3.0
    • SQL
    • None

    Description

      Here is the original text of external table on HDFS:

      Permission	Owner	Group	Size	Last Modified	Replication	Block Size	Name
      -rw-r--r--	root	supergroup	0 B	8/6/2017, 11:43:03 PM	3	256 MB	income_band_001.dat
      -rw-r--r--	root	supergroup	0 B	8/6/2017, 11:39:31 PM	3	256 MB	income_band_002.dat
      ...
      -rw-r--r--	root	supergroup	327 B	8/6/2017, 11:44:47 PM	3	256 MB	income_band_530.dat
      

      After SparkSQL load, every files have a output file, even the files are 0B. For the load on Hive, the data files would be merged according the data size of original files.

      Reproduce:

      CREATE EXTERNAL TABLE t1 (a int,b string)  STORED AS TEXTFILE LOCATION "hdfs://xxx:9000/data/t1"
      CREATE TABLE t2 STORED AS PARQUET AS SELECT * FROM t1;
      

      The table t2 have many small files without data.

      Attachments

        Issue Links

          Activity

            People

              XuanYuan Yuanjian Li
              dapengsun Dapeng Sun
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: