[SPARK-21661] SparkSQL can't merge load table from Hadoop - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 2.2.0
Fix Version/s: 2.3.0
Component/s: SQL
Labels:
None

Description

Here is the original text of external table on HDFS：

Permission	Owner	Group	Size	Last Modified	Replication	Block Size	Name
-rw-r--r--	root	supergroup	0 B	8/6/2017, 11:43:03 PM	3	256 MB	income_band_001.dat
-rw-r--r--	root	supergroup	0 B	8/6/2017, 11:39:31 PM	3	256 MB	income_band_002.dat
...
-rw-r--r--	root	supergroup	327 B	8/6/2017, 11:44:47 PM	3	256 MB	income_band_530.dat

After SparkSQL load, every files have a output file, even the files are 0B. For the load on Hive, the data files would be merged according the data size of original files.

Reproduce:

CREATE EXTERNAL TABLE t1 (a int,b string)  STORED AS TEXTFILE LOCATION "hdfs://xxx:9000/data/t1"
CREATE TABLE t2 STORED AS PARQUET AS SELECT * FROM t1;

The table t2 have many small files without data.

Attachments

Issue Links

links to

[Github] Pull Request #18654

Activity

People

Assignee:: Yuanjian Li

Reporter:: Dapeng Sun

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 08/Aug/17 03:57

Updated:: 12/Dec/22 18:11

Resolved:: 11/Oct/17 14:34