Details
-
Bug
-
Status: Open
-
Major
-
Resolution: Unresolved
-
0.12.0, 0.13.0
-
None
-
None
Description
My lab Env:
Hive 0.13
If the source table has .deflate compressed files and there is where condition,
Hive did not start small file merge feature.
If we have one partition table, and if we run SQL like:
INSERT OVERWRITE TABLE target
select xxx from source where...;
After that, "target" table has many empty files, and the number of files = the
number of mappers.
I can reproduce it in house, and here is minimum reproduce.
Is it by design or do we need to fix it?
----------------------------------------
--------------Reproduce---------------
----------------------------------------
1. Create a source tables – "source_support" and "source_support2" with the
same DDL.
"source_support" is to store normal text files, "source_support2" will have
.deflate compressed files.
CREATE TABLE source_support(
onecol string
)
PARTITIONED BY (
partcol string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat';
CREATE TABLE source_support2(
onecol string
)
PARTITIONED BY (
partcol string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat';
2. Create a one-row data file:
- cat /root/hao/000000_0
'abc'
3. Loading to 3 partitions of "source_support":
LOAD DATA LOCAL INPATH '/root/hao/000000_0' INTO TABLE source_support
PARTITION(partcol='2015-01-01');
LOAD DATA LOCAL INPATH '/root/hao/000000_0' INTO TABLE source_support
PARTITION(partcol='2015-01-02');
LOAD DATA LOCAL INPATH '/root/hao/000000_0' INTO TABLE source_support
PARTITION(partcol='2015-01-03');
hive> select * from source_support;
OK
'abc' 2015-01-01
'abc' 2015-01-02
'abc' 2015-01-03
Time taken: 0.836 seconds, Fetched: 3 row(s)
4. Loading to "source_support2" from "source_support" to generate deflate
files.
set hive.exec.compress.output=true;
INSERT OVERWRITE TABLE source_support2 PARTITION (partcol='2015-01-01')
select onecol from source_support where
partcol='2015-01-01';
set hive.exec.compress.output=true;
INSERT OVERWRITE TABLE source_support2 PARTITION (partcol='2015-01-02')
select onecol from source_support where
partcol='2015-01-02';
set hive.exec.compress.output=true;
INSERT OVERWRITE TABLE source_support2 PARTITION (partcol='2015-01-03')
select onecol from source_support where
partcol='2015-01-03';
5. Source has .deflate files even though the small file merge is enabled.
drop table testbysupport2;
set hive.merge.mapfiles=true;
set hive.merge.mapredfiles=true;
create table testbysupport2 as
SELECT 'policy-sale' data_source
FROM source_support2
WHERE onecol = '2015.01.04' and partcol in
('2015-01-01','2015-01-02','2015-01-03');
[root@n3a warehouse]# ls -altr testbysupport2
total 1
drwxr-xr-x 42 xxx xxx 42 Jan 13 14:34 ..
-rwxr-xr-x 1 root root 0 Jan 13 14:34 000002_0
-rwxr-xr-x 1 root root 0 Jan 13 14:34 000001_0
-rwxr-xr-x 1 root root 0 Jan 13 14:34 000000_0
drwxr-xr-x 2 root root 3 Jan 13 14:34 .
6. If we remove the where condition "onecol = '2015.01.04'",
small file merge is now enabled.
drop table testbysupport2;
set hive.merge.mapfiles=true;
set hive.merge.mapredfiles=true;
create table testbysupport2 as
SELECT 'policy-sale' data_source
FROM source_support2
WHERE partcol in ('2015-01-01','2015-01-02','2015-01-03');
[root@n3a warehouse]# ls -altr testbysupport2
total 2
drwxr-xr-x 42 xxx xxx 42 Jan 13 14:37 ..
-rwxr-xr-x 1 root root 36 Jan 13 14:37 000000_0
drwxr-xr-x 2 root root 1 Jan 13 14:37 .
----------------------------------------
--------------Reproduce---------------
----------------------------------------