Details
-
Bug
-
Status: Closed
-
Blocker
-
Resolution: Fixed
-
0.5
-
None
-
None
-
RH 5.8 (on AWS)
Hadoop 1.1.2.17 (build)
HCat 0.5 (build)
Description
The optimizations brought in by HCATALOG-538 break dynamic partitioning in the e2e tests. The issue is that the assumption that if the first child in a directory structure is a directory, the rest are directories, and if the first child is a file, then the rest are files is an incorrect one.
(Admittedly, one part of that, that of assuming that if the first child is a file, the assumption that it is a leaf directory is not necessarily a bad one in premise, although still incorrect)
The issue with this is that underlying FileOutputCommitter and OutputFormat behaviour would affect whether or not you get files or directories, or whether there would be any _temporary directories still left behind, for eg.
In the case I tested, the issue is that there is a _temporary directory in a "leaf" directory, followed by part files. The optimization sees the _temporary directory, finds nothing inside it, so doesn't mkdir any parent, then decides that the rest are directories, then moves to the part file, and tries to rename it directly without mkdir-ing its parent directory.
The e2e test conf in question is Pig_Checkin_7
{ 'num' => 7 ,'hcat_prep'=>q\drop table if exists pig_checkin_7; create table pig_checkin_7 (name string, age int) partitioned by (ds string) STORED AS TEXTFILE;\ ,'pig' => q\a = load 'studentparttab30k' using org.apache.hcatalog.pig.HCatLoader(); b = foreach a generate name, age, ds; store b into 'pig_checkin_7' using org.apache.hcatalog.pig.HCatStorer();\, ,'result_table' => 'pig_checkin_7', ,'sql' => "select name, age, ds from studentparttab30k;", ,'floatpostprocess' => 1 ,'delimiter' => ' ' }
Attachments
Attachments
Issue Links
- is related to
-
HCATALOG-538 HCatalogStorer fails for 100GB of data with dynamic partitioning (number of partition is 300)
- Closed