Details
Description
When I build a cube, I noticed that when build the dictionary and calculate the cube, there are a large number of mappers be started (more than 10,000); With the log I noticed many mappers has 0 or much less records to process, this confused me;
Then I checked the storage location of the flat table, found there are many files; I did a count and found it is the same number as the mappers;
Too many mappers will cause much overhead, and downgrade the cluster's performance; Kylin should ask Hive to merge those small files during creating the flat table step.
In my hadoop cluster, the hive.merge.mapredfiles was set to false (default value); After changing it to true for Kylin's job, the intermediate table's file number was reduced to 4, each be up to 256M, looks good; Check hive configuration at: https://cwiki.apache.org/confluence/display/Hive/AdminManual+Configuration