Description
When repair a table with thousands of partitions, it could take hundreds of seconds, Hive metastore can only add a few partitioins per seconds, because it will list all the files for each partition to gather the fast stats (number of files, total size of files).
We could improve this by listing the files in Spark in parallel, than sending the fast stats to Hive metastore to avoid this sequential listing.
Attachments
Issue Links
- Parent Feature
-
SPARK-20697 MSCK REPAIR TABLE resets the Storage Information for bucketed hive tables.
- Resolved
- links to