Details
-
Bug
-
Status: Open
-
Major
-
Resolution: Unresolved
-
None
-
None
-
ghx-label-3
Description
Catalogd shows the top-25 largest tables in its WebUI at the "/catalog" endpoint. The estimated metadata size is computed in HdfsTable#getTHdfsTable():
https://github.com/apache/impala/blob/0d49c9d6cc7fc0903d60a78d8aaa996af0249c06/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java#L2414-L2451
The current formula is
- memUsageEstimate = numPartitions * 2KB + numFiles * 500B + numBlocks * 150B + (optional) incrementalStats
- (optional) incrementalStats = numPartitions * numColumns * 200B
It's ok to use this formula to compare tables. But it can't be used to estimate the max heap size of catalogd. E.g. it doesn't consider the column comments and tblproperties which could have long strings. Column names should also be considered in case the table is a wide table.
We can compare the estimated sizes with results from ehcache-sizeof or jamm and update the formula. Or use these libraries to estimate the sizes directly if they won't impact the performance.
CC MikaelSmith