Suresh is willing to do the performance benchmark, but I am trying to understand where you are coming from. Yahoo and FB create very large namespaces by simply buying more memory and increasing the size of the heap.
This is not always possible. Some of our namenodes are running at the maximum configuration for the box (maximum memory, maximum heap, near maximum namespace). For these clusters, upgrading to this feature will require new boxes.
Do you worry about cache pollution when you create 50K more files?
I don't worry about cache pollution when I create 50K more files. What's important is the size of the working set. Inodes are a very popular object within the NN, if inodes make up a significant part of our working set, then it matters. I don't know whether this is the case or not, that's why I think it makes sense to run some benchmarks to make sure we don't see any ill-effects. With the introduction of YARN, the central RM is rarely the bottleneck. Now it's much more common for the NN to be the bottleneck of the cluster, and slowing down the bottleneck always needs to be looked at carefully.
Given that the NN heap (many GBs) is so much larger than the cache, does the additional inode and inode-map size impact the overall system performance?
Good question. Let's find out.
Suresh has argued that a 24GB heap grows by 625MB.
I was using the numbers Todd gathered where a 7G heap grew by 600MB. When we looked at one of our key clusters, we calculated something like 7.5% increase.
Looking at the growth in memory of this feature as a percentage of the total heap size is a more realistic way of looking at the impact of the growth than the growth of an individual data structure like the inode.
IMHO, not having an inode-map and inode number was a serious limitation in the original implementation of NN. I am willing to pay for the extra memory given the value inode-id and inode-map brings (as described by suresh in the beginning of this Jira). Permissions, access time, etc added to the memory cost of the the NN and were accepted because of the value they bring.
Certainly agree it is a limitation. We just need to make sure we fully quantify all of the costs.