Thejas M Nair/Hari Sankar Sivarama Subramaniyan : I have a couple of thoughts about moving JDOException retries solely to the metastore:
a) Firstly, we have had cases so far where a JDOException invalidates the connection on the metastore side, and retrying from the metastore has not helped. Retrying from the client-side, though, causes a fresh openTransaction() that clears the connection and all history, sometimes by hitting a different HMSHandler, and this causes the retry from client to be more successful than a retry from server. Admittedly, this is more likely because we need to clean up our metastore code to make sure that the retry from the metastore-side handles this properly, and thus, is something we should attempt to improve.
b) Second, from a perspective of a loaded metastore, having a metastore thread do retries, thus using up valuable metastore resources/time is more wasteful than having the client do retries. We thus tend to keep our metastore-side retries to a low amount, but the fact that we have client-side retries as well gives us an ability to be fail-fast on the metastore, but retry a large number of times in particular clients if we find the need to do so. Particularly, in HA configurations, I've seen a large number of retries and longer retry-intervals on the client side that allow a connection to go through despite metastore HUPs.
c) Thirdly, speaking of HA, retrying on the client-side allows us to hit alternate metastores as well, if configured, if we have scenarios where one metastore is getting bogged down. As you mention, client should ideally only be retrying connection exceptions, but JDOExceptions are frequently the result of connection exceptions raised by the connection pool from the metastore to the db.
There is definitely scope for refactoring and improvement in all this, I will look into it further, but for now, this is a simpler bugfix to enable the already-existing regex to work correctly.