We are not concerned about the task attempt. The problem here is for Task Tracker's availability.
Have you actually experienced TTs crashing because conf objects were too large? Or where conf objects were taking up a substantial portion of the available heap space?
The way conf was designed has its own benefits. At the same time it comes with some disadvantages. What if a task attempt can run for a day or more? This is not uncommon in, our clusters.
I would conjecture that such a task attempt is likely using many MBs or GBs of memory for the actual work it's doing. Is this patch which saves a few hundred KBs at the extreme end really going to move the needle?
1. With UGI, conf will be created per user in TT. (Security folks?)
But presumably only for every user which is concurrently running a task attempt on that TT, so not that many, right? Unless I'm missing something, which is certainly possible.
2. PIG or any other job can store arbitrary data. Hadoop framework should be able to deal with it as far as it can.
No disagreement there.
3. Last but not least, API should not hold on to client's data.
I see no principled reason the DFSClient "should not hold on to client's data" in the form of the conf object. If this is actually negatively impacting performance or availability, then we should certainly fix that, but you haven't demonstrated that yet.
As every job is different so can workloads can be different. So one can't see or hear all the problems.
Certainly, but we can validate this issue with some testing. Can you please describe what you did to gather these measurements? What exactly are they actually measuring?
My issue here is that this change is being done purely as an optimization, but it's unclear to me that negative issues exist without this patch, or that this patch necessarily addresses those issues. If you can demonstrate those, I'll shut up immediately.