Karam Singh had been extremely helpful in running various tests to hunt this down. And we finally got some results after a couple of weeks of hard work.
Turns out that most of the issues are because we made a switch from 32 bit JVMs to 64 bit. Using compressed references dramatically increased the AMs speed, and the job finishes in around 30-35 mins. That is still a regression, but atleast the job finishes after the compressed-ops setting and/or changing the jvm back to 32 bit.
Giving more heap to the 32 bit JVM, around 3GB, helps to finish the job in around 7-8 mins. But that isn't something we want to do for all jobs. Reverting back to original speed definitely means that AM is wasting away time in GCs. Some of the observations Sid made above may hint at the root culprit.
Will file separate tickets to fix the inefficiencies.