The memory representation of Hdfs files in the catalog is highly inefficient and can be significantly improved. Currently, the Catalog uses ~400-500 bytes per THdfsFileDescriptor object which essentially includes: a) the file name and b) a list of THdfsFileBlocks. Every file block stores information about replicas, disks ids and whether the replica is cached or not. All that information is currently stored in Thrift objects and can be significantly compressed.
Also, the catalog and the Impalad services spend a lot of time (and memory) serializing/deserializing Thrift objects. Using a more efficient serialization library (e.g. FlatBufffers) can significantly improve memory efficiency and speed while processing catalog updates.