Yes, your correctly understood my meaning. For the first step, we can build a pure in-memory data structure. That's quite fast and straightforward. However, for a long term goal. We should think about those aspect.
1. Memory is expensive. There is a proverb, use our limited funds where they can be put to best use. We can cut down the footprint through compression, LRU cache and multiply-layer storage memory ->SSD -> hard disk.
2. I am not familiar with the yarn mode of tajo. From my knowledge, the worker is spawned by nodemanager on demand. Since the workers can't always standup, they can't keep data in-memory for sharing with subsequent queries. A solution is put this cache manager as a aux service in nodemanager like shuffle service in hadoop mapreduce.
3. If we embed a cache service in nodemanager, we should think about how the workers can share cached data from that service. That is an inter-processes sharing. The most efficient way to achieve on unix system is mmap. see http://en.wikipedia.org/wiki/Mmap . In java, Filechannel can mapping a file into memory, its backend is mmap. To be better, if we use those kind of direct memory rather than allocated from heap, we can get a zero GC overhead. see https://gist.github.com/coderplay/8262536 for example.
Thanks for the feedback, really encouraging.