That's an interesting idea. Do you mean that Tajo will use Parquet as the default storage format or have all storage formats deserialize into a representation that follows the Dremel model? Parquet doesn't really have its own in-memory representation. Each of the Parquet packages basically deserialize into a given in-memory representation using the readers and writers. For example, parquet-avro deserializes into Avro GenericRecords (or SpecificRecords), parquet-pig deserializes into Pig Tuples, and my code deserializes into Tajo Tuples.
My changes are currently in the parquet branch in my fork on GitHub: https://github.com/davidzchen/tajo/tree/parquet
They are almost ready. During further testing, I found a few more issues, most of them I have now fixed. One thing I noticed was that when reading a projection, the resulting Tuple still has all the columns of the table schema but the non-projected fields are simply null. What is the motivation for retaining all the columns in the Tuple rather than having the Tuple only contain the projected columns?
There is one last test that is failing which is caused by the fact that I am not handling the NULL_TYPE data type when converting the Tajo schema to a Parquet schema on write. What is NULL_TYPE used for? I wasn't able to find much documentation on its use. I can always write this as a placeholder column or special-case it. Once I fix this, I will post a review request.
There are some follow-up work items that I plan to do, most likely as review changes:
- Add TableStats to ParquetAppender.
- Figure out of ParquetAppender.flush() is needed.
- Additional end-to-end testing
- Add some documentation
Edit: Update GitHub link.