cut related part form our code base. this patch is pretty far from ready-to-merge state, but I guess it won't be merged anyway just show the idea and the problem.
We are using hive metastore to represent data schema information of HTables. Metastore has 3 kind of information:
- WHAT the logical fields look like. From user's perspective, a record may have different fields like integer, date, boolean or text etc. This also indicates how drill would process fields in memory.
- WHERE the logical fields is stored in HTable. there are many places in HTable in which information can be stored: rowkey; qualifier name of a particular CF; value under a particular CF:qualifier; or even version number of a particular cell. The value of a logical field can be stored as any of above. Further, rowkey may contain multiple logical fields. This highly depends on how user design their storage schema.
- HOW logical fields is stored. HBase basically provides a storage for byte. So HTable scanner needs to know how the fields like integer, date, boolean are serialized as byte. For example, 255 would be serialized to \xFF as BINARY:1byte, or [FF 00 00 00] as BINARY:4byte, or "255" as TEXT(with variable length), or "00000255" as TEXT(with fixed length:8). Another example would be logical DATE to (first integer 1375483564 then) [AC 36 FC 51] as BINARY:4byte, or "20130803" as TEXT(with fixed length:8).
The meta definition is in com.xingcloud.meta.HBaseFieldInfo.java
These information will be used in HBase scanner to generate most effective scan (mapping logical Filter to HBase's filter class, and deciding startKey and endKey to scan least data), and in conversion from LogicalPlan to PhysicalPlan, to generate the correct ReadEntry for HBase.
I do understand that strong schema is not drill's primary concern, however I think other approaches to HBase scanner also have to solve the problems above to work correctly. plus: I guess most HBase users do have a carefully designed schema in mind...