-*- mode: markdown; coding: utf-8; fill-column: 78 -*- # Introduce primitive types for HBase This proposal outlines an improvement to HBase that provides for a set of types, above and beyond the existing "byte-bucket" strategy. This is intended to reduce user-level duplication of effort, provide better support for 3rd-party integration, and provide an overall improved experience for developers using HBase. ## Motivation HBase has thus far provided an interface for storing and retrieving byte[]'s. The Bytes utility provides some assistance to the application developer for converting common Java types to and from byte[]. The resulting byte[] do not maintain the sort order of the original values, an unfortunate anti-feature, given the semantics exposed by HBase. Users who wish to use, say, a Java Long value for a rowkey can do so easily with the existing Bytes utility. However, because of this anti-feature, scanning over values both positive and negative yields unexpected results. To do this "correctly", the user is forced to implement their own solution. This solution may be to only store positive Long values, convert to a zero-padded string representation, or implement their own order-preserving serialization format. This activity must be repeated for each value that is used in a rowkey or column qualifier. The HBase client API provides a simple, clean interface for using byte[]. In practice, however, users write applications that store and retrieve Strings, Integers, Doubles, &c. Thus, their code surrounding the HBase API is littered with serialization details. This is ugly at best and at worst is often a source of bugs, particularly on larger teams. Third-party tools interacting with HBase have a similar experience as user code. While storing byte[] is technically sufficient, each tool ends up repeating the implementation of these serialization details. Worse, because the tools are build independently, each one implements its own methods, which results in incompatibility. For instance, a user cannot easily persist data with Pig and then read it back again from Hive or Phoenix. Finally, new users to HBase often struggle with schema design. Compound rowkeys are a fundamental part of the schema design process, and yet HBase provides no API-level support for these concerns. Again, this must be re-implemented with each application. ### TL;DR: - order-preserving serialization tricky to implement correctly (bit manipulations are not the domain of the average Java developer). - users forced to build their own in the common case. - user code is often littered with serialization details. - HBase implementation means better interoperability for external tools (Pig, Hive, Cascading, Phoenix, Panthera). - best practices for compound rowkeys are a common question for new users. The desire is for HBase to ship with a set of well-understood, well integrated types upon which users and other projects can develop. Just as a RDBMS provides support for a handful of primitive types, so too should HBase. This integration can and should exist at the client-level. Keeping it restricted to that side of the wire will have a number of useful side-effects, including: remains optional for users uninterested in the new functionality; can evolve without impact on existing storefile formats; has minimal (no?) impact on Master, RegionServer, RPC implementation; and makes a user-extensible implementation possible. ## Considerations HBase today is primarily a tool for Java developers. This won't always be the case. The supported types should make sense outside of the Java context and serialization should be based on a specification rather than a single reference implementation. This will enable the implementation of native clients in other languages. The following features are considered necessary of a solution: - order preserving - ascending and descending order - compound 'struct' type - language-agnostic specification - Java implementation - fixed-width encoding - variable-width encoding (for Strings) Additionally, these are nice to have: - variable-width encoding (for non-String types) - compound 'union' type - on-disk schema version management (see avro, protobuf) - self-identifying serialized values (see protobuf) - single implementation for all languages (see protobuf) ## Types This implementation should support all the type already supported by the existing Bytes utility. These types include: - BOOLEAN (single byte, 0 or 1), - SHORT (signed whole number on 2 bytes), - INT (signed whole number on 4 bytes), - LONG (signed whole number on 8 bytes), - VARINT (whole number encoded to a variable length), - FLOAT (decimal number, on 4 bytes), - DOUBLE (decimal number, on 8 bytes), - DECIMAL (decimal number encoded to a variable length), - CHAR (fixed-length String), - VARCHAR (variable-length String), - DATETIME (an instant in time), - BYTE (fixed-length byte[]). - VARBYTE (variable-length byte[]). Compound STRUCT types will also be supported. Their implementation is not yet specified. Features to consider include: schema storage; schema version management; component order; support for optional, repeated members; a packed UNION type. Also important for the STRUCT type is efficient support for both variable- and fixed-width encoded components. A specification for null values is TBD. HBase provides efficient columnar storage, so there is no internal penalty imposed on null values. However, proper handling of null values within compound types must be considered. ## Serialization strategies ### BOOLEAN. BOOLEAN values are stored in a single byte, all 1's for a TRUE value, 0x01 for a FALSE value, or all 0's for a NULL. When the sort order is specified as DESCENDING, the encoded value is logically inverted. In the Java implementation, BOOLEAN maps to Java's java.lang.Boolean type. ### SHORT. INT. LONG. Fixed-length whole number types (SHORT, INT, LONG) are stored as signed, 2's compliment numbers stored in Big-Endian byte order. When the sort order is specified as DESCENDING, the encoded value is logically inverted. In the Java implementation, SHORT maps to java.lang.Short; INT to java.lang.Integer; and LONG to java.lang.Long. ### VARINT. Variable-length whole numbers are represented by the VARINT type. Serialization strategy is TDB. In the Java implementation, VARINT maps to the java.math.BigInteger type. ### FLOAT. DOUBLE. Fixed-length decimal number types (FLOAT, DOUBLE) adhere closely to the IEEE 754 standard. The number is first interpreted as a whole number. Then, the sign bit is inverted and also invert the exponent and significand bits if the original value was negative. The resulting value bytes are stored in Big-Endian order. When sort order is specified as DESCENDING, the encoded value is logically inverted. In the Java implementation, FLOAT maps to java.lang.Float and DOUBLE maps to java.lang.Double. ### DECIMAL. Variable-length decimal numbers are represented by the DECIMAL type. Serialization strategy is TDB. In the Java implementation, DECIMAL values map to the java.math.BigDecimal type. ### VARCHAR. The general-purpose String type for HBase is the VARCHAR. It stores an arbitrary number of unicode characters. If you're not sure which type to choose when storing a String, choose this one. The value is first encoded to UTF-8, and then increment each byte by 2. Finally, a termination byte is appended. The increment allows us to take advantage of a detail highlighted in RFC 2279, namely that "The octet values FE and FF never appear". By incrementing all encoded values, 0x00 and 0x01 are available for use as NULL and TERMINATION values, respectively. Because of this, VARCHAR supports encoding both empty Strings and NULL values. When sort order is specified as DESCENDING, the complete encoded value is logically inverted. In the Java implementation, VARCHAR values map to the java.lang.String type. ### CHAR. The CHAR datatype is used to store a fixed-byte-length sequence of characters. Encoding is performed identically to VARCHAR encoding. The resulting value is then validated against the byte-width constraint. Note that just as with VARCHAR, a length of 1-byte is required in order to represent an empty String and a length of 2-bytes is necessary to represent NULL. In the Java implementation, CHAR values map to the java.lang.String type. ### DATETIME. The DATETIME type is used to represent an instant in time. It is stored as a LONG value representing milliseconds from the epoch. ### BYTE. Store a fixed-length sequence of bytes. Serialization strategy TBD. ### VARBYTE. Store a variable-length sequence of bytes. Serialization strategy TBD. ### STRUCT The STRUCT type is defined as the ordered composition of multiple types. The sort order of STRUCT is dictated by the sort order of the ordered composition. STRUCTs can contain STRUCTs. STRUCT will compensate for types which do not provide explicit NULL support. It does so by first serializing a BOOLEAN value to indicate whether the instance is NULL. When the instance is not NULL, it immediately follows the BOOLEAN. In the Java implementation, STRUCT maps to the java.util.List type. ## Out of scope This improvement is intended to drive development of this functionality. Integration into the existing Java client API will be an exercise left to future endeavor, letting us focus first on on-disk formats. Following this implementation, the existing Bytes utility class should be deprecated for use from the Client API. How best to manage that transition is not covered as part of this improvement. This improvement is explicitly NOT interested in schema management. An implementation for this context does not depend on any kind of type management, mapping between rowkeys/column qualifiers and types, or constraint checking. Just as HBase today is a "bring your own types" system, HBase following completion of this improvement will be a "bring your own schema management" system.