HCatalog provides a data transfer API for parallel input and output +without using MapReduce. This API provides a way to read data from a +Hadoop cluster or write data into a Hadoop cluster, using a basic storage +abstraction of tables and rows.
+The data transfer API has three essential classes:
+Auxiliary classes in the data transfer API include:
+The HCatalog data transfer API is designed to facilitate integration +of external systems with Hadoop.
+ + +Reading is a two-step process in which the first step occurs on +the master node of an external system. The second step is done in +parallel on multiple slave nodes.
+ +Reads are done on a “ReadEntity”. Before you start to read, you need to define +a ReadEntity from which to read. This can be done through ReadEntity.Builder. You +can specify a database name, table name, partition, and filter string. +For example:
+ +The code snippet above defines a ReadEntity object ("entity"),
+comprising a table named “mytbl” in a database named “mydb”, which can be used
+to read all the rows of this table.
+Note that this table must exist in HCatalog prior to the start of this
+operation.
After defining a ReadEntity, you obtain an instance of HCatReader +using the ReadEntity and cluster configuration:
+ +The next step is to obtain a ReaderContext from reader
+as follows:
All of the above steps occur on the master node. The master node then +serializes this ReaderContext object and sends it to all the slave nodes. +Slave nodes then use this reader context to read data.
+ +Similar to reading, writing is also a two-step process in which the first +step occurs on the master node. Subsequently, the second step occurs in +parallel on slave nodes.
+ +Writes are done on a “WriteEntity” which can be constructed in a fashion +similar to reads:
+ +The code above creates a WriteEntity object ("entity") which can be used +to write into a table named “mytbl” in the database “mydb”.
+ +After creating a WriteEntity, the next step is to obtain a WriterContext:
+ +All of the above steps occur on the master node. The master node then +serializes the WriterContext object and makes it available to all the +slaves.
+ +On slave nodes, you need to obtain an HCatWriter using WriterContext +as follows:
+ +Then, writer takes an iterator as the argument for
+the write method:
The writer then calls getNext() on this iterator in a loop
+and writes out all the records attached to the iterator.
A complete java program for the reader and writer examples above can be found at: https://svn.apache.org/repos/asf/incubator/hcatalog/trunk/src/test/org/apache/hcatalog/data/TestReaderWriter.java.
+ +