Index: src/java/org/apache/hcatalog/mapreduce/HCatStorageHandler.java =================================================================== --- src/java/org/apache/hcatalog/mapreduce/HCatStorageHandler.java (revision 1311550) +++ src/java/org/apache/hcatalog/mapreduce/HCatStorageHandler.java (working copy) @@ -28,25 +28,24 @@ import org.apache.hadoop.mapred.OutputFormat; /** - * The abstract Class HCatStorageHandler would server as the base class for all + * The abstract Class HCatStorageHandler serves as the base class for all * the storage handlers required for non-native tables in HCatalog. */ public abstract class HCatStorageHandler implements HiveStorageHandler { //TODO move this to HiveStorageHandler /** - * This method is called to allow the StorageHandlers the chance - * to populate the JobContext.getConfiguration() with properties that - * maybe be needed by the handler's bundled artifacts (ie InputFormat, SerDe, etc). - * Key value pairs passed into jobProperties is guaranteed to be set in the job's - * configuration object. User's can retrieve "context" information from tableDesc. - * User's should avoid mutating tableDesc and only make changes in jobProperties. + * This method is called to create a configuration for input; it enables the + * StorageHandlers to populate the JobContext.getConfiguration() with properties that + * may be needed by the handler's bundled artifacts (such as InputFormat, SerDe, etc.). + * Key-value pairs passed into jobProperties are guaranteed to be set in the job's + * configuration object. Users can retrieve "context" information from tableDesc. + * Users should avoid mutating tableDesc and only make changes in jobProperties. * This method is expected to be idempotent such that a job called with the * same tableDesc values should return the same key-value pairs in jobProperties. * Any external state set by this method should remain the same if this method is - * called again. It is up to the user to determine how best guarantee this invariant. + * called again. It is up to the user to determine how best to guarantee this invariant. * - * This method in particular is to create a configuration for input. * @param tableDesc * @param jobProperties */ @@ -54,18 +53,17 @@ //TODO move this to HiveStorageHandler /** - * This method is called to allow the StorageHandlers the chance - * to populate the JobContext.getConfiguration() with properties that - * maybe be needed by the handler's bundled artifacts (ie InputFormat, SerDe, etc). - * Key value pairs passed into jobProperties is guaranteed to be set in the job's - * configuration object. User's can retrieve "context" information from tableDesc. - * User's should avoid mutating tableDesc and only make changes in jobProperties. + * This method is called to create a configuration for output; it enables the + * StorageHandlers to populate the JobContext.getConfiguration() with properties that + * may be needed by the handler's bundled artifacts (such as OutputFormat, SerDe, etc.). + * Key-value pairs passed into jobProperties are guaranteed to be set in the job's + * configuration object. Users can retrieve "context" information from tableDesc. + * Users should avoid mutating tableDesc and only make changes in jobProperties. * This method is expected to be idempotent such that a job called with the * same tableDesc values should return the same key-value pairs in jobProperties. * Any external state set by this method should remain the same if this method is - * called again. It is up to the user to determine how best guarantee this invariant. + * called again. It is up to the user to determine how best to guarantee this invariant. * - * This method in particular is to create a configuration for output. * @param tableDesc * @param jobProperties */ Index: src/java/org/apache/hcatalog/mapreduce/HCatTableInfo.java =================================================================== --- src/java/org/apache/hcatalog/mapreduce/HCatTableInfo.java (revision 1311550) +++ src/java/org/apache/hcatalog/mapreduce/HCatTableInfo.java (working copy) @@ -28,8 +28,8 @@ /** * - * HCatTableInfo - class to communicate table information to {@link HCatInputFormat} - * and {@link HCatOutputFormat} + * The HCatTableInfo class communicates table information to {@link HCatInputFormat} + * and {@link HCatOutputFormat}. * */ public class HCatTableInfo implements Serializable { @@ -97,7 +97,7 @@ } /** - * @return return schema of data columns as defined in meta store + * @return schema of data columns as defined in meta store */ public HCatSchema getDataColumns() { return dataColumns; @@ -117,6 +117,10 @@ return storerInfo; } + /** + * Gets the table location + * @return the table location + */ public String getTableLocation() { return table.getSd().getLocation(); } Index: src/java/org/apache/hcatalog/mapreduce/HCatBaseInputFormat.java =================================================================== --- src/java/org/apache/hcatalog/mapreduce/HCatBaseInputFormat.java (revision 1311550) +++ src/java/org/apache/hcatalog/mapreduce/HCatBaseInputFormat.java (working copy) @@ -61,6 +61,11 @@ private Class inputFileFormatClass; // TODO needs to go in InitializeInput? as part of InputJobInfo + /** + * Get the schema for the HCatRecord data returned by HCatInputFormat. + * @param context the jobContext + * @return the schema + */ public static HCatSchema getOutputSchema(JobContext context) throws IOException { String os = context.getConfiguration().get( @@ -93,7 +98,7 @@ /** * Logically split the set of input files for the job. Returns the - * underlying InputFormat's splits + * underlying InputFormat's splits. * @param jobContext the job context object * @return the splits, an HCatInputSplit wrapper over the storage * handler InputSplits @@ -164,9 +169,9 @@ /** * Create the RecordReader for the given InputSplit. Returns the underlying - * RecordReader if the required operations are supported and schema matches - * with HCatTable schema. Returns an HCatRecordReader if operations need to - * be implemented in HCat. + * RecordReader if the required operations are supported and the schema + * matches HCatTable schema. Returns an HCatRecordReader if operations need + * to be implemented in HCatalog. * @param split the split * @param taskContext the task attempt context * @return the record reader instance, either an HCatRecordReader(later) or @@ -249,7 +254,7 @@ } /** - * Gets the HCatTable schema for the table specified in the HCatInputFormat.setInput call + * Get the HCatTable schema for the table specified in the HCatInputFormat.setInput call * on the specified job context. This information is available only after HCatInputFormat.setInput * has been called for a JobContext. * @param context the context Index: src/java/org/apache/hcatalog/mapreduce/InputJobInfo.java =================================================================== --- src/java/org/apache/hcatalog/mapreduce/InputJobInfo.java (revision 1311550) +++ src/java/org/apache/hcatalog/mapreduce/InputJobInfo.java (working copy) @@ -23,7 +23,7 @@ import java.util.List; import java.util.Properties; -/** The class used to serialize and store the information read from the metadata server */ +/** The class used to serialize and store the information read from the metadata server. */ public class InputJobInfo implements Serializable{ /** The serialization version */ @@ -48,9 +48,10 @@ /** * Initializes a new InputJobInfo * for reading data from a table. - * @param databaseName the db name + * @param databaseName the database name * @param tableName the table name * @param filter the partition filter + * @return a new InputJobInfo */ public static InputJobInfo create(String databaseName, @@ -71,7 +72,7 @@ } /** - * Gets the value of databaseName + * Gets the value of databaseName. * @return the databaseName */ public String getDatabaseName() { @@ -79,7 +80,7 @@ } /** - * Gets the value of tableName + * Gets the value of tableName. * @return the tableName */ public String getTableName() { @@ -87,7 +88,7 @@ } /** - * Gets the table's meta information + * Gets the table's meta information. * @return the HCatTableInfo */ public HCatTableInfo getTableInfo() { @@ -105,7 +106,7 @@ } /** - * Gets the value of partition filter + * Gets the value of the partition filter. * @return the filter string */ public String getFilter() { @@ -127,9 +128,9 @@ } /** - * Set/Get Property information to be passed down to *StorageHandler implementation - * put implementation specific storage handler configurations here - * @return the implementation specific job properties + * Gets Property information to be passed down to *StorageHandler implementation. + * This is for implementation-specific storage handler configurations. + * @return the implementation-specific job properties */ public Properties getProperties() { return properties; Index: src/java/org/apache/hcatalog/mapreduce/HCatBaseOutputFormat.java =================================================================== --- src/java/org/apache/hcatalog/mapreduce/HCatBaseOutputFormat.java (revision 1311550) +++ src/java/org/apache/hcatalog/mapreduce/HCatBaseOutputFormat.java (working copy) @@ -43,9 +43,9 @@ /** * Gets the table schema for the table specified in the HCatOutputFormat.setOutput call * on the specified job context. - * @param context the context + * @param context the job context * @return the table schema - * @throws IOException if HCatOutputFromat.setOutput has not been called for the passed context + * @throws IOException if HCatOutputFormat.setOutput has not been called for the passed context */ public static HCatSchema getTableSchema(JobContext context) throws IOException { OutputJobInfo jobInfo = getJobInfo(context); @@ -155,11 +155,11 @@ } /** - * Configure the output storage handler, with allowing specification - * of partvals from which it picks the dynamic partvals + * Configure the output storage handler, allowing specification + * of partvals from which it picks the dynamic partvals. * @param context the job context * @param jobInfo the output job info - * @param fullPartSpec + * @param fullPartSpec the full specification for a partition * @throws IOException */ @@ -178,6 +178,13 @@ } } + /** + * Set the partition details. + * @param jobInfo the job information + * @param schema the schema + * @param partMap the partition map + * @throws HCatException, IOException + */ protected static void setPartDetails(OutputJobInfo jobInfo, final HCatSchema schema, Map partMap) throws HCatException, IOException { Index: src/java/org/apache/hcatalog/mapreduce/OutputJobInfo.java =================================================================== --- src/java/org/apache/hcatalog/mapreduce/OutputJobInfo.java (revision 1311550) +++ src/java/org/apache/hcatalog/mapreduce/OutputJobInfo.java (working copy) @@ -28,7 +28,7 @@ import org.apache.hadoop.hive.metastore.MetaStoreUtils; import org.apache.hcatalog.data.schema.HCatSchema; -/** The class used to serialize and store the output related information */ +/** The class used to serialize and store the output-related information. */ public class OutputJobInfo implements Serializable { /** The db and table names. */ @@ -67,14 +67,15 @@ /** * Initializes a new OutputJobInfo instance * for writing data from a table. - * @param databaseName the db name + * @param databaseName the database name * @param tableName the table name - * @param partitionValues The partition values to publish to, can be null or empty Map to - * work with hadoop security, the kerberos principal name of the server - else null + * @param partitionValues The partition values to publish to, which can be null or + * an empty Map - to work with Hadoop security, the Kerberos principal name of the + * server - else null. * The principal name should be of the form: - * /_HOST@ like "hcat/_HOST@myrealm.com" + * <servicename>/_HOST@<realm> (such as "hcat/_HOST@myrealm.com"). * The special string _HOST will be replaced automatically with the correct host name - * indicate write to a unpartitioned table. For partitioned tables, this map should + * to indicate write to a unpartitioned table. For partitioned tables, this map should * contain keys for all partition columns with corresponding values. */ public static OutputJobInfo create(String databaseName, @@ -173,7 +174,7 @@ } /** - * Gets the value of partitionValues + * Gets the value of partitionValues. * @return the partitionValues */ public Map getPartitionValues() { @@ -205,16 +206,16 @@ } /** - * Set/Get Property information to be passed down to *StorageHandler implementation - * put implementation specific storage handler configurations here - * @return the implementation specific job properties + * Gets Property information to be passed down to *StorageHandler implementation. + * This is for implementation-specific storage handler configurations. + * @return the implementation-specific job properties */ public Properties getProperties() { return properties; } /** - * Set maximum number of allowable dynamic partitions + * Sets the maximum number of allowable dynamic partitions. * @param maxDynamicPartitions */ public void setMaximumDynamicPartitions(int maxDynamicPartitions){ @@ -222,7 +223,7 @@ } /** - * Returns maximum number of allowable dynamic partitions + * Returns the maximum number of allowable dynamic partitions. * @return maximum number of allowable dynamic partitions */ public int getMaxDynamicPartitions() { @@ -230,7 +231,7 @@ } /** - * Sets whether or not hadoop archiving has been requested for this job + * Sets whether or not Hadoop archiving is requested for this job. * @param harRequested */ public void setHarRequested(boolean harRequested){ @@ -238,29 +239,32 @@ } /** - * Returns whether or not hadoop archiving has been requested for this job - * @return whether or not hadoop archiving has been requested for this job + * Returns whether or not Hadoop archiving has been requested for this job. + * @return true if Hadoop archiving has been requested for this job, otherwise false */ public boolean getHarRequested() { return this.harRequested; } /** - * Returns whether or not Dynamic Partitioning is used - * @return whether or not dynamic partitioning is currently enabled and used + * Returns whether or not Dynamic Partitioning is used. + * @return true if dynamic partitioning is currently enabled and used, otherwise false */ public boolean isDynamicPartitioningUsed() { return !((dynamicPartitioningKeys == null) || (dynamicPartitioningKeys.isEmpty())); } /** - * Sets the list of dynamic partitioning keys used for outputting without specifying all the keys + * Sets the list of dynamic partitioning keys used for outputting without specifying all the keys. * @param dynamicPartitioningKeys */ public void setDynamicPartitioningKeys(List dynamicPartitioningKeys) { this.dynamicPartitioningKeys = dynamicPartitioningKeys; } + /** + * @return a list of the dynamic partition keys + */ public List getDynamicPartitioningKeys(){ return this.dynamicPartitioningKeys; } Index: src/java/org/apache/hcatalog/mapreduce/HCatInputFormat.java =================================================================== --- src/java/org/apache/hcatalog/mapreduce/HCatInputFormat.java (revision 1311550) +++ src/java/org/apache/hcatalog/mapreduce/HCatInputFormat.java (working copy) @@ -22,16 +22,16 @@ import org.apache.hadoop.mapreduce.Job; -/** The InputFormat to use to read data from HCat */ +/** The InputFormat to use to read data into HCatalog. */ public class HCatInputFormat extends HCatBaseInputFormat { /** - * Set the input to use for the Job. This queries the metadata server with - * the specified partition predicates, gets the matching partitions, puts - * the information in the conf object. The inputInfo object is updated with - * information needed in the client context + * Set the input information to use for the job. This queries the metadata server + * with the specified partition predicates, gets the matching partitions, and + * puts the information in the conf object. The inputInfo object is updated + * with information needed in the client context. * @param job the job object - * @param inputJobInfo the input info for table to read + * @param inputJobInfo the input information for the table to read * @throws IOException the exception in communicating with the metadata server */ public static void setInput(Job job, Index: src/java/org/apache/hcatalog/mapreduce/PartInfo.java =================================================================== --- src/java/org/apache/hcatalog/mapreduce/PartInfo.java (revision 1311550) +++ src/java/org/apache/hcatalog/mapreduce/PartInfo.java (working copy) @@ -27,7 +27,7 @@ import org.apache.hcatalog.data.schema.HCatSchema; import org.apache.hcatalog.mapreduce.HCatStorageHandler; -/** The Class used to serialize the partition information read from the metadata server that maps to a partition */ +/** The Class used to serialize the partition information read from the metadata server that maps to a partition. */ public class PartInfo implements Serializable { /** The serialization version */ @@ -63,6 +63,8 @@ * @param storageHandler the storage handler * @param location the location * @param hcatProperties hcat-specific properties at the partition + * @param jobProperties the job properties + * @param tableInfo the table information */ public PartInfo(HCatSchema partitionSchema, HCatStorageHandler storageHandler, String location, Properties hcatProperties, @@ -116,7 +118,7 @@ } /** - * Gets the value of hcatProperties. + * Gets the input storage handler properties. * @return the hcatProperties */ public Properties getInputStorageHandlerProperties() { @@ -147,10 +149,18 @@ return partitionValues; } + /** + * Gets the job properties. + * @return a map of the job properties + */ public Map getJobProperties() { return jobProperties; } + /** + * Gets the HCatalog table information. + * @return the table information + */ public HCatTableInfo getTableInfo() { return tableInfo; } Index: src/java/org/apache/hcatalog/mapreduce/StorerInfo.java =================================================================== --- src/java/org/apache/hcatalog/mapreduce/StorerInfo.java (revision 1311550) +++ src/java/org/apache/hcatalog/mapreduce/StorerInfo.java (working copy) @@ -19,7 +19,7 @@ import java.io.Serializable; import java.util.Properties; -/** Info about the storer to use for writing the data */ +/** Information about the storer to use for writing the data. */ public class StorerInfo implements Serializable { /** The serialization version */ @@ -37,7 +37,7 @@ private String storageHandlerClass; /** - * Initialize the storer info + * Initialize the storer information. * @param ifClass * @param ofClass * @param serdeClass @@ -53,28 +53,43 @@ this.properties = properties; } + /** + * @return the ifClass + */ public String getIfClass() { return ifClass; } + /** + * @param ifClass + */ public void setIfClass(String ifClass) { this.ifClass = ifClass; } + /** + * @return the ofClass + */ public String getOfClass() { return ofClass; } + /** + * @return the serdeClass + */ public String getSerdeClass() { return serdeClass; } + /** + * @return the storageHandlerClass + */ public String getStorageHandlerClass() { return storageHandlerClass; } /** - * @return the properties + * @return the storer properties */ public Properties getProperties() { return properties; Index: src/java/org/apache/hcatalog/common/ErrorType.java =================================================================== --- src/java/org/apache/hcatalog/common/ErrorType.java (revision 1311550) +++ src/java/org/apache/hcatalog/common/ErrorType.java (working copy) @@ -18,40 +18,62 @@ package org.apache.hcatalog.common; /** - * Enum type representing the various errors throws by HCat. + * Enum type representing the various errors thrown by HCatalog. */ public enum ErrorType { /* HCat Input Format related errors 1000 - 1999 */ + /** 1000, "Error initializing database session" */ ERROR_DB_INIT (1000, "Error initializing database session"), + /** 1001, "Query result exceeded maximum number of partitions allowed" */ ERROR_EXCEED_MAXPART (1001, "Query result exceeded maximum number of partitions allowed"), - + /** 1002, "Error setting input information" */ ERROR_SET_INPUT (1002, "Error setting input information"), /* HCat Output Format related errors 2000 - 2999 */ + /** 2000, "Table specified does not exist" */ ERROR_INVALID_TABLE (2000, "Table specified does not exist"), + /** 2001, "Error setting output information" */ ERROR_SET_OUTPUT (2001, "Error setting output information"), + /** 2002, "Partition already present with given partition key values" */ ERROR_DUPLICATE_PARTITION (2002, "Partition already present with given partition key values"), + /** 2003, "Non-partitioned table already contains data" */ ERROR_NON_EMPTY_TABLE (2003, "Non-partitioned table already contains data"), + /** 2004, "HCatOutputFormat not initialized, setOutput has to be called" */ ERROR_NOT_INITIALIZED (2004, "HCatOutputFormat not initialized, setOutput has to be called"), + /** 2005, "Error initializing output storage driver instance" */ ERROR_INIT_STORAGE_HANDLER (2005, "Error initializing storage handler instance"), + /** 2006, "Error adding partition to metastore" */ ERROR_PUBLISHING_PARTITION (2006, "Error adding partition to metastore"), + /** 2007, "Invalid column position in partition schema" */ ERROR_SCHEMA_COLUMN_MISMATCH (2007, "Invalid column position in partition schema"), + /** 2008, "Partition key cannot be present in the partition data" */ ERROR_SCHEMA_PARTITION_KEY (2008, "Partition key cannot be present in the partition data"), + /** 2009, "Invalid column type in partition schema" */ ERROR_SCHEMA_TYPE_MISMATCH (2009, "Invalid column type in partition schema"), + /** 2010, "Invalid partition values specified" */ ERROR_INVALID_PARTITION_VALUES (2010, "Invalid partition values specified"), + /** 2011, "Partition key value not provided for publish" */ ERROR_MISSING_PARTITION_KEY (2011, "Partition key value not provided for publish"), + /** 2012, "Moving of data failed during commit" */ ERROR_MOVE_FAILED (2012, "Moving of data failed during commit"), + /** 2013, "Attempt to create too many dynamic partitions" */ ERROR_TOO_MANY_DYNAMIC_PTNS (2013, "Attempt to create too many dynamic partitions"), + /** 2014, "Error initializing Pig loader" */ ERROR_INIT_LOADER (2014, "Error initializing Pig loader"), + /** 2015, "Error initializing Pig storer" */ ERROR_INIT_STORER (2015, "Error initializing Pig storer"), + /** 2016, "Error operation not supported" */ ERROR_NOT_SUPPORTED (2016, "Error operation not supported"), /* Authorization Errors 3000 - 3999 */ + /** 3000, "Permission denied" */ ERROR_ACCESS_CONTROL (3000, "Permission denied"), /* Miscellaneous errors, range 9000 - 9998 */ + /** 9000, "Functionality currently unimplemented" */ ERROR_UNIMPLEMENTED (9000, "Functionality currently unimplemented"), + /** 9001, "Exception occurred while processing HCat request" */ ERROR_INTERNAL_EXCEPTION (9001, "Exception occurred while processing HCat request"); /** The error code. */ Index: src/java/org/apache/hcatalog/common/HCatException.java =================================================================== --- src/java/org/apache/hcatalog/common/HCatException.java (revision 1311550) +++ src/java/org/apache/hcatalog/common/HCatException.java (working copy) @@ -20,7 +20,7 @@ import java.io.IOException; /** - * Class representing exceptions thrown by HCat. + * Class representing exceptions thrown by HCatalog. */ public class HCatException extends IOException { @@ -92,7 +92,8 @@ /** - * Builds the error message string. The error type message is appended with the extra message. If appendCause + * Builds the error message string. The error type message is appended with the extra message. + * If appendCauseMessage * is true for the error type, then the message of the cause also is added to the message. * @param type the error type * @param extraMessage the extra message string Index: src/java/org/apache/hcatalog/data/transfer/HCatReader.java =================================================================== --- src/java/org/apache/hcatalog/data/transfer/HCatReader.java (revision 1311550) +++ src/java/org/apache/hcatalog/data/transfer/HCatReader.java (working copy) @@ -28,7 +28,7 @@ import org.apache.hcatalog.data.transfer.state.StateProvider; /** This abstract class is internal to HCatalog and abstracts away the notion of - * underlying system from which reads will be done. + * an underlying system from which reads will be done. */ public abstract class HCatReader{ @@ -40,14 +40,14 @@ */ public abstract ReaderContext prepareRead() throws HCatException; - /** This should be called at slave nodes to read {@link HCatRecord}s + /** This should be called at slave nodes to read {@link HCatRecord}s. * @return {@link Iterator} of {@link HCatRecord} * @throws HCatException */ public abstract Iterator read() throws HCatException; /** This constructor will be invoked by {@link DataTransferFactory} at master node. - * Don't use this constructor. Instead, use {@link DataTransferFactory} + * Don't use this constructor. Instead, use {@link DataTransferFactory}. * @param re * @param config */ @@ -57,8 +57,7 @@ } /** This constructor will be invoked by {@link DataTransferFactory} at slave nodes. - * Don't use this constructor. Instead, use {@link DataTransferFactory} - * @param re + * Don't use this constructor. Instead, use {@link DataTransferFactory}. * @param config * @param sp */ @@ -82,7 +81,10 @@ } this.conf = conf; } - + + /** + * @return the configuration + */ public Configuration getConf() { if (null == conf) { throw new IllegalStateException("HCatReader is not constructed correctly."); Index: src/java/org/apache/hcatalog/data/transfer/DataTransferFactory.java =================================================================== --- src/java/org/apache/hcatalog/data/transfer/DataTransferFactory.java (revision 1311550) +++ src/java/org/apache/hcatalog/data/transfer/DataTransferFactory.java (working copy) @@ -78,7 +78,7 @@ } /** This should be called at slave nodes to obtain an instance of {@link HCatWriter} - * @param info {@link WriterContext} obtained at master node. + * @param cntxt {@link WriterContext} obtained at master node. * @return {@link HCatWriter} */ public static HCatWriter getHCatWriter(final WriterContext cntxt) { @@ -89,7 +89,7 @@ /** This should be called at slave nodes to obtain an instance of {@link HCatWriter} * If external system has some mechanism for providing state to HCatalog, this constructor * can be used. - * @param info {@link WriterContext} obtained at master node. + * @param cntxt {@link WriterContext} obtained at master node. * @param sp {@link StateProvider} * @return {@link HCatWriter} */ Index: src/java/org/apache/hcatalog/data/transfer/HCatWriter.java =================================================================== --- src/java/org/apache/hcatalog/data/transfer/HCatWriter.java (revision 1311550) +++ src/java/org/apache/hcatalog/data/transfer/HCatWriter.java (working copy) @@ -45,28 +45,30 @@ */ public abstract WriterContext prepareWrite() throws HCatException; - /** This method should be used at slave needs to perform writes. - * @param {@link Iterator} records to be written into HCatalog. - * @throws {@link HCatException} + /** This method should be used at slave nodes to perform writes. + * @param recordItr {@link Iterator} records to be written into HCatalog + * @throws HCatException */ public abstract void write(final Iterator recordItr) throws HCatException; /** This method should be called at master node. Primary purpose of this is to do metadata commit. - * @throws {@link HCatException} + * @param context the writer context + * @throws HCatException */ public abstract void commit(final WriterContext context) throws HCatException; /** This method should be called at master node. Primary purpose of this is to do cleanups in case * of failures. - * @throws {@link HCatException} * + * @param context the writer context + * @throws HCatException */ public abstract void abort(final WriterContext context) throws HCatException; /** * This constructor will be used at master node * @param we WriteEntity defines where in storage records should be written to. - * @param config Any configuration which external system wants to communicate to HCatalog - * for performing writes. + * @param config any configuration which external system wants to communicate to HCatalog + * for performing writes */ protected HCatWriter(final WriteEntity we, final Map config) { this(config); @@ -74,7 +76,8 @@ } /** This constructor will be used at slave nodes. - * @param config + * @param config the configuration + * @param sp the state provider */ protected HCatWriter(final Configuration config, final StateProvider sp) { this.conf = config; Index: src/docs/overview.html =================================================================== --- src/docs/overview.html (revision 1311550) +++ src/docs/overview.html (working copy) @@ -1,29 +1,46 @@ - - - + + + Overview + + + + + + - + + +
+ + +
+ + +
+
+
+
+ +
+ + + + + +

Overview

@@ -52,54 +69,50 @@

HCatalog

-

HCatalog is a table management and storage management layer for Hadoop that enables users with different data processing tools – Pig, MapReduce, Hive, Streaming – to more easily read and write data on the grid. HCatalog’s table abstraction presents users with a relational view of data in the Hadoop distributed file system (HDFS) and ensures that users need not worry about where or in what format their data is stored – RCFile format, text files, sequence files.

-

(Note: In this release, Streaming is not supported. Also, HCatalog supports only writing RCFile formatted files and only reading PigStorage formated text files.)

+

HCatalog is a table and storage management layer for Hadoop that enables users with different data processing tools – Pig, MapReduce, and Hive – to more easily read and write data on the grid. HCatalog’s table abstraction presents users with a relational view of data in the Hadoop distributed file system (HDFS) and ensures that users need not worry about where or in what format their data is stored – RCFile format, text files, or sequence files.

+

HCatalog supports reading and writing files in any format for which a SerDe can be written. By default, HCatalog supports RCFile, CSV, JSON, and sequence file formats. To use a custom format, you must provide the InputFormat, OutputFormat, and SerDe.

- - - +

HCatalog Architecture

-

HCatalog is built on top of the Hive metastore and incorporates components from the Hive DDL. HCatalog provides read and write interfaces for Pig and MapReduce and a command line interface for data definitions.

-

(Note: HCatalog notification is not available in this release.)

+

HCatalog is built on top of the Hive metastore and incorporates components from the Hive DDL. HCatalog provides read and write interfaces for Pig and MapReduce and uses + Hive's command line interface for issuing data definition and metadata exploration commands.

Interfaces

-

The HCatalog interface for Pig – HCatLoader and HCatStorer – is an implementation of the Pig load and store interfaces. HCatLoader accepts a table to read data from; you can indicate which partitions to scan by immediately following the load statement with a partition filter statement. HCatStorer accepts a table to write to and a specification of partition keys to create a new partition. Currently HCatStorer only supports writing to one partition. HCatLoader and HCatStorer are implemented on top of HCatInputFormat and HCatOutputFormat respectively

-

The HCatalog interface for MapReduce – HCatInputFormat and HCatOutputFormat – is an implementation of Hadoop InputFormat and OutputFormat. HCatInputFormat accepts a table to read data from and a selection predicate to indicate which partitions to scan. HCatOutputFormat accepts a table to write to and a specification of partition keys to create a new partition. Currently HCatOutputFormat only supports writing to one partition.

-

-Note: Currently there is no Hive-specific interface. Since HCatalog uses Hive's metastore, Hive can read data in HCatalog directly as long as a SerDe for that data already exists. In the future we plan to write a HCatalogSerDe so that users won't need storage-specific SerDes and so that Hive users can write data to HCatalog. Currently, this is supported - if a Hive user writes data in the RCFile format, it is possible to read the data through HCatalog.

-

Data is defined using HCatalog's command line interface (CLI). The HCatalog CLI supports most of the DDL portion of Hive's query language, allowing users to create, alter, drop tables, etc. The CLI also supports the data exploration part of the Hive command line, such as SHOW TABLES, DESCRIBE TABLE, etc.

+

The HCatalog interface for Pig – HCatLoader and HCatStorer – is an implementation of the Pig load and store interfaces. HCatLoader accepts a table to read data from; you can indicate which partitions to scan by immediately following the load statement with a partition filter statement. HCatStorer accepts a table to write to and optionally a specification of partition keys to create a new partition. You can write to a single partition by specifying the partition key(s) and value(s) in the STORE clause; and you can write to multiple partitions if the partition key(s) are columns in the data being stored. HCatLoader and HCatStorer are implemented on top of HCatInputFormat and HCatOutputFormat, respectively (see HCatalog Load and Store).

+

The HCatalog interface for MapReduce – HCatInputFormat and HCatOutputFormat – is an implementation of Hadoop InputFormat and OutputFormat. HCatInputFormat accepts a table to read data from and optionally a selection predicate to indicate which partitions to scan. HCatOutputFormat accepts a table to write to and optionally a specification of partition keys to create a new partition. You can write to a single partition by specifying the partition key(s) and value(s) in the STORE clause; and you can write to multiple partitions if the partition key(s) are columns in the data being stored. (See HCatalog Input and Output.)

+

Note: There is no Hive-specific interface. Since HCatalog uses Hive's metastore, Hive can read data in HCatalog directly.

+

Data is defined using HCatalog's command line interface (CLI). The HCatalog CLI supports all Hive DDL that does not require MapReduce to execute, allowing users to create, alter, drop tables, etc. (Unsupported Hive DDL includes import/export, CREATE TABLE AS SELECT, ALTER TABLE options REBUILD and CONCATENATE, and ANALYZE TABLE ... COMPUTE STATISTICS.) The CLI also supports the data exploration part of the Hive command line, such as SHOW TABLES, DESCRIBE TABLE, etc. (see the HCatalog Command Line Interface).

Data Model

-

HCatalog presents a relational view of data in HDFS. Data is stored in tables and these tables can be placed in databases. Tables can also be hash partitioned on one or more keys; that is, for a given value of a key (or set of keys) there will be one partition that contains all rows with that value (or set of values). For example, if a table is partitioned on date and there are three days of data in the table, there will be three partitions in the table. New partitions can be added to a table, and partitions can be dropped from a table. Partitioned tables have no partitions at create time. Unpartitioned tables effectively have one default partition that must be created at table creation time. There is no guaranteed read consistency when a partition is dropped.

-

Partitions contain records. Once a partition is created records cannot be added to it, removed from it, or updated in it. (In the future some ability to integrate changes to a partition will be added.) Partitions are multi-dimensional and not hierarchical. Records are divided into columns. Columns have a name and a datatype. HCatalog supports the same datatypes as Hive.

+

HCatalog presents a relational view of data. Data is stored in tables and these tables can be placed in databases. Tables can also be hash partitioned on one or more keys; that is, for a given value of a key (or set of keys) there will be one partition that contains all rows with that value (or set of values). For example, if a table is partitioned on date and there are three days of data in the table, there will be three partitions in the table. New partitions can be added to a table, and partitions can be dropped from a table. Partitioned tables have no partitions at create time. Unpartitioned tables effectively have one default partition that must be created at table creation time. There is no guaranteed read consistency when a partition is dropped.

+

Partitions contain records. Once a partition is created records cannot be added to it, removed from it, or updated in it. Partitions are multi-dimensional and not hierarchical. Records are divided into columns. Columns have a name and a datatype. HCatalog supports the same datatypes as Hive (see HCatalog Load and Store).

Data Flow Example

-

This simple data flow example shows how HCatalog is used to move data from the grid into a database. - From the database, the data can then be analyzed using Hive.

+

This simple data flow example shows how HCatalog can help grid users share and access data.

First Joe in data acquisition uses distcp to get data onto the grid.

 hadoop distcp file:///file.dat hdfs://data/rawevents/20100819/data
 
-hcat "alter table rawevents add partition 20100819 hdfs://data/rawevents/20100819/data"
+hcat "alter table rawevents add partition (ds='20100819') location 'hdfs://data/rawevents/20100819/data'"
 

Second Sally in data processing uses Pig to cleanse and prepare the data.

-

Without HCatalog, Sally must be manually informed by Joe that data is available, or use Oozie and poll on HDFS.

+

Without HCatalog, Sally must be manually informed by Joe when data is available, or poll on HDFS.

 A = load '/data/rawevents/20100819/data' as (alpha:int, beta:chararray, …);
 B = filter A by bot_finder(zeta) = 0;
 …
 store Z into 'data/processedevents/20100819/data';
 
-

With HCatalog, Oozie will be notified by HCatalog data is available and can then start the Pig job

+

With HCatalog, HCatalog will send a JMS message that data is available. The Pig job can then be started.

 A = load 'rawevents' using HCatLoader;
 B = filter A by date = '20100819' and by bot_finder(zeta) = 0;
@@ -115,21 +128,36 @@
 select advertiser_id, count(clicks)
 from processedevents
 where date = '20100819' 
-group by adverstiser_id;
+group by advertiser_id;
 

With HCatalog, Robert does not need to modify the table structure.

 select advertiser_id, count(clicks)
 from processedevents
 where date = ‘20100819’ 
-group by adverstiser_id;
+group by advertiser_id;
 
+ + +
+ +
 
+
+