Hive
  1. Hive
  2. HIVE-167

Hive: add a RegularExpressionDeserializer

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.4.0
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      We need a RegularExpressionDeserializer to read data based on a regex. This will be very useful for reading files like apache log.

      1. HIVE-167.2.patch
        25 kB
        Zheng Shao
      2. HIVE-167.1.patch
        20 kB
        Zheng Shao

        Issue Links

          Activity

          Hide
          Zheng Shao added a comment -

          Here is some information from src/contrib/hive/serde/README (not committed yet)

          What is SerDe
          -----------
          SerDe is a short name for Serializer and Deserializer.
          Hive uses SerDe (and FileFormat) to read from/write to tables.

          • HDFS files -(InputFileFormat)> <key, value> --(Deserializer)-> Row object
          • Row object -(Serializer)> <key, value> --(OutputFileFormat)-> HDFS files

          Note that the "key" part is ignored when reading, and is always a constant when
          writing. Basically the row object is only stored into the "value".

          One principle of Hive is that Hive does not own the HDFS file format - Users
          should be able to directly read the HDFS files in the Hive tables using other
          tools, or use other tools to directly write to HDFS files that can be read by
          Hive through "CREATE EXTERNAL TABLE", or can be loaded into Hive through "LOAD
          DATA INPATH" which just move the file into Hive table directory.

          Note that org.apache.hadoop.hive.serde is the deprecated old serde library.
          Please look at org.apache.hadoop.hive.serde2 for the latest version.

          Existing FileFormats and SerDe classes
          ------------------------
          Hive currently use these FileFormats to read/write to files:

          • TextInputFormat/NoKeyTextOutputFormat
            These 2 classes read/write data in plain text file format.
          • SequenceFileInputFormat/SequenceFileOutputFormat
            These 2 classes read/write data in hadoop SequenceFile format.

          Hive currently use these SerDe classes to serialize and deserialize data:

          • MetadataTypedColumnsetSerDe
            This serde is used to read/write delimited records like CSV, tab-separated
            control-A separated records.
          • ThriftSerDe
            This serde is used to read/write thrift serialized objects. The class file
            for the Thrift object must be loaded first.
          • DynamicSerDe
            This serde also read/write thrift serialized objects, but it understands thrift
            DDL so the schema of the object can be provided at runtime. Also it supports
            a lot of different protocols, including TBinaryProtocol, TJSONProtocol,
            TCTLSeparatedProtocol (which writes data in delimited records).

          How to load data into Hive
          ------------------------
          In order to load data into Hive, we need to tell Hive the format of the data
          through "CREATE TABLE" statement:

          • FileFormat: the data has to be in Text or SequenceFile.
          • Format of the row:
          • If the data is in delimited format, use MetadataTypedColumnsetSerDe
          • If the data is in delimited format and has more than 1 levels of delimitor,
            use DynamicSerDe with TCTLSeparatedProtocol
          • If the data is a serialized thrift object, use ThriftSerDe

          The steps to load the data:
          1 Create a table:

          CREATE TABLE t (foo STRING, bar STRING)
          ROW FORMAT DELIMITED
          FIELDS TERMINATED BY '\t'
          STORED AS TEXTFILE;

          CREATE TABLE t2 (foo STRING, bar ARRAY<STRING>)
          ROW FORMAT DELIMITED
          FIELDS TERMINATED BY '\t'
          COLLECTION ITEMS TERMINATED BY ','
          STORED AS TEXTFILE;

          CREATE TABLE t3 (foo STRING, bar MAP<STRING,STRING>)
          ROW FORMAT DELIMITED
          FIELDS TERMINATED BY '\t'
          COLLECTION ITEMS TERMINATED BY ','
          MAP KEYS TERMINATED BY ':'
          STORED AS TEXTFILE;

          CREATE TABLE t4 (foo STRING, bar MAP<STRING,STRING>)
          ROW FORMAT SERIALIZER 'org.apache.hadoop.hive.serde2.MetadataTypedColumnsetSerDe'
          WITH SERDEPROPERTIES ('columns'='foo,bar','SERIALIZATION.FORMAT'='9');

          (RegexDeserializer is not done yet)
          CREATE TABLE t5 (foo STRING, bar STRING)
          ROW FORMAT SERIALIZER 'org.apache.hadoop.hive.serde2.RegexDeserializer'
          WITH SERDEPROPERTIES ('regex'='([a-z]) *([a-z])');

          2 Load the data:
          LOAD DATA LOCAL INPATH '../examples/files/kv1.txt' OVERWRITE INTO TABLE t;

          How to read data from Hive tables
          ------------------------
          In order to read data from Hive tables, we need to know the same 2 things:

          • File Format
          • Row Format

          Then we just need to directly open the HDFS file and read the data.

          How to write your own SerDe
          ------------------------

          In most cases, users want to write a Deserializer instead of a SerDe.
          For example, the RegexDeserializer will deserialize the data using the
          configuration parameter 'regex', and possibly a list of column names (see
          serde2.MetadataTypedColumnsetSerDe).

          Please see serde2/Deserializer.java for details.

          Show
          Zheng Shao added a comment - Here is some information from src/contrib/hive/serde/README (not committed yet) What is SerDe ----------- SerDe is a short name for Serializer and Deserializer. Hive uses SerDe (and FileFormat) to read from/write to tables. HDFS files - (InputFileFormat) > <key, value> --(Deserializer) -> Row object Row object - (Serializer) > <key, value> --(OutputFileFormat) -> HDFS files Note that the "key" part is ignored when reading, and is always a constant when writing. Basically the row object is only stored into the "value". One principle of Hive is that Hive does not own the HDFS file format - Users should be able to directly read the HDFS files in the Hive tables using other tools, or use other tools to directly write to HDFS files that can be read by Hive through "CREATE EXTERNAL TABLE", or can be loaded into Hive through "LOAD DATA INPATH" which just move the file into Hive table directory. Note that org.apache.hadoop.hive.serde is the deprecated old serde library. Please look at org.apache.hadoop.hive.serde2 for the latest version. Existing FileFormats and SerDe classes ------------------------ Hive currently use these FileFormats to read/write to files: TextInputFormat/NoKeyTextOutputFormat These 2 classes read/write data in plain text file format. SequenceFileInputFormat/SequenceFileOutputFormat These 2 classes read/write data in hadoop SequenceFile format. Hive currently use these SerDe classes to serialize and deserialize data: MetadataTypedColumnsetSerDe This serde is used to read/write delimited records like CSV, tab-separated control-A separated records. ThriftSerDe This serde is used to read/write thrift serialized objects. The class file for the Thrift object must be loaded first. DynamicSerDe This serde also read/write thrift serialized objects, but it understands thrift DDL so the schema of the object can be provided at runtime. Also it supports a lot of different protocols, including TBinaryProtocol, TJSONProtocol, TCTLSeparatedProtocol (which writes data in delimited records). How to load data into Hive ------------------------ In order to load data into Hive, we need to tell Hive the format of the data through "CREATE TABLE" statement: FileFormat: the data has to be in Text or SequenceFile. Format of the row: If the data is in delimited format, use MetadataTypedColumnsetSerDe If the data is in delimited format and has more than 1 levels of delimitor, use DynamicSerDe with TCTLSeparatedProtocol If the data is a serialized thrift object, use ThriftSerDe The steps to load the data: 1 Create a table: CREATE TABLE t (foo STRING, bar STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS TEXTFILE; CREATE TABLE t2 (foo STRING, bar ARRAY<STRING>) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' COLLECTION ITEMS TERMINATED BY ',' STORED AS TEXTFILE; CREATE TABLE t3 (foo STRING, bar MAP<STRING,STRING>) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' COLLECTION ITEMS TERMINATED BY ',' MAP KEYS TERMINATED BY ':' STORED AS TEXTFILE; CREATE TABLE t4 (foo STRING, bar MAP<STRING,STRING>) ROW FORMAT SERIALIZER 'org.apache.hadoop.hive.serde2.MetadataTypedColumnsetSerDe' WITH SERDEPROPERTIES ('columns'='foo,bar','SERIALIZATION.FORMAT'='9'); (RegexDeserializer is not done yet) CREATE TABLE t5 (foo STRING, bar STRING) ROW FORMAT SERIALIZER 'org.apache.hadoop.hive.serde2.RegexDeserializer' WITH SERDEPROPERTIES ('regex'='( [a-z] ) *( [a-z] )'); 2 Load the data: LOAD DATA LOCAL INPATH '../examples/files/kv1.txt' OVERWRITE INTO TABLE t; How to read data from Hive tables ------------------------ In order to read data from Hive tables, we need to know the same 2 things: File Format Row Format Then we just need to directly open the HDFS file and read the data. How to write your own SerDe ------------------------ In most cases, users want to write a Deserializer instead of a SerDe. For example, the RegexDeserializer will deserialize the data using the configuration parameter 'regex', and possibly a list of column names (see serde2.MetadataTypedColumnsetSerDe). Please see serde2/Deserializer.java for details.
          Hide
          Jeff Hammerbacher added a comment -

          Adding to "Serializers/Deserializers" component.

          Show
          Jeff Hammerbacher added a comment - Adding to "Serializers/Deserializers" component.
          Hide
          Zheng Shao added a comment -

          First cut. The RegexSerDe class together with a test case, and an example is added to contrib directory.

          Show
          Zheng Shao added a comment - First cut. The RegexSerDe class together with a test case, and an example is added to contrib directory.
          Hide
          Namit Jain added a comment -

          Is serde.initialialize() called at create table time - otherwise a create table with nonstring columns will go through.
          Can you add a negative testcase to confirm that ?

          Show
          Namit Jain added a comment - Is serde.initialialize() called at create table time - otherwise a create table with nonstring columns will go through. Can you add a negative testcase to confirm that ?
          Hide
          Zheng Shao added a comment -

          @HIVE-167.2.patch: Talked with Namit offline. Added a negative test case.

          Show
          Zheng Shao added a comment - @ HIVE-167 .2.patch: Talked with Namit offline. Added a negative test case.
          Hide
          Namit Jain added a comment -

          Committed. Thanks Zheng

          Show
          Namit Jain added a comment - Committed. Thanks Zheng

            People

            • Assignee:
              Zheng Shao
              Reporter:
              Zheng Shao
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development