HCatalog
  1. HCatalog
  2. HCATALOG-49

Support Avro Data File Format in HCatalog

    Details

    • Type: New Feature New Feature
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Won't Fix
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      Add input and output drivers for Avro.

        Activity

        Hide
        Tom White added a comment -

        Here is an initial attempt to support Avro in HCatalog.

        Some notes:

        • For output, an Avro schema is computed for the HCatalog schema by the Avro output storage driver. The current patch does not allow you to specify a custom Avro schema - this would be a natural extension.
        • Avro map keys must be strings, wheres they can be any type in HCatalog. The current implementation assumes that HCatalog maps have string types, and fails if this is not true. It might be possible to relax this restriction in the future by doing type conversion.
        • In HCatalog, values can be null, whereas this is not true for simple schemas in Avro. It would be possible to generate null unions in Avro, but this isn't done here. This could be a future enhancement.
        • For the Avro input storage driver, the Avro schema in the Avro Data File is checked for compatibility with the HCatalog schema, and an exception is thrown if there's a mismatch.
        • Byte arrays can not be represented in HCatalog, so there is no way to read byte arrays from Avro files. (Pig has the same limitation.)
        Show
        Tom White added a comment - Here is an initial attempt to support Avro in HCatalog. Some notes: For output, an Avro schema is computed for the HCatalog schema by the Avro output storage driver. The current patch does not allow you to specify a custom Avro schema - this would be a natural extension. Avro map keys must be strings, wheres they can be any type in HCatalog. The current implementation assumes that HCatalog maps have string types, and fails if this is not true. It might be possible to relax this restriction in the future by doing type conversion. In HCatalog, values can be null, whereas this is not true for simple schemas in Avro. It would be possible to generate null unions in Avro, but this isn't done here. This could be a future enhancement. For the Avro input storage driver, the Avro schema in the Avro Data File is checked for compatibility with the HCatalog schema, and an exception is thrown if there's a mismatch. Byte arrays can not be represented in HCatalog, so there is no way to read byte arrays from Avro files. (Pig has the same limitation.)
        Hide
        Thejas M Nair added a comment -

        Comments HCATALOG-49.patch

        • AvroInputStorageDriver.getTypedObj and AvroOutputStorageDriver.getTypedObj should be called recursively for map and list values. Complex type is supported in the schema validation done in TypeConverter .
        • AvroInputStorageDriver.convertToHCatRecord converts char field names to lower case before looking up in avro schema, but TypeConverter.check doesn't. Does avro Schema.getField do case sensitive comparison of field names ?
        • AvroOutputStorageDriver.convertValue - it will bit more efficient to loop on the position (for (int i=0; i < outputSchema., that way the column name does not have to be looked up in outputSchema (ie HCatRecord.get can be used instead of HCatRecord.get(name, schema) )
        • TestAvroInputStorageDriver - I think it will be useful to have some test cases for the case where only some of the fields are requested, and the case when some of the fields are partition keys.

        (FYI, I am not a committer on HCatalog.)

        Show
        Thejas M Nair added a comment - Comments HCATALOG-49 .patch AvroInputStorageDriver.getTypedObj and AvroOutputStorageDriver.getTypedObj should be called recursively for map and list values. Complex type is supported in the schema validation done in TypeConverter . AvroInputStorageDriver.convertToHCatRecord converts char field names to lower case before looking up in avro schema, but TypeConverter.check doesn't. Does avro Schema.getField do case sensitive comparison of field names ? AvroOutputStorageDriver.convertValue - it will bit more efficient to loop on the position (for (int i=0; i < outputSchema., that way the column name does not have to be looked up in outputSchema (ie HCatRecord.get can be used instead of HCatRecord.get(name, schema) ) TestAvroInputStorageDriver - I think it will be useful to have some test cases for the case where only some of the fields are requested, and the case when some of the fields are partition keys. (FYI, I am not a committer on HCatalog.)
        Hide
        Jakob Homan added a comment -

        Hey Tom - I had planned on looking at what it would take to convert the haivvreo code to work with hcatalog. I've not had a chance to go through your code. Do you know if one is more feature-full than the other?

        Show
        Jakob Homan added a comment - Hey Tom - I had planned on looking at what it would take to convert the haivvreo code to work with hcatalog. I've not had a chance to go through your code. Do you know if one is more feature-full than the other?
        Hide
        Jakob Homan added a comment -

        Now that HCat is using SerDe's this work isn't necessary. HCat can just use the AvroSerDe from HIVE-895. Resolving as won't fix.

        Show
        Jakob Homan added a comment - Now that HCat is using SerDe's this work isn't necessary. HCat can just use the AvroSerDe from HIVE-895 . Resolving as won't fix.

          People

          • Assignee:
            Tom White
            Reporter:
            Tom White
          • Votes:
            1 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development