Uploaded image for project: 'HCatalog'
  1. HCatalog
  2. HCATALOG-237

Switch from using StorageDrivers to SerDes to do data (de)serialization

    Details

    • Type: Improvement
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 0.4
    • Fix Version/s: 0.4
    • Component/s: None
    • Labels:
      None

      Description

      HCatalog started by creating its own classes, InputStorageDriver and OutputStorageDriver, to do data conversion between the storage layer Input/OutputFormats and the HCatInput/OutputFormats. These provide very similar functionality to Hive's SerDe class, though with a much simpler interface.

      This usage of separate classes has led to a number of issues for HCatalog. One, it cannot make use of existing Hive SerDes. Two, it has led to a need to make HCat specific extensions of Hive interfaces (such as the StorageHandler) to provide the StorageDescriptors. Three, it means that users who already have Hive installed cannot use HCatalog without first updating every partition in their metastore with storage driver information.

      I propose we switch to using SerDes for this. To address the issue of the more complicated SerDe interface we can provide adaptor classes that make writing new SerDes easy in simple cases.

        Issue Links

        1.
        Changes to HCatInputFormat to make it use SerDes instead of StorageDrivers Sub-task Closed Vikram Dixit K
         
        2.
        Changes to HCatOutputFormat to make it use SerDes instead of StorageDriver Sub-task Closed Francis Liu
         
        3.
        Changes to HCatRecord to support switch from StorageDriver to SerDe Sub-task Closed Sushanth Sowmyan
         
        4.
        CLI changes to remove checks and support for StorageDrivers Sub-task Closed Sushanth Sowmyan
         
        5.
        HCat e2e tests need to change to not use StorageDrivers Sub-task Closed Alan Gates
         
        6.
        Rework JSON StorageDriver into a JSON SerDe Sub-task Closed Sushanth Sowmyan
         
        7.
        Rework HBase storage driver into HBase storage handler Sub-task Closed Rohini Palaniswamy
         
        8. LazyHCatTuple introduction to prevent paying full cost of deserialization of LazyHCatRecord Sub-task Open Sushanth Sowmyan
         
        9.
        Make readFields() and write() in LazyHCatRecord work Sub-task Closed Alan Gates
         
        10.
        remove deprecated HCatStorageHandler Sub-task Closed Francis Liu
         
        11.
        Remove remnants of storage drivers. Sub-task Closed Rohini Palaniswamy
         
        12.
        HCatInputFormat shouldn't expect storageHandler to be serializable Sub-task Closed Sushanth Sowmyan
         
        13. only serialize OutputJobInfo into tableDesc.getJobProperties() when calling configureOutputJobProperties() Sub-task Reopened Unassigned
         
        14.
        Remove remaining code mentioning isd/osd Sub-task Closed Daniel Dai
         
        15. move setInputPath to FosterStorageHandler.configureInputProperties() Sub-task Open Unassigned
         
        16.
        InputJobInfo still uses serverUri and serverKerberosPrincipal Sub-task Closed Sushanth Sowmyan
         
        17.
        Rename storage-drivers directory to storage-handlers (fix packaging, etc) Sub-task Closed Alan Gates
         
        18.
        TableDesc and jobProperties related changes to configureInputJobProperties and configureOutputJobProperties Sub-task Resolved Sushanth Sowmyan
         

          Activity

          Hide
          alangates Alan Gates added a comment -

          Here's some notes I took on what would be required to do this.

          Show
          alangates Alan Gates added a comment - Here's some notes I took on what would be required to do this.
          Hide
          toffer Francis Liu added a comment -

          Attached is a first stab at how StorageHandler (and friends) will replace the StorageDriver apis.

          Show
          toffer Francis Liu added a comment - Attached is a first stab at how StorageHandler (and friends) will replace the StorageDriver apis.
          Hide
          toffer Francis Liu added a comment -

          Will this be done off trunk? We will be starting a sprint to add more features into HBaseHCatStorageHandler, we might have to create a feature branch and merge things after if that was the case?

          Show
          toffer Francis Liu added a comment - Will this be done off trunk? We will be starting a sprint to add more features into HBaseHCatStorageHandler, we might have to create a feature branch and merge things after if that was the case?
          Hide
          ashutoshc Ashutosh Chauhan added a comment -

          I think its better to do it on trunk because no branch ever dies, they live forever. Further, since this is the only major feature as far as I know which is currently being worked on for HCatalog, chances of conflicts are pretty minimal.

          Show
          ashutoshc Ashutosh Chauhan added a comment - I think its better to do it on trunk because no branch ever dies, they live forever. Further, since this is the only major feature as far as I know which is currently being worked on for HCatalog, chances of conflicts are pretty minimal.
          Hide
          toffer Francis Liu added a comment -

          One of the things we need patched if we want HBaseHCatStorageHandler to work in hive.

          Show
          toffer Francis Liu added a comment - One of the things we need patched if we want HBaseHCatStorageHandler to work in hive.
          Hide
          russell.jurney Russell Jurney added a comment -

          What does this mean, if anything, for Pig and HCatalog?

          Show
          russell.jurney Russell Jurney added a comment - What does this mean, if anything, for Pig and HCatalog?
          Hide
          alangates Alan Gates added a comment -

          Almost nothing. Pig will work with HCat as before, via HCatLoader and HCatStorer. The one piece we haven't started working on is a connector to use existing Pig load/store functions for a given format with HCatalog, as existed with storage drivers. We need these, but we haven't gotten started on them yet.

          Show
          alangates Alan Gates added a comment - Almost nothing. Pig will work with HCat as before, via HCatLoader and HCatStorer. The one piece we haven't started working on is a connector to use existing Pig load/store functions for a given format with HCatalog, as existed with storage drivers. We need these, but we haven't gotten started on them yet.
          Hide
          alangates Alan Gates added a comment -

          Marking this as resolved. A few related issues aren't closed, but the basic functionality is there and working.

          Show
          alangates Alan Gates added a comment - Marking this as resolved. A few related issues aren't closed, but the basic functionality is there and working.
          Hide
          alangates Alan Gates added a comment -

          Issue closed with 0.4 release.

          Show
          alangates Alan Gates added a comment - Issue closed with 0.4 release.

            People

            • Assignee:
              Unassigned
              Reporter:
              alangates Alan Gates
            • Votes:
              3 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development