Chukwa
  1. Chukwa
  2. CHUKWA-564

HBase output collector uses incorrect column family

    Details

    • Type: Bug Bug
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      The HBase OutputCollector does this to obtain the column family from the data type:

      cf = key.getReduceType().getBytes();
      

      The column family should instead be taken by the @Table.columnFamily annotation on the processor.

        Activity

        Hide
        Eric Yang added a comment -

        I am biased toward separating config from code in this case because I am doing both coding and config management in my environment. The proposed plugin composition is a good idea. I just wish my brain isn't stuck on redesign demux without map/reduce.

        Show
        Eric Yang added a comment - I am biased toward separating config from code in this case because I am doing both coding and config management in my environment. The proposed plugin composition is a good idea. I just wish my brain isn't stuck on redesign demux without map/reduce.
        Hide
        Ari Rabkin added a comment -

        I like the strategy BIll outlines. The one appealing feature of annotation is that it keeps information about table layout near the code, which reduces the cognitive burden on the developer.

        Show
        Ari Rabkin added a comment - I like the strategy BIll outlines. The one appealing feature of annotation is that it keeps information about table layout near the code, which reduces the cognitive burden on the developer.
        Hide
        Bill Graham added a comment -

        1. Indirection mapping configuration is difficult to maintain on distributed system.

        I don't see how maintaining configs is more difficult than maintaining the same information in annotations. Both are doing the same thing and both need to be deployed to all nodes. The annotation approach is only different in that it requires new classes and compilation to change table and column family names from the defaults.

        2. Add extra overhead to the processor for lookup data routing.

        Again, I don't see how looking up the table/cf values for a processor from a Map for example (in the config case), is any more intensive than using reflection on a processor to see if it has certain annotations.

        The issue of whether the ideal solution for a given use case is annotations or configs aside, it seems like both needs could be met if we made the lookup approach plugable via composition. The default implementation could fetch HBase schema info via annotations on processors as is done currently, but implementors would be free to implement the same interface differently using another approach. I can take a stab at the interface design if there are no objections.

        Show
        Bill Graham added a comment - 1. Indirection mapping configuration is difficult to maintain on distributed system. I don't see how maintaining configs is more difficult than maintaining the same information in annotations. Both are doing the same thing and both need to be deployed to all nodes. The annotation approach is only different in that it requires new classes and compilation to change table and column family names from the defaults. 2. Add extra overhead to the processor for lookup data routing. Again, I don't see how looking up the table/cf values for a processor from a Map for example (in the config case), is any more intensive than using reflection on a processor to see if it has certain annotations. The issue of whether the ideal solution for a given use case is annotations or configs aside, it seems like both needs could be met if we made the lookup approach plugable via composition. The default implementation could fetch HBase schema info via annotations on processors as is done currently, but implementors would be free to implement the same interface differently using another approach. I can take a stab at the interface design if there are no objections.
        Hide
        Eric Yang added a comment -

        Chukwa use case:

        SystemMetrics adaptor is emitting SystemMetrics data type. SystemMetrics data processor is writing data to SystemMetrics table, column family: cpu, disk, memory.

        This is currently possible by using reducer type as grouping for cpu, disk memory.

        I disagree on data routing decouple from the processor for two reason.

        1. Indirection mapping configuration is difficult to maintain on distributed system.
        2. Add extra overhead to the processor for lookup data routing.

        The use case is not writing the same data to different column family, but split subtype data into different column family.

        Show
        Eric Yang added a comment - Chukwa use case: SystemMetrics adaptor is emitting SystemMetrics data type. SystemMetrics data processor is writing data to SystemMetrics table, column family: cpu, disk, memory. This is currently possible by using reducer type as grouping for cpu, disk memory. I disagree on data routing decouple from the processor for two reason. 1. Indirection mapping configuration is difficult to maintain on distributed system. 2. Add extra overhead to the processor for lookup data routing. The use case is not writing the same data to different column family, but split subtype data into different column family.
        Hide
        Bill Graham added a comment -

        I agree that there are limitations in using annotations on the processors. I think that where the data is written should be decoupled from the processors. A processor knows how to process data, but it shouldn't also state where the data should be written. Generic processors like TsProcessors could be used repeatedly for different data types, all of which should be written to different table/column-families. Coupling the two with annotations makes this difficult. You end up with empty subclasses used only to configure different data types to table/cfs via overridden annotations.

        I suggest we externalize the table/cf mappings from the processors. Instead we could have something like an HBaseRouterFactory (or something perhaps named better) that the OutputCollector and the HBaseWriter interact with. HBaseRouterFactory has a method that takes in a dataType and probably also a ChukwaRecord and knows how to return the Table and ColumnFamily that the data should be written too.

        We could then configure that dataType 'foo' should use BarProcessor and write to table 'bat', column family 'biz'.

        I don't know how we'd configure 'foo's payload to be written to multiple cfs though. What's the use case for why we'd want to write the same data to two locations?

        There's still an unresolved separate problem of how to handle ORM-ish functionality as well, since reduxing the many parameters in the record body back to a single 'body' field can be sub-optimal.

        Show
        Bill Graham added a comment - I agree that there are limitations in using annotations on the processors. I think that where the data is written should be decoupled from the processors. A processor knows how to process data, but it shouldn't also state where the data should be written. Generic processors like TsProcessors could be used repeatedly for different data types, all of which should be written to different table/column-families. Coupling the two with annotations makes this difficult. You end up with empty subclasses used only to configure different data types to table/cfs via overridden annotations. I suggest we externalize the table/cf mappings from the processors. Instead we could have something like an HBaseRouterFactory (or something perhaps named better) that the OutputCollector and the HBaseWriter interact with. HBaseRouterFactory has a method that takes in a dataType and probably also a ChukwaRecord and knows how to return the Table and ColumnFamily that the data should be written too. We could then configure that dataType 'foo' should use BarProcessor and write to table 'bat', column family 'biz'. I don't know how we'd configure 'foo's payload to be written to multiple cfs though. What's the use case for why we'd want to write the same data to two locations? There's still an unresolved separate problem of how to handle ORM-ish functionality as well, since reduxing the many parameters in the record body back to a single 'body' field can be sub-optimal.
        Hide
        Eric Yang added a comment -

        The current implementation of demux annotation has an issue to support multiple columnFamilies. A demux parser could be writing out data to different column families. I am having difficulty to design a elegant annotation for supporting multi-column family demux parser while having this work in map/reduce framework as well. I am open to suggestions.

        Show
        Eric Yang added a comment - The current implementation of demux annotation has an issue to support multiple columnFamilies. A demux parser could be writing out data to different column families. I am having difficulty to design a elegant annotation for supporting multi-column family demux parser while having this work in map/reduce framework as well. I am open to suggestions.

          People

          • Assignee:
            Unassigned
            Reporter:
            Bill Graham
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:

              Development