Chukwa
  1. Chukwa
  2. CHUKWA-345

Chunk.application and Chunk.streamName are redundant

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.3.0
    • Component/s: Data Collection
    • Labels:
      None

      Description

      The chunk interface has both an "application" field and a "stream name" field. But they map to the same value. Should cut one of those names, for clarity.

      I think "application" is the less descriptive name, and should be cut.

        Activity

        Hide
        Hudson added a comment -
        Show
        Hudson added a comment - Integrated in Chukwa-trunk #213 (See http://hudson.zones.apache.org/hudson/job/Chukwa-trunk/213/ )
        Hide
        Ari Rabkin added a comment -

        I just committed this, after consulting privately with Jerome.

        @Eric, Jerome: I'm open to changing the serialization format or metadata structure if the need arises, but it isn't clear to me that we have any such need at present. I think the current tagging mechanism satisfies most needs for general-purpose custom metadata.

        Show
        Ari Rabkin added a comment - I just committed this, after consulting privately with Jerome. @Eric, Jerome: I'm open to changing the serialization format or metadata structure if the need arises, but it isn't clear to me that we have any such need at present. I think the current tagging mechanism satisfies most needs for general-purpose custom metadata.
        Hide
        Eric Yang added a comment -

        +1 to commit this patch.

        Serialization format and protocol format are somewhat coupled. It's not possible to use certain protocols if the serialization format uses the same keywords. For example, GET / HTTP/1.0\r\n\r\n maybe required to be escaped in serialization format, if Content-Length: is not honored, and HTTP protocol is chosen to wrap around the serialization protocol. The same escaping rules apply to any transport protocol and serialization format. Hence it's best to define the serialization format and protocol upfront. The implementation would be easier and less error prone. However, the protocol and serialization changes should not be in scope of this jira.

        Show
        Eric Yang added a comment - +1 to commit this patch. Serialization format and protocol format are somewhat coupled. It's not possible to use certain protocols if the serialization format uses the same keywords. For example, GET / HTTP/1.0\r\n\r\n maybe required to be escaped in serialization format, if Content-Length: is not honored, and HTTP protocol is chosen to wrap around the serialization protocol. The same escaping rules apply to any transport protocol and serialization format. Hence it's best to define the serialization format and protocol upfront. The implementation would be easier and less error prone. However, the protocol and serialization changes should not be in scope of this jira.
        Hide
        Ari Rabkin added a comment -

        I don't think this patch has anything to do with the transport protocol. It doesn't change the serialization format. My only goal is to remove a redundancy from the Chunk API. I don't intend to break pig – I'm happy to revise the patch to make sure that pig stays OK.

        Show
        Ari Rabkin added a comment - I don't think this patch has anything to do with the transport protocol. It doesn't change the serialization format. My only goal is to remove a redundancy from the Chunk API. I don't intend to break pig – I'm happy to revise the patch to make sure that pig stays OK.
        Hide
        Jerome Boulon added a comment -

        Eric, I'm not sure why you want to define the serialization format and the transport protocol at the same time? How you serialized the data and how you send them it's 2 things.

        If your idea is to lock everyone to one single transport protocol HTTP then I'll put a -1 on it since there's no reason to lock the transport protocol and HTTP is not the one I'm using for example.
        Regarding the serialization format, this should applied only to the datasink file format, the final output. For the same reason, if our serialization format is not supported in the target language/env then I want to be free to use what ever make more sense for me and of course I will have to convert it before writing the chunk to the dataSink file but hopefully the serialization format will work for everyone.

        Ari, if we're going to do the Map, why would you want to commit that patch? The pig contrib project need to be updated if you want to commit that patch

        Show
        Jerome Boulon added a comment - Eric, I'm not sure why you want to define the serialization format and the transport protocol at the same time? How you serialized the data and how you send them it's 2 things. If your idea is to lock everyone to one single transport protocol HTTP then I'll put a -1 on it since there's no reason to lock the transport protocol and HTTP is not the one I'm using for example. Regarding the serialization format, this should applied only to the datasink file format, the final output. For the same reason, if our serialization format is not supported in the target language/env then I want to be free to use what ever make more sense for me and of course I will have to convert it before writing the chunk to the dataSink file but hopefully the serialization format will work for everyone. Ari, if we're going to do the Map, why would you want to commit that patch? The pig contrib project need to be updated if you want to commit that patch
        Hide
        Ari Rabkin added a comment -

        Eric, is that a +1 to commit this patch?

        I'm open to a more general metadata framework; the intent of this patch was to clean up the status quo, and make sure that the interface correctly describes what's going on underneath. I don't think it breaks backward-compatibility, unless there's a substantial volume of code for processing Chukwa data that we don't know about.

        Show
        Ari Rabkin added a comment - Eric, is that a +1 to commit this patch? I'm open to a more general metadata framework; the intent of this patch was to clean up the status quo, and make sure that the interface correctly describes what's going on underneath. I don't think it breaks backward-compatibility, unless there's a substantial volume of code for processing Chukwa data that we don't know about.
        Hide
        Eric Yang added a comment -

        +1 on removing application and keep stream name.

        +1 on Map<String, String> approach, but we should define the serialization method on HTTP protocol upfront. Revising Chunk protocol should be in another JIRA with more discussions.

        Show
        Eric Yang added a comment - +1 on removing application and keep stream name. +1 on Map<String, String> approach, but we should define the serialization method on HTTP protocol upfront. Revising Chunk protocol should be in another JIRA with more discussions.
        Hide
        Jerome Boulon added a comment -

        Instead of adding/removing some fields, it will be easier to support a MAP at the tag level that way everyone can put what ever make sense for them.

        Chunk

        • DataType
        • Source
        • SeqId (uuid)
        • Map<String,String>
        • List<String> messages

        A Map could easily be serialized using whatever format make sence (SerDe, Json,Avro,....) and could be used for routing, priority....

        Also if we want to do this, since it's going to brake the compatibility why not moving to a well define serialization format (Avro,Thrift or Json),

        Show
        Jerome Boulon added a comment - Instead of adding/removing some fields, it will be easier to support a MAP at the tag level that way everyone can put what ever make sense for them. Chunk DataType Source SeqId (uuid) Map<String,String> List<String> messages A Map could easily be serialized using whatever format make sence (SerDe, Json,Avro,....) and could be used for routing, priority.... Also if we want to do this, since it's going to brake the compatibility why not moving to a well define serialization format (Avro,Thrift or Json),
        Hide
        Ari Rabkin added a comment -

        Comments on this?

        Show
        Ari Rabkin added a comment - Comments on this?
        Hide
        Ari Rabkin added a comment -

        StreamName seemed more descriptive to me. This patch removes Chunk.application

        Show
        Ari Rabkin added a comment - StreamName seemed more descriptive to me. This patch removes Chunk.application
        Hide
        Ari Rabkin added a comment -

        nod However, we never implemented it that way. There really isn't any such field.

        Is there a case for actually implementing it? (Which would also require touching a lot of code)

        Show
        Ari Rabkin added a comment - nod However, we never implemented it that way. There really isn't any such field. Is there a case for actually implementing it? (Which would also require touching a lot of code)
        Hide
        Jerome Boulon added a comment -

        the initial idea was to have one for the Application, like Hadoop and to set the streamname to the actual file path.
        If you remove it make sure that you update all dependent source code like pig queries

        Show
        Jerome Boulon added a comment - the initial idea was to have one for the Application, like Hadoop and to set the streamname to the actual file path. If you remove it make sure that you update all dependent source code like pig queries

          People

          • Assignee:
            Ari Rabkin
            Reporter:
            Ari Rabkin
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development