Chukwa
  1. Chukwa
  2. CHUKWA-330

controller should not create duplicates for adaptors with offsets

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Blocker Blocker
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.2.0, 0.3.0
    • Component/s: Input Tools
    • Labels:
      None
    • Hadoop Flags:
      Incompatible change, Reviewed
    • Release Note:
      Hide
      For 0.2:

      Modified AgentController to add adaptor by name for uniquely identify stream.

      For 0.3:

      Make separation between fixed adaptor parameters and optional adaptor parameters for generating adaptor id.
      Show
      For 0.2: Modified AgentController to add adaptor by name for uniquely identify stream. For 0.3: Make separation between fixed adaptor parameters and optional adaptor parameters for generating adaptor id.

      Description

      The offset-of-first-byte param to FileTailingAdaptor is hashed into the default adaptor name. This changes the semantics of adaptor uniqueness, and broke metrics and the log4jappender.

      Proposed fix:
      Add a new "addByName()" method to ChukwaAgentController that lets callers specify names.
      Use this new method whenever finer control of adaptor uniqueness is needed.

      1. testsForTrunk.patch
        5 kB
        Ari Rabkin
      2. CHUKWA-330-big-1.patch
        31 kB
        Ari Rabkin
      3. CHUKWA-330-big.patch
        26 kB
        Ari Rabkin
      4. CHUKWA-330.patch
        8 kB
        Ari Rabkin

        Issue Links

          Activity

          Hide
          Jerome Boulon added a comment -

          It will be better only if the client has enough knowledge on how the underlying implementation is done and if the client want to take care of that ....
          It's good to have the ability to create your own naming convention for the adaptor but I assume that most developers will just use the implementation that do not specify the name as it simple and they will not have to worry about duplicate and so on ....

          Solution that will not change the semantics of adaptor uniqueness, and broke metrics and the log4jappender :

          • add a method to the adaptor interface (isDuplicate, that takes the same parameters as the start method)
          • each adaptor will register to the factory class so AdaptorManager will be able to call the isDuplicate's method on the right Adaptor
          • if there's no instance for specific adaptor then it's because there's no duplicate

          At the AdaptorManager level:

          • if name is present then do not check anything
          • if name not there then rely on the adaptor to found if it's a duplicate or not.
          Show
          Jerome Boulon added a comment - It will be better only if the client has enough knowledge on how the underlying implementation is done and if the client want to take care of that .... It's good to have the ability to create your own naming convention for the adaptor but I assume that most developers will just use the implementation that do not specify the name as it simple and they will not have to worry about duplicate and so on .... Solution that will not change the semantics of adaptor uniqueness, and broke metrics and the log4jappender : add a method to the adaptor interface (isDuplicate, that takes the same parameters as the start method) each adaptor will register to the factory class so AdaptorManager will be able to call the isDuplicate's method on the right Adaptor if there's no instance for specific adaptor then it's because there's no duplicate At the AdaptorManager level: if name is present then do not check anything if name not there then rely on the adaptor to found if it's a duplicate or not.
          Hide
          Ari Rabkin added a comment -

          My view is that adaptor names are a responsibility for users, not for Adaptor developers. In practice, most users will use the controller, which is able to handle this for them. More precisely, I think uniqueness is a problem for whoever is issuing the "add" command, which is a handful of places in the Chukwa code, plus advanced users who presumably know what they're doing.

          I'm uneasy about an isDuplicate method. The Adaptor interface is already getting uncomfortably complex and that makes me skittish. Worse, it's not clear what isDuplicate should do. What does it mean, really, to say that one adaptor duplicates another? It seems like this is pushing an awful lot of policy into Adaptors, without being able to give adaptor authors any guidance on what it should be.

          If I understand your proposal correctly, isDuplicate() need to be called for each existing adaptor, every time we create one. That might be a performance limitation at scale.

          Show
          Ari Rabkin added a comment - My view is that adaptor names are a responsibility for users , not for Adaptor developers. In practice, most users will use the controller, which is able to handle this for them. More precisely, I think uniqueness is a problem for whoever is issuing the "add" command, which is a handful of places in the Chukwa code, plus advanced users who presumably know what they're doing. I'm uneasy about an isDuplicate method. The Adaptor interface is already getting uncomfortably complex and that makes me skittish. Worse, it's not clear what isDuplicate should do. What does it mean, really, to say that one adaptor duplicates another? It seems like this is pushing an awful lot of policy into Adaptors, without being able to give adaptor authors any guidance on what it should be. If I understand your proposal correctly, isDuplicate() need to be called for each existing adaptor, every time we create one. That might be a performance limitation at scale.
          Hide
          Ari Rabkin added a comment -

          Here's some code to illustrate the fix I had in mind. Lacks tests, so not yet ready to commit.

          I think this approach is clean, and won't need to be revisited, and won't add to the maintainance burden in the future.

          Show
          Ari Rabkin added a comment - Here's some code to illustrate the fix I had in mind. Lacks tests, so not yet ready to commit. I think this approach is clean, and won't need to be revisited, and won't add to the maintainance burden in the future.
          Hide
          Jerome Boulon added a comment -

          Ari could you move the MD5's method to an helper class out of ChukwaAgent.
          There's no point in requiring the agent class on top of chukwa-hadoop-client.jar (so make sure that chukwa-hadoop-client.jar is self contain)

          Show
          Jerome Boulon added a comment - Ari could you move the MD5's method to an helper class out of ChukwaAgent. There's no point in requiring the agent class on top of chukwa-hadoop-client.jar (so make sure that chukwa-hadoop-client.jar is self contain)
          Hide
          Ari Rabkin added a comment -

          Good catch. How about org.apache.hadoop.chukwa.util.AdaptorNamingUtils?

          Show
          Ari Rabkin added a comment - Good catch. How about org.apache.hadoop.chukwa.util.AdaptorNamingUtils?
          Hide
          Ari Rabkin added a comment -

          Quick fix for troubles with metrics. Probably not final word on underlying issue, though.

          Show
          Ari Rabkin added a comment - Quick fix for troubles with metrics. Probably not final word on underlying issue, though.
          Hide
          Eric Yang added a comment -

          Base on today's discussion, the interface should be refined to include optional parameters, and the original naming is generated base on adaptor name + stream name (fixed parameters).

          UTF8 adaptor should have one offset tracker instead of 2 to identify that sending from offset 0 and offset 100 should generate the same adaptor id for the common use cases for UTF8 adaptor.

          Show
          Eric Yang added a comment - Base on today's discussion, the interface should be refined to include optional parameters, and the original naming is generated base on adaptor name + stream name (fixed parameters). UTF8 adaptor should have one offset tracker instead of 2 to identify that sending from offset 0 and offset 100 should generate the same adaptor id for the common use cases for UTF8 adaptor.
          Hide
          Jerome Boulon added a comment -

          For my proposal:

          1- if the client provide a name, then this name overwrite all the AdaptorManager logic
          2- If there's no name then the AdaptorManager rely on the underlying adaptor to provide one.

          3- >> uniqueness is a problem for whoever is issuing the "add" command
          I still think that if the one that is doing the add is automatically able to figure out a unique name in order to avoid duplicate in a distributed environment without talking to anyone else, then the that logic can be pushed down the adaptorManager, who is the only one to have the global vision.

          4- >> isDuplicate() need to be called for each existing adaptor, every time we create one
          No. the AdaptorManager, receive a Add command then call a isDuplicate's method to the factory,
          Here the workfow:

          4.1- At the factory level, if there's no adaptor already registered for this AdaptorType then there's no duplicate
          4.1.1- if it's not a duplicate The AdaptorManger call the factory to create the adaptor
          4.1.1.1- The factory create the new adaptor and register that instance as the instance that will provide all answers for the isDuplicate call for that specific AdaptorType (hash.put(AdaptorType, adaptor) )

          4.1.2- if there's already one adaptor registered for the AdaptorType then the factory return hash.get(AdaptorType).isDuplicate(...)
          4.1.2.1- At the AdaptorManager level if it's a duplicate then it returns the existing adaptor based on the name, otherwise call the factory

          So there's no iteration over any adaptors, just a get from a hash based on AdaptorType + 1 call to isDuplicate(..) and therefore there's no performance issue

          Show
          Jerome Boulon added a comment - For my proposal: 1- if the client provide a name, then this name overwrite all the AdaptorManager logic 2- If there's no name then the AdaptorManager rely on the underlying adaptor to provide one. 3- >> uniqueness is a problem for whoever is issuing the "add" command I still think that if the one that is doing the add is automatically able to figure out a unique name in order to avoid duplicate in a distributed environment without talking to anyone else, then the that logic can be pushed down the adaptorManager, who is the only one to have the global vision. 4- >> isDuplicate() need to be called for each existing adaptor, every time we create one No. the AdaptorManager, receive a Add command then call a isDuplicate's method to the factory, Here the workfow: 4.1- At the factory level, if there's no adaptor already registered for this AdaptorType then there's no duplicate 4.1.1- if it's not a duplicate The AdaptorManger call the factory to create the adaptor 4.1.1.1- The factory create the new adaptor and register that instance as the instance that will provide all answers for the isDuplicate call for that specific AdaptorType (hash.put(AdaptorType, adaptor) ) 4.1.2- if there's already one adaptor registered for the AdaptorType then the factory return hash.get(AdaptorType).isDuplicate(...) 4.1.2.1- At the AdaptorManager level if it's a duplicate then it returns the existing adaptor based on the name, otherwise call the factory So there's no iteration over any adaptors, just a get from a hash based on AdaptorType + 1 call to isDuplicate(..) and therefore there's no performance issue
          Hide
          Ari Rabkin added a comment -

          Eric, Jerome – both your suggestions involve significant changes to the adaptor interface, as well as to the semantics of adaptor creation. The biggest reason I wanted to fix this in the controller is that it's a very small change – the patch only changes a handful of lines.

          How would you feel about doing the controller fix in 0.2, and something more principled for trunk?

          Show
          Ari Rabkin added a comment - Eric, Jerome – both your suggestions involve significant changes to the adaptor interface, as well as to the semantics of adaptor creation. The biggest reason I wanted to fix this in the controller is that it's a very small change – the patch only changes a handful of lines. How would you feel about doing the controller fix in 0.2, and something more principled for trunk?
          Hide
          Jerome Boulon added a comment -

          or we can put back something similar to the simple check that was there before for the 0.2 release + a check on the name.
          But that one involved scanning all adaptors and may have some performances impact under a large number of adaptors.

          if (params.indexOf(a.getValue().getStreamName())!=-1)

          { log.warn(params + " already exist, skipping."); return null; }
          Show
          Jerome Boulon added a comment - or we can put back something similar to the simple check that was there before for the 0.2 release + a check on the name. But that one involved scanning all adaptors and may have some performances impact under a large number of adaptors. if (params.indexOf(a.getValue().getStreamName())!=-1) { log.warn(params + " already exist, skipping."); return null; }
          Hide
          Ari Rabkin added a comment -

          As best I understand, here's the best synethsis of what Eric and Jerome are proposing.

          • Adaptor.start() has been split into start() and parseArgs(). The parseArgs() method parses the arguments, stores them, and returns only the part of the argument that is required for duplicate detection – so for instance, the offset is not included. This is Eric's point about splitting mandatory from optional params.

          Only the mandatory part is used in synthesizing adaptor names by default. This is basically jerome's suggestion, except without the hassle of maintaining a registry.

          I'm proposing this for trunk, not sure about 0.2

          Show
          Ari Rabkin added a comment - As best I understand, here's the best synethsis of what Eric and Jerome are proposing. Adaptor.start() has been split into start() and parseArgs(). The parseArgs() method parses the arguments, stores them, and returns only the part of the argument that is required for duplicate detection – so for instance, the offset is not included. This is Eric's point about splitting mandatory from optional params. Only the mandatory part is used in synthesizing adaptor names by default. This is basically jerome's suggestion, except without the hassle of maintaining a registry. I'm proposing this for trunk, not sure about 0.2
          Hide
          Cheng added a comment -

          Do we have any number about how many files are processed every day? I think the performance impact introduced by scanning all adaptors could be neglected.

          Show
          Cheng added a comment - Do we have any number about how many files are processed every day? I think the performance impact introduced by scanning all adaptors could be neglected.
          Hide
          Jerome Boulon added a comment -
          • How do you know that you don't need the optional part to compute the adaptor's name?
            What if the optional part says use this extractRecord method unless there's already someone watching the file?
          • What the purpose of parseArgs()? returns only the part of the argument that is required for duplicate detection? so that's basically my proposal with the isDuplicate method. At least with a isDuplicate method's name you kind of know what is used for.
            And if you don't maintain the registry part (the AdaptorType lookup from the factory) then you are always going to create a new instance just to get the name, is that correct?
          Show
          Jerome Boulon added a comment - How do you know that you don't need the optional part to compute the adaptor's name? What if the optional part says use this extractRecord method unless there's already someone watching the file? What the purpose of parseArgs()? returns only the part of the argument that is required for duplicate detection? so that's basically my proposal with the isDuplicate method. At least with a isDuplicate method's name you kind of know what is used for. And if you don't maintain the registry part (the AdaptorType lookup from the factory) then you are always going to create a new instance just to get the name, is that correct?
          Hide
          Ari Rabkin added a comment -

          Yes. The parseArgs() method returns exactly the part of the argument used for duplicate detection; this was intended to carry out your suggestion. I noticed that isDuplicate would basically have to reproduce all the parsing logic from start(), and it occurred to me that that should be factored out-- duplicating the code would be a significant maintainance burden.

          The point about "optional parts" is a good one. I think this is a documentation-only fix – instead of talking about optional parts, the comment should be more explicit that the string returned is used for duplicate detection.

          As to not keeping a registry – object creation in Java is really efficient, as is garbage-collecting short-lived objects so I think the overhead of creating an object, and throwing it away if that adaptor already exists, is OK. (Particularly since it avoids a hash lookup) Adding more shared datastructures is unappealing – each new datastructure needs a synchronization scheme, and introduces the posibility of subtle bugs.

          Show
          Ari Rabkin added a comment - Yes. The parseArgs() method returns exactly the part of the argument used for duplicate detection; this was intended to carry out your suggestion. I noticed that isDuplicate would basically have to reproduce all the parsing logic from start(), and it occurred to me that that should be factored out-- duplicating the code would be a significant maintainance burden. The point about "optional parts" is a good one. I think this is a documentation-only fix – instead of talking about optional parts, the comment should be more explicit that the string returned is used for duplicate detection. As to not keeping a registry – object creation in Java is really efficient, as is garbage-collecting short-lived objects so I think the overhead of creating an object, and throwing it away if that adaptor already exists, is OK. (Particularly since it avoids a hash lookup) Adding more shared datastructures is unappealing – each new datastructure needs a synchronization scheme, and introduces the posibility of subtle bugs.
          Hide
          Ari Rabkin added a comment -

          big-1 includes fixes to a number of test cases; most of these fixes are unrelated to CHUKWA-330, but needed to make unit tests pass.

          Show
          Ari Rabkin added a comment - big-1 includes fixes to a number of test cases; most of these fixes are unrelated to CHUKWA-330 , but needed to make unit tests pass.
          Hide
          Ari Rabkin added a comment -

          To be clear – I'm proposing the big patch for trunk, small patch for 0.2 – changing the Adaptor interface is a biggish change for this stage of release cycle.

          Show
          Ari Rabkin added a comment - To be clear – I'm proposing the big patch for trunk, small patch for 0.2 – changing the Adaptor interface is a biggish change for this stage of release cycle.
          Hide
          Eric Yang added a comment -

          +1, thanks Ari for addressing these issues with lots of patients.

          CHUKWA-330 is good for chukwa 0.2 branch.
          CHUKWA-330-big-1 is good for trunk for additional bake time.

          Show
          Eric Yang added a comment - +1, thanks Ari for addressing these issues with lots of patients. CHUKWA-330 is good for chukwa 0.2 branch. CHUKWA-330 -big-1 is good for trunk for additional bake time.
          Hide
          Eric Yang added a comment -

          I just committed this, thanks Ari.

          Show
          Eric Yang added a comment - I just committed this, thanks Ari.
          Hide
          Ari Rabkin added a comment -

          Hey Eric? I think you may have committed the one without the revised tests. Was that deliberate?

          Show
          Ari Rabkin added a comment - Hey Eric? I think you may have committed the one without the revised tests. Was that deliberate?
          Hide
          Eric Yang added a comment -

          My bad, I committed the wrong patch.

          Show
          Eric Yang added a comment - My bad, I committed the wrong patch.
          Hide
          Eric Yang added a comment -

          I just committed the test cases, thanks Ari.

          Show
          Eric Yang added a comment - I just committed the test cases, thanks Ari.

            People

            • Assignee:
              Ari Rabkin
              Reporter:
              Ari Rabkin
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Time Tracking

                Estimated:
                Original Estimate - 1.5h
                1.5h
                Remaining:
                Remaining Estimate - 1.5h
                1.5h
                Logged:
                Time Spent - Not Specified
                Not Specified

                  Development