Flume
  1. Flume
  2. FLUME-1941

Support defaults or inheritance in configs

    Details

    • Type: New Feature New Feature
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: Configuration
    • Labels:
      None

      Description

      Proposal to support defaults or inheritance in configs.

      The idea is to create a "prototypal" component config, such as a source or sink, which is not necessarily instantiated but is used to avoid repetitive configurations when creating multiple components of the same type. A great example of this is users who define 5 HDFS sinks to increase write parallelism, but they each contain many of the same configuration parameters and differ only in their name and path.

      Basic idea:

      agent.sinks.sink-proto-1.type = my-sink
      agent.sinks.sink-proto-1.path = /var/log/foo/bar
      agent.sinks.sink-proto-1.serializer = MySerializer$Builder
      agent.sinks.sink-proto-1.credentials = mpercy
      
      agent.sinks.sink-1.__prototype__ = sink-proto-1
      agent.sinks.sink-1.path = /var/log/baz/blam
      
      agent.sinks.sink-2.__prototype__ = sink-proto-1
      agent.sinks.sink-2.path = /var/log/glerp/bazinga
      

        Issue Links

          Activity

          Hide
          Gabriel Commeau added a comment - - edited

          That's a great idea! Long configuration files are not only hard to maintain, but error-prone as well.
          I'd add that a prototype should be able to inherit from another prototype, in a recursive matter:

          agent.sinks.sink-proto-2.__prototype__ = sink-proto-1
          agent.sinks.sink-proto-2.path = /var/log/foo/bar2
          

          Additionally (related to default values), I'd suggest if it does not break the backward compatibility to add meta-variables that could be reused like below:

          my.basic.variable = 100
          agent1.sinks.sink1.batchSize = ${my.basic.variable}
          
          Show
          Gabriel Commeau added a comment - - edited That's a great idea! Long configuration files are not only hard to maintain, but error-prone as well. I'd add that a prototype should be able to inherit from another prototype, in a recursive matter: agent.sinks.sink-proto-2.__prototype__ = sink-proto-1 agent.sinks.sink-proto-2.path = / var /log/foo/bar2 Additionally (related to default values), I'd suggest if it does not break the backward compatibility to add meta-variables that could be reused like below: my.basic.variable = 100 agent1.sinks.sink1.batchSize = ${my.basic.variable}
          Hide
          Brock Noland added a comment -

          +1 to this plan

          Show
          Brock Noland added a comment - +1 to this plan
          Hide
          Mike Percy added a comment -

          Gabriel, thanks a lot for the feedback!

          I like the recursive concept. Of course, the impl will have to watch out for self-loops.

          I also think defining variables is a great idea. Due to recent changes that went into the exec source to support arbitrary shell syntax, I'm concerned about conflicting with shell variables however. So I think we would need to use non-bash syntax for the substitution. Maybe something like JIRA double-brackets? (this is basically JIRA syntax, so watch out when posting it in here. I'm putting all the syntax into JIRA "noformat" sections from here on out):

          agent.sink.sink-1.foo = /bar/{{variable-name}}/baz
          

          Maybe someone who is more familiar with Bash than I am would like to provide feedback on this issue? Note also that the BucketPath stuff in HDFS sink has some similar-looking and magical date- and variable-substitution syntax which looks like:

          agent.sink.sink-1.path = /path/to/hdfs/%{header-name}/%YY-%mm-%dd/%HH:%MM/blah
          

          It's best to restrict the variable definitions to be constrained to some type of hierarchy as well, so we maintain the flexibility of extending the syntax as needed in the future. Maybe something like:

          agent1.__vars__.my-variable-name = foo
          

          as well as

          __global__.__vars__.my-variable-name = bar
          

          We could easily get into ambiguities during parsing by allowing dots in variable names, so I'd advocate against that.

          Thoughts? Alternative suggestions most welcome.

          Show
          Mike Percy added a comment - Gabriel, thanks a lot for the feedback! I like the recursive concept. Of course, the impl will have to watch out for self-loops. I also think defining variables is a great idea. Due to recent changes that went into the exec source to support arbitrary shell syntax, I'm concerned about conflicting with shell variables however. So I think we would need to use non-bash syntax for the substitution. Maybe something like JIRA double-brackets? (this is basically JIRA syntax, so watch out when posting it in here. I'm putting all the syntax into JIRA "noformat" sections from here on out): agent.sink.sink-1.foo = /bar/{{variable-name}}/baz Maybe someone who is more familiar with Bash than I am would like to provide feedback on this issue? Note also that the BucketPath stuff in HDFS sink has some similar-looking and magical date- and variable-substitution syntax which looks like: agent.sink.sink-1.path = /path/to/hdfs/%{header-name}/%YY-%mm-%dd/%HH:%MM/blah It's best to restrict the variable definitions to be constrained to some type of hierarchy as well, so we maintain the flexibility of extending the syntax as needed in the future. Maybe something like: agent1.__vars__.my-variable-name = foo as well as __global__.__vars__.my-variable-name = bar We could easily get into ambiguities during parsing by allowing dots in variable names, so I'd advocate against that. Thoughts? Alternative suggestions most welcome.
          Hide
          Brock Noland added a comment -

          Yeah it's probably best to stay away from $var even if we could require exec source users to escape vars (since it's not released).

          Show
          Brock Noland added a comment - Yeah it's probably best to stay away from $var even if we could require exec source users to escape vars (since it's not released).
          Hide
          Gabriel Commeau added a comment -

          That makes sense. Another way to handle the potential conflicts is for the parser to ignore the variables that it can't substitute, and leave the string as-is. The user would be responsible for making sure the variable names chosen within Flume's configuration file do not conflict with the rest (the command line of the exec source for instance).

          Also, I envisioned the variables to be global only. If we were to declare it within an agent's configuration section (i.e. "agent1."), it'd imply a variable scope. It may be a little too much for Flume's configuration IMHO. I love the idea of grouping them in a global section though: it's more readable and clearer for everybody, and it prevents future conflicts.

          Show
          Gabriel Commeau added a comment - That makes sense. Another way to handle the potential conflicts is for the parser to ignore the variables that it can't substitute, and leave the string as-is. The user would be responsible for making sure the variable names chosen within Flume's configuration file do not conflict with the rest (the command line of the exec source for instance). Also, I envisioned the variables to be global only. If we were to declare it within an agent's configuration section (i.e. "agent1."), it'd imply a variable scope. It may be a little too much for Flume's configuration IMHO. I love the idea of grouping them in a global section though: it's more readable and clearer for everybody, and it prevents future conflicts.
          Hide
          Connor Woodson added a comment -

          I think using {'s might make it overly complicated due to the exec source and the hdfs path names; what about using "__" (two _), either in front or maybe either side of the variable name (as that is something which I assume would normally never be in a configuration).

          agent.vars = variable
          agent.vars.variable = cookies
          agent.sinks.hdfs.hdfs.path = http://www.__variable__.com
          

          The easiest route is to prevent variables from referencing themselves; not sure of a good work around from that. If something detected as a variable hasn't been defined, then throw a warning but treat that code as not a variable (and if a line contains an opening two "_" but not a closing, probably again just throw a warning but keep the text).

          For the prototypes et al., intstead of creating a new section for defining prototypes what about giving sources/channels/sinks a "parent" attribute (I don't know how well it'd work for interceptors / serializers), so you'd have a configuration like so:

          agent.sinks.s1.type = HDFS
          ...
          agent.sinks.s2.parent = s1
          

          Rules would probably be that you can't have both a parent and a type; and to prevent an infinite reference case, the parent sink cannot have "parent" defined.

          To allow for recursive inheritance, it would depend on how the configuration code works; I haven't taken a look at it so I don't know. The parent of a sink either has no parent defined, as before, or it will need to already have been fully defined i.e. has taken all properties from its parent sink, thus making it no longer have a parent, and thus fulfilling the previous case.

          Show
          Connor Woodson added a comment - I think using {'s might make it overly complicated due to the exec source and the hdfs path names; what about using "__" (two _), either in front or maybe either side of the variable name (as that is something which I assume would normally never be in a configuration). agent.vars = variable agent.vars.variable = cookies agent.sinks.hdfs.hdfs.path = http://www.__variable__.com The easiest route is to prevent variables from referencing themselves; not sure of a good work around from that. If something detected as a variable hasn't been defined, then throw a warning but treat that code as not a variable (and if a line contains an opening two "_" but not a closing, probably again just throw a warning but keep the text). For the prototypes et al., intstead of creating a new section for defining prototypes what about giving sources/channels/sinks a "parent" attribute (I don't know how well it'd work for interceptors / serializers), so you'd have a configuration like so: agent.sinks.s1.type = HDFS ... agent.sinks.s2.parent = s1 Rules would probably be that you can't have both a parent and a type; and to prevent an infinite reference case, the parent sink cannot have "parent" defined. To allow for recursive inheritance, it would depend on how the configuration code works; I haven't taken a look at it so I don't know. The parent of a sink either has no parent defined, as before, or it will need to already have been fully defined i.e. has taken all properties from its parent sink, thus making it no longer have a parent, and thus fulfilling the previous case.
          Hide
          Mike Percy added a comment -

          I don't think bash interprets double curly brackets:

          $ echo $0
          -bash
          $ ls {{k}}
          ls: {{k}}: No such file or directory
          

          This parent concept is quite similar to the __prototype__ concept in the above proposal. Restricting it to only allowing inheriting from the same type (for some definition of type) makes a lot of sense.

          Show
          Mike Percy added a comment - I don't think bash interprets double curly brackets: $ echo $0 -bash $ ls {{k}} ls: {{k}}: No such file or directory This parent concept is quite similar to the __prototype__ concept in the above proposal. Restricting it to only allowing inheriting from the same type (for some definition of type) makes a lot of sense.

            People

            • Assignee:
              Unassigned
              Reporter:
              Mike Percy
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:

                Development