Uploaded image for project: 'Apache Storm'
  1. Apache Storm
  2. STORM-107

Add better ways to construct topologies



    • New Feature
    • Status: Open
    • Minor
    • Resolution: Unresolved
    • None
    • None
    • storm-core
    • None



      AFAIK the only way to construct a topology is to manually wire them together, e.g.

         {"firehose" (spout-spec firehose-spout)}
         {"our-bolt-1" (bolt-spec {"firehose" :shuffle}
                                  :p 5)
          "our-bolt-2" (bolt-spec {"our-bolt-1" ["word"]}
                                   :p 6)})

      This sort of manual specification of edges seems a bit too 1990's for me. I would like a modular way to express topologies, so that you can compose sub-topologies together. Another benefit of an alternative to this graph setup is that ensuring that the topology is correct does not mean tracing every edge in the graph to make sure the graph is right.

      I am thinking maybe some sort of LINQ-style query that simply desugars to the arguments we pass into topology.

      For example, the following could desugar into the two map arguments we're passing to topology:

      (def firehose (mk-spout "firehose" firehose-spout))
      (def bolt1 (mk-bolt "our-bolt-1" some-bolt :p 5))
      (def bolt2 (mk-bolt "our-bolt-1" some-other-bolt :p 6))
      (from-in thing (compose firehose
        (select thing))

      Here from-in is pulling thing out of the result of compose'ing the firehose and the bolts, forming the topology we saw before. mk-spout would register a named spout spec, and the from macro would return the two dictionaries passed into topology.

      The specification needs a lot of work, but I'm willing to write the patch myself once it's nailed down. The question is, do you want me to write it and send it off to you, or am I going to have to build a storm-tools repot to distribute it?

      mrflip:We have an internal tool for describing topologies at a high level, and though it hasn't reached production we have found:
      1. it definitely makes sense to have one set of objects that describe topologies, and a different set of objects that express them.
      2. it probably makes sense to have those classes generate a static manifest: a lifeless JSON representation of a topology.

      To the first point, initially we did it like storm: the FooEacher class would know how to wire itself into a topology(), and also know how to Foo each record that it received. We later refactored to separate topology construction from data handling: there is an EacherStage that represents anything that obeys the Each contract, so you'd say flow do source(:kafka_trident_spout) > eacher(:foo_eacher) > so_on() > and_so_forth(). The code became simpler and more powerful.
      () Actually in storm stages are wired into the topology, but the issue is that they're around at run-time in both cases, requiring serialization and so forth.

      More importantly, it's worth considering a static manifest.

      The virtue of a manifest is that it is universal and static. If it's a JSON file, anything can generate it and anything can consume it; that would meet the needs of external programs which want to orchestrate Storm/Trident, as well as the repeated requests to visualize a topology in the UI. Also since it's static, the worker logic can simplify as it will know the whole graph in advance. From my experience, apart from the transactional code, the topology instantiation logic is the most complicated in the joint. That feels justifiable for the transaction logic but not for the topology instantiation.

      The danger of a manifest is also that it is static – you could find yourself on the primrose path to maven-style XML hell, where you wake up one day and find you've attached layers of ponderous machinery to make a static config file Turing-complete. I think the problem comes when you try to make the file human-editable. The manifest should expressly be the porcelain result of a DSL, with all decisions baked in – it must not be a DSL.

      In general, we find that absolute separation of orchestration (what things should be wired together) and action (actually doing things) seems painful at design time but ends up making things simpler and more powerful.


        Issue Links



              Unassigned Unassigned
              xumingming James Xu
              0 Vote for this issue
              3 Start watching this issue