|
Cascading and HWS are different beasts.
Cascading is a different way of doing what Pig does. Programming in Cascading is programming on a higher level abstraction that resolves in a series of Map/Reduce jobs. HWS is a (server) workflow system specialized on running Hadoop/Pig jobs wired via a PDL descriptor. Following a few quick highlights on how Cascading and HWS differ: Cascading uses a topological search model to resolve the execution path.HWS uses a 'DAG of processes workflow' model that allows explicitly expressing parallelism and alternate execution paths (decisions). Cascading runs as a client from the command lineHWS is a server system (like Hadoop Job Tracker) to which you submit workflow jobs and later check the status. In HWS there are not resources held once the client submitted the workflow job, the workflow job runs in the server. This allows you to run several thousands of workflow jobs concurrently from a single HWS that supports system failover. In HWS monitoring and status tracking of jobs is done via CLIs and a web console that gathers data from HWS (like you do in Hadoop). Cascading primary programming model is similar to PIG but with a Java API.In Cascading you can still use your Hadoop jobs as a flow, as a way to integrate with existing map/reduce apps, but the real benefit of cascading is by using its API programming model. HWS primary programming model are Hadoop/Pig jobs connected via a workflow definition PDL like XML file. In cascading you need to write Java code to wire your Hadoop jobsIn HWS you don't have to wire your Hadoop/Pig jobs in Java but in a workflow XML file in a more declarative way. -worried a bit about the logic of when ${something} refs are resolved. We already have that problem in Hadoop in that Configuration resolves ${property} refs in the JVM where the conf is, so as the configuration gets moved around, the values change, and you get no way of fixing them. -similarly, there some inconsistency between the language as far as URLs for callbacks are concerned. You need a consistent expression language -I was a bit worried about the Web Service APIs -good to see that they are JSON/RESTy. Given that we also need a long-haul API to talk to hadoop itself, it would be good if what was done here could act as a foundation for any other long-haul web APIS. -Why the choice of a (possibly HA) SQL db for the jobstore? It is possible to build failover with them, but you need 2-3 machines dedicated to the role. If we could push state out to HDFS/HBase then perhaps we'd eliminate a point of complexity. Alternatively: share state between peers using Zookeper? -I'd consider allowing users to plug in new actions without editing the codebase; it could be handled with a schema that went from <email> to <action type="email" />, with action-specific content. This would let me use jabber/twitter plugins for my job status, etc. -Testing? hsqldb could be used for the database. One more Q. Why XML and not JSON?
Cascading is good software, but it is GPL'd and does not currently accept outside submissions, therefore, it's incompatible with Hadoop from a licensing perspective.
It would be great for Hadoop to have a more generic workflow scheduler that is Apache2 licensed. As such, we should avoid a discussion cascading vs HWS. It isn't relevant unless the cascading folks are open to licensing their code under the apache2 license. Comments re Cascading aren't strictly true, but this isn't the forum to clarify.
> this isn't the forum to clarify
Why not? The question is whether this is redundant with Cascading, so comparisons are certainly relevant, no? All I was trying to say is that with current licensing, cascading cannot be hosted with / distributed by Apache/Hadoop. I can see how there is room to misinterpret other aspects of my comment.
I am sure there would be overlap between the feature sets for HWS and Cascading. That said, I think it's important to have a workflow scheduler that can be shipped with Hadoop under the apache2 license. I definitely agree this is not the forum to discuss licensing issues and implications around other projects. Sorry, I was referring to the technical and process comments.
We do accept contributions. Independent extensions are usually a better option. All levels of work representation are DAGs internally, and subsequently are processes/submitted in parallel, where possible, to Hadoop. Cascading was designed with all the necessary public abstractions to implement HWS, in a fashion. etc etc.. regardless, Christophe's licensing points are valid. Regarding the EL resolution,Resolution is well defined, how and when. ${something} expressions are resolved from workflow job properties and EL functions at them time that a workflow job starts a workflow action (enters the node). Values from the HWS configuration (hws-default.xml & hws-site.xml) are not used to resolve workflow job properties (this is different from Hadoop). The EL language HWS uses is JSP EL (and we use commons-el implementation). It allows you to support variables, functions and complex expressions. What inconsistency you (Steve) refer to? probably is a mistake on the spec. Regarding the choice of SQL DB for job store,If you need HA/failover for the DB you can get it, granted you need hardware, but it is zero to minimal effort from HWS side. We did not use HDFS because we need frequent workflow job context updates (ie every time that an action starts, ends). HWS keeps zero state in memory when workflow job is not doing a transition (this allows HWS to scale big), because of this we need index access (ie by job ID, action ID, user ID). A SQL DB gives very good read/write access times. Finally the transaction support makes very straight forward keeping things consistent in case of failure. HWS uses SQL standard with no extensions from any implementation, thus it can run on any SQL DB (we use HSQL for unitests and MySQL when deployed). Regarding allowing uses to plug in new actions without editing the codebase and schema,No need to modify HWS codebase for new action. Your suggestion about using something like <action type="email"> instead <email> makes sense. This would also remove the need of tweaking the XML-schema. Something like: <action name="myPigjob"> <pig xmlns="hws:pig:1"> ... </pig> </action> By doing this schema validation of the action type can also be performed. Plus you could support multiple versions of an action type if needed. Regarding testing, using HSQLYes, we already do, testcases fly. Regarding Why XML and not JSON?HWS uses JSON for all WS API responses. HWS uses XML for the workflow job conf (leveraging Hadoop Configuration). HWS uses XML for the workflow definition (PDL). Regarding Cascading comments not strictly trueI'm not a Cascading expert so I may have missed something, but I think I got things mostly right. Corrections please... It would be useful to clarify the goals a bit. For example, is the aim to be language independent, so one can launch workflows from any programming language? Does this need to be a server-side workflow scheduler, or would a client-side scheduler be sufficient (for the first release at least)?
One of the stated goals is simplicity, so I wonder if there are some simpler approaches that should be considered. For example:
Finally, a few comments on the spec itself:
Tom -there's good reasons for not using Ant as a workflow system, even though it is a great build tool with good support from IDEs and CI systems
The Ant tasks for Hadoop aren't that complex, dont have much in the way of testing and rely on DFSClient, which abuses System.out in ways Ant wont like (Ant puts its own one up that buffers different threads up to line endings). They shouldn't be a reason to stay with Ant. On the topic of Ant, GridAnt does what Tom is thinking of, you could play with that today
http://www.globus.org/cog/projects/gridant/ Also, I'd expect to see Ant tasks to do the long haul submissions, to push out workflows with late binding information (username, S3 login details) sent as part of the process, but not hard coded into the XML Regarding language independence for launching WF jobs,Yes, that is one of the motivations of the WS API. Regarding HWS being server-side,Yes, to provide scalability, failover, monitoring and manageability. We want to move away from client-side solutions. With many 1000s WF jobs per day, a client-side solution running a process per WF job is not a feasible option. Regarding leveraging existing workflow engines,Definitely, that has been our approach from the beginning, our current implementation uses jBPM (which uses Hibernate for persistency). jBPM & Hibernate are LGPL. Apache has provisions to work with LGPL, though this complicates things a bit (for testing, integration, bundling, and potentially some extra layer of indirection). Once there is a agreement in the Hadoop community regarding the functional spec we'd discuss about specifics of the implementation. We can easily replace jBPM with another implementation, and we are doing some work on this area. Regarding naming corrections,"hadoop" action to "map-reduce" action and "hdfs" to "fs", makes sense. Regarding HDFS generalization,As we use FileSystem API, it's there already, it just a name thing and per previous comment we could rename 'hdfs' to 'fs'. Regarding querying a WF job progress,Yes, you can get the current status of the WF job, stats about completed and running actions: start time, end time, resolved configuration values (with all EL expressions evaluated), current status, exit transition, link to the webconsole for the action if any (ie Hadoop job webconsole link), etc. I'd be against embedding stuff and playing xmlns nesting tricks the way you've suggested. <action name="myPigjob"> <pig xmlns="hws:pig:1"> ... </pig> </action>
<action name="myPigjob" type="pig"> <task name="pig" > <inputs> <dataset url="s3:/traffic/oystercard/london/2009" /> <dataset url="s3:/traffic/anpr/london/2009/" /> <file path="/users/steve/pig/car-vs-tube.pig" /> </inputs> <option name="limit" value="40000" /> <option name="pigfile" value="/users/steve/pig/car-vs-tube.pig" /> <option name="startdate" value="2009-01-01" /> <option name="enddate" value="2009-02-28" /> <outputs> <dataset url="hdfs://traffic/anpr/london" /> </outputs> </task> </action> Its uglier, but having file inputs and outputs explicit could make high level workload scheduling much easier, especially once you start deciding which rack to create VMs on Regarding the use of XSDWe want to use XSD as it allows us to do XML schema validation at deployment time, making much slimmer all the parsing code. And the only programmatic validation we have to do at deployment time it is that the DAG does not have loose ends and does not have cycles. Regarding the use of multiple XSDsWe can provide a single XSD but that will complicate how new action types can be validated at deployment time. As it would require creating a new XSD, a new XSD should have a different URI. That is one of the reasons we went the approach of different XSDs for actions. Another reason is that by using different XDSs eventually you could support a new hadoop action while still supporting the old one for all deployed applications. Option 1: Current option, one XSD for control nodes and one XSD per action node type. Option 2: Current option, one XSD for control nodes and one XSD for all the (out of the box) action node types. Option 3: Integrate the control nodes and all the (out of the box) action nodes into a single XSD, leaving an extension point for custom action nodes. Thoughts? Regarding input/output datasets for a high level workload scheduler[IMO, this is a different topic from the XSD issue] I understand the motivation of this, but I see this belonging to the workload scheduling level system. IMO, the workflow nodes should stick to use a direct mapping of Hadoop/Pig configuration knobs (config props for Hadoop, params and config props for Pig). This makes the workflow model more intuitive to the Hadoop/Pig developers. It should be the higher level system (in your case the workload scheduler) should map the input/output datasets to the Hadoop/Pig configuration knobs. Based on the feedback and some refinements we've done we udpated the spec.
Changelog:
Hey Alejandro,
Any updates on when the code for HWS will be available to the open source community? Thanks, Jeff,
By then of this week we'll posted an updated spec. We expect posting the code next week, still in the process of ironing out documentation, examples and distro. Thanks. Alejandro Alejandro,
you giving a talk about Oozie during the hadoop summit? – amr An updated spec. Changes summary:
The attached tarball is Oozie source code.
It includes Hadoop/Pig JARs to compile with 0.18.3 and 0.20.0 (0.19.1 JARs were left out as the tar was larger than 10MB). bootstrap instructions in readme.txt file in root directory. Binary distro for Hadoop 0.18.3, with documentation and examples.
We've just posted an initial drop of Oozie, still work to be done, it is functional complete except for the HTTP/EMAIL action nodes that are not yet implemented, plus normalization and clean up of errors an log messages.
We'll continue to post regular drops. Alejandro,
I am trying to run the tests for this on Hadoop 18.3 and several things are failing: 1) I get repeated warnings along the lines of The package name was org.apache.hadoop.dfs in 18. I am using the -Dh18 flag. 2) I also get the following Failed and Errored out tests: Failed tests: Tests in error: These are along the lines of Thoughts? Dmitriy,
The WARN from the log are warnings, they are not breaking things. because of some package changes from H18 to H19 we are registering some exceptions by name instead by class, thus not having to have 2 branches. What you see is a warning for the packages not available in 18. The failed tests it seem to be related to HDFS authorization being disabled in your Hadoop. The tests in error seem to be related not to have SSH passphrase-less configured. Please check the binary distro, the documentation explains how to setup SSH. Please let me know adding Oozie to the issue summary
Hadoop Summit Oozie preso
I'm building this now, the setup-maven.sh and setup-jars.sh have a shebang for /bin/sh, but they actually rely on bash functions. At least on my ubuntu 9.04, I had to modify these to /bin/bash to run them.
Oops, thanks for letting us know, we'll fix that in the next drop.
This tar contains the hadoop/streaming/pig/sjon JARs not avail in public maven repositories.
The tar should be expanded within the contrib/oozie directory (after applying the patch). Oozie source in patch format under contrib.
Error codes have been normalized to number ranges. Added build version information to client and server WS, both avail via CLI. |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Doesn't this replicate what cascading already does quite well?
See http://www.cascading.org