In the spirit of interoperability Can we work to modularizing the bigtop puppet recipes to not define "hadoop_cluster_node" as an HDFS specific class.
I'm not a puppet expert but, from testing on
BIGTOP-1171, im starting to note that HDFS dependency can make deployment a little complex (i.e. the init-hdfs logic etc..).
For those of us not necessarily dependant on HDFS, this is a cumbersome service to maintain.
Here are two reasons why decoupling "hadoop_cluster_node" from HDFS is beneficial:
- For HDFS USers: In some use cases we might want to use bigtop to provision many nodes, only some of which are "data nodes". For example: Lets say our cluster is crawling the web in mappers, and doing some machine learning and distillling large pages into a small relational database tuple, i.e. that summarizes the "entities" in the page. In this case we don't necessarily benefit much from locality because we might be CPU rather than network/io bound. So we might want to provision a cluster of 50 machines : 40 multicore CPU heavy ones and just 10 datanodes to support the DFS. I know this is an extreme case but its a good example.
- For NON-HDFS users: One important aspect of emerging hadoop workflows is HCFS : https://wiki.apache.org/hadoop/HCFS/ – the idea that filesystems like S3, OrangeFS, GlusterFileSystem, etc.. are all just as capable , although not necessarily optimal, of supporting YARN and Hadoop operations as HDFS.
This JIRA Might have to be done in phases, and might need some refinement since im not a puppet expert. But here is what seems logical:
1) hadoop_cluster_node shouldnt necessarily know about jobtrackers, tasktrackers, or any other non essential yarn components.
2) Since YARN does need a DFS of some sort to run on, hadoop_cluster_node will need definitions for that DFS. These configuration properties (fs.defaultFS, fs.default.name, could be put into the puppet configurations and discovered that way).
3) while we're at it : should the hadoop_cluster_node class even know about specific ecosystem components (zookeeper,etc..). Some tools, such as zookeeper, dont even need hadoop to run, so there is alot of modularization there to be done.
Maybe this can be done in phases , but again, a puppet expert will have to weigh in on whats feasible , practical, and maybe on how to phase these changes in an agile way. Any feedback is welcome - i realize this is a significant undertaking... But its important to democratize the hadoop stack and bigtop is the perfect place to do it!