Bigtop
  1. Bigtop
  2. BIGTOP-1177

Puppet Recipes: Can we modularize them to foster HCFS initiatives?

    Details

    • Type: Improvement Improvement
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: 0.7.0
    • Fix Version/s: backlog
    • Component/s: Deployment
    • Labels:
      None

      Description

      In the spirit of interoperability Can we work to modularizing the bigtop puppet recipes to not define "hadoop_cluster_node" as an HDFS specific class.

      I'm not a puppet expert but, from testing on BIGTOP-1171, im starting to note that HDFS dependency can make deployment a little complex (i.e. the init-hdfs logic etc..).

      For those of us not necessarily dependant on HDFS, this is a cumbersome service to maintain.

      Here are two reasons why decoupling "hadoop_cluster_node" from HDFS is beneficial:

      • For HDFS USers: In some use cases we might want to use bigtop to provision many nodes, only some of which are "data nodes". For example: Lets say our cluster is crawling the web in mappers, and doing some machine learning and distillling large pages into a small relational database tuple, i.e. that summarizes the "entities" in the page. In this case we don't necessarily benefit much from locality because we might be CPU rather than network/io bound. So we might want to provision a cluster of 50 machines : 40 multicore CPU heavy ones and just 10 datanodes to support the DFS. I know this is an extreme case but its a good example.
      • For NON-HDFS users: One important aspect of emerging hadoop workflows is HCFS : https://wiki.apache.org/hadoop/HCFS/ – the idea that filesystems like S3, OrangeFS, GlusterFileSystem, etc.. are all just as capable , although not necessarily optimal, of supporting YARN and Hadoop operations as HDFS.

      This JIRA Might have to be done in phases, and might need some refinement since im not a puppet expert. But here is what seems logical:

      1) hadoop_cluster_node shouldnt necessarily know about jobtrackers, tasktrackers, or any other non essential yarn components.

      2) Since YARN does need a DFS of some sort to run on, hadoop_cluster_node will need definitions for that DFS. These configuration properties (fs.defaultFS, fs.default.name, could be put into the puppet configurations and discovered that way).

      • fs.defaultFS
      • fs.default.name
      • fs.AbstractFileSystem
      • impl,org.apache.hadoop.fs.local....
      • fs.defaultFS
      • hbase.rootdir
      • fs......impl
      • fs.default.name
      • fs.defaultFS
      • fs.AbstractFileSystem.....impl
      • mapreduce.jobtracker.staging.root.dir
      • yarn.app.mapreduce.am.staging-dir

      3) while we're at it : should the hadoop_cluster_node class even know about specific ecosystem components (zookeeper,etc..). Some tools, such as zookeeper, dont even need hadoop to run, so there is alot of modularization there to be done.

      Maybe this can be done in phases , but again, a puppet expert will have to weigh in on whats feasible , practical, and maybe on how to phase these changes in an agile way. Any feedback is welcome - i realize this is a significant undertaking... But its important to democratize the hadoop stack and bigtop is the perfect place to do it!

        Activity

        Hide
        jay vyas added a comment -

        An update on this front: In a related JIRA, we have refactored hdfs-init.sh into an hcfs-init.sh class, which would play a key role in the task of modularizing puppet recipes to support the broader HCFS hadoop community.

        Although the BIGTOP-1200 init-hcfs.sh patch is under debate, the purpose of the patch (BIGTOP-1200), I think, is undeniable: That bigtop must embrace the broadening ecosystem of hadoop compatible file systems , so that we can ultimately share the immensly complex task of mantaining rock-solid file system and deployment semantics in hadoop.

        Some examples:

        • init-hdfs.sh shouldnt have generic FS logic. Rather, that should be in a file called "init-hcfs.sh". (BIGTOP-1200).
        • Puppet recipes should be configurable regarding what filesystem they target. (this JIRA).
        • FS Smoke tests should also be coded to an interface, not to HDFS specifically, see BIGTOP-1032
        Show
        jay vyas added a comment - An update on this front: In a related JIRA, we have refactored hdfs-init.sh into an hcfs-init.sh class, which would play a key role in the task of modularizing puppet recipes to support the broader HCFS hadoop community. Although the BIGTOP-1200 init-hcfs.sh patch is under debate, the purpose of the patch ( BIGTOP-1200 ), I think, is undeniable: That bigtop must embrace the broadening ecosystem of hadoop compatible file systems , so that we can ultimately share the immensly complex task of mantaining rock-solid file system and deployment semantics in hadoop. Some examples: init-hdfs.sh shouldnt have generic FS logic. Rather, that should be in a file called "init-hcfs.sh". ( BIGTOP-1200 ). Puppet recipes should be configurable regarding what filesystem they target. (this JIRA). FS Smoke tests should also be coded to an interface, not to HDFS specifically, see BIGTOP-1032

          People

          • Assignee:
            Unassigned
            Reporter:
            jay vyas
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:

              Development