Uploaded image for project: 'Bigtop'
  1. Bigtop
  2. BIGTOP-1414

Add Apache Spark implementation to BigPetStore

    Details

    • Type: Improvement
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: backlog
    • Fix Version/s: 1.0.0
    • Component/s: blueprints
    • Labels:
      None

      Description

      Currently we only process data with hadoop. Now its time to add spark to the bigpetstore application. This will basically demonstrate the difference between a mapreduce based hadoop implementation of a big data app, versus a Spark one.

      We will need to

      • update graphviz arch.dot to diagram spark as a new path.
      • Adding a spark job to the existing code, in a new package., which uses existing scala based generator, however, we will use it inside a spark job, rather than in a hadoop inputsplit.
      • The job should output to an RDD, which can then be serialized to disk, or else, fed into the next spark job...

      So, the next spark job should

      • group the data and write product summaries to a local file
      • run a product recommender against the input data set.

      We want the jobs to be runnable as modular, or as a single job, to leverage the RDD paradigm.

      So it will be interesting to see how the code is architected. Lets start the planning in this JIRA. I have some stuff ive informally hacked together, maybe i can attach an initial patch just to start a dialog.

      1. chart.png
        54 kB
        Jörn Franke

        Issue Links

          Activity

          Hide
          jornfranke Jörn Franke added a comment -

          You can the most benefit from Spark if you do an incremental and/or iterative job which relies on cached data.
          Of course, we can do something similar as the Hadoop Map/Reduce job, but if it is only executed once then we do not have much benefit out of it - probably it will have a similar performance.
          I think trend analysis could be one example, e.g.

          • a batch job which is supposedly executed every week (or other schedules)
          • The batch job generates some trend data from big pet store, e.g. dog food with turkey has been bought 5 times more in December week 4 more than in previous week(s)
            • This means we need to keep the trend data for each week as a RDD in memory, because we do a comparison between the current weeks and x previous week(s)
              The difference to the Hadoop Map/Reduce job will be that we leverage cached results of previous jobs.

          This is just some simple example. We need to think if it really make sense or if we should have a more sophisticated example. I would also like to include shared variables. Finally, i would like to extend it to Spark Streaming, e.g. complex event processing in combination with a Spark batch job.

          What do you think?

          Show
          jornfranke Jörn Franke added a comment - You can the most benefit from Spark if you do an incremental and/or iterative job which relies on cached data. Of course, we can do something similar as the Hadoop Map/Reduce job, but if it is only executed once then we do not have much benefit out of it - probably it will have a similar performance. I think trend analysis could be one example, e.g. a batch job which is supposedly executed every week (or other schedules) The batch job generates some trend data from big pet store, e.g. dog food with turkey has been bought 5 times more in December week 4 more than in previous week(s) This means we need to keep the trend data for each week as a RDD in memory, because we do a comparison between the current weeks and x previous week(s) The difference to the Hadoop Map/Reduce job will be that we leverage cached results of previous jobs. This is just some simple example. We need to think if it really make sense or if we should have a more sophisticated example. I would also like to include shared variables. Finally, i would like to extend it to Spark Streaming, e.g. complex event processing in combination with a Spark batch job. What do you think?
          Hide
          jayunit100 jay vyas added a comment -

          I agree spark streaming would be great to demonstrate in some way.

          However, I think the first step, generating data, should be isolated from the other spark stuff. The reason is that the spark data generation isnt something people normally do in a spark app - and we shouldnt couple it the spark application blueprint, in any way.

          do you see what i mean?

          Show
          jayunit100 jay vyas added a comment - I agree spark streaming would be great to demonstrate in some way. However, I think the first step, generating data, should be isolated from the other spark stuff. The reason is that the spark data generation isnt something people normally do in a spark app - and we shouldnt couple it the spark application blueprint, in any way. do you see what i mean?
          Hide
          jornfranke Jörn Franke added a comment -

          I agree with you. Two questions:

          "update graphviz arch.dot to diagram spark as a new path."
          where is this package?

          "Adding a spark job to the existing code, in a new package., which uses existing scala based generator, however, we will use it inside a spark job, rather than in a hadoop inputsplit."
          where to put put the code? is it a new subproject, with its own build.gradle? where can I find the other jobs to have some example?

          We can create some subtasks according to the tasks you mentioned in the first post.

          I can contribute to this issue.

          Show
          jornfranke Jörn Franke added a comment - I agree with you. Two questions: "update graphviz arch.dot to diagram spark as a new path." where is this package? "Adding a spark job to the existing code, in a new package., which uses existing scala based generator, however, we will use it inside a spark job, rather than in a hadoop inputsplit." where to put put the code? is it a new subproject, with its own build.gradle? where can I find the other jobs to have some example? We can create some subtasks according to the tasks you mentioned in the first post. I can contribute to this issue.
          Hide
          jayunit100 jay vyas added a comment - - edited

          hi again Jörn Franke ... The whole of the bigpetstore blueprint application is under bigtop-bigpetstore.

          • the arch.dot file is in bigtop-bigpetstore ... once you open it in graphviz or http://sandbox.kidstrythisathome.com/erdos/index.html, it will be obvious to you what the next steps will be , most likely.
          • there is a build.gradle file also in bigtop-bigpetstore, which already has scala and java support in it. in fact, some of the existing bigpetstore code relies on scala, and you can easily find the scala class therein.

          So you will want to modify that build.gradle to include a spark maven dependency, and then write spark classes as you normally would in any other app.

          Show
          jayunit100 jay vyas added a comment - - edited hi again Jörn Franke ... The whole of the bigpetstore blueprint application is under bigtop-bigpetstore . the arch.dot file is in bigtop-bigpetstore ... once you open it in graphviz or http://sandbox.kidstrythisathome.com/erdos/index.html , it will be obvious to you what the next steps will be , most likely. there is a build.gradle file also in bigtop-bigpetstore , which already has scala and java support in it. in fact, some of the existing bigpetstore code relies on scala, and you can easily find the scala class therein. So you will want to modify that build.gradle to include a spark maven dependency, and then write spark classes as you normally would in any other app.
          Hide
          jayunit100 jay vyas added a comment -

          So Jörn Franke what will be the goal of this patch. Can you attach a diagram of what you might have in mind ?

          Show
          jayunit100 jay vyas added a comment - So Jörn Franke what will be the goal of this patch. Can you attach a diagram of what you might have in mind ?
          Hide
          jornfranke Jörn Franke added a comment - - edited

          Hi,

          I attached a chart. Basically for the first spark job - to keep it simple - I would use the cleaned CSV as input and have as output various groupings based on country and product + some simple statistics (count, avg).

          Optionally, it can store them to a file.

          Of course, we should define later additional jobs demonstrating the features of Spark.

          I am not yet sure about the Spark version: I propose at least 1.0.0, because this one is rather stable and included amongst other in the Cloudera Quickstart VM 5.1

          Let me know what you think.

          Best regards,

          Show
          jornfranke Jörn Franke added a comment - - edited Hi, I attached a chart. Basically for the first spark job - to keep it simple - I would use the cleaned CSV as input and have as output various groupings based on country and product + some simple statistics (count, avg). Optionally, it can store them to a file. Of course, we should define later additional jobs demonstrating the features of Spark. I am not yet sure about the Spark version: I propose at least 1.0.0, because this one is rather stable and included amongst other in the Cloudera Quickstart VM 5.1 Let me know what you think. Best regards,
          Hide
          jayunit100 jay vyas added a comment - - edited

          Looks good as a start.

          Then later i think we should decouple the spark code entirely, once we implement the new data model in BIGTOP-1366 .

          Let me know if you need other help getting started, look forward to testing the patch for this !

          Show
          jayunit100 jay vyas added a comment - - edited Looks good as a start. Then later i think we should decouple the spark code entirely, once we implement the new data model in BIGTOP-1366 . Let me know if you need other help getting started, look forward to testing the patch for this !
          Hide
          jornfranke Jörn Franke added a comment -

          ok, you can assign it to me - I somehow cannot assign issues to myself.
          I will get in touch with you once I have the code ready

          Show
          jornfranke Jörn Franke added a comment - ok, you can assign it to me - I somehow cannot assign issues to myself. I will get in touch with you once I have the code ready
          Hide
          jornfranke Jörn Franke added a comment -

          oh btw. do we want a Scala or a Java job?

          Show
          jornfranke Jörn Franke added a comment - oh btw. do we want a Scala or a Java job?
          Hide
          rnowling RJ Nowling added a comment -

          jay vyas I think BIGTOP-1366, BIGTOP-1535, and BIGTOP-1537 solve this JIRA but are broken down into separate steps. Should we close this JIRA or make the other JIRAs subtasks of this one?

          Show
          rnowling RJ Nowling added a comment - jay vyas I think BIGTOP-1366 , BIGTOP-1535 , and BIGTOP-1537 solve this JIRA but are broken down into separate steps. Should we close this JIRA or make the other JIRAs subtasks of this one?
          Hide
          jayunit100 jay vyas added a comment -

          RJ Nowling i agree, lets do subtasks.

          then when spark impl is equivalent to existing bigpetstore-mapreduce, we will declare this resolved.

          sounds good?

          Show
          jayunit100 jay vyas added a comment - RJ Nowling i agree, lets do subtasks. then when spark impl is equivalent to existing bigpetstore-mapreduce , we will declare this resolved. sounds good?
          Hide
          rnowling RJ Nowling added a comment -

          yep, agreed!

          Show
          rnowling RJ Nowling added a comment - yep, agreed!
          Hide
          rnowling RJ Nowling added a comment -

          I added BIGTOP-1366 as a block, and I converted the other JIRAs into sub tasks.

          Show
          rnowling RJ Nowling added a comment - I added BIGTOP-1366 as a block, and I converted the other JIRAs into sub tasks.
          Hide
          jayunit100 jay vyas added a comment -

          This is now in bigpetstore-spark. see BIGTOP-1535 for details.

          the minute details of this JIRA are not relevant anymore.

          Show
          jayunit100 jay vyas added a comment - This is now in bigpetstore-spark . see BIGTOP-1535 for details. the minute details of this JIRA are not relevant anymore.
          Hide
          jayunit100 jay vyas added a comment -

          closing, this has been completed.

          Show
          jayunit100 jay vyas added a comment - closing, this has been completed.

            People

            • Assignee:
              jayunit100 jay vyas
              Reporter:
              jayunit100 jay vyas
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development