Currently we only process data with hadoop. Now its time to add spark to the bigpetstore application. This will basically demonstrate the difference between a mapreduce based hadoop implementation of a big data app, versus a Spark one.
We will need to
- update graphviz arch.dot to diagram spark as a new path.
- Adding a spark job to the existing code, in a new package., which uses existing scala based generator, however, we will use it inside a spark job, rather than in a hadoop inputsplit.
- The job should output to an RDD, which can then be serialized to disk, or else, fed into the next spark job...
So, the next spark job should
- group the data and write product summaries to a local file
- run a product recommender against the input data set.
We want the jobs to be runnable as modular, or as a single job, to leverage the RDD paradigm.
So it will be interesting to see how the code is architected. Lets start the planning in this JIRA. I have some stuff ive informally hacked together, maybe i can attach an initial patch just to start a dialog.
|Add Spark ETL script to BigPetStore||Resolved|
|[BigPetStore] Add Spark Product Recommender example||Resolved|