jay vyas, what do you mean by "in place processing of data?" To use any of the data in Spark, we have to read it into memory and structure it into more useful data structures. For example, customer and store details are repeated in every line. There is also a little of parsing logic needed to parse date/times back into the appropriate objects to do things like sort transactions by dates and times.
I think it's a good thing that the Spark driver produces the "raw" data – it is a realistic problem that data scientists face.
Ok, so then the question of whether we should use MapReduce for ETL or write a Spark version. I see upsides and downsides here.
Pros of a Spark ETL script:
- Good example for users
- Not all Spark users will have access to a MapReduce installation. In fact, many users are either leaving MR for Spark or just starting on Spark alone. I think forcing users to setup MR to use BPS Spark would cause headaches for users.
- Provide comparisons between Spark and MapReduce solutions (Pig, etc.)
Cons of a Spark ETL script:
- Greater risk of divergence of BPS MapReduce and BPS Spark.
If the primary concern is divergence, then I wonder if this can be address in other ways? For example, can you add a MapReduce driver for the new data generator, output data in the same format as the Spark driver, and modify the Pig script to convert it to the same or a similar normalized representation so that we the components are interchangeable?
If you're not comfortable with that solution, then let's find one that we're both happy with. I want to make sure we get on the same page.