Details

    • Type: Sub-task
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.0.0
    • Component/s: blueprints
    • Labels:
      None

      Description

      We should add script that reads the results from the data generator and normalizes the data and splits it into separate tables (ETL). It would be nice to use Spark SQL but it is not required.

        Issue Links

          Activity

          Hide
          jayunit100 jay vyas added a comment -

          Lets also add in this patch the arch.dot for the spark pipeline.

          Im actually wondering wether you really need spark ETL ? I think MapReduce is great for ETL, and really, the spark components shine at demonstrating in place processing of data, and should focus more on that.

          But open to a pure ETL step if you (or others) think thats a good path forward

          Show
          jayunit100 jay vyas added a comment - Lets also add in this patch the arch.dot for the spark pipeline. Im actually wondering wether you really need spark ETL ? I think MapReduce is great for ETL, and really, the spark components shine at demonstrating in place processing of data, and should focus more on that. But open to a pure ETL step if you (or others) think thats a good path forward
          Hide
          rnowling RJ Nowling added a comment -

          jay vyas, what do you mean by "in place processing of data?" To use any of the data in Spark, we have to read it into memory and structure it into more useful data structures. For example, customer and store details are repeated in every line. There is also a little of parsing logic needed to parse date/times back into the appropriate objects to do things like sort transactions by dates and times.

          I think it's a good thing that the Spark driver produces the "raw" data – it is a realistic problem that data scientists face.

          Ok, so then the question of whether we should use MapReduce for ETL or write a Spark version. I see upsides and downsides here.

          Pros of a Spark ETL script:

          • Good example for users
          • Not all Spark users will have access to a MapReduce installation. In fact, many users are either leaving MR for Spark or just starting on Spark alone. I think forcing users to setup MR to use BPS Spark would cause headaches for users.
          • Provide comparisons between Spark and MapReduce solutions (Pig, etc.)

          Cons of a Spark ETL script:

          • Greater risk of divergence of BPS MapReduce and BPS Spark.

          If the primary concern is divergence, then I wonder if this can be address in other ways? For example, can you add a MapReduce driver for the new data generator, output data in the same format as the Spark driver, and modify the Pig script to convert it to the same or a similar normalized representation so that we the components are interchangeable?

          If you're not comfortable with that solution, then let's find one that we're both happy with. I want to make sure we get on the same page.

          Show
          rnowling RJ Nowling added a comment - jay vyas , what do you mean by "in place processing of data?" To use any of the data in Spark, we have to read it into memory and structure it into more useful data structures. For example, customer and store details are repeated in every line. There is also a little of parsing logic needed to parse date/times back into the appropriate objects to do things like sort transactions by dates and times. I think it's a good thing that the Spark driver produces the "raw" data – it is a realistic problem that data scientists face. Ok, so then the question of whether we should use MapReduce for ETL or write a Spark version. I see upsides and downsides here. Pros of a Spark ETL script: Good example for users Not all Spark users will have access to a MapReduce installation. In fact, many users are either leaving MR for Spark or just starting on Spark alone. I think forcing users to setup MR to use BPS Spark would cause headaches for users. Provide comparisons between Spark and MapReduce solutions (Pig, etc.) Cons of a Spark ETL script: Greater risk of divergence of BPS MapReduce and BPS Spark. If the primary concern is divergence, then I wonder if this can be address in other ways? For example, can you add a MapReduce driver for the new data generator, output data in the same format as the Spark driver, and modify the Pig script to convert it to the same or a similar normalized representation so that we the components are interchangeable? If you're not comfortable with that solution, then let's find one that we're both happy with. I want to make sure we get on the same page.
          Hide
          jayunit100 jay vyas added a comment -

          yup, agreed, by in place i meant in memory. Im okay w/ the Spark ETL script. Your right - it will provide a good comparison between mapreduce/spark also.

          Show
          jayunit100 jay vyas added a comment - yup, agreed, by in place i meant in memory . Im okay w/ the Spark ETL script. Your right - it will provide a good comparison between mapreduce/spark also.
          Hide
          rnowling RJ Nowling added a comment -

          Ok, I'll work on an ETL script.

          Show
          rnowling RJ Nowling added a comment - Ok, I'll work on an ETL script.
          Hide
          rnowling RJ Nowling added a comment -

          This patch:

          • Adds case classes for a normalized, structured data model
          • Adds I/O utility methods for reading and writing the structure data model
          • Adds a Spark ETL component which parses the dirty CSV from the generator, normalizes the data, and writes it out in the form of the structured data model
          • Adds tests for all of the above
          • Updates the README to discuss the data model and new component
          • Adds a GraphViz workflow diagram for current and future components

          jay vyas I decided to create a separate arch diagram for now. I suggest we create a separate JIRA to merge them since discussion may be in order. Also, I didn't fix trailing whitespace – can you handle that on commit?

          Show
          rnowling RJ Nowling added a comment - This patch: Adds case classes for a normalized, structured data model Adds I/O utility methods for reading and writing the structure data model Adds a Spark ETL component which parses the dirty CSV from the generator, normalizes the data, and writes it out in the form of the structured data model Adds tests for all of the above Updates the README to discuss the data model and new component Adds a GraphViz workflow diagram for current and future components jay vyas I decided to create a separate arch diagram for now. I suggest we create a separate JIRA to merge them since discussion may be in order. Also, I didn't fix trailing whitespace – can you handle that on commit?
          Hide
          jayunit100 jay vyas added a comment - - edited

          yup ! i can fix in the commit .

          also id suggest (just a suggestion)that in general its betterto consolidate some of these commits where possible (i.e. maybe one bps patch per day) - since from bigtops perspective petstore updates dont need to be super granular.

          this is great work i will review this tonite also

          Show
          jayunit100 jay vyas added a comment - - edited yup ! i can fix in the commit . also id suggest (just a suggestion)that in general its betterto consolidate some of these commits where possible (i.e. maybe one bps patch per day) - since from bigtops perspective petstore updates dont need to be super granular. this is great work i will review this tonite also
          Hide
          jayunit100 jay vyas added a comment -

          adding as related to 1414 so we can track all these BPS improvements

          Show
          jayunit100 jay vyas added a comment - adding as related to 1414 so we can track all these BPS improvements
          Hide
          jayunit100 jay vyas added a comment - - edited

          hi RJ Nowling .

          • the commit message is supposed to be a one liner: "BIGTOP-XYZA. Add Spark ETL script tp BigPetStore". so that has to be updated. i messed up last time and didnt catch it and let you slide
          • and can you remind me the other two JIRAs you need reviewed, also, and update commit messages in them if necessary.

          otherwise, code looks solid. this is a clean and maintainable example of a production ready spark app. will commit once commit msg is updated

          Show
          jayunit100 jay vyas added a comment - - edited hi RJ Nowling . the commit message is supposed to be a one liner: "BIGTOP-XYZA. Add Spark ETL script tp BigPetStore". so that has to be updated. i messed up last time and didnt catch it and let you slide and can you remind me the other two JIRAs you need reviewed, also, and update commit messages in them if necessary. otherwise, code looks solid. this is a clean and maintainable example of a production ready spark app. will commit once commit msg is updated
          Hide
          jayunit100 jay vyas added a comment -

          commited, thanks rj.

          Show
          jayunit100 jay vyas added a comment - commited, thanks rj.

            People

            • Assignee:
              rnowling RJ Nowling
              Reporter:
              rnowling RJ Nowling
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development