Uploaded image for project: 'Bigtop'
  1. Bigtop
  2. BIGTOP-1366

Updated, Richer Model for Generating Data for BigPetStore

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: backlog
    • Fix Version/s: 1.0.0
    • Component/s: blueprints
    • Labels:

      Description

      BigPetStore uses synthetic data as the basis for its workflow. BPS's current model for generating customer data is sufficient for basic testing of the Hadoop ecosystem, *but the model is very basic and lacks sufficient complexity for embedding interesting patterns into the data*.

      As a result, *more complex, scalable testing such as testing clustering algorithms in Mahout on non-trivial data or multidimensional data with factors influencing it* is not currently possible.

      Efforts are currently underway to incrementally improve the current model (see BIGTOP-1271 and BIGTOP-1272).

      To create a model that can that incorporate *realistic, non-hierarchichal patterns* and input data to generate rich customer/transaction data with interesting correlations will require a re-imagining of the current model and its framework.

      To support the improvements to the model in BigPetStore, I have been working on an *alternative ab initio model, developed from scratch*. Since the development of a new model involves substantial R&D work with more specialized tools (mathematical and plotting libraries), I'm doing the current work outside of BPS using the iPython Notebook environment. Due to the long time frame, the model will be developed on a separate timeline to prevent slowing the development of BPS.

      Once the model has stabilized, I will begin incorporating the model into BPS itself. One option is to implement the model in using Scala for clean integration with *spark* which is likely to play an increasingly important role in the hadoop ecosystem, and thus will be an important part of bigpetstore as a test/blueprint app.

        Issue Links

          Activity

          Hide
          rnowling RJ Nowling added a comment -
          Show
          rnowling RJ Nowling added a comment - Current work on the alternative model can be tracked here: https://github.com/rnowling/bigpetstore-data-generator To view the iPython Notebook directly, look here: http://nbviewer.ipython.org/github/rnowling/bigpetstore-data-generator/blob/master/notebooks/MonteCarloExample.ipynb
          Hide
          jayunit100 jay vyas added a comment - - edited

          Thanks RJ. Tl:DR

          • RJ is working on making the dataset generation much more sophisticated, and plans to port it to scala some day. This is mostly theoretical work at the moment.
          • A requirement is that this new model can be used in any paradigm : So we will want to decouple the model implementation, if possible, from spark.
          • This new model will (or at least, can) take into account everything: product inventories, customer preferences, possibly even temperature of states etc when generating transactions. Thus it can be used to benchmark machine learning tools in very sparse environments.

          Thanks again for doing this. In the interim, I think it would be great if you could chime in on the primitive models which we are currently using - although they aren't as advanced as this - if we have feedback we will at least be able to keep placeholders in the code wherever possible to pave the way for things to come.

          Show
          jayunit100 jay vyas added a comment - - edited Thanks RJ. Tl:DR RJ is working on making the dataset generation much more sophisticated, and plans to port it to scala some day. This is mostly theoretical work at the moment. A requirement is that this new model can be used in any paradigm : So we will want to decouple the model implementation, if possible, from spark . This new model will (or at least, can ) take into account everything : product inventories, customer preferences, possibly even temperature of states etc when generating transactions. Thus it can be used to benchmark machine learning tools in very sparse environments. Thanks again for doing this. In the interim, I think it would be great if you could chime in on the primitive models which we are currently using - although they aren't as advanced as this - if we have feedback we will at least be able to keep placeholders in the code wherever possible to pave the way for things to come.
          Hide
          jayunit100 jay vyas added a comment -

          HI RJ Nowling : You're now added to bigtop contributors. Welcome and thanks again. hope to see a patch soon this is an exciting framework - and i think it can be used by the emerging graph and ml communities as well be a big benefit to adding better patterns to the data we use in the bigpetstore tests.

          Show
          jayunit100 jay vyas added a comment - HI RJ Nowling : You're now added to bigtop contributors. Welcome and thanks again. hope to see a patch soon this is an exciting framework - and i think it can be used by the emerging graph and ml communities as well be a big benefit to adding better patterns to the data we use in the bigpetstore tests.
          Hide
          jornfranke Jörn Franke added a comment -

          I think over the long run - maybe not for this patch - the following functionality could be interesting:

          • outsource fixed values from code to CSV files (e.g. list of states in US, fake street names etc.): Benefit would be that any user can customize this and add own data (e.g. list of states in Germany)
          • generate data according to some probability distribution (normal, exponential, binomial, uniform, lognormal, bernoulli, geometric, copula etc.). Benefit would be that we can simulate any sensor data, e.g. machine data for a machine that fails
          • generate data according to some queuing model (e.g. markov chain or stochastic processes in general). Benefit would be that we can, for instance, simulate a lot of transaction during noon for a restaurant. Queues have a lots of use cases (e.g. airlines, stores, supply chain management, network transmission).
          Show
          jornfranke Jörn Franke added a comment - I think over the long run - maybe not for this patch - the following functionality could be interesting: outsource fixed values from code to CSV files (e.g. list of states in US, fake street names etc.): Benefit would be that any user can customize this and add own data (e.g. list of states in Germany) generate data according to some probability distribution (normal, exponential, binomial, uniform, lognormal, bernoulli, geometric, copula etc.). Benefit would be that we can simulate any sensor data, e.g. machine data for a machine that fails generate data according to some queuing model (e.g. markov chain or stochastic processes in general). Benefit would be that we can, for instance, simulate a lot of transaction during noon for a restaurant. Queues have a lots of use cases (e.g. airlines, stores, supply chain management, network transmission).
          Hide
          jayunit100 jay vyas added a comment - - edited

          Hi jorn. this has been done by RJ, but not implemented in mapreduce or spark yet.

          Should we consider implementing this in spark as a prerequisite to BIGTOP-1414 , and let the spark code generate , and process, its own data?

          The implementation is already existing in python. we have tested it quite thouroughly, and it does something almost identical
          to what you have mentioned here.

          You can check out the diagram of the data model here: https://github.com/rnowling/bigpetstore-data-generator/raw/master/bdcloud_paper/latex/paper.pdf .

          And the python code is also in that repository - (the code can be easily translated to scala, i can help or at least get this started if need be).

          Show
          jayunit100 jay vyas added a comment - - edited Hi jorn. this has been done by RJ, but not implemented in mapreduce or spark yet. Should we consider implementing this in spark as a prerequisite to BIGTOP-1414 , and let the spark code generate , and process, its own data? The implementation is already existing in python. we have tested it quite thouroughly, and it does something almost identical to what you have mentioned here. You can check out the diagram of the data model here: https://github.com/rnowling/bigpetstore-data-generator/raw/master/bdcloud_paper/latex/paper.pdf . And the python code is also in that repository - (the code can be easily translated to scala, i can help or at least get this started if need be).
          Hide
          jornfranke Jörn Franke added a comment -

          Hi,

          oh ok I was not aware of the paper. This is indeed related to the functionality that I proposed. A very nice paper.

          With respect to your question: At the moment I plan to use the cleaned CSV data as an output of the PIG process (see diagram).

          It would be fine if there would be a map/reduce job that implements your generator and stores the data in HDFS. Then, we can use this data in Spark and do whatever we want with it.

          Show
          jornfranke Jörn Franke added a comment - Hi, oh ok I was not aware of the paper. This is indeed related to the functionality that I proposed. A very nice paper. With respect to your question: At the moment I plan to use the cleaned CSV data as an output of the PIG process (see diagram). It would be fine if there would be a map/reduce job that implements your generator and stores the data in HDFS. Then, we can use this data in Spark and do whatever we want with it.
          Hide
          jayunit100 jay vyas added a comment -

          sounds good...

          • proceed with your work on BIGTOP-1414
          • we will morph it to read from the new data model once it is implemented.

          I'll look into this for now !

          Show
          jayunit100 jay vyas added a comment - sounds good... proceed with your work on BIGTOP-1414 we will morph it to read from the new data model once it is implemented. I'll look into this for now !
          Hide
          jayunit100 jay vyas added a comment -

          I'll be pushing code to here https://github.com/jayunit100/bpsgenerator for this in the interim while i hash out some ideas.
          Hope to put in a patch this wk

          Show
          jayunit100 jay vyas added a comment - I'll be pushing code to here https://github.com/jayunit100/bpsgenerator for this in the interim while i hash out some ideas. Hope to put in a patch this wk
          Hide
          jayunit100 jay vyas added a comment -

          Update RJ Has ported his initial python based generator to java, which will serve as the seed for this https://github.com/rnowling/bigpetstore-data-generator/ .

          Show
          jayunit100 jay vyas added a comment - Update RJ Has ported his initial python based generator to java, which will serve as the seed for this https://github.com/rnowling/bigpetstore-data-generator/ .
          Hide
          rnowling RJ Nowling added a comment - - edited

          Right now, the Java port is about half done. Only supports generating stores and customers. We need to add support for generating the transactions. (The Python version supports generating transactions.)

          After many discussions jay vyas and I have realized that it may be preferable to have two implementations of the data generator, a Python sandbox for me to prototype ideas and a JVM-based, stable implementation for external users. As new ideas are proven successful in the Python sandbox, they will be migrated to the JVM port. I'll help maintain the JVM port.

          I've been refactoring and adding unit tests and documentation to the Python implementation for a v0.2 release. Once complete, v0.2 will be the basis for finishing the JVM port.

          The JVM port is currently using Java. I am, however, also open to using Clojure. Scala is another option but less preferable for me.

          jay vyas, what is the current status of Clojure support / interest in BigTop? Will BigTop accept Clojure code?

          Show
          rnowling RJ Nowling added a comment - - edited Right now, the Java port is about half done. Only supports generating stores and customers. We need to add support for generating the transactions. (The Python version supports generating transactions.) After many discussions jay vyas and I have realized that it may be preferable to have two implementations of the data generator, a Python sandbox for me to prototype ideas and a JVM-based, stable implementation for external users. As new ideas are proven successful in the Python sandbox, they will be migrated to the JVM port. I'll help maintain the JVM port. I've been refactoring and adding unit tests and documentation to the Python implementation for a v0.2 release. Once complete, v0.2 will be the basis for finishing the JVM port. The JVM port is currently using Java. I am, however, also open to using Clojure. Scala is another option but less preferable for me. jay vyas , what is the current status of Clojure support / interest in BigTop? Will BigTop accept Clojure code?
          Hide
          jayunit100 jay vyas added a comment -

          Java would be fine and simple to maintain. clojure is a whole nother conversation. Looking forward to a patch for this !

          Show
          jayunit100 jay vyas added a comment - Java would be fine and simple to maintain. clojure is a whole nother conversation. Looking forward to a patch for this !
          Hide
          jayunit100 jay vyas added a comment -

          We will want to do the BPS Cleanup before we update the data generator, or as part of it.

          Show
          jayunit100 jay vyas added a comment - We will want to do the BPS Cleanup before we update the data generator, or as part of it.
          Hide
          jayunit100 jay vyas added a comment -

          Update on this. The model itself is being published in IEEE BigData and Cloud proceedings.

          RJ Nowling — can you provide a technical explanation of your java implementation and details on progress once you get a chance?

          Show
          jayunit100 jay vyas added a comment - Update on this. The model itself is being published in IEEE BigData and Cloud proceedings. RJ Nowling — can you provide a technical explanation of your java implementation and details on progress once you get a chance?
          Hide
          rnowling RJ Nowling added a comment -

          Here's a link to the conference:
          http://www.swinflow.org/confs/bdcloud2014/

          You can review the Java code in the javaport branch on GitHub:
          https://github.com/rnowling/bigpetstore-data-generator/tree/javaport

          The Java port currently has:

          • a build system with Gradle
          • about ~75 classes including unit tests. Every functional class has a corresponding unit test.
          • ~4k lines of Java code

          For the release, I need to:

          • Implement about 4-5 more classes and their corresponding unit tests
          • Implement the local command-line driver
          • Move the simulation parameters from a class containing constants into an external configuration file with a Configuration class
          • Run some analytics comparing the Java implementation to the Python implementation for correctness
          • Write a Hadoop MapReduce or Spark driver to test out the public API and make any necessary changes

          I expect the Java implementation to be available anywhere from a few weeks to a couple of months, depending largely on my travel schedule and time spent on finishing my Ph.D.

          The design centers around 3 types of data: Stores, Customers, PurchasingProfiles, and Transactions. They are generated in a pipeline of Store > Customers -> PurchasingProfiles> Transactions. For each type of data, there are simple classes for data and corresponding generators which provide an API to the underlying logic. The transactions and purchasing profiles are the most complex and computationally-intensive components so their generated are designed to be instantiated multiple times for parallelization.

          I do not specify an on-disk file format – the driver (local CLI, Hadoop, Spark, etc.) will be responsible for writing out the data in a format of its choice.

          I have a list of several improvements to the math model in the next 6 months or so and expect the model to stabilize once those are done. In the mean time, nothing will be removed from the data model but some optional data may be added.

          Show
          rnowling RJ Nowling added a comment - Here's a link to the conference: http://www.swinflow.org/confs/bdcloud2014/ You can review the Java code in the javaport branch on GitHub: https://github.com/rnowling/bigpetstore-data-generator/tree/javaport The Java port currently has: a build system with Gradle about ~75 classes including unit tests. Every functional class has a corresponding unit test. ~4k lines of Java code For the release, I need to: Implement about 4-5 more classes and their corresponding unit tests Implement the local command-line driver Move the simulation parameters from a class containing constants into an external configuration file with a Configuration class Run some analytics comparing the Java implementation to the Python implementation for correctness Write a Hadoop MapReduce or Spark driver to test out the public API and make any necessary changes I expect the Java implementation to be available anywhere from a few weeks to a couple of months, depending largely on my travel schedule and time spent on finishing my Ph.D. The design centers around 3 types of data: Stores, Customers, PurchasingProfiles, and Transactions. They are generated in a pipeline of Store > Customers -> PurchasingProfiles > Transactions. For each type of data, there are simple classes for data and corresponding generators which provide an API to the underlying logic. The transactions and purchasing profiles are the most complex and computationally-intensive components so their generated are designed to be instantiated multiple times for parallelization. I do not specify an on-disk file format – the driver (local CLI, Hadoop, Spark, etc.) will be responsible for writing out the data in a format of its choice. I have a list of several improvements to the math model in the next 6 months or so and expect the model to stabilize once those are done. In the mean time, nothing will be removed from the data model but some optional data may be added.
          Hide
          jayunit100 jay vyas added a comment -

          okay sounds great. looking forward to the minimum viable implementation !

          Show
          jayunit100 jay vyas added a comment - okay sounds great. looking forward to the minimum viable implementation !
          Hide
          rvs Roman Shaposhnik added a comment -

          This does sound super interesting!

          Show
          rvs Roman Shaposhnik added a comment - This does sound super interesting!
          Hide
          rnowling RJ Nowling added a comment -

          Hi all,

          Just an update. I have an initial Spark driver for the data generator:

          https://github.com/rnowling/bigpetstore-data-generator/blob/javaport/src/java/bps-data-generator/spark_driver/src/main/scala/com/github/rnowling/bps/datagenerator/spark/Driver.scala

          I'm using the Spark driver to test out the API. My goal is to have a handful of high level generators classes that need to be called in each parallel step. These will be supported by high level data readers and data models. This way, the data generator can easily be used in MapReduce, Spark, or CLI drivers without knowing the details of the methods. Seems I'm almost there, just need a few more cosmetic changes.

          Note that I'm using the javaport branch for my current work – eventually I'll merge this into the master branch and mark it as a v0.2 release. I should be able to release and make it available in BigTop once I clean up the Spark driver and API a bit.

          Show
          rnowling RJ Nowling added a comment - Hi all, Just an update. I have an initial Spark driver for the data generator: https://github.com/rnowling/bigpetstore-data-generator/blob/javaport/src/java/bps-data-generator/spark_driver/src/main/scala/com/github/rnowling/bps/datagenerator/spark/Driver.scala I'm using the Spark driver to test out the API. My goal is to have a handful of high level generators classes that need to be called in each parallel step. These will be supported by high level data readers and data models. This way, the data generator can easily be used in MapReduce, Spark, or CLI drivers without knowing the details of the methods. Seems I'm almost there, just need a few more cosmetic changes. Note that I'm using the javaport branch for my current work – eventually I'll merge this into the master branch and mark it as a v0.2 release. I should be able to release and make it available in BigTop once I clean up the Spark driver and API a bit.
          Hide
          jayunit100 jay vyas added a comment -

          great . looking forward to testing the patch !

          how should we "upgrade" the existing code to your current datamodel so that its compliant with your library ?

          Show
          jayunit100 jay vyas added a comment - great . looking forward to testing the patch ! how should we "upgrade" the existing code to your current datamodel so that its compliant with your library ?
          Hide
          rnowling RJ Nowling added a comment -

          jay vyas I can modify the Spark driver to output data in a format as class possible to the current data format. Let's see what's different and decide from there. The changes will most likely be minor.

          I was able to clean up the public API bits. I think we just need to expand the documentation and then we should be good to go for release.

          Show
          rnowling RJ Nowling added a comment - jay vyas I can modify the Spark driver to output data in a format as class possible to the current data format. Let's see what's different and decide from there. The changes will most likely be minor. I was able to clean up the public API bits. I think we just need to expand the documentation and then we should be good to go for release.
          Hide
          jayunit100 jay vyas added a comment -

          just popped up. might be a way we could collaborate on these generators.

          Show
          jayunit100 jay vyas added a comment - just popped up. might be a way we could collaborate on these generators.
          Hide
          jayunit100 jay vyas added a comment -

          Reviewing https://github.com/rnowling/bigpetstore-data-generator/blob/javaport/src/java/bps-data-generator/spark_driver/src/main/scala/com/github/rnowling/bps/datagenerator/spark/Driver.scala
          now . setting up a spark dev env for this on a new machine . will let you know. Thanks for all your hard work on this RJ Nowling!

          Show
          jayunit100 jay vyas added a comment - Reviewing https://github.com/rnowling/bigpetstore-data-generator/blob/javaport/src/java/bps-data-generator/spark_driver/src/main/scala/com/github/rnowling/bps/datagenerator/spark/Driver.scala now . setting up a spark dev env for this on a new machine . will let you know. Thanks for all your hard work on this RJ Nowling !
          Hide
          jayunit100 jay vyas added a comment - - edited

          Hi RJ. I had some trouble building the spark driver last night, sbt compile or sbt console shows a bunch of class not found errors. Will poke around and see if i get it working today

          UPDATE : got this sorted, ... moving on w / testing this and looking at the API

          Show
          jayunit100 jay vyas added a comment - - edited Hi RJ. I had some trouble building the spark driver last night, sbt compile or sbt console shows a bunch of class not found errors. Will poke around and see if i get it working today UPDATE : got this sorted, ... moving on w / testing this and looking at the API
          Hide
          jayunit100 jay vyas added a comment -

          One other note, while I'm hacking around on this,...

          Unless folks in bigtop actually want to curate the bigpetstore data generator code,, btw, for anyone interested : The plan here is to :

          • remove data generator code from bigtop, so bigtop can focus on the running of the pipeline , and use the datagenerator source in a separate github repo .
          • We can always publish jars of the generator in bigtop so that we have astable dependency that is curated in the ASF.
          Show
          jayunit100 jay vyas added a comment - One other note, while I'm hacking around on this,... Unless folks in bigtop actually want to curate the bigpetstore data generator code,, btw, for anyone interested : The plan here is to : remove data generator code from bigtop, so bigtop can focus on the running of the pipeline , and use the datagenerator source in a separate github repo . We can always publish jars of the generator in bigtop so that we have astable dependency that is curated in the ASF.
          Hide
          jayunit100 jay vyas added a comment - - edited

          Hi guys.
          TL;DR : close ! but minor changes needed... I reviewed it in spark 0.9 and created a driver bash script for submitting spark jobs to 0.9 since spark-submit isn't available. we might need to have a couple minor mods to the spark driver to match 0.9 apis SparkContext. I did this testing in pure bigtop 0.9 VMs. details below

          Okay, I've cobbled together a "spark submit" type script based on some templates i found online for bigtop. This will be the way we submit jobs for spark 9x. When we upgrade to spark 1x we can use rj's exact README directions above.

          source /etc/spark/conf/spark-env.sh
          
          export JAVA_HOME=/usr/lib/jvm/java-1.7.0-openjdk-1.7.0.71.x86_64/
          
          # system jars:
          CLASSPATH=$CLASSPATH:$SPARK_HOME/assembly/lib/*
          
          # app jar:
          CLASSPATH=$CLASSPATH:/usr/lib/spark/examples/lib/spark-examples_2.10-0.9.1.jar:/bigtop-home/*jar:/usr/lib/spark/*:/usr/lib/spark/lib/*:/usr/lib/spark/assembly/lib/*
          
          CONFIG_OPTS="-Dspark.master=local -Dspark.jars=target/sparkwordcount-0.0.1-SNAPSHOT.jar"
          
          $JAVA_HOME/bin/java -cp $CLASSPATH $CONFIG_OPTS org.apache.spark.examples.SparkPi local 2 2
          

          result:

          [vagrant@bigtop1 ~]$ ./submit.sh                                                                                                             
          Reading zipcode data
          Read 30891 zipcode entries
          Reading name data
          Read 86987 first names and 47819 last names
          Reading product data
          Read 4 product categories
          Generating stores...
          Done.
          Generating customers...
          Done.
          Exception in thread "main" java.lang.NoSuchMethodError: org.apache.spark.SparkContext.<init>(Lorg/apache/spark/SparkConf;)V
                  at com.github.rnowling.bps.datagenerator.spark.SparkDriver$.main(Driver.scala:45)
                  at com.github.rnowling.bps.datagenerator.spark.SparkDriver.main(Driver.scala)
          

          So we will need to possibly refactor the way SparkContext is instantiated for 0.9 api . otherwise looks to work quite well, and the spark driver launches and gives great error messages, for missing resources/ dir and so on. which i really like. I did this in bigtop VMs, and just copied the resources/* into bigtop-home .

          Show
          jayunit100 jay vyas added a comment - - edited Hi guys. TL;DR : close ! but minor changes needed... I reviewed it in spark 0.9 and created a driver bash script for submitting spark jobs to 0.9 since spark-submit isn't available. we might need to have a couple minor mods to the spark driver to match 0.9 apis SparkContext. I did this testing in pure bigtop 0.9 VMs. details below Okay, I've cobbled together a "spark submit" type script based on some templates i found online for bigtop. This will be the way we submit jobs for spark 9x . When we upgrade to spark 1x we can use rj's exact README directions above. source /etc/spark/conf/spark-env.sh export JAVA_HOME=/usr/lib/jvm/java-1.7.0-openjdk-1.7.0.71.x86_64/ # system jars: CLASSPATH=$CLASSPATH:$SPARK_HOME/assembly/lib/* # app jar: CLASSPATH=$CLASSPATH:/usr/lib/spark/examples/lib/spark-examples_2.10-0.9.1.jar:/bigtop-home/*jar:/usr/lib/spark/*:/usr/lib/spark/lib/*:/usr/lib/spark/assembly/lib/* CONFIG_OPTS="-Dspark.master=local -Dspark.jars=target/sparkwordcount-0.0.1-SNAPSHOT.jar" $JAVA_HOME/bin/java -cp $CLASSPATH $CONFIG_OPTS org.apache.spark.examples.SparkPi local 2 2 result: [vagrant@bigtop1 ~]$ ./submit.sh Reading zipcode data Read 30891 zipcode entries Reading name data Read 86987 first names and 47819 last names Reading product data Read 4 product categories Generating stores... Done. Generating customers... Done. Exception in thread "main" java.lang.NoSuchMethodError: org.apache.spark.SparkContext.<init>(Lorg/apache/spark/SparkConf;)V at com.github.rnowling.bps.datagenerator.spark.SparkDriver$.main(Driver.scala:45) at com.github.rnowling.bps.datagenerator.spark.SparkDriver.main(Driver.scala) So we will need to possibly refactor the way SparkContext is instantiated for 0.9 api . otherwise looks to work quite well, and the spark driver launches and gives great error messages, for missing resources/ dir and so on. which i really like. I did this in bigtop VMs, and just copied the resources/* into bigtop-home .
          Hide
          jayunit100 jay vyas added a comment -

          RJ Nowling do you want to push updates to your github repo that makes your code work w/ spark 0.9, and then ill build and retest in the real cluster ? Or would you rather keep your repo using spark 1x and wait for bigtop to catch up?

          Show
          jayunit100 jay vyas added a comment - RJ Nowling do you want to push updates to your github repo that makes your code work w/ spark 0.9, and then ill build and retest in the real cluster ? Or would you rather keep your repo using spark 1x and wait for bigtop to catch up?
          Hide
          rnowling RJ Nowling added a comment -

          jay vyas Spark 0.9 is 2 releases behind, and Spark is about to release 1.2.0. Maybe this is good motivation for talking to the BigTop Spark maintainer and figure out if we should bump the version?

          Show
          rnowling RJ Nowling added a comment - jay vyas Spark 0.9 is 2 releases behind, and Spark is about to release 1.2.0. Maybe this is good motivation for talking to the BigTop Spark maintainer and figure out if we should bump the version?
          Hide
          jayunit100 jay vyas added a comment -

          Sure .... In the meantime what do you suggest? Ok okay adding this patch. With understanding that it will work when we update bigtop to 1x .

          Is that what you are suggesting?

          Anyone else have opinions?

          Show
          jayunit100 jay vyas added a comment - Sure .... In the meantime what do you suggest? Ok okay adding this patch. With understanding that it will work when we update bigtop to 1x . Is that what you are suggesting? Anyone else have opinions?
          Hide
          rnowling RJ Nowling added a comment - - edited

          jay vyas Yes, let's get the current driver into BigTop. I'll work on finishing the 0.2 release of the data generator and publishing a JAR we can pull via Maven.

          We can then work on updating the BPS MapReduce driver and data model.

          PS – Here's an updated URL for the Spark Driver:

          https://github.com/rnowling/bigpetstore-data-generator/tree/master/spark_driver

          Show
          rnowling RJ Nowling added a comment - - edited jay vyas Yes, let's get the current driver into BigTop. I'll work on finishing the 0.2 release of the data generator and publishing a JAR we can pull via Maven. We can then work on updating the BPS MapReduce driver and data model. PS – Here's an updated URL for the Spark Driver: https://github.com/rnowling/bigpetstore-data-generator/tree/master/spark_driver
          Hide
          jayunit100 jay vyas added a comment -

          Okay, thats fine i guess. RJ Nowling ping when stable for next review, and then i will give this another shot.

          Show
          jayunit100 jay vyas added a comment - Okay, thats fine i guess. RJ Nowling ping when stable for next review, and then i will give this another shot.
          Hide
          rnowling RJ Nowling added a comment -

          jay vyas I think I misspoke. On the phone, we said you were going to work on the MapReduce driver, and I'll continue working on the Spark driver and BinTray JAR release. And we'll converge on a data model. Is that right?

          Show
          rnowling RJ Nowling added a comment - jay vyas I think I misspoke. On the phone, we said you were going to work on the MapReduce driver, and I'll continue working on the Spark driver and BinTray JAR release. And we'll converge on a data model. Is that right?
          Hide
          jayunit100 jay vyas added a comment -

          yup sounds good. Let me look into MR implementation.

          Show
          jayunit100 jay vyas added a comment - yup sounds good. Let me look into MR implementation.
          Hide
          rnowling RJ Nowling added a comment -

          jay vyas I updated the Spark driver to print out city & state for stores and customers and real dates/times. I'm missing the product Ids and product prices (although price can be inferred from the product desc as size * per_unit_cost). I'm willing to call the spark driver done for now.

          Next step is publishing a jar to bintray.

          Show
          rnowling RJ Nowling added a comment - jay vyas I updated the Spark driver to print out city & state for stores and customers and real dates/times. I'm missing the product Ids and product prices (although price can be inferred from the product desc as size * per_unit_cost). I'm willing to call the spark driver done for now. Next step is publishing a jar to bintray.
          Hide
          rnowling RJ Nowling added a comment -

          I published a JAR on GitHub ( https://github.com/rnowling/bigpetstore-data-generator/releases/tag/v0.2 ) and BinTray ( https://bintray.com/rnowling/bigpetstore/bigpetstore-data-generator/0.2/view/general ). I modified the pom.xml file and created a bigpetstore-data-generator-0.2.pom.

          I updated the Spark driver sbt build script to use the BinTray repo ( https://github.com/rnowling/bigpetstore-data-generator/blob/master/spark_driver/build.sbt ). However, it will need further testing with Maven, gradle, etc. to see if I did it correctly.

          Show
          rnowling RJ Nowling added a comment - I published a JAR on GitHub ( https://github.com/rnowling/bigpetstore-data-generator/releases/tag/v0.2 ) and BinTray ( https://bintray.com/rnowling/bigpetstore/bigpetstore-data-generator/0.2/view/general ). I modified the pom.xml file and created a bigpetstore-data-generator-0.2.pom. I updated the Spark driver sbt build script to use the BinTray repo ( https://github.com/rnowling/bigpetstore-data-generator/blob/master/spark_driver/build.sbt ). However, it will need further testing with Maven, gradle, etc. to see if I did it correctly.
          Hide
          rnowling RJ Nowling added a comment - - edited

          Add Spark driver for generating data using new data generator. Update build.gradle to add Spark and data generator dependencies. Update README.

          jay vyas Can you take a look? At this point, we need to converge the data models, update the Pig scripts, and add a MapReduce driver.

          Note that we may want to clean up the dependencies in gradle so that we don't include the spark assembly jar in the shadowJar (that is provided by Spark).

          Show
          rnowling RJ Nowling added a comment - - edited Add Spark driver for generating data using new data generator. Update build.gradle to add Spark and data generator dependencies. Update README. jay vyas Can you take a look? At this point, we need to converge the data models, update the Pig scripts, and add a MapReduce driver. Note that we may want to clean up the dependencies in gradle so that we don't include the spark assembly jar in the shadowJar (that is provided by Spark).
          Hide
          rnowling RJ Nowling added a comment -

          Updated patch w/ re-organization of BPS into MapReduce and Spark versions. Add Apache headers to SparkDriver

          Show
          rnowling RJ Nowling added a comment - Updated patch w/ re-organization of BPS into MapReduce and Spark versions. Add Apache headers to SparkDriver
          Hide
          rnowling RJ Nowling added a comment -

          jay vyas and I have discussed how best to organize the new Spark code with the existing MapReduce code. We've decided to organize the code into separate applications to support deployment to pure Spark or MapReduce environments. By separating the applications, we prevent programmers from adding dependencies between the two applications that would prevent pure deployments.

          Show
          rnowling RJ Nowling added a comment - jay vyas and I have discussed how best to organize the new Spark code with the existing MapReduce code. We've decided to organize the code into separate applications to support deployment to pure Spark or MapReduce environments. By separating the applications, we prevent programmers from adding dependencies between the two applications that would prevent pure deployments.
          Hide
          jayunit100 jay vyas added a comment -

          Looks like the patch accurately separates the spark from mapreduce code.

          I will test this when i get home. is this a final patch for review ?

          Show
          jayunit100 jay vyas added a comment - Looks like the patch accurately separates the spark from mapreduce code. I will test this when i get home. is this a final patch for review ?
          Hide
          rnowling RJ Nowling added a comment -

          Yes, final patch for review.

          Show
          rnowling RJ Nowling added a comment - Yes, final patch for review.
          Hide
          jayunit100 jay vyas added a comment - - edited

          RJ Nowling okay thanks for this.

          • It looks like you need to update build.gradle in the restructuring, to reference "../../pom.xml" instead of "../pom.xml". BigPetStore builds pull some data in from bigtop pom , by default, in from top level bigtop. Easy fix .
          • Also , you have a lot of trailing whitespaces. I can fix this on commit via --fix-whitespace so its not a huge problem.
          • Can you add a gradle test that launches a local spark job. Right now there are none. ill try to paste a snippet of how to do this tonite if i can
          • otherwise, looks like the existing map reduce code still works, and the spark code looks good as well !

          FYI

          org.apache.bigtop.bigpetstore.docs.TestDocs > testGraphViz PASSED
          
          org.apache.bigtop.bigpetstore.generator.TestNumericalIdUtils > testName PASSED
          
          org.apache.bigtop.bigpetstore.generator.TestPetStoreTransactionGeneratorJob > test PASSED
          
          BUILD SUCCESSFUL
          
          Total time: 2 mins 3.84 secs
          
          
          Show
          jayunit100 jay vyas added a comment - - edited RJ Nowling okay thanks for this. It looks like you need to update build.gradle in the restructuring, to reference "../../pom.xml" instead of "../pom.xml". BigPetStore builds pull some data in from bigtop pom , by default, in from top level bigtop. Easy fix . Also , you have a lot of trailing whitespaces. I can fix this on commit via --fix-whitespace so its not a huge problem. Can you add a gradle test that launches a local spark job. Right now there are none. ill try to paste a snippet of how to do this tonite if i can otherwise, looks like the existing map reduce code still works, and the spark code looks good as well ! FYI org.apache.bigtop.bigpetstore.docs.TestDocs > testGraphViz PASSED org.apache.bigtop.bigpetstore.generator.TestNumericalIdUtils > testName PASSED org.apache.bigtop.bigpetstore.generator.TestPetStoreTransactionGeneratorJob > test PASSED BUILD SUCCESSFUL Total time: 2 mins 3.84 secs
          Hide
          rnowling RJ Nowling added a comment -

          jay vyas I updated the patch to:

          • Fix the path to the pom.xml in the bigpetstore-mapreduce/build.gradle
          • Refactored the Spark driver into more easily tested functions
          • Added a unit test for the Spark driver and updated the associated build.gradle to change the scala test dependency to scala 2.10 from scala 2.11 to match Spark
          • Update BPS Spark README to document tests

          I didn't fix whitespace since you said you can handle that on commit. Thanks!

          Show
          rnowling RJ Nowling added a comment - jay vyas I updated the patch to: Fix the path to the pom.xml in the bigpetstore-mapreduce/build.gradle Refactored the Spark driver into more easily tested functions Added a unit test for the Spark driver and updated the associated build.gradle to change the scala test dependency to scala 2.10 from scala 2.11 to match Spark Update BPS Spark README to document tests I didn't fix whitespace since you said you can handle that on commit. Thanks!
          Hide
          jayunit100 jay vyas added a comment -

          great! testing it....

          Show
          jayunit100 jay vyas added a comment - great! testing it....
          Hide
          jayunit100 jay vyas added a comment - - edited

          Great work RJ ! ran gradle test and indeed,

          • it builds and runs the data gen unit tests next step ill wait for others to chime in, as far as i can tell, this is +1, and we now have
          • the original mapreduce code works as well, so the pom.xml stuff is fixed.

          now we have a powerfull, spark based data generator for bps.

          • There is one last step - we need to (1) move arch.dot one level up and (2) update the arch.dot file with the description of the new architecture/ You can easily do that with graphviz (paste contentes of arch.dot into erdos and edit). Just create a jira for that please, assign to yourself.

          ill commit this tomorrow unless others have any issues.

          Show
          jayunit100 jay vyas added a comment - - edited Great work RJ ! ran gradle test and indeed, it builds and runs the data gen unit tests next step ill wait for others to chime in, as far as i can tell, this is +1, and we now have the original mapreduce code works as well, so the pom.xml stuff is fixed. now we have a powerfull, spark based data generator for bps. There is one last step - we need to (1) move arch.dot one level up and (2) update the arch.dot file with the description of the new architecture/ You can easily do that with graphviz (paste contentes of arch.dot into erdos and edit). Just create a jira for that please, assign to yourself. ill commit this tomorrow unless others have any issues.
          Hide
          jayunit100 jay vyas added a comment - - edited

          commited ! Thanks RJ ! have fun in australia presenting this work.
          And when you come back

          lets add kangaroos to the pet store items !!!!

          For those curious about this commit (its pretty big)

          • refactors mapreduce and the new spark impl into separate directorys
          • adds spark support for bigpetstore data generation
          • for a demo of how to use this to test your spark clusters, just see the readme in bigtop-bigpetstore/bigpetstore-spark/.
          Show
          jayunit100 jay vyas added a comment - - edited commited ! Thanks RJ ! have fun in australia presenting this work. And when you come back lets add kangaroos to the pet store items !!!! For those curious about this commit (its pretty big) refactors mapreduce and the new spark impl into separate directorys adds spark support for bigpetstore data generation for a demo of how to use this to test your spark clusters, just see the readme in bigtop-bigpetstore/bigpetstore-spark/ .
          Hide
          rnowling RJ Nowling added a comment -

          That's great news!

          Show
          rnowling RJ Nowling added a comment - That's great news!

            People

            • Assignee:
              rnowling RJ Nowling
              Reporter:
              rnowling RJ Nowling
            • Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Time Tracking

                Estimated:
                Original Estimate - 8,736h
                8,736h
                Remaining:
                Remaining Estimate - 8,736h
                8,736h
                Logged:
                Time Spent - Not Specified
                Not Specified

                  Development