Bigtop
  1. Bigtop
  2. BIGTOP-1272

BigPetStore: Productionize the Mahout recommender

    Details

    • Type: New Feature New Feature
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: backlog
    • Fix Version/s: 0.8.0
    • Component/s: blueprints
    • Labels:
      None

      Description

      BIGTOP-1271 adds patterns into the data that gaurantee that a meaningfull type of product recommendation can be given for at least some customers, since we know that there are going to be many customers who only bought 1 product, and also customers that bought 2 or more products – even in a dataset size of 10. due to the gaussian distribution of purchases that is also in the dataset generator.

      The current mahout recommender code is statically valid: It runs to completion in local unit tests if a hadoop 1x tarball is present but otherwise it hasn't been tested at scale. So, lets get it working. this JIRA also will comprise:

      • deciding wether to use mahout 2x for unit tests (default on mahout maven repo is the 1x impl) and wether or not bigtop should host a mahout 2x jar? After all, bigtop builds a mahout 2x jar as part of its packaging process, and BigPetStore might thus need a mahout 2x jar in order to test against the right same of bigtop releases.
      1. build.gradle
        9 kB
        bhashit parikh
      2. BIGTOP-1272.patch
        124 kB
        bhashit parikh
      3. BIGTOP-1272.patch
        124 kB
        bhashit parikh
      4. BIGTOP-1272.patch
        135 kB
        bhashit parikh
      5. BIGTOP-1272.patch
        148 kB
        bhashit parikh
      6. BIGTOP-1272.patch
        137 kB
        bhashit parikh
      7. BIGTOP-1272.patch
        140 kB
        bhashit parikh
      8. arch.jpeg
        161 kB
        jay vyas

        Issue Links

          Activity

          jay vyas created issue -
          jay vyas made changes -
          Field Original Value New Value
          Affects Version/s backlog [ 12324373 ]
          jay vyas made changes -
          Description BIGTOP-1271 adds patterns into the data that gaurantee that a meaningfull type of product recommendation can be given for at least *some* customers, since we know that there are going to be many customers who only bought 1 or 2 products due to the gaussian distribution of purchases that is also in the dataset generator.

          The current mahout recommender code is statically valid: It runs to completion in local unit tests if a hadoop 1x tarball is present but otherwise it hasn't been tested at scale. So, lets get it working. this JIRA also will comprise:

          - deciding wether to use mahout 2x for unit tests (default on mahout maven repo is the 1x impl) and wether or not bigtop should host a mahout 2x jar? After all, bigtop builds a mahout 2x jar as part of its packaging process, and BigPetStore might thus need a mahout 2x jar in order to test against the right same of bigtop releases.
          BIGTOP-1271 adds patterns into the data that gaurantee that a meaningfull type of product recommendation can be given for at least *some* customers, since we know that there are going to be many customers who only bought 1 product, and also customers that bought 2 or more products -- even in a dataset size of 10. due to the gaussian distribution of purchases that is also in the dataset generator.

          The current mahout recommender code is statically valid: It runs to completion in local unit tests if a hadoop 1x tarball is present but otherwise it hasn't been tested at scale. So, lets get it working. this JIRA also will comprise:

          - deciding wether to use mahout 2x for unit tests (default on mahout maven repo is the 1x impl) and wether or not bigtop should host a mahout 2x jar? After all, bigtop builds a mahout 2x jar as part of its packaging process, and BigPetStore might thus need a mahout 2x jar in order to test against the right same of bigtop releases.
          jay vyas made changes -
          Link This issue is blocked by BIGTOP-1269 [ BIGTOP-1269 ]
          Hide
          jay vyas added a comment -

          Gating this on BIGTOP-1269 which cleans the build up. After wards we will have to surgically put some careful "hacks" into bigpetstore to support mahout 2x.

          Show
          jay vyas added a comment - Gating this on BIGTOP-1269 which cleans the build up. After wards we will have to surgically put some careful "hacks" into bigpetstore to support mahout 2x.
          jay vyas made changes -
          Link This issue relates to BIGTOP-1270 [ BIGTOP-1270 ]
          Hide
          jay vyas added a comment -

          bhashit parikh are you interested in picking this up ? Heres an outline of what we should do i think. Others welcome to chime in of course.

          1) Add an external mahout 2 repo, so building isnt required. For example the HDP servers open maven repos compile for 2x.

          2) Add Back in the mahout recommender and write an integration test (one that is like we have for pig) To do this, we will need to

          • create a mock input file of integer,integer,1|0 tuples. The mock file should have similar "users" (column 1) , for example:
            1,100,1
            2,100,1
            2,200,1
            

            In the above user "2" is similar to "1" (they both like product "100"). So we would like to see that 1 has a recommendation to buy "200" in the output.

          • write the java code to call the mahout parallel ALS job directly via the API, taking above mock input file as input.
          • tune the parameters so that at small # of results, some recommendations are still made (i.e. so that integration tests can be done fast, but locally )

          After that point, the "prototyping" will be done - and we can move forward with

          3) Embedding user types in the data (i.e. BIGTOP-1271). That will mean that the data produced by the data set generator has meaningfull user trends which can be used as input to the recommender. I like the idea of using scala to redo it, as you showed me offline in http://pastebin.com/wHXCEuk4

          4) Create a new pig script "BPS_transactions.pig" (like BPS_Analytics.pig) to output a 3 column hashcode file which we will use for the "real" input to mahout in the actual integration tests / cluster . Maybe w/ a python udf for the hashing of products and users. I will provide that as a patch in this JIRA and we can add it in the overall JIRA when you finish 1-3. This will allow us to keep bigpetstore moving forward inspite of (see BIGTOP-1270 / HIVE-7115 on why hive is difficult to run in bigpetstore at the moment ).

          5) Match up this with BIGTOP-1327 (updated arch.dot diagram) to ensure that the architecture is matched correctly.

          6) update the arch.dot command with the exact commands (i.e. "hadoop jar bps.jar BPSRecommender -in ... -out...")

          At that point, we will do some testing of the bigpetstore jar file, based on 6, in the cluster, and then commit the next iteration of bigpetstore !

          Show
          jay vyas added a comment - bhashit parikh are you interested in picking this up ? Heres an outline of what we should do i think. Others welcome to chime in of course. 1) Add an external mahout 2 repo, so building isnt required. For example the HDP servers open maven repos compile for 2x. 2) Add Back in the mahout recommender and write an integration test (one that is like we have for pig) To do this, we will need to create a mock input file of integer,integer,1|0 tuples. The mock file should have similar "users" (column 1) , for example: 1,100,1 2,100,1 2,200,1 In the above user "2" is similar to "1" (they both like product "100"). So we would like to see that 1 has a recommendation to buy "200" in the output. write the java code to call the mahout parallel ALS job directly via the API, taking above mock input file as input. tune the parameters so that at small # of results, some recommendations are still made (i.e. so that integration tests can be done fast, but locally ) After that point, the "prototyping" will be done - and we can move forward with 3) Embedding user types in the data (i.e. BIGTOP-1271 ). That will mean that the data produced by the data set generator has meaningfull user trends which can be used as input to the recommender. I like the idea of using scala to redo it, as you showed me offline in http://pastebin.com/wHXCEuk4 4) Create a new pig script "BPS_transactions.pig" (like BPS_Analytics.pig) to output a 3 column hashcode file which we will use for the "real" input to mahout in the actual integration tests / cluster . Maybe w/ a python udf for the hashing of products and users. I will provide that as a patch in this JIRA and we can add it in the overall JIRA when you finish 1-3. This will allow us to keep bigpetstore moving forward inspite of (see BIGTOP-1270 / HIVE-7115 on why hive is difficult to run in bigpetstore at the moment ). 5) Match up this with BIGTOP-1327 (updated arch.dot diagram) to ensure that the architecture is matched correctly. 6) update the arch.dot command with the exact commands (i.e. "hadoop jar bps.jar BPSRecommender -in ... -out...") At that point, we will do some testing of the bigpetstore jar file, based on 6, in the cluster, and then commit the next iteration of bigpetstore !
          jay vyas made changes -
          Link This issue is blocked by BIGTOP-1327 [ BIGTOP-1327 ]
          Hide
          jay vyas added a comment -

          the final steps of this jira are blocked by an update required to the architecture diagram ( arch.dot ) file

          Show
          jay vyas added a comment - the final steps of this jira are blocked by an update required to the architecture diagram ( arch.dot ) file
          Hide
          bhashit parikh added a comment -

          jay vyas I have started working on this. Going step by step. Taking care of the first three steps in the first run.

          Show
          bhashit parikh added a comment - jay vyas I have started working on this. Going step by step. Taking care of the first three steps in the first run.
          jay vyas made changes -
          Link This issue blocks BIGTOP-1275 [ BIGTOP-1275 ]
          Hide
          jay vyas added a comment -

          Attached the new arch diagram here for convenience.

          Show
          jay vyas added a comment - Attached the new arch diagram here for convenience.
          jay vyas made changes -
          Attachment arch.jpeg [ 12648250 ]
          Hide
          bhashit parikh added a comment - - edited

          So far, I have finished up to step 2, and I have been working on step 3 and 4. Here are some thoughts on how we can go about that:

          1. Instead of performing hashing for users and products, we assign unique ids to both. This is more likely to resemble a real life scenario where both types of data are generally stored in relational databases. This also saves us from having to depend on the hashes to encode/decode the users and product information back after mahout is done processing.
          2. Since we want the to use the output from a pig script to be the input for mahout, we would need to change the current data generation code to include user-ids and product-ids. If we change the data-generation part, we'd also need to make changes to the current pig related code to deal with the changed format.
          3. As discussed with jay vyas, I'm also working on making the association between states and products more modular.
          Show
          bhashit parikh added a comment - - edited So far, I have finished up to step 2, and I have been working on step 3 and 4. Here are some thoughts on how we can go about that: Instead of performing hashing for users and products, we assign unique ids to both. This is more likely to resemble a real life scenario where both types of data are generally stored in relational databases. This also saves us from having to depend on the hashes to encode/decode the users and product information back after mahout is done processing. Since we want the to use the output from a pig script to be the input for mahout, we would need to change the current data generation code to include user-ids and product-ids. If we change the data-generation part, we'd also need to make changes to the current pig related code to deal with the changed format. As discussed with jay vyas , I'm also working on making the association between states and products more modular.
          Hide
          jay vyas added a comment -

          yup , i agree definetly bhashit - I will attach a patch that does that. in the meantime, for the integration test you can mock the files as necessary

          Show
          jay vyas added a comment - yup , i agree definetly bhashit - I will attach a patch that does that. in the meantime, for the integration test you can mock the files as necessary
          Hide
          bhashit parikh added a comment -

          jay vyas I am a still trying to determine where should I store the user data. We'd need to persist the user-data between the pig task and the mahout task. In a real world scenario, the user data would probably come from some kind of a relational DB. We can go with an in-memory database. I am not sure whether it fits our requirements well or not. Persistence of the user data during the execution of the pipeline is especially important since we want the recommendations generated by mahout to be obvious (easily understandable). In order to do so, we would need to process the output of mahout (which would contain the user and product ids) and generate visibly meaningful results (probably using pig). Can we discuss this?

          Show
          bhashit parikh added a comment - jay vyas I am a still trying to determine where should I store the user data. We'd need to persist the user-data between the pig task and the mahout task. In a real world scenario, the user data would probably come from some kind of a relational DB. We can go with an in-memory database. I am not sure whether it fits our requirements well or not. Persistence of the user data during the execution of the pipeline is especially important since we want the recommendations generated by mahout to be obvious (easily understandable). In order to do so, we would need to process the output of mahout (which would contain the user and product ids) and generate visibly meaningful results (probably using pig). Can we discuss this?
          Hide
          jay vyas added a comment -

          (1) Sure i can ping you on skype. But for now, can you mock the input data sets:

          10011 1 1
          10011 2 1
          ....
          

          Can easily be used as as in put to the mahout recommender - in the sense that it only takes a pure CSV file as input.

          How that that CSV is generated should be irrelevant to the recommender, right....?

          Hope that wil help for now, am in transit but I'll catch up with you within the next day to discuss this on your skype account .

          Show
          jay vyas added a comment - (1) Sure i can ping you on skype. But for now, can you mock the input data sets: 10011 1 1 10011 2 1 .... Can easily be used as as in put to the mahout recommender - in the sense that it only takes a pure CSV file as input. How that that CSV is generated should be irrelevant to the recommender, right....? Hope that wil help for now, am in transit but I'll catch up with you within the next day to discuss this on your skype account .
          Hide
          bhashit parikh added a comment -

          jay vyas That part is already done.

          Show
          bhashit parikh added a comment - jay vyas That part is already done.
          Hide
          jay vyas added a comment -

          cool! can you gist/pastebin or attach a patch i can play with ? FYI Am back on the ground again and so i can help however necessary this week !

          Show
          jay vyas added a comment - cool! can you gist/pastebin or attach a patch i can play with ? FYI Am back on the ground again and so i can help however necessary this week !
          Hide
          bhashit parikh added a comment - - edited

          After giving the whole process a lot of thought, I have finalized upon the following approach for getting the whole thing done:

          1. Use hadoop java API to write customer records (TSV). Each record will contain (id, firstName, lastName, state).
          2. Use the same approach for writing a TSV containing details of the product (id, name, price). We could have skipped this step but for the fact that mahout will require product ids to process the data.
          3. Once these two sets of base records are generated, generate transaction records containing the the customer ids and products.
          4. The generated transaction records will simulate real world buying patterns in that the customers who generally buy dog products will buy dog products most of the time, but once in a while they may buy some other product as well. However, the frequency of that happening will be very low.
          5. The weight currently given to each state will be used for generating customer records so that we have larger number of customer from states with higher weight.
          6. Create a new pig script to translate the transaction records to the format required by mahout recommender.
          7. Perform parallel ALS recommendation
          8. At this stage, the recommendations are done. However, they will be in a format like 1 100 102. In order to make them more readable, we can run some pig code that generates another file by reading from both the transaction records and the output of mahout recommender to generate some output like id:1, bought: dog_collar,dog_food, recommended: dog_leash, or something to that effect. jay vyas what do you think?
          Show
          bhashit parikh added a comment - - edited After giving the whole process a lot of thought, I have finalized upon the following approach for getting the whole thing done: Use hadoop java API to write customer records (TSV). Each record will contain (id, firstName, lastName, state). Use the same approach for writing a TSV containing details of the product (id, name, price). We could have skipped this step but for the fact that mahout will require product ids to process the data. Once these two sets of base records are generated, generate transaction records containing the the customer ids and products. The generated transaction records will simulate real world buying patterns in that the customers who generally buy dog products will buy dog products most of the time, but once in a while they may buy some other product as well. However, the frequency of that happening will be very low. The weight currently given to each state will be used for generating customer records so that we have larger number of customer from states with higher weight. Create a new pig script to translate the transaction records to the format required by mahout recommender. Perform parallel ALS recommendation At this stage, the recommendations are done. However, they will be in a format like 1 100 102 . In order to make them more readable, we can run some pig code that generates another file by reading from both the transaction records and the output of mahout recommender to generate some output like id:1, bought: dog_collar,dog_food, recommended: dog_leash , or something to that effect. jay vyas what do you think?
          Hide
          jay vyas added a comment -

          Hi bhashit: can you look at the arxh.dot file and update it with your thoughts and attach an image in this thread?

          A easy way to do this is just paste it in erdos (graphviz online editor).

          The steps 6 and 7 are the critical ones to get right. IMO this doesn't need anyt new steps - other than a 2nd output from the generate transactions phase (which can be done in same mapreduce job using multipleoutputs).

          Show
          jay vyas added a comment - Hi bhashit: can you look at the arxh.dot file and update it with your thoughts and attach an image in this thread? A easy way to do this is just paste it in erdos (graphviz online editor). The steps 6 and 7 are the critical ones to get right. IMO this doesn't need anyt new steps - other than a 2nd output from the generate transactions phase (which can be done in same mapreduce job using multipleoutputs).
          Hide
          bhashit parikh added a comment -

          jay vyas I thought about using MultipleOutputs. We can do it that way. But, I was thinking that since that MR job was a data generation phase, we can probably keep it limited to that. Once our existing pig script does its cleanup work, we execute another pig script as another step in the pipeline, to generate the input for the next step, the mahout recommender. Step 7 is already done and step 6 is just a few lines of pig code (as a script, or executed through java). I am not sure if that's the right way to go or not. So, should I go with MultipleOutputs or the other way?

          Show
          bhashit parikh added a comment - jay vyas I thought about using MultipleOutputs . We can do it that way. But, I was thinking that since that MR job was a data generation phase, we can probably keep it limited to that. Once our existing pig script does its cleanup work, we execute another pig script as another step in the pipeline, to generate the input for the next step, the mahout recommender. Step 7 is already done and step 6 is just a few lines of pig code (as a script, or executed through java). I am not sure if that's the right way to go or not. So, should I go with MultipleOutputs or the other way?
          Hide
          jay vyas added a comment -

          I'd like to keep the architecture in place in terms of total overall steps :

          • if we are generating a petabyte of data - it could take hours to run a single job . Adding more jobs --> more time, more manual steps by people running the app, and more integration tests to write for us.
          • we dont want to clutter the pipeline with extra steps that don't add more breadth to the amount of ecosystem we cover.

          Do you agree - in that sense - that adding more pig scripts is going to make things harder to maintain? If so lets do MultipleOutputs. But please do feel free to debate the point further - if I'm missing something, and there is some extra value to the additional step ?. I think MultipleOutputs will be an easy 4 or 5 lines of extension to the existing o.a.b.bps.generator.MyMapper class , and just another couple of lines to extend the arch.dot and the TestPetStoreTransactionGeneratorJob.

          Show
          jay vyas added a comment - I'd like to keep the architecture in place in terms of total overall steps : if we are generating a petabyte of data - it could take hours to run a single job . Adding more jobs --> more time, more manual steps by people running the app, and more integration tests to write for us. we dont want to clutter the pipeline with extra steps that don't add more breadth to the amount of ecosystem we cover. Do you agree - in that sense - that adding more pig scripts is going to make things harder to maintain? If so lets do MultipleOutputs . But please do feel free to debate the point further - if I'm missing something, and there is some extra value to the additional step ? . I think MultipleOutputs will be an easy 4 or 5 lines of extension to the existing o.a.b.bps.generator.MyMapper class , and just another couple of lines to extend the arch.dot and the TestPetStoreTransactionGeneratorJob .
          Hide
          bhashit parikh added a comment -

          jay vyas What you are saying does make more sense. The input for the mahout recommender is also the data to be generated, so we can include that in the data-generation MR job. I was just going for the separation of concerns thing. I mean, the code that generated transaction records should only generate those records and nothing else. But that's probably not the right way to go when we are dealing with large datasets. I'll add the MultipleOutputs code.

          Show
          bhashit parikh added a comment - jay vyas What you are saying does make more sense. The input for the mahout recommender is also the data to be generated, so we can include that in the data-generation MR job. I was just going for the separation of concerns thing. I mean, the code that generated transaction records should only generate those records and nothing else. But that's probably not the right way to go when we are dealing with large datasets. I'll add the MultipleOutputs code.
          Hide
          bhashit parikh added a comment - - edited

          Submitted the patch with the code. Haven't updated arch.dot yet since I want to test out the whole flow using the hadoop jar commands with the mahout jobs once before updating it.

          To run the recommender, we first need to run the pig clean job using

          gradle clean integrationTest -PITProfile=pig
          

          This proccesses the transaction records and generates the input files required by mahout.

          and then the mahout jobs:

           gradle integrationTest -PITProfile=mahout
          
          Show
          bhashit parikh added a comment - - edited Submitted the patch with the code. Haven't updated arch.dot yet since I want to test out the whole flow using the hadoop jar commands with the mahout jobs once before updating it. To run the recommender, we first need to run the pig clean job using gradle clean integrationTest -PITProfile=pig This proccesses the transaction records and generates the input files required by mahout. and then the mahout jobs: gradle integrationTest -PITProfile=mahout
          bhashit parikh made changes -
          Attachment BIGTOP-1272.patch [ 12653749 ]
          Hide
          bhashit parikh added a comment -

          removed mavenLocal() I added for testing.

          Show
          bhashit parikh added a comment - removed mavenLocal() I added for testing.
          bhashit parikh made changes -
          Attachment BIGTOP-1272.patch [ 12653753 ]
          Hide
          bhashit parikh added a comment - - edited

          Hey, running te bigpetstore jar with hadoop jar command requires that all the dependencies are bundled into the jar itself, or are at least specified some other way. Gradle doesn't have anything that supports this out of the box. Trying to find a reliable way to do that. There are some options available, but all of them have reported problems. My eclipse is able to build a jar with all the dependencies included. SO I guess I should be able to do so with Gradle. jay vyas How did you use to build the jar with maven earlier? Did the hadoop jar command execution with bigpetstore jar worked with maven?

          Show
          bhashit parikh added a comment - - edited Hey, running te bigpetstore jar with hadoop jar command requires that all the dependencies are bundled into the jar itself, or are at least specified some other way. Gradle doesn't have anything that supports this out of the box. Trying to find a reliable way to do that. There are some options available, but all of them have reported problems. My eclipse is able to build a jar with all the dependencies included. SO I guess I should be able to do so with Gradle. jay vyas How did you use to build the jar with maven earlier? Did the hadoop jar command execution with bigpetstore jar worked with maven?
          Hide
          jay vyas added a comment -

          We use the HADOOP_CLASSPATH environmental variable , and add jars at runtime.

          Bundling jars is a possibility but dangerous if there is a conflict with Hadoop jars..
          A better idea imo- better to explicitly specify the libraries at runtime in some way.

          That is how we do the pig portions in the case that pig is not on the classpath, we just use te hadoop_classpath env variable

          Show
          jay vyas added a comment - We use the HADOOP_CLASSPATH environmental variable , and add jars at runtime. Bundling jars is a possibility but dangerous if there is a conflict with Hadoop jars.. A better idea imo- better to explicitly specify the libraries at runtime in some way. That is how we do the pig portions in the case that pig is not on the classpath, we just use te hadoop_classpath env variable
          Hide
          bhashit parikh added a comment -

          Okay. That is a better option. Since the arch.dot didn't show any other command line args, I thought maybe the maven feature for bundling the dependencies was being used. Cool. I'll do the testing using that then. Is there anything special about HADOOP_CLASSPATH?

          Show
          bhashit parikh added a comment - Okay. That is a better option. Since the arch.dot didn't show any other command line args, I thought maybe the maven feature for bundling the dependencies was being used. Cool. I'll do the testing using that then. Is there anything special about HADOOP_CLASSPATH ?
          Hide
          jay vyas added a comment - - edited

          You know a simpler way I think that will work is :

          hadoop fs -copyFromLocal pig-without-hadoop*.jar hdfs://localhost:1234/tmp/pig.jar
          /usr/lib/hadoop/bin/hadoop jar hive-pig/bigpetstore-1.3.10.jar org.bigtop.bigpetstore.etl.PigCSVCleaner -libjars hdfs://localhost:1234/tmp/pig.jar bigpetstore bigpetstore_cleaned

          We can try that out. Thats how iirc I run it in some test scripts. The alternative:

          {{export HADOOP_CLASSPATH=/usr/lib/pig/pig-0.12.0.2.0.6.1-101-withouthadoop.jar

          hadoop jar ……}}

          Either way is (i think) equivalent , but libjars might be easier since you don't have to copy the file to every node on the cluster, you just copy the jar once into whatever dfs you are using.

          Show
          jay vyas added a comment - - edited You know a simpler way I think that will work is : hadoop fs -copyFromLocal pig-without-hadoop*.jar hdfs://localhost:1234/tmp/pig.jar /usr/lib/hadoop/bin/hadoop jar hive-pig/bigpetstore-1.3.10.jar org.bigtop.bigpetstore.etl.PigCSVCleaner -libjars hdfs://localhost:1234/tmp/pig.jar bigpetstore bigpetstore_cleaned We can try that out. Thats how iirc I run it in some test scripts. The alternative: {{export HADOOP_CLASSPATH=/usr/lib/pig/pig-0.12.0.2.0.6.1-101-withouthadoop.jar hadoop jar ……}} Either way is (i think) equivalent , but libjars might be easier since you don't have to copy the file to every node on the cluster, you just copy the jar once into whatever dfs you are using.
          Hide
          bhashit parikh added a comment - - edited

          After much head banging against classpath errors and local hadoop config: I was finally able to run the code using hadoop jar command. In case someone else needs it, here are the steps that I used for running the mahout job.

          1. To get the appropriate mahout.jar compiled against hadoop 2.2.0 that I was using: (cloned mahout from its git repo).
            1. mvn -Dhadoop2.version=2.2.0 -DskipTests clean package.
            2. find the mahout-mrlegacy-1.0-SNAPSHOT-job.jar jar file at mrlegacy/target dir in your git clone dir.
          2. set HADOOP_CLASSPATH env variable to include the jar file built in the previous step, plus your scala library. For ex.
            export HADOOP_CLASSPATH=/home/bp/jars/pig-withouthadoop.jar:/home/bp/jars/mahout-mrlegacy-1.0-SNAPSHOT-job.jar:/home/bp/opts/candidates/scala/scala-2.11.1/lib/scala-library.jar
            

            The mahout jar file name ends with -job.jar. This is a jar without including the hadoop dependency. That's what we need.
            Change the paths according to where you have stored the jars. I included the pig jar as well as I wanted to run the pig clean job too.

          3. After copying the generated transactions records to HDFS, run the hadoop jar command.
            hadoop jar /home/bp/code/bigtop/bigtop-bigpetstore/BigPetStore.jar org.apache.bigtop.bigpetstore.recommend.ItemRecommender   /bps_integration_/cleaned/Mahout /bps_integration_/Mahout/factorization /bps_integration/Mahout/recommendations
            

          My local hadoop installation version is 2.2.0, which is the same the version configured in gradle.

          And voila!. It works. Mahout is generating recommendations.

          The next step is, of course, setting this up using -libjars version, but now that should be easier.

          Show
          bhashit parikh added a comment - - edited After much head banging against classpath errors and local hadoop config: I was finally able to run the code using hadoop jar command. In case someone else needs it, here are the steps that I used for running the mahout job. To get the appropriate mahout.jar compiled against hadoop 2.2.0 that I was using: (cloned mahout from its git repo). mvn -Dhadoop2.version=2.2.0 -DskipTests clean package . find the mahout-mrlegacy-1.0-SNAPSHOT-job.jar jar file at mrlegacy/target dir in your git clone dir. set HADOOP_CLASSPATH env variable to include the jar file built in the previous step, plus your scala library. For ex. export HADOOP_CLASSPATH=/home/bp/jars/pig-withouthadoop.jar:/home/bp/jars/mahout-mrlegacy-1.0-SNAPSHOT-job.jar:/home/bp/opts/candidates/scala/scala-2.11.1/lib/scala-library.jar The mahout jar file name ends with -job.jar . This is a jar without including the hadoop dependency. That's what we need. Change the paths according to where you have stored the jars. I included the pig jar as well as I wanted to run the pig clean job too. After copying the generated transactions records to HDFS, run the hadoop jar command. hadoop jar /home/bp/code/bigtop/bigtop-bigpetstore/BigPetStore.jar org.apache.bigtop.bigpetstore.recommend.ItemRecommender /bps_integration_/cleaned/Mahout /bps_integration_/Mahout/factorization /bps_integration/Mahout/recommendations My local hadoop installation version is 2.2.0, which is the same the version configured in gradle . And voila!. It works. Mahout is generating recommendations. The next step is, of course, setting this up using -libjars version, but now that should be easier.
          Hide
          jay vyas added a comment -

          Thanks bhashit.... can we add more unit tests?
          Then once those are passing ill test it on a Hadoop cluster also.

          Show
          jay vyas added a comment - Thanks bhashit.... can we add more unit tests? Then once those are passing ill test it on a Hadoop cluster also.
          Hide
          jay vyas added a comment -

          Regarding mahout 2x jars: why not just use the ones that are produced from Hortonworks or cloudera (cdh5) maven repos? I assume those are both compiled for 2x

          Show
          jay vyas added a comment - Regarding mahout 2x jars: why not just use the ones that are produced from Hortonworks or cloudera (cdh5) maven repos? I assume those are both compiled for 2x
          Hide
          bhashit parikh added a comment - - edited

          The jars used by gradle are built without any dependencies. Since, in the gradle environment, all the dependencies are available in the build environment. While running with hadoop jar command, we'd need all the dependencies used by mahout itself as well, while excluding the hadoop dependency. I found out after going through a mahout book and some documentation that this was the standard way for using mahout from the command line. They knew that mahout is frequently used as a hadoop map-reduce job ((is this the right sentence structure?)), so, by default, they are providing a jar as a aprt of their mvn package process that we can use with the hadoop jar command. Even with pig, I used the pig-withouthadoop.jar from the standard pig distribution.

          Show
          bhashit parikh added a comment - - edited The jars used by gradle are built without any dependencies. Since, in the gradle environment, all the dependencies are available in the build environment. While running with hadoop jar command, we'd need all the dependencies used by mahout itself as well, while excluding the hadoop dependency. I found out after going through a mahout book and some documentation that this was the standard way for using mahout from the command line. They knew that mahout is frequently used as a hadoop map-reduce job ((is this the right sentence structure?)), so, by default, they are providing a jar as a aprt of their mvn package process that we can use with the hadoop jar command. Even with pig, I used the pig-withouthadoop.jar from the standard pig distribution.
          Hide
          bhashit parikh added a comment -

          Added code for verifying the output of Mahout recommender as well as the generation of the mahout-input by pig.

          Modified the arch.dot file.

          Modified the README to account for the new development.

          Show
          bhashit parikh added a comment - Added code for verifying the output of Mahout recommender as well as the generation of the mahout-input by pig. Modified the arch.dot file. Modified the README to account for the new development.
          bhashit parikh made changes -
          Attachment BIGTOP-1272.patch [ 12655517 ]
          Hide
          jay vyas added a comment -

          thanks bhashit I will review this tonite !!!!!!!!!!!!! Sorry ive been very busy lately. I'll run all the local tests locally first, make sure the code builds, and then try to deploy it in a hadoop cluster as an app. If it works, I'll update the BigPetStore demo videos on youtube, and commit the patch.

          Show
          jay vyas added a comment - thanks bhashit I will review this tonite !!!!!!!!!!!!! Sorry ive been very busy lately. I'll run all the local tests locally first, make sure the code builds, and then try to deploy it in a hadoop cluster as an app. If it works, I'll update the BigPetStore demo videos on youtube, and commit the patch.
          Hide
          jay vyas added a comment - - edited

          overall the code looks good and makes sense - here some first (minor) comments.

          • Regarding patch format, there are some trailing whitespaces . Its a minor issue but we like to remove them if you can do so in your IDE. I have a recipe for this with intellij (see the comments in BIGTOP-1240).

          In arch.dot :

          • how does MahoutRecommenderJob get launched ? Isnt it done all in one ? If so it should be expressed in the second arrow (the same way you do above, in the pig part).

          In build.gradle :

          should "test ..." be excluding your Mahout test as well as "TestPig" "TestHive" "TestCrunch", etc.? I remember that we exclude those tests for a reason, probably because they are integration tests. And since Mahout test is integration test, shouldnt we exclude that as well ?

          In the scala source:

          • There is a "TODO Jay / Bhashit ..." line in the DataForger scala class. Is that still relevant?
          • Just curious : does this run on a hadoop cluster. I have not tested yet, just wondering if anything special needs to be done for the scala genrated libraries (i.e. do we have to add scala to the classpath on each node)? I can work on that you arent sure how it should be done.

          This is a huge patch ! So I have to keep reviewing it tomorrow. So far it looks like a big effort and looks like it should all work.

          Show
          jay vyas added a comment - - edited overall the code looks good and makes sense - here some first (minor) comments. Regarding patch format, there are some trailing whitespaces . Its a minor issue but we like to remove them if you can do so in your IDE. I have a recipe for this with intellij (see the comments in BIGTOP-1240 ). In arch.dot : how does MahoutRecommenderJob get launched ? Isnt it done all in one ? If so it should be expressed in the second arrow (the same way you do above, in the pig part). In build.gradle : should "test ..." be excluding your Mahout test as well as "TestPig" "TestHive" "TestCrunch", etc.? I remember that we exclude those tests for a reason, probably because they are integration tests. And since Mahout test is integration test, shouldnt we exclude that as well ? In the scala source: There is a "TODO Jay / Bhashit ..." line in the DataForger scala class. Is that still relevant? Just curious : does this run on a hadoop cluster. I have not tested yet, just wondering if anything special needs to be done for the scala genrated libraries (i.e. do we have to add scala to the classpath on each node)? I can work on that you arent sure how it should be done. This is a huge patch ! So I have to keep reviewing it tomorrow. So far it looks like a big effort and looks like it should all work.
          Hide
          bhashit parikh added a comment - - edited
          1. I'll take care of the trailing whitespaces. I keep forgetting about those.
          2. In our code, the Mahout RecommenderJob is run automatically as a part of the ItemRecommender.scala code. However, the recommendations are executed internally in two difference phases. I was trying to communicate that through arch.dot. Now that I think about it, it could be a bit misleading. Should I just keep a single step for it?
          3. The integration tests are excluded out of the test task by default since they are all in the "src/integrationTest" directory. The test task is only executing the unit tasks from the "src/test" dir. So, we don't need to add Mahout integration test to the pattern. Also, out of all the tests named in the "exclude" pattern, only the last one exists currently.
          4. I'll remove the TODO in DataForger.scala. I think it'll be taken care of when we work on integrating even more useful patterns in the data-generation as a part of the BIGTOP-1366.
          5. Yes, the scala-library will need to be present on the classpath for the scala code to be executed. We'll need to use scala-library.jar with version 2.11. I don't know if we'd need to copy the jar on all nodes of the cluster, I haven't run a Hadoop cluster before. I think the scala-library jar will be needed everywhere the pig-withouthadoop and other jars are needed. You mentioned in one of the previous comments that using the -libjars would avoid having to copy all the jars on all the nodes.
          Show
          bhashit parikh added a comment - - edited I'll take care of the trailing whitespaces. I keep forgetting about those. In our code, the Mahout RecommenderJob is run automatically as a part of the ItemRecommender.scala code. However, the recommendations are executed internally in two difference phases. I was trying to communicate that through arch.dot . Now that I think about it, it could be a bit misleading. Should I just keep a single step for it? The integration tests are excluded out of the test task by default since they are all in the "src/integrationTest" directory. The test task is only executing the unit tasks from the "src/test" dir. So, we don't need to add Mahout integration test to the pattern. Also, out of all the tests named in the "exclude" pattern, only the last one exists currently. I'll remove the TODO in DataForger.scala . I think it'll be taken care of when we work on integrating even more useful patterns in the data-generation as a part of the BIGTOP-1366 . Yes, the scala-library will need to be present on the classpath for the scala code to be executed. We'll need to use scala-library.jar with version 2.11. I don't know if we'd need to copy the jar on all nodes of the cluster, I haven't run a Hadoop cluster before. I think the scala-library jar will be needed everywhere the pig-withouthadoop and other jars are needed. You mentioned in one of the previous comments that using the -libjars would avoid having to copy all the jars on all the nodes.
          Hide
          jay vyas added a comment -

          Finished reading th patch, and looks like its all there.

          But, after running the mahout integration test, I get an InvalidInputException. I suspect this is related to the formatting of the pig stage, but not sure:

              Command line arguments: {--alpha=[0.8], --endPhase=[2147483647], --implicitFeedback=[false], --input=[bps_integration_/cleaned/Mahout], --lambda=[0.1], --numFeatures=[2], --numIterations=[5], --numThreadsPerSolver=[1], --output=[bps_integration_/Mahout/AlsFactorization], --startPhase=[0], --tempDir=[/tmp/mahout_1405475399824]}
              Command line arguments: {--alpha=[0.8], --endPhase=[2147483647], --implicitFeedback=[false], --input=[bps_integration_/cleaned/Mahout], --lambda=[0.1], --numFeatures=[2], --numIterations=[5], --numThreadsPerSolver=[1], --output=[bps_integration_/Mahout/AlsFactorization], --startPhase=[0], --tempDir=[/tmp/mahout_1405475399824]}
              mapred.input.dir is deprecated. Instead, use mapreduce.input.fileinputformat.inputdir
              mapred.input.dir is deprecated. Instead, use mapreduce.input.fileinputformat.inputdir
              mapred.compress.map.output is deprecated. Instead, use mapreduce.map.output.compress
              mapred.compress.map.output is deprecated. Instead, use mapreduce.map.output.compress
              mapred.output.dir is deprecated. Instead, use mapreduce.output.fileoutputformat.outputdir
              mapred.output.dir is deprecated. Instead, use mapreduce.output.fileoutputformat.outputdir
              Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - already initialized
              Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - already initialized
              Cleaning up the staging area file:/tmp/hadoop-bigpetstore/mapred/staging/bigpetstore1132639609/.staging/job_local1132639609_0002
              Cleaning up the staging area file:/tmp/hadoop-bigpetstore/mapred/staging/bigpetstore1132639609/.staging/job_local1132639609_0002
          
          org.apache.bigtop.bigpetstore.BigPetStoreMahoutIT > testPetStorePipeline FAILED
              org.apache.hadoop.mapreduce.lib.input.InvalidInputException at BigPetStoreMahoutIT.java:69
          
          1 test completed, 1 failed
          :integrationTest FAILED
          
          

          Will dive some more.

          Show
          jay vyas added a comment - Finished reading th patch, and looks like its all there. But, after running the mahout integration test, I get an InvalidInputException . I suspect this is related to the formatting of the pig stage, but not sure: Command line arguments: {--alpha=[0.8], --endPhase=[2147483647], --implicitFeedback=[false], --input=[bps_integration_/cleaned/Mahout], --lambda=[0.1], --numFeatures=[2], --numIterations=[5], --numThreadsPerSolver=[1], --output=[bps_integration_/Mahout/AlsFactorization], --startPhase=[0], --tempDir=[/tmp/mahout_1405475399824]} Command line arguments: {--alpha=[0.8], --endPhase=[2147483647], --implicitFeedback=[false], --input=[bps_integration_/cleaned/Mahout], --lambda=[0.1], --numFeatures=[2], --numIterations=[5], --numThreadsPerSolver=[1], --output=[bps_integration_/Mahout/AlsFactorization], --startPhase=[0], --tempDir=[/tmp/mahout_1405475399824]} mapred.input.dir is deprecated. Instead, use mapreduce.input.fileinputformat.inputdir mapred.input.dir is deprecated. Instead, use mapreduce.input.fileinputformat.inputdir mapred.compress.map.output is deprecated. Instead, use mapreduce.map.output.compress mapred.compress.map.output is deprecated. Instead, use mapreduce.map.output.compress mapred.output.dir is deprecated. Instead, use mapreduce.output.fileoutputformat.outputdir mapred.output.dir is deprecated. Instead, use mapreduce.output.fileoutputformat.outputdir Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - already initialized Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - already initialized Cleaning up the staging area file:/tmp/hadoop-bigpetstore/mapred/staging/bigpetstore1132639609/.staging/job_local1132639609_0002 Cleaning up the staging area file:/tmp/hadoop-bigpetstore/mapred/staging/bigpetstore1132639609/.staging/job_local1132639609_0002 org.apache.bigtop.bigpetstore.BigPetStoreMahoutIT > testPetStorePipeline FAILED org.apache.hadoop.mapreduce.lib.input.InvalidInputException at BigPetStoreMahoutIT.java:69 1 test completed, 1 failed :integrationTest FAILED Will dive some more.
          Hide
          jay vyas added a comment - - edited

          Interesting .... Second time around ... It passed ! I ran the pig integration test , followed by the mahout integration test explicitly.

          So, bhashit parikh , can is the mahout test now dependent on the pig integration test ? I think so, which is okay (actually, it makes sense), but just confirming.

          After I play with the code some more tomorrow, I think this patch will be ready for putting into bigpetstore. This is makes bigpetstore a really first class demonstration of the power of the hadoop ecosystem.

          Looking forward to add an alternate Spark path next.

          Show
          jay vyas added a comment - - edited Interesting .... Second time around ... It passed ! I ran the pig integration test , followed by the mahout integration test explicitly. So, bhashit parikh , can is the mahout test now dependent on the pig integration test ? I think so, which is okay (actually, it makes sense), but just confirming. After I play with the code some more tomorrow, I think this patch will be ready for putting into bigpetstore. This is makes bigpetstore a really first class demonstration of the power of the hadoop ecosystem. Looking forward to add an alternate Spark path next.
          Hide
          bhashit parikh added a comment - - edited

          jay vyas Yes, you are right. The mahout test is dependent on the output of the pig phase of the pipeline. There is a way to make it behave more like a pipeline (using Gradle) like we talked about on Skype. I am thinking about facilitating that, however, off the top of my head, I can think of at least one problem with an automated pipeline. I'll think it through and then maybe I can discuss the approach with you. Possibly create a different JIRA issue for that.

          Show
          bhashit parikh added a comment - - edited jay vyas Yes, you are right. The mahout test is dependent on the output of the pig phase of the pipeline. There is a way to make it behave more like a pipeline (using Gradle) like we talked about on Skype. I am thinking about facilitating that, however, off the top of my head, I can think of at least one problem with an automated pipeline. I'll think it through and then maybe I can discuss the approach with you. Possibly create a different JIRA issue for that.
          Hide
          jay vyas added a comment - - edited

          (edit) This doesnt seem to build if *eclipse* is missing , Ive just now realized after trying to build it in a fresh bigtop cluster from head...

          I see very odd errors if the eclipse plugin is enabled on a machine where eclipse is installed.

           
           "No such property allDependencies for class: java.io.File.
          

          The build does indeed succeed on any java 1.7+ as long as eclipse plugin is commented out of the gradle stuff.

          whew. on to the last bit of testing of this...........

          Show
          jay vyas added a comment - - edited (edit) This doesnt seem to build if * eclipse * is missing , Ive just now realized after trying to build it in a fresh bigtop cluster from head... I see very odd errors if the eclipse plugin is enabled on a machine where eclipse is installed. "No such property allDependencies for class: java.io.File. The build does indeed succeed on any java 1.7+ as long as eclipse plugin is commented out of the gradle stuff. whew. on to the last bit of testing of this...........
          Hide
          bhashit parikh added a comment - - edited

          That's really odd. Eclipse plugin doesn't depend on the presence of eclipse, it just generates the project files so that it can be imported into eclipse. It seems like this might be an issue with the way the eclipse plugin interacts with the rest of the plugins. I'll try to do this on a VM and find out what's going on. jay vyas Would it be possible for you to give a screenshot/paste of the stacktrace (with the --stacktrace flag) while building with gradle? And the output of gradle --version?

          EDIT: I have a suspicion now that this could be an issue with the version of gradle. Gradle 2 was released a couple of weeks ago, and the eclipse plugin might have some catch-up to do. Not sure though.

          Show
          bhashit parikh added a comment - - edited That's really odd. Eclipse plugin doesn't depend on the presence of eclipse, it just generates the project files so that it can be imported into eclipse. It seems like this might be an issue with the way the eclipse plugin interacts with the rest of the plugins. I'll try to do this on a VM and find out what's going on. jay vyas Would it be possible for you to give a screenshot/paste of the stacktrace (with the --stacktrace flag) while building with gradle? And the output of gradle --version ? EDIT: I have a suspicion now that this could be an issue with the version of gradle. Gradle 2 was released a couple of weeks ago, and the eclipse plugin might have some catch-up to do. Not sure though.
          bhashit parikh made changes -
          Attachment build.gradle [ 12656448 ]
          bhashit parikh made changes -
          Comment [ build.gradle for gradle 2.0 ]
          Hide
          bhashit parikh added a comment - - edited

          I was able to reproduce the build error with gradle 2.0. They have changed the way the classpath for eclipse is configured (and they didn't warn us! ). I was able to make the integrationTests (and the build) work by changing the last part of build.gradle where eclipse classpath is configured. If this is indeed the reason the behind the reported build error, I think I should create a gradle wrapper (gradlew files) and put it in with the project (maybe I should do that anyway. The gradle folks generally advice to do that). That way, everyone can use the same version of gradle. jay vyas Would it be possible to check the build using gradle 1.12? Or using the build.gradle that I attached here to test using gradle 2.0?

          If this indeed turns out to be a version issue, I think I should do two things:

          1. create gradle wrapper for the project
          2. Possibly upgrade to gradle 2.0. It has several improvements, including a few for scala/java cross-compiling.
          Show
          bhashit parikh added a comment - - edited I was able to reproduce the build error with gradle 2.0. They have changed the way the classpath for eclipse is configured (and they didn't warn us! ). I was able to make the integrationTests (and the build) work by changing the last part of build.gradle where eclipse classpath is configured. If this is indeed the reason the behind the reported build error, I think I should create a gradle wrapper ( gradlew files) and put it in with the project (maybe I should do that anyway. The gradle folks generally advice to do that). That way, everyone can use the same version of gradle. jay vyas Would it be possible to check the build using gradle 1.12? Or using the build.gradle that I attached here to test using gradle 2.0? If this indeed turns out to be a version issue, I think I should do two things: create gradle wrapper for the project Possibly upgrade to gradle 2.0. It has several improvements, including a few for scala/java cross-compiling.
          Hide
          jay vyas added a comment - - edited

          hi bhashit parikh.

          There are some dependency errors to be fixed to run this in a cluster.

          It looks like for example org.apache.commons.lang3 is being used, maybe some others also.

          Can you update the patch with README instructions for what libraries we need to bundle in , or maybe
          build an uber jar in gradle?

          (lets just focus the eclipse related stuff on BIGTOP-1379),.

          Show
          jay vyas added a comment - - edited hi bhashit parikh . There are some dependency errors to be fixed to run this in a cluster. It looks like for example org.apache.commons.lang3 is being used, maybe some others also. Can you update the patch with README instructions for what libraries we need to bundle in , or maybe build an uber jar in gradle? (lets just focus the eclipse related stuff on BIGTOP-1379 ),.
          Hide
          jay vyas added a comment - - edited

          After running this in a real hadoop cluster, the additional dependencies added in this patch became clear - Unfortunately, we can't really run this on a hadoop cluster because of the dependencies, unless you can come up with a way to build them into a uber-jar (i tried that, it failed because of a META_INF issue, not sure what it was)

          1) I resolved at least some of the dependencies : JFairy, CommonsLang3, and scala lib 2.10, and added them manually to hadoop/lib... but still there were other dependencies missing (org.yaml...) So

          2) I also tried to create a fat jar with gradle, but that failed because of a META issue in the jarfile. Maybe we can a get fat jar solution to work ?

          bhashit parikh so even though the code should work, i cannot deploy it in a cluster in any easy way. we will have to come up with a good way to deploy this jar file in a way which is reliable.

          Let me know what ideas you have here, or just attach an updated patch and ill test it . thanks!

          Show
          jay vyas added a comment - - edited After running this in a real hadoop cluster, the additional dependencies added in this patch became clear - Unfortunately, we can't really run this on a hadoop cluster because of the dependencies, unless you can come up with a way to build them into a uber-jar (i tried that, it failed because of a META_INF issue, not sure what it was) 1) I resolved at least some of the dependencies : JFairy, CommonsLang3, and scala lib 2.10, and added them manually to hadoop/lib ... but still there were other dependencies missing (org.yaml...) So 2) I also tried to create a fat jar with gradle, but that failed because of a META issue in the jarfile. Maybe we can a get fat jar solution to work ? bhashit parikh so even though the code should work, i cannot deploy it in a cluster in any easy way. we will have to come up with a good way to deploy this jar file in a way which is reliable . Let me know what ideas you have here, or just attach an updated patch and ill test it . thanks!
          Hide
          bhashit parikh added a comment -

          I am looking into the options that we can have. Unlike maven, gradle doesn't have first-class support for building with dependencies. But unlike maven, we do have the full power of Groovy.

          I have been looking into the whole thing today. There are a few approaches that we can use; all of which are described here. jay vyas let me know if I have missed a candidate.

          There is one problem that I think can be a bit problematic. The hadoop distribution itself provides all the hadoop-core and other transitive dependencies while running with hadoop jar command. So, we'll need to exclude the hadoop dependencies when we build the jar. Something along the lines of the maven provided scope; but it should be more versatile than that.

          I'll include the gradle wrapper with the new patch as well. As the eclipse issue is not the only thing that could be problematic; the gradle folks seem to have made some backward-incompatible changes in the syntax. The wrapper will also help with the CI in BIGTOP-1379.

          Show
          bhashit parikh added a comment - I am looking into the options that we can have. Unlike maven, gradle doesn't have first-class support for building with dependencies. But unlike maven, we do have the full power of Groovy. I have been looking into the whole thing today. There are a few approaches that we can use; all of which are described here . jay vyas let me know if I have missed a candidate. There is one problem that I think can be a bit problematic. The hadoop distribution itself provides all the hadoop-core and other transitive dependencies while running with hadoop jar command. So, we'll need to exclude the hadoop dependencies when we build the jar. Something along the lines of the maven provided scope; but it should be more versatile than that. I'll include the gradle wrapper with the new patch as well. As the eclipse issue is not the only thing that could be problematic; the gradle folks seem to have made some backward-incompatible changes in the syntax. The wrapper will also help with the CI in BIGTOP-1379 .
          Hide
          jay vyas added a comment - - edited

          bhashit i have an idea, skype ? (for those interested - it involves hadoop libjars + modifying generator to properly use ToolRunner)

          Show
          jay vyas added a comment - - edited bhashit i have an idea, skype ? (for those interested - it involves hadoop libjars + modifying generator to properly use ToolRunner)
          Hide
          jay vyas added a comment - - edited

          bhashit parikh Good news ! I was able to run the generator (will test pig/mahout updates as well shortly) by simply using gradle to write out all the libs to a folder, and then using hadoop jar <classname> -libjars <comma_sep_list_of_jars> 100 bps/out.

          So at least that is one nice, platform neutral option. If you have a better fat jar implementation then that might work also.

          FYI I noticed, however, that on the CLIENT side you require commons lang3. That is required BEFORE the mappers even START. Which means libjars isnt sufficient - because you have some local code in the job that uses an external library. (libjars only are given to jobs running in tasks in hadoop 2.0 (maybe 2.2+ fixes this)).

          Show
          jay vyas added a comment - - edited bhashit parikh Good news ! I was able to run the generator (will test pig/mahout updates as well shortly) by simply using gradle to write out all the libs to a folder, and then using hadoop jar <classname> -libjars <comma_sep_list_of_jars> 100 bps/out . So at least that is one nice, platform neutral option. If you have a better fat jar implementation then that might work also. FYI I noticed, however, that on the CLIENT side you require commons lang3. That is required BEFORE the mappers even START. Which means libjars isnt sufficient - because you have some local code in the job that uses an external library. (libjars only are given to jobs running in tasks in hadoop 2.0 (maybe 2.2+ fixes this)).
          Hide
          jay vyas added a comment -

          Okay. I got it to run. Now here is a review:

          1) There is a bug in the BPS_ANALYTICS.pig output. It appears we now have alot of entry records.

          bash-4.1$ hadoop fs -cat pig_ad_hoc_script0/part*                                                                                                                       [172/1845]
                          28
          ...
                          72
                          68
          AK      filter  3
          AK      air_pump        2
          AK      cat_food        3
          AK      dog_food        3
          

          2) There needs to be doc in README for running the Mahout Recommender job. At this point I see no main() method in ItemRecommender, so running hadoop jar /home/bp/code/bigtop/bigtop-bigpetstore/BigPetStore.jar org.apache.bigtop.bigpetstore.recommend.ItemRecommender .... definetly is not an option.

          3) Also, gradle tooling for fat jar.

          Then we will be able to try to run it again.

          Show
          jay vyas added a comment - Okay. I got it to run. Now here is a review: 1) There is a bug in the BPS_ANALYTICS.pig output. It appears we now have alot of entry records. bash-4.1$ hadoop fs -cat pig_ad_hoc_script0/part* [172/1845] 28 ... 72 68 AK filter 3 AK air_pump 2 AK cat_food 3 AK dog_food 3 2) There needs to be doc in README for running the Mahout Recommender job . At this point I see no main() method in ItemRecommender, so running hadoop jar /home/bp/code/bigtop/bigtop-bigpetstore/BigPetStore.jar org.apache.bigtop.bigpetstore.recommend.ItemRecommender .... definetly is not an option. 3) Also, gradle tooling for fat jar . Then we will be able to try to run it again.
          Hide
          bhashit parikh added a comment -
          1. The number of entries are probably higher because I set the default number of records to be generated to 100. Let me know if I should reduce that number and/or if my guess in incorrect.
          2. The ItemRecommender.scala does have a main method. It is right at the end of the file, inside object ItemRecommender. Scala doesn't have statics, so the main method has to reside within the companion object.
          3. I have spent a few hours on getting a satisfying fat-jar. I can create the fat-jars with the necessary configurations. However, it seems like those pig-withouthadoop.jar and the similar jar for mahout would need to be specified manually. I looked into some of the pig code that deals with this. It seems that pig, when run as a hadoop job, creates a few jars and provides them to hadoop. I am still looking into that.

          I'll post more updates soon.

          Show
          bhashit parikh added a comment - The number of entries are probably higher because I set the default number of records to be generated to 100. Let me know if I should reduce that number and/or if my guess in incorrect. The ItemRecommender.scala does have a main method. It is right at the end of the file, inside object ItemRecommender . Scala doesn't have statics, so the main method has to reside within the companion object. I have spent a few hours on getting a satisfying fat-jar. I can create the fat-jars with the necessary configurations. However, it seems like those pig-withouthadoop.jar and the similar jar for mahout would need to be specified manually. I looked into some of the pig code that deals with this. It seems that pig, when run as a hadoop job, creates a few jars and provides them to hadoop. I am still looking into that. I'll post more updates soon.
          Hide
          jay vyas added a comment -

          1) the bug im seeing is that it literally prints out counts with no associated product name. the numbers

           28,62,... 

          above have no associated product. That means we must have some whitespace products or something.

          2) Re: ItemRecommender. Okay good just add in the README.md the exact way you want me to test it and confirm that its in the attached patch, and ill test it as you prescribe in the next iteration.

          3) regarding fat-jar ? thats fine. If we have to add pig-without-hadoop and mahout.jar to the classpath, thats not aproblem at all. As long as its in the README.md, anyone can easily follow along, we are in good shape.

          thanks bhashit, let me know when this is ready to test again.

          Show
          jay vyas added a comment - 1) the bug im seeing is that it literally prints out counts with no associated product name. the numbers 28,62,... above have no associated product. That means we must have some whitespace products or something. 2) Re: ItemRecommender. Okay good just add in the README.md the exact way you want me to test it and confirm that its in the attached patch, and ill test it as you prescribe in the next iteration. 3) regarding fat-jar ? thats fine. If we have to add pig-without-hadoop and mahout.jar to the classpath, thats not aproblem at all. As long as its in the README.md, anyone can easily follow along, we are in good shape. thanks bhashit, let me know when this is ready to test again.
          Hide
          bhashit parikh added a comment - - edited

          I have added code the creating a fatjar, and ran it successfully on a single node cluster using the generated fatjar. I have also modified a bit of a code that was causing the empty (numbers-only) records to be generated by pig. Turns out that since the the mahout-input was being stored in the same dir as the cleaned output (TSV file), and the pig script was using the entire cleaned directory as input, the mahout-input records were being used by the ad-hoc script as well. Well, that's taken care of now.

          The instructions for running are:

          1. Use gradle clean shadowJar -Pfor-cluster to generate a fatjar that excludes the pig, hadoop, and mahout dependencies (including the transitive ones). The name of the generated file will be BigPetStore-0.8.0-SNAPSHOT-all.jar, inside the build/lib dir. I'll refer to it as bps.jar for convenience.
          2. Find or generate the pig-withouthadoop.jar from the pig distribution. To build the correct jar, you can use the command ant mvn-jar from inside your pig distribution/checkout. After running this command, you can find pig-0.12.1-SNAPSHOT-withouthadoop-h2.jar inside the build dir. This is the exact jar that is used by our gradle build.
          3. To get the appropriate mahout.jar compiled against hadoop 2.2.0 : (cloned mahout from its git repo).
            1. mvn -Dhadoop2.version=2.2.0 -DskipTests clean package.
            2. find the mahout-mrlegacy-1.0-SNAPSHOT-job.jar jar file at mrlegacy/target dir in your git clone dir.
          4. Specify both of these (pig and mahout) jars using libjars and HADOOP_CLASSPATH when running the hadoop jar command
          1. To run the pig part:
            hadoop jar bps.jar  /bps_integration_/generated  /bps_integration_/cleaned /your/path/BPS_analytics.pig -libjars=...
            
          1. To run the mahout code:
            hadoop jar bps.jar org.apache.bigtop.bigpetstore.recommend.ItemRecommender  /bps_integration_/cleaned/Mahout  /bps_integration_/Mahout/Factorization /bps_integration_/Mahout/Recommendations -libjars=...
            

          The output of mahout would be in /bps_integration_/Mahout/Recommendations dir.

          I found out that for pig code that uses GROUP BY clause, the hadoop job-history-server needs to be running. I got it running using mr-jobhistory-daemon.sh start historyserver. Came across that solution on this stack-overflow question

          Show
          bhashit parikh added a comment - - edited I have added code the creating a fatjar, and ran it successfully on a single node cluster using the generated fatjar. I have also modified a bit of a code that was causing the empty (numbers-only) records to be generated by pig. Turns out that since the the mahout-input was being stored in the same dir as the cleaned output (TSV file), and the pig script was using the entire cleaned directory as input, the mahout-input records were being used by the ad-hoc script as well. Well, that's taken care of now. The instructions for running are: Use gradle clean shadowJar -Pfor-cluster to generate a fatjar that excludes the pig, hadoop, and mahout dependencies (including the transitive ones). The name of the generated file will be BigPetStore-0.8.0-SNAPSHOT-all.jar , inside the build/lib dir. I'll refer to it as bps.jar for convenience. Find or generate the pig-withouthadoop.jar from the pig distribution. To build the correct jar, you can use the command ant mvn-jar from inside your pig distribution/checkout. After running this command, you can find pig-0.12.1-SNAPSHOT-withouthadoop-h2.jar inside the build dir. This is the exact jar that is used by our gradle build. To get the appropriate mahout.jar compiled against hadoop 2.2.0 : (cloned mahout from its git repo). mvn -Dhadoop2.version=2.2.0 -DskipTests clean package. find the mahout-mrlegacy-1.0-SNAPSHOT-job.jar jar file at mrlegacy/target dir in your git clone dir. Specify both of these (pig and mahout) jars using libjars and HADOOP_CLASSPATH when running the hadoop jar command To run the pig part: hadoop jar bps.jar /bps_integration_/generated /bps_integration_/cleaned /your/path/BPS_analytics.pig -libjars=... To run the mahout code: hadoop jar bps.jar org.apache.bigtop.bigpetstore.recommend.ItemRecommender /bps_integration_/cleaned/Mahout /bps_integration_/Mahout/Factorization /bps_integration_/Mahout/Recommendations -libjars=... The output of mahout would be in /bps_integration_/Mahout/Recommendations dir. I found out that for pig code that uses GROUP BY clause, the hadoop job-history-server needs to be running. I got it running using mr-jobhistory-daemon.sh start historyserver . Came across that solution on this stack-overflow question
          bhashit parikh made changes -
          Attachment BIGTOP-1272.patch [ 12658106 ]
          Hide
          bhashit parikh added a comment -

          jay vyas I have uploaded the patch so that you can start testing. I'll still need to clean it up a little since it contains some redundant code in build.gradle.

          Show
          bhashit parikh added a comment - jay vyas I have uploaded the patch so that you can start testing. I'll still need to clean it up a little since it contains some redundant code in build.gradle.
          bhashit parikh made changes -
          Attachment BIGTOP-1272.patch [ 12658120 ]
          Hide
          jay vyas added a comment - - edited
          • is this patch 100% ready for testing, including README updates?. If so I can have a look asap. otherwise busy with some other stuff at the moment, so I'd rather just wait for a full clean patch to do the review..... its a complex deploy, so the devil will be in the details of following the README and ensuring that it works as specified.
          • On another note: if the patch is still needing alot of work, im starting to wonder how important mahout mapreduce implementations are to the broader community, given the move to mahout spark implementations that is occuring.
          • open to ideas. is anyone in need of mahout mapreduce tests ,... or are we all moving to do all our machine learning on spark ? and bhashit parikh do you think a clean implementationis close around the corner?
          Show
          jay vyas added a comment - - edited is this patch 100% ready for testing, including README updates?. If so I can have a look asap. otherwise busy with some other stuff at the moment, so I'd rather just wait for a full clean patch to do the review..... its a complex deploy, so the devil will be in the details of following the README and ensuring that it works as specified. On another note: if the patch is still needing alot of work, im starting to wonder how important mahout mapreduce implementations are to the broader community, given the move to mahout spark implementations that is occuring. open to ideas. is anyone in need of mahout mapreduce tests ,... or are we all moving to do all our machine learning on spark ? and bhashit parikh do you think a clean implementationis close around the corner?
          Hide
          bhashit parikh added a comment - - edited

          jay vyas The patch is ready. The only problem we were facing was the classpath one, which gets exacerbated in a hadoop cluster. I think they are fixed now. I have already tested on a single-node cluster. I did update the README in the past, multiple times, but since then I have made a lot of changes. I will finalize the README once you finish testing it. I have left the instructions for running this patch in my previous comment.

          The classpath issue is going to bug us with or without mahout; even if we just use spark. And the solution for that is not going to be any cleaner/easier than this unless we resort to running some sort of a shell script, or full-blown groovy code through gradle. The cleanliness of that approach is rather debatable as well.

          Show
          bhashit parikh added a comment - - edited jay vyas The patch is ready. The only problem we were facing was the classpath one, which gets exacerbated in a hadoop cluster. I think they are fixed now. I have already tested on a single-node cluster. I did update the README in the past, multiple times, but since then I have made a lot of changes. I will finalize the README once you finish testing it. I have left the instructions for running this patch in my previous comment. The classpath issue is going to bug us with or without mahout ; even if we just use spark. And the solution for that is not going to be any cleaner/easier than this unless we resort to running some sort of a shell script, or full-blown groovy code through gradle. The cleanliness of that approach is rather debatable as well.
          Hide
          jay vyas added a comment - - edited

          Sounds good bhashit.

          • I'm okay with testing, without a readme. Can you clarify the directions you'd like me to follow for implementing your --libjars?
          hadoop jar bps.jar org.apache.bigtop.bigpetstore.recommend.ItemRecommender  /bps_integration_/cleaned/Mahout  /bps_integration_/Mahout/Factorization /bps_integration_/Mahout/Recommendations -libjars=...
          
          • Regarding your other idea, also... I think we can put a driver for this in the bigtop-1222 patch after that is completed.
          Show
          jay vyas added a comment - - edited Sounds good bhashit. I'm okay with testing, without a readme. Can you clarify the directions you'd like me to follow for implementing your --libjars? hadoop jar bps.jar org.apache.bigtop.bigpetstore.recommend.ItemRecommender /bps_integration_/cleaned/Mahout /bps_integration_/Mahout/Factorization /bps_integration_/Mahout/Recommendations -libjars=... Regarding your other idea, also... I think we can put a driver for this in the bigtop-1222 patch after that is completed.
          Hide
          bhashit parikh added a comment -

          The jars that will be required through -libjars right now will be the pig-withouthadoop.jar and the mahout-job.jar. Both of which I built from their respective distributions (I have put the way I did that in the previous comment as well).

          Show
          bhashit parikh added a comment - The jars that will be required through -libjars right now will be the pig-withouthadoop.jar and the mahout-job.jar . Both of which I built from their respective distributions (I have put the way I did that in the previous comment as well).
          Hide
          jay vyas added a comment - - edited

          Okay bhashit. I built it succesfull. w/ jdk 7 (not 6)... Lets make sure we add this stuff to readme:

          • Java 1.7 is required.
          • gradle 2.0 is (i assume) required

          ... Now ill test it on a cluster and let you know......

          • First we need to export HADOOP_CLASSPATH=pig.jar:mahout.jar (so local client has access to libs)
          • THEN you also need to append -libjars $ {JARS}

            (so mappers have access to the stuff)

          • THEN ...... IT WORKS

          Now please do the following so we can commit !

          • Remove the added trailing whitespace in the files under bigtop-bigpetstore/
          • update the README with the following directions for running:
          ### Note that both pig and mahout can be yum installed
          ### via bigtop.  Mahout 2.0 can also be yum installed from 
          ### any vendor distro.  you dont need to build those jars.
          
          ### As usual, generate the data.
          hadoop jar bigpetstore.jar org.apache.bigtop.bigpetstore.generator.BPSGenerator 100 bigpetstore/gen
          
          ### For yarn node managers that run the actual tasks, we need mahout/pig on cp.
          export JARS="/usr/lib/pig/pig-0.12.0-.....1.0-withouthadoop.jar,/usr/lib/mahout/mahout-core-job.jar"
          ### For the client, we also need these jars on the cp to kick off the jobs.
          export HADOOP_CLASSPATH=`echo $JARS | sed s/,/:/g`
          
          ### Now,  clean it with pig.
          hadoop jar bps.jar org.apache.bigtop.bigpetstore.etl.PigCSVCleaner -libjars $JARS bigpetstore/gen/ bigpetstore/pig/ BPS_analytics.pig
          
          ### Finally, process with mahout.
          hadoop jar bps.jar org.apache.bigtop.bigpetstore.recommend.ItemRecommender -libjars $JARS,/usr/lib/mahout/mahout-core-job.jar bigpetstore/pig/Mahout bigpetstore/Mahout/AlsFactorization bigpetstore/Mahout/AlsRecommendations
          
          

          After you make those 2 very minor modifications, I can commit this. THANKS for sticking with me through all this testing.

          Show
          jay vyas added a comment - - edited Okay bhashit. I built it succesfull. w/ jdk 7 (not 6)... Lets make sure we add this stuff to readme: Java 1.7 is required. gradle 2.0 is (i assume) required ... Now ill test it on a cluster and let you know...... First we need to export HADOOP_CLASSPATH=pig.jar:mahout.jar (so local client has access to libs) THEN you also need to append -libjars $ {JARS} (so mappers have access to the stuff) THEN ...... IT WORKS Now please do the following so we can commit ! Remove the added trailing whitespace in the files under bigtop-bigpetstore/ update the README with the following directions for running: ### Note that both pig and mahout can be yum installed ### via bigtop. Mahout 2.0 can also be yum installed from ### any vendor distro. you dont need to build those jars. ### As usual, generate the data. hadoop jar bigpetstore.jar org.apache.bigtop.bigpetstore.generator.BPSGenerator 100 bigpetstore/gen ### For yarn node managers that run the actual tasks, we need mahout/pig on cp. export JARS="/usr/lib/pig/pig-0.12.0-.....1.0-withouthadoop.jar,/usr/lib/mahout/mahout-core-job.jar" ### For the client, we also need these jars on the cp to kick off the jobs. export HADOOP_CLASSPATH=`echo $JARS | sed s/,/:/g` ### Now, clean it with pig. hadoop jar bps.jar org.apache.bigtop.bigpetstore.etl.PigCSVCleaner -libjars $JARS bigpetstore/gen/ bigpetstore/pig/ BPS_analytics.pig ### Finally, process with mahout. hadoop jar bps.jar org.apache.bigtop.bigpetstore.recommend.ItemRecommender -libjars $JARS,/usr/lib/mahout/mahout-core-job.jar bigpetstore/pig/Mahout bigpetstore/Mahout/AlsFactorization bigpetstore/Mahout/AlsRecommendations After you make those 2 very minor modifications, I can commit this. THANKS for sticking with me through all this testing.
          Hide
          bhashit parikh added a comment -

          Attaching the latest patch with the README updated and trailing white-spaces removed

          Show
          bhashit parikh added a comment - Attaching the latest patch with the README updated and trailing white-spaces removed
          bhashit parikh made changes -
          Attachment BIGTOP-1272.patch [ 12662720 ]
          Hide
          jay vyas added a comment -

          +1 README looks clear , and is exactly what I did to run it.
          After all this testing i definetly think its ready to push in , and i can commit this .

          Show
          jay vyas added a comment - +1 README looks clear , and is exactly what I did to run it. After all this testing i definetly think its ready to push in , and i can commit this .
          Hide
          bhashit parikh added a comment -

          jay vyas . Eagerly waiting for the commit.

          Show
          bhashit parikh added a comment - jay vyas . Eagerly waiting for the commit.
          Hide
          jay vyas added a comment -

          Commited. Thanks bhashit.....!!!!

          I couldn't assign it to you, but
          Maybe I'll modify that later (or you can file an
          Infra ticket to see why your name doesn't come
          Up as a possible assignee).

          Show
          jay vyas added a comment - Commited. Thanks bhashit.....!!!! I couldn't assign it to you, but Maybe I'll modify that later (or you can file an Infra ticket to see why your name doesn't come Up as a possible assignee).
          jay vyas made changes -
          Status Open [ 1 ] Resolved [ 5 ]
          Assignee jay vyas [ jayunit100 ]
          Fix Version/s 0.8.0 [ 12324841 ]
          Resolution Fixed [ 1 ]
          Transition Time In Source Status Execution Times Last Executer Last Execution Date
          Open Open Resolved Resolved
          126d 31m 1 jay vyas 20/Aug/14 11:55

            People

            • Assignee:
              jay vyas
              Reporter:
              jay vyas
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development