Uploaded image for project: 'Mahout'
  1. Mahout
  2. MAHOUT-1541

Create CLI Driver for Spark Cooccurrence Analysis

    Details

    • Type: New Feature
    • Status: Closed
    • Priority: Major
    • Resolution: Implemented
    • Affects Version/s: None
    • Fix Version/s: 0.10.0
    • Component/s: CLI
    • Labels:

      Description

      Create a CLI driver to import data in a flexible manner, create an IndexedDataset with BiMap ID translation dictionaries, call the Spark CooccurrenceAnalysis with the appropriate params, then write output with external IDs optionally reattached.

      Ultimately it should be able to read input as the legacy mr does but will support reading externally defined IDs and flexible formats. Output will be of the legacy format or text files of the user's specification with reattached Item IDs.

      Support for legacy formats is a question, users can always use the legacy code if they want this. Internal to the IndexedDataset is a Spark DRM so pipelining can be accomplished without any writing to an actual file so the legacy sequence file output may not be needed.

      Opinions?

        Activity

        Hide
        pferrel Pat Ferrel added a comment -

        progress can be followed here: https://github.com/pferrel/harness

        nothing much checked in except an option parser in the POM (scopt MIT license) and skeleton for the driver.

        Show
        pferrel Pat Ferrel added a comment - progress can be followed here: https://github.com/pferrel/harness nothing much checked in except an option parser in the POM (scopt MIT license) and skeleton for the driver.
        Hide
        ssc Sebastian Schelter added a comment -

        I'm not sure whether it is a good idea to again introduce custom preprocessing code for each algorithm. I think we should rather wait for the generic dataframe that we are aiming to build and integrate the cooccurrence code with that. I have suggested to merge the IndexedDataset functionality with a dataframe.

        Show
        ssc Sebastian Schelter added a comment - I'm not sure whether it is a good idea to again introduce custom preprocessing code for each algorithm. I think we should rather wait for the generic dataframe that we are aiming to build and integrate the cooccurrence code with that. I have suggested to merge the IndexedDataset functionality with a dataframe.
        Hide
        pferrel Pat Ferrel added a comment - - edited

        Agreed (mostly), only the CLI is custom for each algo.

        The preprocessor was a remnant of your old example patch and isn't meant to be repeated. Not planning to have separate code for every algo at all, in fact it should be quite the opposite. There will be a custom CLI for each algo and one of a couple customizable but general purpose importer/exporters (text delimited, sequencefile?) with some method of specifying input and output schema.

        The IndexedDataset would be identical in structure in all cases. Should have some of the IndexedDataset improvements (mostly BiMaps) today and I'm willing to merge them with some other dataframe in the future.

        What I am doing is exactly what we agreed to in MAHOUT-1518 There is another Jira about dataframes but I wasn't aware of any progress made on it. Don't want to "wait" I only have limited time in windows, if I wait I may get nothing done. And I could use this right now to rebuild the solr recommender and the other Mahout recommenders. This work seems at worse independant of some other r-like dataframe, or a best can be integrated as that solidifies.

        In the meantime any suggestions about using another effort like some usable dataframe-ish object is fine. I had though we'd convinced ourselves that the needs of an r-like dataframe and an import/export IndexedDataset were too different. Dmitriy certainly made strong arguments to that effect.

        Just using the cooccurrence analysis to have an end to end example.

        BTW do we really need to support sequencefiles where the legacy code does?

        Show
        pferrel Pat Ferrel added a comment - - edited Agreed (mostly), only the CLI is custom for each algo. The preprocessor was a remnant of your old example patch and isn't meant to be repeated. Not planning to have separate code for every algo at all, in fact it should be quite the opposite. There will be a custom CLI for each algo and one of a couple customizable but general purpose importer/exporters (text delimited, sequencefile?) with some method of specifying input and output schema. The IndexedDataset would be identical in structure in all cases. Should have some of the IndexedDataset improvements (mostly BiMaps) today and I'm willing to merge them with some other dataframe in the future. What I am doing is exactly what we agreed to in MAHOUT-1518 There is another Jira about dataframes but I wasn't aware of any progress made on it. Don't want to "wait" I only have limited time in windows, if I wait I may get nothing done. And I could use this right now to rebuild the solr recommender and the other Mahout recommenders. This work seems at worse independant of some other r-like dataframe, or a best can be integrated as that solidifies. In the meantime any suggestions about using another effort like some usable dataframe-ish object is fine. I had though we'd convinced ourselves that the needs of an r-like dataframe and an import/export IndexedDataset were too different. Dmitriy certainly made strong arguments to that effect. Just using the cooccurrence analysis to have an end to end example. BTW do we really need to support sequencefiles where the legacy code does?
        Hide
        tdunning Ted Dunning added a comment -

        BTW do we really need to support sequencefiles where the legacy code does?

        I sincerely hope not.

        Show
        tdunning Ted Dunning added a comment - BTW do we really need to support sequencefiles where the legacy code does? I sincerely hope not.
        Hide
        pferrel Pat Ferrel added a comment -

        Good, the only reason to do this, that I can think of, is so it will fit in existing workflows and the legacy code fits there already.

        The first version of this will be for text delimited import/export, I assume that other formats may be nice, like JSON, or others. Any guidance here would be appreciated.

        Show
        pferrel Pat Ferrel added a comment - Good, the only reason to do this, that I can think of, is so it will fit in existing workflows and the legacy code fits there already. The first version of this will be for text delimited import/export, I assume that other formats may be nice, like JSON, or others. Any guidance here would be appreciated.
        Hide
        kanjilal Saikat Kanjilal added a comment -

        Pat,
        Just one comment on the "no-progress around the dataframes JIRA", I assume you are referring to 1490, there is indeed quite a bit of progress presenting APIs around a set of generic operations around a dataFrame, based on Dmitry's recommendation I took the path of creating a proposal rather than blasting off and writing code to do this and have that be heavily criticized and not meeting the committable expectations, this way the design will be in place and have general consensus before any coding efforts begin, I'd love to get feedback from you and others to move 1490 along, please see blog and comment on JIRA if you'd like.

        Regards

        Show
        kanjilal Saikat Kanjilal added a comment - Pat, Just one comment on the "no-progress around the dataframes JIRA", I assume you are referring to 1490, there is indeed quite a bit of progress presenting APIs around a set of generic operations around a dataFrame, based on Dmitry's recommendation I took the path of creating a proposal rather than blasting off and writing code to do this and have that be heavily criticized and not meeting the committable expectations, this way the design will be in place and have general consensus before any coding efforts begin, I'd love to get feedback from you and others to move 1490 along, please see blog and comment on JIRA if you'd like. Regards
        Hide
        pferrel Pat Ferrel added a comment - - edited

        For something as complicated as an r-like dataframe that's a good approach and I did read it.

        The sole reason for IndexedDataset in my use is import/export. You'll see the code in my github in a few days. If the needs match I'll be happy to merge IndexedDataset and/or this driver and import code with whatever comes out of 1490.

        For now I have an actual need for this code in the solr-recommender running on the demo site and the import/export code will have minimal impact on the internals of IndexedDataset so I'm going with it for now only for expediency. There is no need for slices by label or the like so there should be little duplicated work.

        The IndexedDataset is defined as:

        /**
         * Comments: Wraps a Mahout DrmLike object and includes two BiMaps to store translation
         *   dictionaries. This may be replaced with a Mahout DSL dataframe-like object in the future.
         *   The primary use of this is for import and export, keeping track of external IDs and
         *   preserving them all the way to output.
         *
         * Example: For a transpose job the 'matrix: DrmLike[Int]' is passed into the DSL code
         *   that transposes the values, then the dictionaries are swapped and a new
         *   IndexedDataset is returned from the job, which will be exported to files using
         *   the labels.reverse(ID: Int) thereby preserving the external ID.
         *
         * @param matrix  DrmLike[Int], representing the distributed matrix storing the actual data.
         * @param rowLabels BiMap[String, Int] storing a bidirectional mapping of external String ID to
         *                  and from the ordinal Mahout Int ID. This one holds row labels
         * @param columnLabels BiMap[String, Int] storing a bidirectional mapping of external String
         *                  ID to and from the ordinal Mahout Int ID. This one holds column labels
         *
         * @return
         */
        
        case class IndexedDataset(matrix: DrmLike[Int], rowLabels: BiMap[String,Int], columnLabels: BiMap[String,Int])
        

        Note the BiMaps are actually java BiHashMaps from Guava. That will be sufficient for my current needs.

        Note that the Cooccurrence driver is a proposed template of CLI drivers in general. The code is being designed to work for any CLI access to Mahout-Spark. I'll have it running on the demo site solr recommender as soon as it's tested out and before any official Mahout commit so there is plenty of time to give opinions.

        Show
        pferrel Pat Ferrel added a comment - - edited For something as complicated as an r-like dataframe that's a good approach and I did read it. The sole reason for IndexedDataset in my use is import/export. You'll see the code in my github in a few days. If the needs match I'll be happy to merge IndexedDataset and/or this driver and import code with whatever comes out of 1490. For now I have an actual need for this code in the solr-recommender running on the demo site and the import/export code will have minimal impact on the internals of IndexedDataset so I'm going with it for now only for expediency. There is no need for slices by label or the like so there should be little duplicated work. The IndexedDataset is defined as: /** * Comments: Wraps a Mahout DrmLike object and includes two BiMaps to store translation * dictionaries. This may be replaced with a Mahout DSL dataframe-like object in the future . * The primary use of this is for import and export, keeping track of external IDs and * preserving them all the way to output. * * Example: For a transpose job the 'matrix: DrmLike[Int]' is passed into the DSL code * that transposes the values, then the dictionaries are swapped and a new * IndexedDataset is returned from the job, which will be exported to files using * the labels.reverse(ID: Int) thereby preserving the external ID. * * @param matrix DrmLike[Int], representing the distributed matrix storing the actual data. * @param rowLabels BiMap[ String , Int] storing a bidirectional mapping of external String ID to * and from the ordinal Mahout Int ID. This one holds row labels * @param columnLabels BiMap[ String , Int] storing a bidirectional mapping of external String * ID to and from the ordinal Mahout Int ID. This one holds column labels * * @ return */ case class IndexedDataset(matrix: DrmLike[Int], rowLabels: BiMap[ String ,Int], columnLabels: BiMap[ String ,Int]) Note the BiMaps are actually java BiHashMaps from Guava. That will be sufficient for my current needs. Note that the Cooccurrence driver is a proposed template of CLI drivers in general. The code is being designed to work for any CLI access to Mahout-Spark. I'll have it running on the demo site solr recommender as soon as it's tested out and before any official Mahout commit so there is plenty of time to give opinions.
        Hide
        pferrel Pat Ferrel added a comment -

        I have a partial implementation of the CLI driver, Importer, IndexedDataset and some tests.

        The design object is to support all Mahout V2 CLIs for blackbox type jobs similar to the legacy CLI but allow more flexible text file import/export maintaining user specified IDs.

        Anyone interested please take a look at the github repo and its wiki here: https://github.com/pferrel/harness/wiki

        I would greatly appreciate comments. This is a very early version and was shamelessly stolen from some examples Sebastian provided. It does actually run the cross-cooccurrence Spark code and display example output. It reads from a text-delimited file but there is only console output at present. Most options are not implemented yet because I'd like to get feedback now.

        Show
        pferrel Pat Ferrel added a comment - I have a partial implementation of the CLI driver, Importer, IndexedDataset and some tests. The design object is to support all Mahout V2 CLIs for blackbox type jobs similar to the legacy CLI but allow more flexible text file import/export maintaining user specified IDs. Anyone interested please take a look at the github repo and its wiki here: https://github.com/pferrel/harness/wiki I would greatly appreciate comments. This is a very early version and was shamelessly stolen from some examples Sebastian provided. It does actually run the cross-cooccurrence Spark code and display example output. It reads from a text-delimited file but there is only console output at present. Most options are not implemented yet because I'd like to get feedback now.
        Hide
        pferrel Pat Ferrel added a comment -

        The basic import, do the cooccurrence, export is working on a local Spark version. Works for self and cross cooccurrence. user specified IDs are preserved. There is a proposed format for drm-like things encoded as text delimited files and a way to change the schema for it.

        By no means have all the option combos been tested yet

        It would be great if someone could take a look at it.

        Show
        pferrel Pat Ferrel added a comment - The basic import, do the cooccurrence, export is working on a local Spark version. Works for self and cross cooccurrence. user specified IDs are preserved. There is a proposed format for drm-like things encoded as text delimited files and a way to change the schema for it. By no means have all the option combos been tested yet It would be great if someone could take a look at it.
        Hide
        pferrel Pat Ferrel added a comment -

        More progress: it reads and writes. A CLI option parser is taking about all params needed and some minimal validation and consistency checking is started.

        The structure of IndexedDataset and the 'Store' types are ready for some early review. The idea is that Stores create IndexedDataset(s) by reading text files and write them out too. The design allows for non-text read/write but nothing is implemented for this yet. There is also an example driver that may be split into generic and job specific parts but I haven't tackled this yet.

        The read/write have not been tested on a cluster yet.

        Feedback is appreciated, especially where noted on the github description. https://github.com/pferrel/harness/wiki

        Show
        pferrel Pat Ferrel added a comment - More progress: it reads and writes. A CLI option parser is taking about all params needed and some minimal validation and consistency checking is started. The structure of IndexedDataset and the 'Store' types are ready for some early review. The idea is that Stores create IndexedDataset(s) by reading text files and write them out too. The design allows for non-text read/write but nothing is implemented for this yet. There is also an example driver that may be split into generic and job specific parts but I haven't tackled this yet. The read/write have not been tested on a cluster yet. Feedback is appreciated, especially where noted on the github description. https://github.com/pferrel/harness/wiki
        Hide
        pferrel Pat Ferrel added a comment -

        Current design ends up with some sorta ugly code but I think the file schema can be factored out to simplify things. Hold off on comments until I have a chance to do this.

        Show
        pferrel Pat Ferrel added a comment - Current design ends up with some sorta ugly code but I think the file schema can be factored out to simplify things. Hold off on comments until I have a chance to do this.
        Hide
        githubbot ASF GitHub Bot added a comment -

        GitHub user pferrel opened a pull request:

        https://github.com/apache/mahout/pull/11

        Mahout 1541

        MAHOUT-1541 WIP, need access to an RDD, which used to be in a DrmLike[Int]. Still need the DrmLike to pass in to cooccurrence but need the RDD to write a text file. CheckpointedDrmSpark doesn't have the DrmLike?

        You can merge this pull request into a Git repository by running:

        $ git pull https://github.com/pferrel/mahout mahout-1541

        Alternatively you can review and apply these changes as the patch at:

        https://github.com/apache/mahout/pull/11.patch

        To close this pull request, make a commit to your master/trunk branch
        with (at least) the following in the commit message:

        This closes #11


        commit 107a0ba9605241653a85b113661a8fa5c055529f
        Author: pferrel <pat@occamsmachete.com>
        Date: 2014-06-04T19:54:22Z

        added Sebastian's CooccurrenceAnalysis patch updated it to use current Mahout-DSL

        commit 74b9921c4c9bd8903585bbd74d9e66298ea8b7a0
        Author: pferrel <pat@occamsmachete.com>
        Date: 2014-06-04T20:09:07Z

        Adding stuff for itemsimilarity driver for Spark

        commit a59265931ed3a51ba81e1a0a7171ebb102be4fa4
        Author: pferrel <pat@occamsmachete.com>
        Date: 2014-06-04T20:13:13Z

        added scopt to pom deps

        commit 16c03f7fa73c156859d1dba3a333ef9e8bf922b0
        Author: pferrel <pat@occamsmachete.com>
        Date: 2014-06-04T21:32:18Z

        added Sebastian's MurmurHash changes

        Signed-off-by: pferrel <pat@occamsmachete.com>

        commit 8a4b4347ddb7b9ac97590aa20189d89d8a07a80a
        Author: pferrel <pat@occamsmachete.com>
        Date: 2014-06-04T21:33:11Z

        Merge branch 'mahout-1464' into mahout-1541

        commit 2f87f5433f90fa2c49ef386ca245943e1fc73beb
        Author: pferrel <pat@occamsmachete.com>
        Date: 2014-06-05T01:44:16Z

        MAHOUT-1541 still working on this, some refactoring in the DSL for abstracting away Spark has moved access to rddsno Jira is closed yet

        commit c6adaa44c80bba99d41600e260bbb1ad5c972e69
        Author: pferrel <pat@occamsmachete.com>
        Date: 2014-06-05T16:52:23Z

        MAHOUT-1464 import cleanup, minor changes to examples for running on Spark Cluster

        commit 2caceab31703ed214c1e66d5fc63b8bdb05d37a3
        Author: pferrel <pat@occamsmachete.com>
        Date: 2014-06-05T16:55:09Z

        Merge branch 'mahout-1464' into mahout-1541


        Show
        githubbot ASF GitHub Bot added a comment - GitHub user pferrel opened a pull request: https://github.com/apache/mahout/pull/11 Mahout 1541 MAHOUT-1541 WIP, need access to an RDD, which used to be in a DrmLike [Int] . Still need the DrmLike to pass in to cooccurrence but need the RDD to write a text file. CheckpointedDrmSpark doesn't have the DrmLike? You can merge this pull request into a Git repository by running: $ git pull https://github.com/pferrel/mahout mahout-1541 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/mahout/pull/11.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #11 commit 107a0ba9605241653a85b113661a8fa5c055529f Author: pferrel <pat@occamsmachete.com> Date: 2014-06-04T19:54:22Z added Sebastian's CooccurrenceAnalysis patch updated it to use current Mahout-DSL commit 74b9921c4c9bd8903585bbd74d9e66298ea8b7a0 Author: pferrel <pat@occamsmachete.com> Date: 2014-06-04T20:09:07Z Adding stuff for itemsimilarity driver for Spark commit a59265931ed3a51ba81e1a0a7171ebb102be4fa4 Author: pferrel <pat@occamsmachete.com> Date: 2014-06-04T20:13:13Z added scopt to pom deps commit 16c03f7fa73c156859d1dba3a333ef9e8bf922b0 Author: pferrel <pat@occamsmachete.com> Date: 2014-06-04T21:32:18Z added Sebastian's MurmurHash changes Signed-off-by: pferrel <pat@occamsmachete.com> commit 8a4b4347ddb7b9ac97590aa20189d89d8a07a80a Author: pferrel <pat@occamsmachete.com> Date: 2014-06-04T21:33:11Z Merge branch 'mahout-1464' into mahout-1541 commit 2f87f5433f90fa2c49ef386ca245943e1fc73beb Author: pferrel <pat@occamsmachete.com> Date: 2014-06-05T01:44:16Z MAHOUT-1541 still working on this, some refactoring in the DSL for abstracting away Spark has moved access to rddsno Jira is closed yet commit c6adaa44c80bba99d41600e260bbb1ad5c972e69 Author: pferrel <pat@occamsmachete.com> Date: 2014-06-05T16:52:23Z MAHOUT-1464 import cleanup, minor changes to examples for running on Spark Cluster commit 2caceab31703ed214c1e66d5fc63b8bdb05d37a3 Author: pferrel <pat@occamsmachete.com> Date: 2014-06-05T16:55:09Z Merge branch 'mahout-1464' into mahout-1541
        Hide
        pferrel Pat Ferrel added a comment -

        The mahout-1541 branch in the above PR currently does not build because of core DSL changes since it was written.

        The root issue is that I need DrmLike[Int]s to pass to CooccurrenceAnalysis.cooccurrence. The CLI reads these from text files. To read text files I need a SparkContext, which is not a problem. But to write the output I need an RDD. The DrmLike's that I was keeping used to contain the RDD but no longer do. A question that is blocking at present is:

        1) What is the preferred way to pass around an object with both a DrmLike interface and an Rdd reference? Should I create this or does it already exist? I've looked in the PDF doc and the scaladocs but couldn't find the answer. Therefore I'll create something unless someone has better advice.

        Show
        pferrel Pat Ferrel added a comment - The mahout-1541 branch in the above PR currently does not build because of core DSL changes since it was written. The root issue is that I need DrmLike [Int] s to pass to CooccurrenceAnalysis.cooccurrence. The CLI reads these from text files. To read text files I need a SparkContext, which is not a problem. But to write the output I need an RDD. The DrmLike's that I was keeping used to contain the RDD but no longer do. A question that is blocking at present is: 1) What is the preferred way to pass around an object with both a DrmLike interface and an Rdd reference? Should I create this or does it already exist? I've looked in the PDF doc and the scaladocs but couldn't find the answer. Therefore I'll create something unless someone has better advice.
        Hide
        pferrel Pat Ferrel added a comment -

        The existing itemsimilarity outputs indicators, one per row. This is more compact but seems like not the ideal way to get them. The indicators will likely be associated in rows to a specific item id. Therefore I've implemented the Spark version to output what amounts the a sparse matrix with external IDs intact. A row will (if no formatting params are specified) look like this:

        user1<tab>user2:strength1,user3:strength2
        user2<tab>user5:strength3,user6:strength4,...

        To make this directly ingestible by Solr or ElasticSearch we probably need to drop the strength values via a CLI option and maybe support named output formats like CSV, JSON. CSV can be done with the current options if the strengths are dropped.

        Show
        pferrel Pat Ferrel added a comment - The existing itemsimilarity outputs indicators, one per row. This is more compact but seems like not the ideal way to get them. The indicators will likely be associated in rows to a specific item id. Therefore I've implemented the Spark version to output what amounts the a sparse matrix with external IDs intact. A row will (if no formatting params are specified) look like this: user1<tab>user2:strength1,user3:strength2 user2<tab>user5:strength3,user6:strength4,... To make this directly ingestible by Solr or ElasticSearch we probably need to drop the strength values via a CLI option and maybe support named output formats like CSV, JSON. CSV can be done with the current options if the strengths are dropped.
        Hide
        githubbot ASF GitHub Bot added a comment -

        Github user pferrel closed the pull request at:

        https://github.com/apache/mahout/pull/11

        Show
        githubbot ASF GitHub Bot added a comment - Github user pferrel closed the pull request at: https://github.com/apache/mahout/pull/11
        Hide
        githubbot ASF GitHub Bot added a comment -

        GitHub user pferrel opened a pull request:

        https://github.com/apache/mahout/pull/22

        MAHOUT-1541

        Not ready for merge, just looking for a checkpoint review. The code works and I'm reasonable happy with it. It needs review for:

        • Design: will it be flexible, are there better ways to organize the code?
        • Scala usage: I'm pretty new to Scala.
        • Package structure: no clue here so its all in "drivers"

        Please don't take the tests too seriously and some things are not implemented. I'd like a once over before I do a bunch of test cases and finish polishing up.

        This will close three tickets when complete. MAHOUT-1541, MAHOUT-1568, and MAHOUT-1569

        You can merge this pull request into a Git repository by running:

        $ git pull https://github.com/pferrel/mahout mahout-1541

        Alternatively you can review and apply these changes as the patch at:

        https://github.com/apache/mahout/pull/22.patch

        To close this pull request, make a commit to your master/trunk branch
        with (at least) the following in the commit message:

        This closes #22


        commit 107a0ba9605241653a85b113661a8fa5c055529f
        Author: pferrel <pat@occamsmachete.com>
        Date: 2014-06-04T19:54:22Z

        added Sebastian's CooccurrenceAnalysis patch updated it to use current Mahout-DSL

        commit 74b9921c4c9bd8903585bbd74d9e66298ea8b7a0
        Author: pferrel <pat@occamsmachete.com>
        Date: 2014-06-04T20:09:07Z

        Adding stuff for itemsimilarity driver for Spark

        commit a59265931ed3a51ba81e1a0a7171ebb102be4fa4
        Author: pferrel <pat@occamsmachete.com>
        Date: 2014-06-04T20:13:13Z

        added scopt to pom deps

        commit 16c03f7fa73c156859d1dba3a333ef9e8bf922b0
        Author: pferrel <pat@occamsmachete.com>
        Date: 2014-06-04T21:32:18Z

        added Sebastian's MurmurHash changes

        Signed-off-by: pferrel <pat@occamsmachete.com>

        commit 8a4b4347ddb7b9ac97590aa20189d89d8a07a80a
        Author: pferrel <pat@occamsmachete.com>
        Date: 2014-06-04T21:33:11Z

        Merge branch 'mahout-1464' into mahout-1541

        commit 2f87f5433f90fa2c49ef386ca245943e1fc73beb
        Author: pferrel <pat@occamsmachete.com>
        Date: 2014-06-05T01:44:16Z

        MAHOUT-1541 still working on this, some refactoring in the DSL for abstracting away Spark has moved access to rddsno Jira is closed yet

        commit c6adaa44c80bba99d41600e260bbb1ad5c972e69
        Author: pferrel <pat@occamsmachete.com>
        Date: 2014-06-05T16:52:23Z

        MAHOUT-1464 import cleanup, minor changes to examples for running on Spark Cluster

        commit 2caceab31703ed214c1e66d5fc63b8bdb05d37a3
        Author: pferrel <pat@occamsmachete.com>
        Date: 2014-06-05T16:55:09Z

        Merge branch 'mahout-1464' into mahout-1541

        commit 6df6a54e3ff174d39bd817caf7d16c2d362be3f8
        Author: pferrel <pat@occamsmachete.com>
        Date: 2014-06-07T20:39:25Z

        Merge branch 'master' into mahout-1541

        commit a2f84dea3f32d3df3e98c61f085bc1fabd453551
        Author: pferrel <pat@occamsmachete.com>
        Date: 2014-06-07T21:27:06Z

        drmWrap seems to be the answer to the changed DrmLike interface. Code works again but more to do.

        commit d3a2ba5027436d0abef67a1a5e82557064f4ba49
        Author: pferrel <pat@occamsmachete.com>
        Date: 2014-06-17T16:00:38Z

        merged master, got new cooccurrence code

        commit 4b2fb07b21a8ac2d532ee51b65b27d1482293cb0
        Author: pferrel <pat@occamsmachete.com>
        Date: 2014-06-19T17:08:02Z

        for high level review, not ready for merge

        commit 996ccfb82a8ed3ff90f51968e661b2449f3c4759
        Author: pferrel <pat@occamsmachete.com>
        Date: 2014-06-19T17:46:23Z

        for high level review, not ready for merge. changed to dot notation

        commit f62ab071869ee205ad398a3e094d871138e11a9e
        Author: pferrel <pat@occamsmachete.com>
        Date: 2014-06-19T18:13:44Z

        for high level review, not ready for merge. fixed a couple scaladoc refs


        Show
        githubbot ASF GitHub Bot added a comment - GitHub user pferrel opened a pull request: https://github.com/apache/mahout/pull/22 MAHOUT-1541 Not ready for merge, just looking for a checkpoint review. The code works and I'm reasonable happy with it. It needs review for: Design: will it be flexible, are there better ways to organize the code? Scala usage: I'm pretty new to Scala. Package structure: no clue here so its all in "drivers" Please don't take the tests too seriously and some things are not implemented. I'd like a once over before I do a bunch of test cases and finish polishing up. This will close three tickets when complete. MAHOUT-1541 , MAHOUT-1568 , and MAHOUT-1569 You can merge this pull request into a Git repository by running: $ git pull https://github.com/pferrel/mahout mahout-1541 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/mahout/pull/22.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22 commit 107a0ba9605241653a85b113661a8fa5c055529f Author: pferrel <pat@occamsmachete.com> Date: 2014-06-04T19:54:22Z added Sebastian's CooccurrenceAnalysis patch updated it to use current Mahout-DSL commit 74b9921c4c9bd8903585bbd74d9e66298ea8b7a0 Author: pferrel <pat@occamsmachete.com> Date: 2014-06-04T20:09:07Z Adding stuff for itemsimilarity driver for Spark commit a59265931ed3a51ba81e1a0a7171ebb102be4fa4 Author: pferrel <pat@occamsmachete.com> Date: 2014-06-04T20:13:13Z added scopt to pom deps commit 16c03f7fa73c156859d1dba3a333ef9e8bf922b0 Author: pferrel <pat@occamsmachete.com> Date: 2014-06-04T21:32:18Z added Sebastian's MurmurHash changes Signed-off-by: pferrel <pat@occamsmachete.com> commit 8a4b4347ddb7b9ac97590aa20189d89d8a07a80a Author: pferrel <pat@occamsmachete.com> Date: 2014-06-04T21:33:11Z Merge branch 'mahout-1464' into mahout-1541 commit 2f87f5433f90fa2c49ef386ca245943e1fc73beb Author: pferrel <pat@occamsmachete.com> Date: 2014-06-05T01:44:16Z MAHOUT-1541 still working on this, some refactoring in the DSL for abstracting away Spark has moved access to rddsno Jira is closed yet commit c6adaa44c80bba99d41600e260bbb1ad5c972e69 Author: pferrel <pat@occamsmachete.com> Date: 2014-06-05T16:52:23Z MAHOUT-1464 import cleanup, minor changes to examples for running on Spark Cluster commit 2caceab31703ed214c1e66d5fc63b8bdb05d37a3 Author: pferrel <pat@occamsmachete.com> Date: 2014-06-05T16:55:09Z Merge branch 'mahout-1464' into mahout-1541 commit 6df6a54e3ff174d39bd817caf7d16c2d362be3f8 Author: pferrel <pat@occamsmachete.com> Date: 2014-06-07T20:39:25Z Merge branch 'master' into mahout-1541 commit a2f84dea3f32d3df3e98c61f085bc1fabd453551 Author: pferrel <pat@occamsmachete.com> Date: 2014-06-07T21:27:06Z drmWrap seems to be the answer to the changed DrmLike interface. Code works again but more to do. commit d3a2ba5027436d0abef67a1a5e82557064f4ba49 Author: pferrel <pat@occamsmachete.com> Date: 2014-06-17T16:00:38Z merged master, got new cooccurrence code commit 4b2fb07b21a8ac2d532ee51b65b27d1482293cb0 Author: pferrel <pat@occamsmachete.com> Date: 2014-06-19T17:08:02Z for high level review, not ready for merge commit 996ccfb82a8ed3ff90f51968e661b2449f3c4759 Author: pferrel <pat@occamsmachete.com> Date: 2014-06-19T17:46:23Z for high level review, not ready for merge. changed to dot notation commit f62ab071869ee205ad398a3e094d871138e11a9e Author: pferrel <pat@occamsmachete.com> Date: 2014-06-19T18:13:44Z for high level review, not ready for merge. fixed a couple scaladoc refs
        Hide
        githubbot ASF GitHub Bot added a comment -

        Github user pferrel commented on the pull request:

        https://github.com/apache/mahout/pull/22#issuecomment-46598676

        Docs written describing this here: https://github.com/pferrel/harness/wiki

        Show
        githubbot ASF GitHub Bot added a comment - Github user pferrel commented on the pull request: https://github.com/apache/mahout/pull/22#issuecomment-46598676 Docs written describing this here: https://github.com/pferrel/harness/wiki
        Hide
        githubbot ASF GitHub Bot added a comment -

        Github user pferrel commented on the pull request:

        https://github.com/apache/mahout/pull/22#issuecomment-46598913

        BTW untested on a cluster. Still trying to get mine back working.

        Show
        githubbot ASF GitHub Bot added a comment - Github user pferrel commented on the pull request: https://github.com/apache/mahout/pull/22#issuecomment-46598913 BTW untested on a cluster. Still trying to get mine back working.
        Hide
        githubbot ASF GitHub Bot added a comment -

        Github user dlyubimov commented on a diff in the pull request:

        https://github.com/apache/mahout/pull/22#discussion_r13987402

        — Diff: spark/src/main/scala/org/apache/mahout/cf/examples/Recommendations.scala —
        @@ -0,0 +1,172 @@
        +/*
        + * Licensed to the Apache Software Foundation (ASF) under one or more
        + * contributor license agreements. See the NOTICE file distributed with
        + * this work for additional information regarding copyright ownership.
        + * The ASF licenses this file to You under the Apache License, Version 2.0
        + * (the "License"); you may not use this file except in compliance with
        + * the License. You may obtain a copy of the License at
        + *
        + * http://www.apache.org/licenses/LICENSE-2.0
        + *
        + * Unless required by applicable law or agreed to in writing, software
        + * distributed under the License is distributed on an "AS IS" BASIS,
        + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
        + * See the License for the specific language governing permissions and
        + * limitations under the License.
        + */
        +
        +package org.apache.mahout.cf.examples
        +
        +import scala.io.Source
        +import org.apache.mahout.math._
        +import scalabindings._
        +import RLikeOps._
        +import drm._
        +import RLikeDrmOps._
        +import org.apache.mahout.sparkbindings._
        +
        +import org.apache.mahout.cf.CooccurrenceAnalysis._
        +import scala.collection.JavaConversions._
        +
        +/**
        + * The Epinions dataset contains ratings from users to items and a trust-network between the users.
        + * We use co-occurrence analysis to compute "users who like these items, also like that items" and
        + * "users who trust these users, like that items"
        + *
        + * Download and unpack the dataset files from:
        + *
        + * http://www.trustlet.org/datasets/downloaded_epinions/ratings_data.txt.bz2
        + * http://www.trustlet.org/datasets/downloaded_epinions/trust_data.txt.bz2
        + **/
        +object RunCrossCooccurrenceAnalysisOnEpinions {
        +
        + def main(args: Array[String]): Unit = {
        +
        + if (args.length == 0)

        { + println("Usage: RunCooccurrenceAnalysisOnMovielens1M <path-to-dataset-folder>") + println("Download the dataset from http://www.trustlet.org/datasets/downloaded_epinions/ratings_data.txt.bz2 and") + println("http://www.trustlet.org/datasets/downloaded_epinions/trust_data.txt.bz2") + sys.exit(-1) + }

        +
        + val datasetDir = args(0)
        +
        + val epinionsRatings = new SparseMatrix(49290, 139738)
        +
        + var firstLineSkipped = false
        + for (line <- Source.fromFile(datasetDir + "/ratings_data.txt").getLines()) {
        + if (line.contains(' ') && firstLineSkipped)

        { + val tokens = line.split(' ') + val userID = tokens(0).toInt - 1 + val itemID = tokens(1).toInt - 1 + val rating = tokens(2).toDouble + epinionsRatings(userID, itemID) = rating + }

        + firstLineSkipped = true
        + }
        +
        + val epinionsTrustNetwork = new SparseMatrix(49290, 49290)
        + firstLineSkipped = false
        + for (line <- Source.fromFile(datasetDir + "/trust_data.txt").getLines()) {
        + if (line.contains(' ') && firstLineSkipped)

        { + val tokens = line.trim.split(' ') + val userID = tokens(0).toInt - 1 + val trustedUserId = tokens(1).toInt - 1 + epinionsTrustNetwork(userID, trustedUserId) = 1 + }

        + firstLineSkipped = true
        + }
        +
        + System.setProperty("spark.kryo.referenceTracking", "false")
        + System.setProperty("spark.kryoserializer.buffer.mb", "100")
        +/* to run on local, can provide number of core by changing to local[4] */
        + implicit val distributedContext = mahoutSparkContext(masterUrl = "local", appName = "MahoutLocalContext",
        + customJars = Traversable.empty[String])
        +
        + /* to run on a Spark cluster provide the Spark Master URL
        + implicit val distributedContext = mahoutSparkContext(masterUrl = "spark://occam4:7077", appName = "MahoutClusteredContext",
        + customJars = Traversable.empty[String])
        — End diff –

        if no custom jars are used, parameter does not have to be there. Scala default parameters are pretty useful alternative to tons of overloaded functions.

        Show
        githubbot ASF GitHub Bot added a comment - Github user dlyubimov commented on a diff in the pull request: https://github.com/apache/mahout/pull/22#discussion_r13987402 — Diff: spark/src/main/scala/org/apache/mahout/cf/examples/Recommendations.scala — @@ -0,0 +1,172 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.mahout.cf.examples + +import scala.io.Source +import org.apache.mahout.math._ +import scalabindings._ +import RLikeOps._ +import drm._ +import RLikeDrmOps._ +import org.apache.mahout.sparkbindings._ + +import org.apache.mahout.cf.CooccurrenceAnalysis._ +import scala.collection.JavaConversions._ + +/** + * The Epinions dataset contains ratings from users to items and a trust-network between the users. + * We use co-occurrence analysis to compute "users who like these items, also like that items" and + * "users who trust these users, like that items" + * + * Download and unpack the dataset files from: + * + * http://www.trustlet.org/datasets/downloaded_epinions/ratings_data.txt.bz2 + * http://www.trustlet.org/datasets/downloaded_epinions/trust_data.txt.bz2 + **/ +object RunCrossCooccurrenceAnalysisOnEpinions { + + def main(args: Array [String] ): Unit = { + + if (args.length == 0) { + println("Usage: RunCooccurrenceAnalysisOnMovielens1M <path-to-dataset-folder>") + println("Download the dataset from http://www.trustlet.org/datasets/downloaded_epinions/ratings_data.txt.bz2 and") + println("http://www.trustlet.org/datasets/downloaded_epinions/trust_data.txt.bz2") + sys.exit(-1) + } + + val datasetDir = args(0) + + val epinionsRatings = new SparseMatrix(49290, 139738) + + var firstLineSkipped = false + for (line <- Source.fromFile(datasetDir + "/ratings_data.txt").getLines()) { + if (line.contains(' ') && firstLineSkipped) { + val tokens = line.split(' ') + val userID = tokens(0).toInt - 1 + val itemID = tokens(1).toInt - 1 + val rating = tokens(2).toDouble + epinionsRatings(userID, itemID) = rating + } + firstLineSkipped = true + } + + val epinionsTrustNetwork = new SparseMatrix(49290, 49290) + firstLineSkipped = false + for (line <- Source.fromFile(datasetDir + "/trust_data.txt").getLines()) { + if (line.contains(' ') && firstLineSkipped) { + val tokens = line.trim.split(' ') + val userID = tokens(0).toInt - 1 + val trustedUserId = tokens(1).toInt - 1 + epinionsTrustNetwork(userID, trustedUserId) = 1 + } + firstLineSkipped = true + } + + System.setProperty("spark.kryo.referenceTracking", "false") + System.setProperty("spark.kryoserializer.buffer.mb", "100") +/* to run on local, can provide number of core by changing to local [4] */ + implicit val distributedContext = mahoutSparkContext(masterUrl = "local", appName = "MahoutLocalContext", + customJars = Traversable.empty [String] ) + + /* to run on a Spark cluster provide the Spark Master URL + implicit val distributedContext = mahoutSparkContext(masterUrl = "spark://occam4:7077", appName = "MahoutClusteredContext", + customJars = Traversable.empty [String] ) — End diff – if no custom jars are used, parameter does not have to be there. Scala default parameters are pretty useful alternative to tons of overloaded functions.
        Hide
        githubbot ASF GitHub Bot added a comment -

        Github user dlyubimov commented on a diff in the pull request:

        https://github.com/apache/mahout/pull/22#discussion_r13987451

        — Diff: spark/src/main/scala/org/apache/mahout/cf/examples/Recommendations.scala —
        @@ -0,0 +1,172 @@
        +/*
        + * Licensed to the Apache Software Foundation (ASF) under one or more
        + * contributor license agreements. See the NOTICE file distributed with
        + * this work for additional information regarding copyright ownership.
        + * The ASF licenses this file to You under the Apache License, Version 2.0
        + * (the "License"); you may not use this file except in compliance with
        + * the License. You may obtain a copy of the License at
        + *
        + * http://www.apache.org/licenses/LICENSE-2.0
        + *
        + * Unless required by applicable law or agreed to in writing, software
        + * distributed under the License is distributed on an "AS IS" BASIS,
        + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
        + * See the License for the specific language governing permissions and
        + * limitations under the License.
        + */
        +
        +package org.apache.mahout.cf.examples
        +
        +import scala.io.Source
        +import org.apache.mahout.math._
        +import scalabindings._
        +import RLikeOps._
        +import drm._
        +import RLikeDrmOps._
        +import org.apache.mahout.sparkbindings._
        +
        +import org.apache.mahout.cf.CooccurrenceAnalysis._
        +import scala.collection.JavaConversions._
        +
        +/**
        + * The Epinions dataset contains ratings from users to items and a trust-network between the users.
        + * We use co-occurrence analysis to compute "users who like these items, also like that items" and
        + * "users who trust these users, like that items"
        + *
        + * Download and unpack the dataset files from:
        + *
        + * http://www.trustlet.org/datasets/downloaded_epinions/ratings_data.txt.bz2
        + * http://www.trustlet.org/datasets/downloaded_epinions/trust_data.txt.bz2
        + **/
        +object RunCrossCooccurrenceAnalysisOnEpinions {
        +
        + def main(args: Array[String]): Unit = {
        +
        + if (args.length == 0)

        { + println("Usage: RunCooccurrenceAnalysisOnMovielens1M <path-to-dataset-folder>") + println("Download the dataset from http://www.trustlet.org/datasets/downloaded_epinions/ratings_data.txt.bz2 and") + println("http://www.trustlet.org/datasets/downloaded_epinions/trust_data.txt.bz2") + sys.exit(-1) + }

        +
        + val datasetDir = args(0)
        +
        + val epinionsRatings = new SparseMatrix(49290, 139738)
        +
        + var firstLineSkipped = false
        + for (line <- Source.fromFile(datasetDir + "/ratings_data.txt").getLines()) {
        + if (line.contains(' ') && firstLineSkipped)

        { + val tokens = line.split(' ') + val userID = tokens(0).toInt - 1 + val itemID = tokens(1).toInt - 1 + val rating = tokens(2).toDouble + epinionsRatings(userID, itemID) = rating + }

        + firstLineSkipped = true
        + }
        +
        + val epinionsTrustNetwork = new SparseMatrix(49290, 49290)
        + firstLineSkipped = false
        + for (line <- Source.fromFile(datasetDir + "/trust_data.txt").getLines()) {
        + if (line.contains(' ') && firstLineSkipped)

        { + val tokens = line.trim.split(' ') + val userID = tokens(0).toInt - 1 + val trustedUserId = tokens(1).toInt - 1 + epinionsTrustNetwork(userID, trustedUserId) = 1 + }

        + firstLineSkipped = true
        + }
        +
        + System.setProperty("spark.kryo.referenceTracking", "false")
        + System.setProperty("spark.kryoserializer.buffer.mb", "100")
        +/* to run on local, can provide number of core by changing to local[4] */
        + implicit val distributedContext = mahoutSparkContext(masterUrl = "local", appName = "MahoutLocalContext",
        + customJars = Traversable.empty[String])
        — End diff –

        again, no need to specify customJars if no jars are added.

        Show
        githubbot ASF GitHub Bot added a comment - Github user dlyubimov commented on a diff in the pull request: https://github.com/apache/mahout/pull/22#discussion_r13987451 — Diff: spark/src/main/scala/org/apache/mahout/cf/examples/Recommendations.scala — @@ -0,0 +1,172 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.mahout.cf.examples + +import scala.io.Source +import org.apache.mahout.math._ +import scalabindings._ +import RLikeOps._ +import drm._ +import RLikeDrmOps._ +import org.apache.mahout.sparkbindings._ + +import org.apache.mahout.cf.CooccurrenceAnalysis._ +import scala.collection.JavaConversions._ + +/** + * The Epinions dataset contains ratings from users to items and a trust-network between the users. + * We use co-occurrence analysis to compute "users who like these items, also like that items" and + * "users who trust these users, like that items" + * + * Download and unpack the dataset files from: + * + * http://www.trustlet.org/datasets/downloaded_epinions/ratings_data.txt.bz2 + * http://www.trustlet.org/datasets/downloaded_epinions/trust_data.txt.bz2 + **/ +object RunCrossCooccurrenceAnalysisOnEpinions { + + def main(args: Array [String] ): Unit = { + + if (args.length == 0) { + println("Usage: RunCooccurrenceAnalysisOnMovielens1M <path-to-dataset-folder>") + println("Download the dataset from http://www.trustlet.org/datasets/downloaded_epinions/ratings_data.txt.bz2 and") + println("http://www.trustlet.org/datasets/downloaded_epinions/trust_data.txt.bz2") + sys.exit(-1) + } + + val datasetDir = args(0) + + val epinionsRatings = new SparseMatrix(49290, 139738) + + var firstLineSkipped = false + for (line <- Source.fromFile(datasetDir + "/ratings_data.txt").getLines()) { + if (line.contains(' ') && firstLineSkipped) { + val tokens = line.split(' ') + val userID = tokens(0).toInt - 1 + val itemID = tokens(1).toInt - 1 + val rating = tokens(2).toDouble + epinionsRatings(userID, itemID) = rating + } + firstLineSkipped = true + } + + val epinionsTrustNetwork = new SparseMatrix(49290, 49290) + firstLineSkipped = false + for (line <- Source.fromFile(datasetDir + "/trust_data.txt").getLines()) { + if (line.contains(' ') && firstLineSkipped) { + val tokens = line.trim.split(' ') + val userID = tokens(0).toInt - 1 + val trustedUserId = tokens(1).toInt - 1 + epinionsTrustNetwork(userID, trustedUserId) = 1 + } + firstLineSkipped = true + } + + System.setProperty("spark.kryo.referenceTracking", "false") + System.setProperty("spark.kryoserializer.buffer.mb", "100") +/* to run on local, can provide number of core by changing to local [4] */ + implicit val distributedContext = mahoutSparkContext(masterUrl = "local", appName = "MahoutLocalContext", + customJars = Traversable.empty [String] ) — End diff – again, no need to specify customJars if no jars are added.
        Hide
        githubbot ASF GitHub Bot added a comment -

        Github user dlyubimov commented on a diff in the pull request:

        https://github.com/apache/mahout/pull/22#discussion_r13987539

        — Diff: spark/src/main/scala/org/apache/mahout/cf/examples/Recommendations.scala —
        @@ -0,0 +1,172 @@
        +/*
        + * Licensed to the Apache Software Foundation (ASF) under one or more
        + * contributor license agreements. See the NOTICE file distributed with
        + * this work for additional information regarding copyright ownership.
        + * The ASF licenses this file to You under the Apache License, Version 2.0
        + * (the "License"); you may not use this file except in compliance with
        + * the License. You may obtain a copy of the License at
        + *
        + * http://www.apache.org/licenses/LICENSE-2.0
        + *
        + * Unless required by applicable law or agreed to in writing, software
        + * distributed under the License is distributed on an "AS IS" BASIS,
        + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
        + * See the License for the specific language governing permissions and
        + * limitations under the License.
        + */
        +
        +package org.apache.mahout.cf.examples
        +
        +import scala.io.Source
        +import org.apache.mahout.math._
        +import scalabindings._
        +import RLikeOps._
        +import drm._
        +import RLikeDrmOps._
        +import org.apache.mahout.sparkbindings._
        +
        +import org.apache.mahout.cf.CooccurrenceAnalysis._
        +import scala.collection.JavaConversions._
        +
        +/**
        + * The Epinions dataset contains ratings from users to items and a trust-network between the users.
        + * We use co-occurrence analysis to compute "users who like these items, also like that items" and
        + * "users who trust these users, like that items"
        + *
        + * Download and unpack the dataset files from:
        + *
        + * http://www.trustlet.org/datasets/downloaded_epinions/ratings_data.txt.bz2
        + * http://www.trustlet.org/datasets/downloaded_epinions/trust_data.txt.bz2
        + **/
        +object RunCrossCooccurrenceAnalysisOnEpinions {
        +
        + def main(args: Array[String]): Unit = {
        +
        + if (args.length == 0)

        { + println("Usage: RunCooccurrenceAnalysisOnMovielens1M <path-to-dataset-folder>") + println("Download the dataset from http://www.trustlet.org/datasets/downloaded_epinions/ratings_data.txt.bz2 and") + println("http://www.trustlet.org/datasets/downloaded_epinions/trust_data.txt.bz2") + sys.exit(-1) + }

        +
        + val datasetDir = args(0)
        +
        + val epinionsRatings = new SparseMatrix(49290, 139738)
        +
        + var firstLineSkipped = false
        + for (line <- Source.fromFile(datasetDir + "/ratings_data.txt").getLines()) {
        + if (line.contains(' ') && firstLineSkipped)

        { + val tokens = line.split(' ') + val userID = tokens(0).toInt - 1 + val itemID = tokens(1).toInt - 1 + val rating = tokens(2).toDouble + epinionsRatings(userID, itemID) = rating + }

        + firstLineSkipped = true
        + }
        +
        + val epinionsTrustNetwork = new SparseMatrix(49290, 49290)
        + firstLineSkipped = false
        + for (line <- Source.fromFile(datasetDir + "/trust_data.txt").getLines()) {
        + if (line.contains(' ') && firstLineSkipped)

        { + val tokens = line.trim.split(' ') + val userID = tokens(0).toInt - 1 + val trustedUserId = tokens(1).toInt - 1 + epinionsTrustNetwork(userID, trustedUserId) = 1 + }

        + firstLineSkipped = true
        + }
        +
        + System.setProperty("spark.kryo.referenceTracking", "false")
        + System.setProperty("spark.kryoserializer.buffer.mb", "100")
        +/* to run on local, can provide number of core by changing to local[4] */
        + implicit val distributedContext = mahoutSparkContext(masterUrl = "local", appName = "MahoutLocalContext",
        + customJars = Traversable.empty[String])
        +
        + /* to run on a Spark cluster provide the Spark Master URL
        + implicit val distributedContext = mahoutSparkContext(masterUrl = "spark://occam4:7077", appName = "MahoutClusteredContext",
        + customJars = Traversable.empty[String])
        +*/
        + val drmEpinionsRatings = drmParallelize(epinionsRatings, numPartitions = 2)
        — End diff –

        num partitions here – is it a "magic: number or you know it is enough for this particular dataset and dataset cannot change?

        Show
        githubbot ASF GitHub Bot added a comment - Github user dlyubimov commented on a diff in the pull request: https://github.com/apache/mahout/pull/22#discussion_r13987539 — Diff: spark/src/main/scala/org/apache/mahout/cf/examples/Recommendations.scala — @@ -0,0 +1,172 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.mahout.cf.examples + +import scala.io.Source +import org.apache.mahout.math._ +import scalabindings._ +import RLikeOps._ +import drm._ +import RLikeDrmOps._ +import org.apache.mahout.sparkbindings._ + +import org.apache.mahout.cf.CooccurrenceAnalysis._ +import scala.collection.JavaConversions._ + +/** + * The Epinions dataset contains ratings from users to items and a trust-network between the users. + * We use co-occurrence analysis to compute "users who like these items, also like that items" and + * "users who trust these users, like that items" + * + * Download and unpack the dataset files from: + * + * http://www.trustlet.org/datasets/downloaded_epinions/ratings_data.txt.bz2 + * http://www.trustlet.org/datasets/downloaded_epinions/trust_data.txt.bz2 + **/ +object RunCrossCooccurrenceAnalysisOnEpinions { + + def main(args: Array [String] ): Unit = { + + if (args.length == 0) { + println("Usage: RunCooccurrenceAnalysisOnMovielens1M <path-to-dataset-folder>") + println("Download the dataset from http://www.trustlet.org/datasets/downloaded_epinions/ratings_data.txt.bz2 and") + println("http://www.trustlet.org/datasets/downloaded_epinions/trust_data.txt.bz2") + sys.exit(-1) + } + + val datasetDir = args(0) + + val epinionsRatings = new SparseMatrix(49290, 139738) + + var firstLineSkipped = false + for (line <- Source.fromFile(datasetDir + "/ratings_data.txt").getLines()) { + if (line.contains(' ') && firstLineSkipped) { + val tokens = line.split(' ') + val userID = tokens(0).toInt - 1 + val itemID = tokens(1).toInt - 1 + val rating = tokens(2).toDouble + epinionsRatings(userID, itemID) = rating + } + firstLineSkipped = true + } + + val epinionsTrustNetwork = new SparseMatrix(49290, 49290) + firstLineSkipped = false + for (line <- Source.fromFile(datasetDir + "/trust_data.txt").getLines()) { + if (line.contains(' ') && firstLineSkipped) { + val tokens = line.trim.split(' ') + val userID = tokens(0).toInt - 1 + val trustedUserId = tokens(1).toInt - 1 + epinionsTrustNetwork(userID, trustedUserId) = 1 + } + firstLineSkipped = true + } + + System.setProperty("spark.kryo.referenceTracking", "false") + System.setProperty("spark.kryoserializer.buffer.mb", "100") +/* to run on local, can provide number of core by changing to local [4] */ + implicit val distributedContext = mahoutSparkContext(masterUrl = "local", appName = "MahoutLocalContext", + customJars = Traversable.empty [String] ) + + /* to run on a Spark cluster provide the Spark Master URL + implicit val distributedContext = mahoutSparkContext(masterUrl = "spark://occam4:7077", appName = "MahoutClusteredContext", + customJars = Traversable.empty [String] ) +*/ + val drmEpinionsRatings = drmParallelize(epinionsRatings, numPartitions = 2) — End diff – num partitions here – is it a "magic: number or you know it is enough for this particular dataset and dataset cannot change?
        Hide
        githubbot ASF GitHub Bot added a comment -

        Github user dlyubimov commented on a diff in the pull request:

        https://github.com/apache/mahout/pull/22#discussion_r13987730

        — Diff: spark/src/main/scala/org/apache/mahout/drivers/ReaderWriter.scala —
        @@ -0,0 +1,222 @@
        +/*
        + * Licensed to the Apache Software Foundation (ASF) under one or more
        + * contributor license agreements. See the NOTICE file distributed with
        + * this work for additional information regarding copyright ownership.
        + * The ASF licenses this file to You under the Apache License, Version 2.0
        + * (the "License"); you may not use this file except in compliance with
        + * the License. You may obtain a copy of the License at
        + *
        + * http://www.apache.org/licenses/LICENSE-2.0
        + *
        + * Unless required by applicable law or agreed to in writing, software
        + * distributed under the License is distributed on an "AS IS" BASIS,
        + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
        + * See the License for the specific language governing permissions and
        + * limitations under the License.
        + */
        +
        +package org.apache.mahout.drivers
        +
        +import org.apache.spark.SparkContext._
        +import org.apache.mahout.math.RandomAccessSparseVector
        +import org.apache.spark.SparkContext
        +import com.google.common.collect.

        {BiMap, HashBiMap}

        +import scala.collection.JavaConversions._
        +import org.apache.mahout.math.drm.

        {CheckpointedDrm, DrmLike}

        +import org.apache.mahout.sparkbindings._
        +
        +
        +/** Reader trait is abstract in the sense that the reader function must be defined by an extending trait, which also defines the type to be read.
        + * @tparam T type of object read, usually supplied by an extending trait.
        + * @todo the reader need not create both dictionaries but does at present. There are cases where one or the other dictionary is never used so saving the memory for a very large dictionary may be worth the optimization to specify which dictionaries are created.
        + */
        +trait Reader[T]

        { + val mc: SparkContext + val readSchema: Schema + protected def reader(mc: SparkContext, readSchema: Schema, source: String): T + def readFrom(source: String): T = reader(mc, readSchema, source) +}

        +
        +/** Writer trait is abstract in the sense that the writer method must be supplied by an extending trait, which also defines the type to be written.
        + * @tparam T
        + */
        +trait Writer[T]

        { + val mc: SparkContext + val writeSchema: Schema + protected def writer(mc: SparkContext, writeSchema: Schema, dest: String, collection: T): Unit + def writeTo(collection: T, dest: String) = writer(mc, writeSchema, dest, collection) +}

        +
        +/** Extends Reader trait to supply the [[org.apache.mahout.drivers.IndexedDataset]] as the type read and a reader function for reading text delimited files as described in the [[org.apache.mahout.drivers.Schema]]
        + */
        +trait TDIndexedDatasetReader extends Reader[IndexedDataset]{
        + /** Read in text delimited tuples from all URIs in this comma delimited source String.
        + *
        + * @param mc context for the Spark job
        + * @param readSchema describes the delimiters and positions of values in the text delimited file.
        + * @param source comma delimited URIs of text files to be read into the [[org.apache.mahout.drivers.IndexedDataset]]
        + * @return
        + */
        + protected def reader(mc: SparkContext, readSchema: Schema, source: String): IndexedDataset = {
        + try {
        + val delimiter = readSchema("delim").asInstanceOf[String]
        + val rowIDPosition = readSchema("rowIDPosition").asInstanceOf[Int]
        + val columnIDPosition = readSchema("columnIDPosition").asInstanceOf[Int]
        + val filterPosition = readSchema("filterPosition").asInstanceOf[Int]
        + val filterBy = readSchema("filter").asInstanceOf[String]
        + //instance vars must be put into locally scoped vals when used in closures that are
        + //executed but Spark
        +
        + assert(!source.isEmpty,

        { + println(this.getClass.toString + ": has no files to read") + throw new IllegalArgumentException + }

        )
        +
        + var columns = mc.textFile(source).map(

        { line => line.split(delimiter)}

        )
        +
        + columns = columns.filter(

        { tokens => tokens(filterPosition) == filterBy}

        )
        +
        + val interactions = columns.map(

        { tokens => tokens(rowIDPosition) -> tokens(columnIDPosition)}

        )
        +
        + interactions.cache()
        +
        + val rowIDs = interactions.map(

        { case (rowID, _) => rowID}

        ).distinct().collect()
        + val columnIDs = interactions.map(

        { case (_, columnID) => columnID}

        ).distinct().collect()
        +
        + val numRows = rowIDs.size
        + val numColumns = columnIDs.size
        +
        + val rowIDDictionary = asOrderedDictionary(rowIDs)
        + val rowIDDictionary_bcast = mc.broadcast(rowIDDictionary)
        +
        + val columnIDDictionary = asOrderedDictionary(columnIDs)
        + val columnIDDictionary_bcast = mc.broadcast(columnIDDictionary)
        +
        + val indexedInteractions =
        + interactions.map(

        { case (rowID, columnID) => + val rowIndex = rowIDDictionary_bcast.value.get(rowID).get + val columnIndex = columnIDDictionary_bcast.value.get(columnID).get + + rowIndex -> columnIndex + }

        ).groupByKey().map({ case (rowIndex, columnIndexes) =>
        — End diff –

        Here and elsewhere. If compound closure, parenthesis could be ommitted

        interactions.map {
        case (rowID, columnID) =>
        val rowIndex = rowIDDictionary_bcast.value.get(rowID).get
        ...

        Show
        githubbot ASF GitHub Bot added a comment - Github user dlyubimov commented on a diff in the pull request: https://github.com/apache/mahout/pull/22#discussion_r13987730 — Diff: spark/src/main/scala/org/apache/mahout/drivers/ReaderWriter.scala — @@ -0,0 +1,222 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.mahout.drivers + +import org.apache.spark.SparkContext._ +import org.apache.mahout.math.RandomAccessSparseVector +import org.apache.spark.SparkContext +import com.google.common.collect. {BiMap, HashBiMap} +import scala.collection.JavaConversions._ +import org.apache.mahout.math.drm. {CheckpointedDrm, DrmLike} +import org.apache.mahout.sparkbindings._ + + +/** Reader trait is abstract in the sense that the reader function must be defined by an extending trait, which also defines the type to be read. + * @tparam T type of object read, usually supplied by an extending trait. + * @todo the reader need not create both dictionaries but does at present. There are cases where one or the other dictionary is never used so saving the memory for a very large dictionary may be worth the optimization to specify which dictionaries are created. + */ +trait Reader [T] { + val mc: SparkContext + val readSchema: Schema + protected def reader(mc: SparkContext, readSchema: Schema, source: String): T + def readFrom(source: String): T = reader(mc, readSchema, source) +} + +/** Writer trait is abstract in the sense that the writer method must be supplied by an extending trait, which also defines the type to be written. + * @tparam T + */ +trait Writer [T] { + val mc: SparkContext + val writeSchema: Schema + protected def writer(mc: SparkContext, writeSchema: Schema, dest: String, collection: T): Unit + def writeTo(collection: T, dest: String) = writer(mc, writeSchema, dest, collection) +} + +/** Extends Reader trait to supply the [ [org.apache.mahout.drivers.IndexedDataset] ] as the type read and a reader function for reading text delimited files as described in the [ [org.apache.mahout.drivers.Schema] ] + */ +trait TDIndexedDatasetReader extends Reader [IndexedDataset] { + /** Read in text delimited tuples from all URIs in this comma delimited source String. + * + * @param mc context for the Spark job + * @param readSchema describes the delimiters and positions of values in the text delimited file. + * @param source comma delimited URIs of text files to be read into the [ [org.apache.mahout.drivers.IndexedDataset] ] + * @return + */ + protected def reader(mc: SparkContext, readSchema: Schema, source: String): IndexedDataset = { + try { + val delimiter = readSchema("delim").asInstanceOf [String] + val rowIDPosition = readSchema("rowIDPosition").asInstanceOf [Int] + val columnIDPosition = readSchema("columnIDPosition").asInstanceOf [Int] + val filterPosition = readSchema("filterPosition").asInstanceOf [Int] + val filterBy = readSchema("filter").asInstanceOf [String] + //instance vars must be put into locally scoped vals when used in closures that are + //executed but Spark + + assert(!source.isEmpty, { + println(this.getClass.toString + ": has no files to read") + throw new IllegalArgumentException + } ) + + var columns = mc.textFile(source).map( { line => line.split(delimiter)} ) + + columns = columns.filter( { tokens => tokens(filterPosition) == filterBy} ) + + val interactions = columns.map( { tokens => tokens(rowIDPosition) -> tokens(columnIDPosition)} ) + + interactions.cache() + + val rowIDs = interactions.map( { case (rowID, _) => rowID} ).distinct().collect() + val columnIDs = interactions.map( { case (_, columnID) => columnID} ).distinct().collect() + + val numRows = rowIDs.size + val numColumns = columnIDs.size + + val rowIDDictionary = asOrderedDictionary(rowIDs) + val rowIDDictionary_bcast = mc.broadcast(rowIDDictionary) + + val columnIDDictionary = asOrderedDictionary(columnIDs) + val columnIDDictionary_bcast = mc.broadcast(columnIDDictionary) + + val indexedInteractions = + interactions.map( { case (rowID, columnID) => + val rowIndex = rowIDDictionary_bcast.value.get(rowID).get + val columnIndex = columnIDDictionary_bcast.value.get(columnID).get + + rowIndex -> columnIndex + } ).groupByKey().map({ case (rowIndex, columnIndexes) => — End diff – Here and elsewhere. If compound closure, parenthesis could be ommitted interactions.map { case (rowID, columnID) => val rowIndex = rowIDDictionary_bcast.value.get(rowID).get ...
        Hide
        githubbot ASF GitHub Bot added a comment -

        Github user dlyubimov commented on a diff in the pull request:

        https://github.com/apache/mahout/pull/22#discussion_r13987777

        — Diff: spark/src/main/scala/org/apache/mahout/drivers/ReaderWriter.scala —
        @@ -0,0 +1,222 @@
        +/*
        + * Licensed to the Apache Software Foundation (ASF) under one or more
        + * contributor license agreements. See the NOTICE file distributed with
        + * this work for additional information regarding copyright ownership.
        + * The ASF licenses this file to You under the Apache License, Version 2.0
        + * (the "License"); you may not use this file except in compliance with
        + * the License. You may obtain a copy of the License at
        + *
        + * http://www.apache.org/licenses/LICENSE-2.0
        + *
        + * Unless required by applicable law or agreed to in writing, software
        + * distributed under the License is distributed on an "AS IS" BASIS,
        + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
        + * See the License for the specific language governing permissions and
        + * limitations under the License.
        + */
        +
        +package org.apache.mahout.drivers
        +
        +import org.apache.spark.SparkContext._
        +import org.apache.mahout.math.RandomAccessSparseVector
        +import org.apache.spark.SparkContext
        +import com.google.common.collect.

        {BiMap, HashBiMap}

        +import scala.collection.JavaConversions._
        +import org.apache.mahout.math.drm.

        {CheckpointedDrm, DrmLike}

        +import org.apache.mahout.sparkbindings._
        +
        +
        +/** Reader trait is abstract in the sense that the reader function must be defined by an extending trait, which also defines the type to be read.
        + * @tparam T type of object read, usually supplied by an extending trait.
        + * @todo the reader need not create both dictionaries but does at present. There are cases where one or the other dictionary is never used so saving the memory for a very large dictionary may be worth the optimization to specify which dictionaries are created.
        + */
        +trait Reader[T]{
        — End diff –

        here and elsewhere: style spacing

        Show
        githubbot ASF GitHub Bot added a comment - Github user dlyubimov commented on a diff in the pull request: https://github.com/apache/mahout/pull/22#discussion_r13987777 — Diff: spark/src/main/scala/org/apache/mahout/drivers/ReaderWriter.scala — @@ -0,0 +1,222 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.mahout.drivers + +import org.apache.spark.SparkContext._ +import org.apache.mahout.math.RandomAccessSparseVector +import org.apache.spark.SparkContext +import com.google.common.collect. {BiMap, HashBiMap} +import scala.collection.JavaConversions._ +import org.apache.mahout.math.drm. {CheckpointedDrm, DrmLike} +import org.apache.mahout.sparkbindings._ + + +/** Reader trait is abstract in the sense that the reader function must be defined by an extending trait, which also defines the type to be read. + * @tparam T type of object read, usually supplied by an extending trait. + * @todo the reader need not create both dictionaries but does at present. There are cases where one or the other dictionary is never used so saving the memory for a very large dictionary may be worth the optimization to specify which dictionaries are created. + */ +trait Reader [T] { — End diff – here and elsewhere: style spacing
        Hide
        githubbot ASF GitHub Bot added a comment -

        Github user dlyubimov commented on a diff in the pull request:

        https://github.com/apache/mahout/pull/22#discussion_r13987854

        — Diff: spark/src/main/scala/org/apache/mahout/drivers/ReaderWriter.scala —
        @@ -0,0 +1,222 @@
        +/*
        + * Licensed to the Apache Software Foundation (ASF) under one or more
        + * contributor license agreements. See the NOTICE file distributed with
        + * this work for additional information regarding copyright ownership.
        + * The ASF licenses this file to You under the Apache License, Version 2.0
        + * (the "License"); you may not use this file except in compliance with
        + * the License. You may obtain a copy of the License at
        + *
        + * http://www.apache.org/licenses/LICENSE-2.0
        + *
        + * Unless required by applicable law or agreed to in writing, software
        + * distributed under the License is distributed on an "AS IS" BASIS,
        + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
        + * See the License for the specific language governing permissions and
        + * limitations under the License.
        + */
        +
        +package org.apache.mahout.drivers
        +
        +import org.apache.spark.SparkContext._
        +import org.apache.mahout.math.RandomAccessSparseVector
        +import org.apache.spark.SparkContext
        +import com.google.common.collect.

        {BiMap, HashBiMap}

        +import scala.collection.JavaConversions._
        +import org.apache.mahout.math.drm.

        {CheckpointedDrm, DrmLike}

        +import org.apache.mahout.sparkbindings._
        +
        +
        +/** Reader trait is abstract in the sense that the reader function must be defined by an extending trait, which also defines the type to be read.
        + * @tparam T type of object read, usually supplied by an extending trait.
        + * @todo the reader need not create both dictionaries but does at present. There are cases where one or the other dictionary is never used so saving the memory for a very large dictionary may be worth the optimization to specify which dictionaries are created.
        + */
        +trait Reader[T]{
        — End diff –

        Also, it is a matter of convention, but I usually prefer to put public traits and classes in their own file. One may argue "old habits die hard", but i still think it is the right thing to do for publicly visible things.

        Show
        githubbot ASF GitHub Bot added a comment - Github user dlyubimov commented on a diff in the pull request: https://github.com/apache/mahout/pull/22#discussion_r13987854 — Diff: spark/src/main/scala/org/apache/mahout/drivers/ReaderWriter.scala — @@ -0,0 +1,222 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.mahout.drivers + +import org.apache.spark.SparkContext._ +import org.apache.mahout.math.RandomAccessSparseVector +import org.apache.spark.SparkContext +import com.google.common.collect. {BiMap, HashBiMap} +import scala.collection.JavaConversions._ +import org.apache.mahout.math.drm. {CheckpointedDrm, DrmLike} +import org.apache.mahout.sparkbindings._ + + +/** Reader trait is abstract in the sense that the reader function must be defined by an extending trait, which also defines the type to be read. + * @tparam T type of object read, usually supplied by an extending trait. + * @todo the reader need not create both dictionaries but does at present. There are cases where one or the other dictionary is never used so saving the memory for a very large dictionary may be worth the optimization to specify which dictionaries are created. + */ +trait Reader [T] { — End diff – Also, it is a matter of convention, but I usually prefer to put public traits and classes in their own file. One may argue "old habits die hard", but i still think it is the right thing to do for publicly visible things.
        Hide
        githubbot ASF GitHub Bot added a comment -

        Github user dlyubimov commented on a diff in the pull request:

        https://github.com/apache/mahout/pull/22#discussion_r13988052

        — Diff: spark/src/main/scala/org/apache/mahout/drivers/Schema.scala —
        @@ -0,0 +1,29 @@
        +/*
        + * Licensed to the Apache Software Foundation (ASF) under one or more
        + * contributor license agreements. See the NOTICE file distributed with
        + * this work for additional information regarding copyright ownership.
        + * The ASF licenses this file to You under the Apache License, Version 2.0
        + * (the "License"); you may not use this file except in compliance with
        + * the License. You may obtain a copy of the License at
        + *
        + * http://www.apache.org/licenses/LICENSE-2.0
        + *
        + * Unless required by applicable law or agreed to in writing, software
        + * distributed under the License is distributed on an "AS IS" BASIS,
        + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
        + * See the License for the specific language governing permissions and
        + * limitations under the License.
        + */
        +
        +package org.apache.mahout.drivers
        +
        +import scala.collection.mutable.HashMap
        +
        +/** Syntactic sugar for HashMap[String, Any]
        + *
        + * @param params list of mappings for instantiation {{

        {val mySchema = new Schema("one" -> 1, "two" -> "2", ...)}

        }}
        + */
        +class Schema(params: Tuple2[String, Any]*) extends HashMap[String, Any] {
        — End diff –

        Hm. i'd still rather delegate than extend,

        Then you can either expose parameters as a public value; or, assuming you just want to be able to write schema.any-MapLike-method, you can provide an explicit conversion from Schema to MapLike.

        Show
        githubbot ASF GitHub Bot added a comment - Github user dlyubimov commented on a diff in the pull request: https://github.com/apache/mahout/pull/22#discussion_r13988052 — Diff: spark/src/main/scala/org/apache/mahout/drivers/Schema.scala — @@ -0,0 +1,29 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.mahout.drivers + +import scala.collection.mutable.HashMap + +/** Syntactic sugar for HashMap [String, Any] + * + * @param params list of mappings for instantiation {{ {val mySchema = new Schema("one" -> 1, "two" -> "2", ...)} }} + */ +class Schema(params: Tuple2 [String, Any] *) extends HashMap [String, Any] { — End diff – Hm. i'd still rather delegate than extend, Then you can either expose parameters as a public value; or, assuming you just want to be able to write schema.any-MapLike-method, you can provide an explicit conversion from Schema to MapLike.
        Hide
        githubbot ASF GitHub Bot added a comment -

        Github user dlyubimov commented on the pull request:

        https://github.com/apache/mahout/pull/22#issuecomment-46606053

        General note: a lot of style problems.
        Code lines not to exceed 120 charactes (i think i saw some suspiciously long).

        Definitely lack of comments.

        FYI Spark comment style is the following: every comment starts with a captial letter and is formatted to cut off at 100th character. And they are very draconian about it. And that's what i followed here as well.

        Well the 100th character is questionable since it cannot be auto formatted in IDEA, but I do believe comments do need some justification applied on the right.

        for closure stacks, i'd suggest the following comment /etc style

        val b = A

        // I want to map
        .map

        { tuple => .... }

        // I want to filter
        .filter

        { tuple => .... }

        So it would be useful to state in plain words what closure is to accomplish, since they are functional units, i.e. function-grade citizens, and as such, imo deserve some explanation.

        Show
        githubbot ASF GitHub Bot added a comment - Github user dlyubimov commented on the pull request: https://github.com/apache/mahout/pull/22#issuecomment-46606053 General note: a lot of style problems. Code lines not to exceed 120 charactes (i think i saw some suspiciously long). Definitely lack of comments. FYI Spark comment style is the following: every comment starts with a captial letter and is formatted to cut off at 100th character. And they are very draconian about it. And that's what i followed here as well. Well the 100th character is questionable since it cannot be auto formatted in IDEA, but I do believe comments do need some justification applied on the right. for closure stacks, i'd suggest the following comment /etc style val b = A // I want to map .map { tuple => .... } // I want to filter .filter { tuple => .... } So it would be useful to state in plain words what closure is to accomplish, since they are functional units, i.e. function-grade citizens, and as such, imo deserve some explanation.
        Hide
        githubbot ASF GitHub Bot added a comment -

        Github user pferrel commented on a diff in the pull request:

        https://github.com/apache/mahout/pull/22#discussion_r14001263

        — Diff: spark/src/main/scala/org/apache/mahout/cf/examples/Recommendations.scala —
        @@ -0,0 +1,172 @@
        +/*
        + * Licensed to the Apache Software Foundation (ASF) under one or more
        + * contributor license agreements. See the NOTICE file distributed with
        + * this work for additional information regarding copyright ownership.
        + * The ASF licenses this file to You under the Apache License, Version 2.0
        + * (the "License"); you may not use this file except in compliance with
        + * the License. You may obtain a copy of the License at
        + *
        + * http://www.apache.org/licenses/LICENSE-2.0
        + *
        + * Unless required by applicable law or agreed to in writing, software
        + * distributed under the License is distributed on an "AS IS" BASIS,
        + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
        + * See the License for the specific language governing permissions and
        + * limitations under the License.
        + */
        +
        +package org.apache.mahout.cf.examples
        +
        +import scala.io.Source
        +import org.apache.mahout.math._
        +import scalabindings._
        +import RLikeOps._
        +import drm._
        +import RLikeDrmOps._
        +import org.apache.mahout.sparkbindings._
        +
        +import org.apache.mahout.cf.CooccurrenceAnalysis._
        +import scala.collection.JavaConversions._
        +
        +/**
        + * The Epinions dataset contains ratings from users to items and a trust-network between the users.
        + * We use co-occurrence analysis to compute "users who like these items, also like that items" and
        + * "users who trust these users, like that items"
        + *
        + * Download and unpack the dataset files from:
        + *
        + * http://www.trustlet.org/datasets/downloaded_epinions/ratings_data.txt.bz2
        + * http://www.trustlet.org/datasets/downloaded_epinions/trust_data.txt.bz2
        + **/
        +object RunCrossCooccurrenceAnalysisOnEpinions {
        +
        + def main(args: Array[String]): Unit = {
        +
        + if (args.length == 0)

        { + println("Usage: RunCooccurrenceAnalysisOnMovielens1M <path-to-dataset-folder>") + println("Download the dataset from http://www.trustlet.org/datasets/downloaded_epinions/ratings_data.txt.bz2 and") + println("http://www.trustlet.org/datasets/downloaded_epinions/trust_data.txt.bz2") + sys.exit(-1) + }

        +
        + val datasetDir = args(0)
        +
        + val epinionsRatings = new SparseMatrix(49290, 139738)
        +
        + var firstLineSkipped = false
        + for (line <- Source.fromFile(datasetDir + "/ratings_data.txt").getLines()) {
        + if (line.contains(' ') && firstLineSkipped)

        { + val tokens = line.split(' ') + val userID = tokens(0).toInt - 1 + val itemID = tokens(1).toInt - 1 + val rating = tokens(2).toDouble + epinionsRatings(userID, itemID) = rating + }

        + firstLineSkipped = true
        + }
        +
        + val epinionsTrustNetwork = new SparseMatrix(49290, 49290)
        + firstLineSkipped = false
        + for (line <- Source.fromFile(datasetDir + "/trust_data.txt").getLines()) {
        + if (line.contains(' ') && firstLineSkipped)

        { + val tokens = line.trim.split(' ') + val userID = tokens(0).toInt - 1 + val trustedUserId = tokens(1).toInt - 1 + epinionsTrustNetwork(userID, trustedUserId) = 1 + }

        + firstLineSkipped = true
        + }
        +
        + System.setProperty("spark.kryo.referenceTracking", "false")
        + System.setProperty("spark.kryoserializer.buffer.mb", "100")
        +/* to run on local, can provide number of core by changing to local[4] */
        + implicit val distributedContext = mahoutSparkContext(masterUrl = "local", appName = "MahoutLocalContext",
        + customJars = Traversable.empty[String])
        +
        + /* to run on a Spark cluster provide the Spark Master URL
        + implicit val distributedContext = mahoutSparkContext(masterUrl = "spark://occam4:7077", appName = "MahoutClusteredContext",
        + customJars = Traversable.empty[String])
        +*/
        + val drmEpinionsRatings = drmParallelize(epinionsRatings, numPartitions = 2)
        — End diff –

        Needs to be taken out before merge. Was forced for testing, not production.

        Show
        githubbot ASF GitHub Bot added a comment - Github user pferrel commented on a diff in the pull request: https://github.com/apache/mahout/pull/22#discussion_r14001263 — Diff: spark/src/main/scala/org/apache/mahout/cf/examples/Recommendations.scala — @@ -0,0 +1,172 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.mahout.cf.examples + +import scala.io.Source +import org.apache.mahout.math._ +import scalabindings._ +import RLikeOps._ +import drm._ +import RLikeDrmOps._ +import org.apache.mahout.sparkbindings._ + +import org.apache.mahout.cf.CooccurrenceAnalysis._ +import scala.collection.JavaConversions._ + +/** + * The Epinions dataset contains ratings from users to items and a trust-network between the users. + * We use co-occurrence analysis to compute "users who like these items, also like that items" and + * "users who trust these users, like that items" + * + * Download and unpack the dataset files from: + * + * http://www.trustlet.org/datasets/downloaded_epinions/ratings_data.txt.bz2 + * http://www.trustlet.org/datasets/downloaded_epinions/trust_data.txt.bz2 + **/ +object RunCrossCooccurrenceAnalysisOnEpinions { + + def main(args: Array [String] ): Unit = { + + if (args.length == 0) { + println("Usage: RunCooccurrenceAnalysisOnMovielens1M <path-to-dataset-folder>") + println("Download the dataset from http://www.trustlet.org/datasets/downloaded_epinions/ratings_data.txt.bz2 and") + println("http://www.trustlet.org/datasets/downloaded_epinions/trust_data.txt.bz2") + sys.exit(-1) + } + + val datasetDir = args(0) + + val epinionsRatings = new SparseMatrix(49290, 139738) + + var firstLineSkipped = false + for (line <- Source.fromFile(datasetDir + "/ratings_data.txt").getLines()) { + if (line.contains(' ') && firstLineSkipped) { + val tokens = line.split(' ') + val userID = tokens(0).toInt - 1 + val itemID = tokens(1).toInt - 1 + val rating = tokens(2).toDouble + epinionsRatings(userID, itemID) = rating + } + firstLineSkipped = true + } + + val epinionsTrustNetwork = new SparseMatrix(49290, 49290) + firstLineSkipped = false + for (line <- Source.fromFile(datasetDir + "/trust_data.txt").getLines()) { + if (line.contains(' ') && firstLineSkipped) { + val tokens = line.trim.split(' ') + val userID = tokens(0).toInt - 1 + val trustedUserId = tokens(1).toInt - 1 + epinionsTrustNetwork(userID, trustedUserId) = 1 + } + firstLineSkipped = true + } + + System.setProperty("spark.kryo.referenceTracking", "false") + System.setProperty("spark.kryoserializer.buffer.mb", "100") +/* to run on local, can provide number of core by changing to local [4] */ + implicit val distributedContext = mahoutSparkContext(masterUrl = "local", appName = "MahoutLocalContext", + customJars = Traversable.empty [String] ) + + /* to run on a Spark cluster provide the Spark Master URL + implicit val distributedContext = mahoutSparkContext(masterUrl = "spark://occam4:7077", appName = "MahoutClusteredContext", + customJars = Traversable.empty [String] ) +*/ + val drmEpinionsRatings = drmParallelize(epinionsRatings, numPartitions = 2) — End diff – Needs to be taken out before merge. Was forced for testing, not production.
        Hide
        githubbot ASF GitHub Bot added a comment -

        Github user pferrel commented on a diff in the pull request:

        https://github.com/apache/mahout/pull/22#discussion_r14001316

        — Diff: spark/src/main/scala/org/apache/mahout/drivers/ReaderWriter.scala —
        @@ -0,0 +1,222 @@
        +/*
        + * Licensed to the Apache Software Foundation (ASF) under one or more
        + * contributor license agreements. See the NOTICE file distributed with
        + * this work for additional information regarding copyright ownership.
        + * The ASF licenses this file to You under the Apache License, Version 2.0
        + * (the "License"); you may not use this file except in compliance with
        + * the License. You may obtain a copy of the License at
        + *
        + * http://www.apache.org/licenses/LICENSE-2.0
        + *
        + * Unless required by applicable law or agreed to in writing, software
        + * distributed under the License is distributed on an "AS IS" BASIS,
        + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
        + * See the License for the specific language governing permissions and
        + * limitations under the License.
        + */
        +
        +package org.apache.mahout.drivers
        +
        +import org.apache.spark.SparkContext._
        +import org.apache.mahout.math.RandomAccessSparseVector
        +import org.apache.spark.SparkContext
        +import com.google.common.collect.

        {BiMap, HashBiMap}

        +import scala.collection.JavaConversions._
        +import org.apache.mahout.math.drm.

        {CheckpointedDrm, DrmLike}

        +import org.apache.mahout.sparkbindings._
        +
        +
        +/** Reader trait is abstract in the sense that the reader function must be defined by an extending trait, which also defines the type to be read.
        + * @tparam T type of object read, usually supplied by an extending trait.
        + * @todo the reader need not create both dictionaries but does at present. There are cases where one or the other dictionary is never used so saving the memory for a very large dictionary may be worth the optimization to specify which dictionaries are created.
        + */
        +trait Reader[T]{
        — End diff –

        Yeah, I agree, the files need to be rearranged but I wanted to get an eye on the design in case there was some deeper need to refactor.

        Show
        githubbot ASF GitHub Bot added a comment - Github user pferrel commented on a diff in the pull request: https://github.com/apache/mahout/pull/22#discussion_r14001316 — Diff: spark/src/main/scala/org/apache/mahout/drivers/ReaderWriter.scala — @@ -0,0 +1,222 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.mahout.drivers + +import org.apache.spark.SparkContext._ +import org.apache.mahout.math.RandomAccessSparseVector +import org.apache.spark.SparkContext +import com.google.common.collect. {BiMap, HashBiMap} +import scala.collection.JavaConversions._ +import org.apache.mahout.math.drm. {CheckpointedDrm, DrmLike} +import org.apache.mahout.sparkbindings._ + + +/** Reader trait is abstract in the sense that the reader function must be defined by an extending trait, which also defines the type to be read. + * @tparam T type of object read, usually supplied by an extending trait. + * @todo the reader need not create both dictionaries but does at present. There are cases where one or the other dictionary is never used so saving the memory for a very large dictionary may be worth the optimization to specify which dictionaries are created. + */ +trait Reader [T] { — End diff – Yeah, I agree, the files need to be rearranged but I wanted to get an eye on the design in case there was some deeper need to refactor.
        Hide
        githubbot ASF GitHub Bot added a comment -

        Github user pferrel commented on a diff in the pull request:

        https://github.com/apache/mahout/pull/22#discussion_r14001554

        — Diff: spark/src/main/scala/org/apache/mahout/cf/examples/Recommendations.scala —
        @@ -0,0 +1,172 @@
        +/*
        + * Licensed to the Apache Software Foundation (ASF) under one or more
        + * contributor license agreements. See the NOTICE file distributed with
        + * this work for additional information regarding copyright ownership.
        + * The ASF licenses this file to You under the Apache License, Version 2.0
        + * (the "License"); you may not use this file except in compliance with
        + * the License. You may obtain a copy of the License at
        + *
        + * http://www.apache.org/licenses/LICENSE-2.0
        + *
        + * Unless required by applicable law or agreed to in writing, software
        + * distributed under the License is distributed on an "AS IS" BASIS,
        + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
        + * See the License for the specific language governing permissions and
        + * limitations under the License.
        + */
        +
        +package org.apache.mahout.cf.examples
        +
        +import scala.io.Source
        +import org.apache.mahout.math._
        +import scalabindings._
        +import RLikeOps._
        +import drm._
        +import RLikeDrmOps._
        +import org.apache.mahout.sparkbindings._
        +
        +import org.apache.mahout.cf.CooccurrenceAnalysis._
        +import scala.collection.JavaConversions._
        +
        +/**
        + * The Epinions dataset contains ratings from users to items and a trust-network between the users.
        + * We use co-occurrence analysis to compute "users who like these items, also like that items" and
        + * "users who trust these users, like that items"
        + *
        + * Download and unpack the dataset files from:
        + *
        + * http://www.trustlet.org/datasets/downloaded_epinions/ratings_data.txt.bz2
        + * http://www.trustlet.org/datasets/downloaded_epinions/trust_data.txt.bz2
        + **/
        +object RunCrossCooccurrenceAnalysisOnEpinions {
        +
        + def main(args: Array[String]): Unit = {
        +
        + if (args.length == 0)

        { + println("Usage: RunCooccurrenceAnalysisOnMovielens1M <path-to-dataset-folder>") + println("Download the dataset from http://www.trustlet.org/datasets/downloaded_epinions/ratings_data.txt.bz2 and") + println("http://www.trustlet.org/datasets/downloaded_epinions/trust_data.txt.bz2") + sys.exit(-1) + }

        +
        + val datasetDir = args(0)
        +
        + val epinionsRatings = new SparseMatrix(49290, 139738)
        +
        + var firstLineSkipped = false
        + for (line <- Source.fromFile(datasetDir + "/ratings_data.txt").getLines()) {
        + if (line.contains(' ') && firstLineSkipped)

        { + val tokens = line.split(' ') + val userID = tokens(0).toInt - 1 + val itemID = tokens(1).toInt - 1 + val rating = tokens(2).toDouble + epinionsRatings(userID, itemID) = rating + }

        + firstLineSkipped = true
        + }
        +
        + val epinionsTrustNetwork = new SparseMatrix(49290, 49290)
        + firstLineSkipped = false
        + for (line <- Source.fromFile(datasetDir + "/trust_data.txt").getLines()) {
        + if (line.contains(' ') && firstLineSkipped)

        { + val tokens = line.trim.split(' ') + val userID = tokens(0).toInt - 1 + val trustedUserId = tokens(1).toInt - 1 + epinionsTrustNetwork(userID, trustedUserId) = 1 + }

        + firstLineSkipped = true
        + }
        +
        + System.setProperty("spark.kryo.referenceTracking", "false")
        + System.setProperty("spark.kryoserializer.buffer.mb", "100")
        +/* to run on local, can provide number of core by changing to local[4] */
        + implicit val distributedContext = mahoutSparkContext(masterUrl = "local", appName = "MahoutLocalContext",
        + customJars = Traversable.empty[String])
        +
        + /* to run on a Spark cluster provide the Spark Master URL
        + implicit val distributedContext = mahoutSparkContext(masterUrl = "spark://occam4:7077", appName = "MahoutClusteredContext",
        + customJars = Traversable.empty[String])
        — End diff –

        Ok, seems like I remember these had to be passed in when I wrote that, my mistake,

        Show
        githubbot ASF GitHub Bot added a comment - Github user pferrel commented on a diff in the pull request: https://github.com/apache/mahout/pull/22#discussion_r14001554 — Diff: spark/src/main/scala/org/apache/mahout/cf/examples/Recommendations.scala — @@ -0,0 +1,172 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.mahout.cf.examples + +import scala.io.Source +import org.apache.mahout.math._ +import scalabindings._ +import RLikeOps._ +import drm._ +import RLikeDrmOps._ +import org.apache.mahout.sparkbindings._ + +import org.apache.mahout.cf.CooccurrenceAnalysis._ +import scala.collection.JavaConversions._ + +/** + * The Epinions dataset contains ratings from users to items and a trust-network between the users. + * We use co-occurrence analysis to compute "users who like these items, also like that items" and + * "users who trust these users, like that items" + * + * Download and unpack the dataset files from: + * + * http://www.trustlet.org/datasets/downloaded_epinions/ratings_data.txt.bz2 + * http://www.trustlet.org/datasets/downloaded_epinions/trust_data.txt.bz2 + **/ +object RunCrossCooccurrenceAnalysisOnEpinions { + + def main(args: Array [String] ): Unit = { + + if (args.length == 0) { + println("Usage: RunCooccurrenceAnalysisOnMovielens1M <path-to-dataset-folder>") + println("Download the dataset from http://www.trustlet.org/datasets/downloaded_epinions/ratings_data.txt.bz2 and") + println("http://www.trustlet.org/datasets/downloaded_epinions/trust_data.txt.bz2") + sys.exit(-1) + } + + val datasetDir = args(0) + + val epinionsRatings = new SparseMatrix(49290, 139738) + + var firstLineSkipped = false + for (line <- Source.fromFile(datasetDir + "/ratings_data.txt").getLines()) { + if (line.contains(' ') && firstLineSkipped) { + val tokens = line.split(' ') + val userID = tokens(0).toInt - 1 + val itemID = tokens(1).toInt - 1 + val rating = tokens(2).toDouble + epinionsRatings(userID, itemID) = rating + } + firstLineSkipped = true + } + + val epinionsTrustNetwork = new SparseMatrix(49290, 49290) + firstLineSkipped = false + for (line <- Source.fromFile(datasetDir + "/trust_data.txt").getLines()) { + if (line.contains(' ') && firstLineSkipped) { + val tokens = line.trim.split(' ') + val userID = tokens(0).toInt - 1 + val trustedUserId = tokens(1).toInt - 1 + epinionsTrustNetwork(userID, trustedUserId) = 1 + } + firstLineSkipped = true + } + + System.setProperty("spark.kryo.referenceTracking", "false") + System.setProperty("spark.kryoserializer.buffer.mb", "100") +/* to run on local, can provide number of core by changing to local [4] */ + implicit val distributedContext = mahoutSparkContext(masterUrl = "local", appName = "MahoutLocalContext", + customJars = Traversable.empty [String] ) + + /* to run on a Spark cluster provide the Spark Master URL + implicit val distributedContext = mahoutSparkContext(masterUrl = "spark://occam4:7077", appName = "MahoutClusteredContext", + customJars = Traversable.empty [String] ) — End diff – Ok, seems like I remember these had to be passed in when I wrote that, my mistake,
        Hide
        githubbot ASF GitHub Bot added a comment -

        Github user pferrel commented on a diff in the pull request:

        https://github.com/apache/mahout/pull/22#discussion_r14001642

        — Diff: spark/src/main/scala/org/apache/mahout/cf/examples/Recommendations.scala —
        @@ -0,0 +1,172 @@
        +/*
        + * Licensed to the Apache Software Foundation (ASF) under one or more
        + * contributor license agreements. See the NOTICE file distributed with
        + * this work for additional information regarding copyright ownership.
        + * The ASF licenses this file to You under the Apache License, Version 2.0
        + * (the "License"); you may not use this file except in compliance with
        + * the License. You may obtain a copy of the License at
        + *
        + * http://www.apache.org/licenses/LICENSE-2.0
        + *
        + * Unless required by applicable law or agreed to in writing, software
        + * distributed under the License is distributed on an "AS IS" BASIS,
        + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
        + * See the License for the specific language governing permissions and
        + * limitations under the License.
        + */
        +
        +package org.apache.mahout.cf.examples
        +
        +import scala.io.Source
        +import org.apache.mahout.math._
        +import scalabindings._
        +import RLikeOps._
        +import drm._
        +import RLikeDrmOps._
        +import org.apache.mahout.sparkbindings._
        +
        +import org.apache.mahout.cf.CooccurrenceAnalysis._
        +import scala.collection.JavaConversions._
        +
        +/**
        + * The Epinions dataset contains ratings from users to items and a trust-network between the users.
        + * We use co-occurrence analysis to compute "users who like these items, also like that items" and
        + * "users who trust these users, like that items"
        + *
        + * Download and unpack the dataset files from:
        + *
        + * http://www.trustlet.org/datasets/downloaded_epinions/ratings_data.txt.bz2
        + * http://www.trustlet.org/datasets/downloaded_epinions/trust_data.txt.bz2
        + **/
        +object RunCrossCooccurrenceAnalysisOnEpinions {
        +
        + def main(args: Array[String]): Unit = {
        +
        + if (args.length == 0)

        { + println("Usage: RunCooccurrenceAnalysisOnMovielens1M <path-to-dataset-folder>") + println("Download the dataset from http://www.trustlet.org/datasets/downloaded_epinions/ratings_data.txt.bz2 and") + println("http://www.trustlet.org/datasets/downloaded_epinions/trust_data.txt.bz2") + sys.exit(-1) + }

        +
        + val datasetDir = args(0)
        +
        + val epinionsRatings = new SparseMatrix(49290, 139738)
        +
        + var firstLineSkipped = false
        + for (line <- Source.fromFile(datasetDir + "/ratings_data.txt").getLines()) {
        + if (line.contains(' ') && firstLineSkipped)

        { + val tokens = line.split(' ') + val userID = tokens(0).toInt - 1 + val itemID = tokens(1).toInt - 1 + val rating = tokens(2).toDouble + epinionsRatings(userID, itemID) = rating + }

        + firstLineSkipped = true
        + }
        +
        + val epinionsTrustNetwork = new SparseMatrix(49290, 49290)
        + firstLineSkipped = false
        + for (line <- Source.fromFile(datasetDir + "/trust_data.txt").getLines()) {
        + if (line.contains(' ') && firstLineSkipped)

        { + val tokens = line.trim.split(' ') + val userID = tokens(0).toInt - 1 + val trustedUserId = tokens(1).toInt - 1 + epinionsTrustNetwork(userID, trustedUserId) = 1 + }

        + firstLineSkipped = true
        + }
        +
        + System.setProperty("spark.kryo.referenceTracking", "false")
        + System.setProperty("spark.kryoserializer.buffer.mb", "100")
        +/* to run on local, can provide number of core by changing to local[4] */
        + implicit val distributedContext = mahoutSparkContext(masterUrl = "local", appName = "MahoutLocalContext",
        + customJars = Traversable.empty[String])
        +
        + /* to run on a Spark cluster provide the Spark Master URL
        + implicit val distributedContext = mahoutSparkContext(masterUrl = "spark://occam4:7077", appName = "MahoutClusteredContext",
        + customJars = Traversable.empty[String])
        +*/
        + val drmEpinionsRatings = drmParallelize(epinionsRatings, numPartitions = 2)
        + val drmEpinionsTrustNetwork = drmParallelize(epinionsTrustNetwork, numPartitions = 2)
        +
        + val indicatorMatrices = cooccurrences(drmEpinionsRatings, randomSeed = 0xdeadbeef,
        + maxInterestingItemsPerThing = 100, maxNumInteractions = 500, Array(drmEpinionsTrustNetwork))
        +
        +/* local storage */
        + RecommendationExamplesHelper.saveIndicatorMatrix(indicatorMatrices(0),
        + "/tmp/co-occurrence-on-epinions/indicators-item-item/")
        + RecommendationExamplesHelper.saveIndicatorMatrix(indicatorMatrices(1),
        + "/tmp/co-occurrence-on-epinions/indicators-trust-item/")
        +
        +/* To run on HDFS put your path to the data here, example of fully qualified path on my cluster provided
        + RecommendationExamplesHelper.saveIndicatorMatrix(indicatorMatrices(0),
        + "hdfs://occam4:54310/user/pat/xrsj/indicators-item-item/")
        + RecommendationExamplesHelper.saveIndicatorMatrix(indicatorMatrices(1),
        + "hdfs://occam4:54310/user/pat/xrsj/indicators-trust-item/")
        +*/
        + distributedContext.close()
        +
        + println("Saved indicators to /tmp/co-occurrence-on-epinions/")
        + }
        +}
        +
        +/**
        + * The movielens1M dataset contains movie ratings, we use co-occurrence analysis to compute
        + * "users who like these movies, also like that movies"
        + *
        + * Download and unpack the dataset files from:
        + * http://files.grouplens.org/datasets/movielens/ml-1m.zip
        + */
        +object RunCooccurrenceAnalysisOnMovielens1M {
        +
        + def main(args: Array[String]): Unit = {
        +
        + if (args.length == 0)

        { + println("Usage: RunCooccurrenceAnalysisOnMovielens1M <path-to-dataset-folder>") + println("Download the dataset from http://files.grouplens.org/datasets/movielens/ml-1m.zip") + sys.exit(-1) + }

        +
        + val datasetDir = args(0)
        +
        + System.setProperty("spark.kryo.referenceTracking", "false")
        + System.setProperty("spark.kryoserializer.buffer.mb", "100")
        +
        + implicit val sc = mahoutSparkContext(masterUrl = "local", appName = "MahoutLocalContext",
        + customJars = Traversable.empty[String])
        +
        + System.setProperty("mahout.math.AtA.maxInMemNCol", 4000.toString)
        +
        + val movielens = new SparseMatrix(6040, 3952)
        +
        + for (line <- Source.fromFile(datasetDir + "/ratings.dat").getLines())

        { + val tokens = line.split("::") + val userID = tokens(0).toInt - 1 + val itemID = tokens(1).toInt - 1 + val rating = tokens(2).toDouble + movielens(userID, itemID) = rating + }

        +
        + val drmMovielens = drmParallelize(movielens, numPartitions = 2)
        +
        + val indicatorMatrix = cooccurrences(drmMovielens).head
        +
        + RecommendationExamplesHelper.saveIndicatorMatrix(indicatorMatrix,
        + "/tmp/co-occurrence-on-movielens/indicators-item-item/")
        +
        + sc.stop()
        +
        + println("Saved indicators to /tmp/co-occurrence-on-movielens/")
        + }
        +}
        +
        +object RecommendationExamplesHelper {
        +
        + def saveIndicatorMatrix(indicatorMatrix: DrmLike[Int], path: String) = {
        + indicatorMatrix.rdd.flatMap({ case (thingID, itemVector) =>
        + for (elem <- itemVector.nonZeroes()) yield

        { thingID + '\t' + elem.index }

        + })
        + .saveAsTextFile(path)
        + }
        +}
        — End diff –

        This file is from Sebastian, it will be updated to use the driver and kept in examples. Was originally from a patch for MAHOUT-1464

        Show
        githubbot ASF GitHub Bot added a comment - Github user pferrel commented on a diff in the pull request: https://github.com/apache/mahout/pull/22#discussion_r14001642 — Diff: spark/src/main/scala/org/apache/mahout/cf/examples/Recommendations.scala — @@ -0,0 +1,172 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.mahout.cf.examples + +import scala.io.Source +import org.apache.mahout.math._ +import scalabindings._ +import RLikeOps._ +import drm._ +import RLikeDrmOps._ +import org.apache.mahout.sparkbindings._ + +import org.apache.mahout.cf.CooccurrenceAnalysis._ +import scala.collection.JavaConversions._ + +/** + * The Epinions dataset contains ratings from users to items and a trust-network between the users. + * We use co-occurrence analysis to compute "users who like these items, also like that items" and + * "users who trust these users, like that items" + * + * Download and unpack the dataset files from: + * + * http://www.trustlet.org/datasets/downloaded_epinions/ratings_data.txt.bz2 + * http://www.trustlet.org/datasets/downloaded_epinions/trust_data.txt.bz2 + **/ +object RunCrossCooccurrenceAnalysisOnEpinions { + + def main(args: Array [String] ): Unit = { + + if (args.length == 0) { + println("Usage: RunCooccurrenceAnalysisOnMovielens1M <path-to-dataset-folder>") + println("Download the dataset from http://www.trustlet.org/datasets/downloaded_epinions/ratings_data.txt.bz2 and") + println("http://www.trustlet.org/datasets/downloaded_epinions/trust_data.txt.bz2") + sys.exit(-1) + } + + val datasetDir = args(0) + + val epinionsRatings = new SparseMatrix(49290, 139738) + + var firstLineSkipped = false + for (line <- Source.fromFile(datasetDir + "/ratings_data.txt").getLines()) { + if (line.contains(' ') && firstLineSkipped) { + val tokens = line.split(' ') + val userID = tokens(0).toInt - 1 + val itemID = tokens(1).toInt - 1 + val rating = tokens(2).toDouble + epinionsRatings(userID, itemID) = rating + } + firstLineSkipped = true + } + + val epinionsTrustNetwork = new SparseMatrix(49290, 49290) + firstLineSkipped = false + for (line <- Source.fromFile(datasetDir + "/trust_data.txt").getLines()) { + if (line.contains(' ') && firstLineSkipped) { + val tokens = line.trim.split(' ') + val userID = tokens(0).toInt - 1 + val trustedUserId = tokens(1).toInt - 1 + epinionsTrustNetwork(userID, trustedUserId) = 1 + } + firstLineSkipped = true + } + + System.setProperty("spark.kryo.referenceTracking", "false") + System.setProperty("spark.kryoserializer.buffer.mb", "100") +/* to run on local, can provide number of core by changing to local [4] */ + implicit val distributedContext = mahoutSparkContext(masterUrl = "local", appName = "MahoutLocalContext", + customJars = Traversable.empty [String] ) + + /* to run on a Spark cluster provide the Spark Master URL + implicit val distributedContext = mahoutSparkContext(masterUrl = "spark://occam4:7077", appName = "MahoutClusteredContext", + customJars = Traversable.empty [String] ) +*/ + val drmEpinionsRatings = drmParallelize(epinionsRatings, numPartitions = 2) + val drmEpinionsTrustNetwork = drmParallelize(epinionsTrustNetwork, numPartitions = 2) + + val indicatorMatrices = cooccurrences(drmEpinionsRatings, randomSeed = 0xdeadbeef, + maxInterestingItemsPerThing = 100, maxNumInteractions = 500, Array(drmEpinionsTrustNetwork)) + +/* local storage */ + RecommendationExamplesHelper.saveIndicatorMatrix(indicatorMatrices(0), + "/tmp/co-occurrence-on-epinions/indicators-item-item/") + RecommendationExamplesHelper.saveIndicatorMatrix(indicatorMatrices(1), + "/tmp/co-occurrence-on-epinions/indicators-trust-item/") + +/* To run on HDFS put your path to the data here, example of fully qualified path on my cluster provided + RecommendationExamplesHelper.saveIndicatorMatrix(indicatorMatrices(0), + "hdfs://occam4:54310/user/pat/xrsj/indicators-item-item/") + RecommendationExamplesHelper.saveIndicatorMatrix(indicatorMatrices(1), + "hdfs://occam4:54310/user/pat/xrsj/indicators-trust-item/") +*/ + distributedContext.close() + + println("Saved indicators to /tmp/co-occurrence-on-epinions/") + } +} + +/** + * The movielens1M dataset contains movie ratings, we use co-occurrence analysis to compute + * "users who like these movies, also like that movies" + * + * Download and unpack the dataset files from: + * http://files.grouplens.org/datasets/movielens/ml-1m.zip + */ +object RunCooccurrenceAnalysisOnMovielens1M { + + def main(args: Array [String] ): Unit = { + + if (args.length == 0) { + println("Usage: RunCooccurrenceAnalysisOnMovielens1M <path-to-dataset-folder>") + println("Download the dataset from http://files.grouplens.org/datasets/movielens/ml-1m.zip") + sys.exit(-1) + } + + val datasetDir = args(0) + + System.setProperty("spark.kryo.referenceTracking", "false") + System.setProperty("spark.kryoserializer.buffer.mb", "100") + + implicit val sc = mahoutSparkContext(masterUrl = "local", appName = "MahoutLocalContext", + customJars = Traversable.empty [String] ) + + System.setProperty("mahout.math.AtA.maxInMemNCol", 4000.toString) + + val movielens = new SparseMatrix(6040, 3952) + + for (line <- Source.fromFile(datasetDir + "/ratings.dat").getLines()) { + val tokens = line.split("::") + val userID = tokens(0).toInt - 1 + val itemID = tokens(1).toInt - 1 + val rating = tokens(2).toDouble + movielens(userID, itemID) = rating + } + + val drmMovielens = drmParallelize(movielens, numPartitions = 2) + + val indicatorMatrix = cooccurrences(drmMovielens).head + + RecommendationExamplesHelper.saveIndicatorMatrix(indicatorMatrix, + "/tmp/co-occurrence-on-movielens/indicators-item-item/") + + sc.stop() + + println("Saved indicators to /tmp/co-occurrence-on-movielens/") + } +} + +object RecommendationExamplesHelper { + + def saveIndicatorMatrix(indicatorMatrix: DrmLike [Int] , path: String) = { + indicatorMatrix.rdd.flatMap({ case (thingID, itemVector) => + for (elem <- itemVector.nonZeroes()) yield { thingID + '\t' + elem.index } + }) + .saveAsTextFile(path) + } +} — End diff – This file is from Sebastian, it will be updated to use the driver and kept in examples. Was originally from a patch for MAHOUT-1464
        Hide
        githubbot ASF GitHub Bot added a comment -

        Github user pferrel commented on the pull request:

        https://github.com/apache/mahout/pull/22#issuecomment-46756833

        ok. well I'll assume the naming conventions are ok the Scala is reasonable and start the polish and tests. There are a lot of combinations of options.

        Show
        githubbot ASF GitHub Bot added a comment - Github user pferrel commented on the pull request: https://github.com/apache/mahout/pull/22#issuecomment-46756833 ok. well I'll assume the naming conventions are ok the Scala is reasonable and start the polish and tests. There are a lot of combinations of options.
        Hide
        githubbot ASF GitHub Bot added a comment -

        Github user pferrel commented on the pull request:

        https://github.com/apache/mahout/pull/22#issuecomment-47129279

        This is getting close to merge time. There are several test cases, which exercise most of the code in the PR. It is building with tests on the current master.

        Missing but needs to be done before merge:

        • a modified mahout script that runs this.

        Significantly missing, but planned for another PR:

        • two, or more input streams for cross-cooccurrence. This version will do a cross-cooccurrence calc by filtering one input tuple stream into two matrices. This allows testing and may be a common use case but should not be the only option for this use.
        • uses HashBiMaps from Guava for all ID management, even when the IDs are Mahout ordinals. Also all four ID indexes are created even though in this case the External Row/User IDs are never used. An optimization would calculate only the dictionaries needed.
        • HashBiMaps are created once and broadcast to the rest of the jobs. These are not based on rdds and so we may want to do something about these in the future. Haven't thought much about this so suggestions are welcome.
        Show
        githubbot ASF GitHub Bot added a comment - Github user pferrel commented on the pull request: https://github.com/apache/mahout/pull/22#issuecomment-47129279 This is getting close to merge time. There are several test cases, which exercise most of the code in the PR. It is building with tests on the current master. Missing but needs to be done before merge: a modified mahout script that runs this. Significantly missing, but planned for another PR: two, or more input streams for cross-cooccurrence. This version will do a cross-cooccurrence calc by filtering one input tuple stream into two matrices. This allows testing and may be a common use case but should not be the only option for this use. uses HashBiMaps from Guava for all ID management, even when the IDs are Mahout ordinals. Also all four ID indexes are created even though in this case the External Row/User IDs are never used. An optimization would calculate only the dictionaries needed. HashBiMaps are created once and broadcast to the rest of the jobs. These are not based on rdds and so we may want to do something about these in the future. Haven't thought much about this so suggestions are welcome.
        Hide
        githubbot ASF GitHub Bot added a comment -

        Github user pferrel commented on the pull request:

        https://github.com/apache/mahout/pull/22#issuecomment-47375744

        • pom edited
        • assembly/job.xml added
        • mahout script edited, no mahout.cmd but will ask someone with a windows machine to update it.
        • updated MahoutKryoRegistrator to add HashBiMap

        Runs on clustered HDFS + Spark, still using Hadoop 1.2.1 but after some more testing and hopefully some review of the stuff in this comment, it's about ready to merge.

        Show
        githubbot ASF GitHub Bot added a comment - Github user pferrel commented on the pull request: https://github.com/apache/mahout/pull/22#issuecomment-47375744 pom edited assembly/job.xml added mahout script edited, no mahout.cmd but will ask someone with a windows machine to update it. updated MahoutKryoRegistrator to add HashBiMap Runs on clustered HDFS + Spark, still using Hadoop 1.2.1 but after some more testing and hopefully some review of the stuff in this comment, it's about ready to merge.
        Hide
        githubbot ASF GitHub Bot added a comment -

        Github user pferrel commented on a diff in the pull request:

        https://github.com/apache/mahout/pull/22#discussion_r14325027

        — Diff: spark/src/main/scala/org/apache/mahout/drivers/ReaderWriter.scala —
        @@ -0,0 +1,222 @@
        +/*
        + * Licensed to the Apache Software Foundation (ASF) under one or more
        + * contributor license agreements. See the NOTICE file distributed with
        + * this work for additional information regarding copyright ownership.
        + * The ASF licenses this file to You under the Apache License, Version 2.0
        + * (the "License"); you may not use this file except in compliance with
        + * the License. You may obtain a copy of the License at
        + *
        + * http://www.apache.org/licenses/LICENSE-2.0
        + *
        + * Unless required by applicable law or agreed to in writing, software
        + * distributed under the License is distributed on an "AS IS" BASIS,
        + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
        + * See the License for the specific language governing permissions and
        + * limitations under the License.
        + */
        +
        +package org.apache.mahout.drivers
        +
        +import org.apache.spark.SparkContext._
        +import org.apache.mahout.math.RandomAccessSparseVector
        +import org.apache.spark.SparkContext
        +import com.google.common.collect.

        {BiMap, HashBiMap}

        +import scala.collection.JavaConversions._
        +import org.apache.mahout.math.drm.

        {CheckpointedDrm, DrmLike}

        +import org.apache.mahout.sparkbindings._
        +
        +
        +/** Reader trait is abstract in the sense that the reader function must be defined by an extending trait, which also defines the type to be read.
        + * @tparam T type of object read, usually supplied by an extending trait.
        + * @todo the reader need not create both dictionaries but does at present. There are cases where one or the other dictionary is never used so saving the memory for a very large dictionary may be worth the optimization to specify which dictionaries are created.
        + */
        +trait Reader[T]{
        — End diff –

        I hear you but maybe a compromise is best in this case. The Reader and Writer traits are core but trivial. I put them in their own file. The TextDelimited stuff is much more complext but have some one liner syntactic sugar to make them easier to use so I put these in another file--all related to read/write of text delimited files. Gives some better logical grouping, which can be missing from one class one file in Java.

        Show
        githubbot ASF GitHub Bot added a comment - Github user pferrel commented on a diff in the pull request: https://github.com/apache/mahout/pull/22#discussion_r14325027 — Diff: spark/src/main/scala/org/apache/mahout/drivers/ReaderWriter.scala — @@ -0,0 +1,222 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.mahout.drivers + +import org.apache.spark.SparkContext._ +import org.apache.mahout.math.RandomAccessSparseVector +import org.apache.spark.SparkContext +import com.google.common.collect. {BiMap, HashBiMap} +import scala.collection.JavaConversions._ +import org.apache.mahout.math.drm. {CheckpointedDrm, DrmLike} +import org.apache.mahout.sparkbindings._ + + +/** Reader trait is abstract in the sense that the reader function must be defined by an extending trait, which also defines the type to be read. + * @tparam T type of object read, usually supplied by an extending trait. + * @todo the reader need not create both dictionaries but does at present. There are cases where one or the other dictionary is never used so saving the memory for a very large dictionary may be worth the optimization to specify which dictionaries are created. + */ +trait Reader [T] { — End diff – I hear you but maybe a compromise is best in this case. The Reader and Writer traits are core but trivial. I put them in their own file. The TextDelimited stuff is much more complext but have some one liner syntactic sugar to make them easier to use so I put these in another file--all related to read/write of text delimited files. Gives some better logical grouping, which can be missing from one class one file in Java.
        Hide
        githubbot ASF GitHub Bot added a comment -

        Github user pferrel commented on a diff in the pull request:

        https://github.com/apache/mahout/pull/22#discussion_r14325090

        — Diff: spark/src/main/scala/org/apache/mahout/drivers/Schema.scala —
        @@ -0,0 +1,29 @@
        +/*
        + * Licensed to the Apache Software Foundation (ASF) under one or more
        + * contributor license agreements. See the NOTICE file distributed with
        + * this work for additional information regarding copyright ownership.
        + * The ASF licenses this file to You under the Apache License, Version 2.0
        + * (the "License"); you may not use this file except in compliance with
        + * the License. You may obtain a copy of the License at
        + *
        + * http://www.apache.org/licenses/LICENSE-2.0
        + *
        + * Unless required by applicable law or agreed to in writing, software
        + * distributed under the License is distributed on an "AS IS" BASIS,
        + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
        + * See the License for the specific language governing permissions and
        + * limitations under the License.
        + */
        +
        +package org.apache.mahout.drivers
        +
        +import scala.collection.mutable.HashMap
        +
        +/** Syntactic sugar for HashMap[String, Any]
        + *
        + * @param params list of mappings for instantiation {{

        {val mySchema = new Schema("one" -> 1, "two" -> "2", ...)}

        }}
        + */
        +class Schema(params: Tuple2[String, Any]*) extends HashMap[String, Any] {
        — End diff –

        OK, so create a class that contains an instance of MapLike[String, Any] then use the above constructors and supply a new implicit conversion in the companion Object? I've seen this pattern but wasn't sure why to use it. I'll give it a try here but if you could say why this is good practice i'd appreciate it.

        Show
        githubbot ASF GitHub Bot added a comment - Github user pferrel commented on a diff in the pull request: https://github.com/apache/mahout/pull/22#discussion_r14325090 — Diff: spark/src/main/scala/org/apache/mahout/drivers/Schema.scala — @@ -0,0 +1,29 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.mahout.drivers + +import scala.collection.mutable.HashMap + +/** Syntactic sugar for HashMap [String, Any] + * + * @param params list of mappings for instantiation {{ {val mySchema = new Schema("one" -> 1, "two" -> "2", ...)} }} + */ +class Schema(params: Tuple2 [String, Any] *) extends HashMap [String, Any] { — End diff – OK, so create a class that contains an instance of MapLike [String, Any] then use the above constructors and supply a new implicit conversion in the companion Object? I've seen this pattern but wasn't sure why to use it. I'll give it a try here but if you could say why this is good practice i'd appreciate it.
        Hide
        githubbot ASF GitHub Bot added a comment -

        Github user pferrel commented on the pull request:

        https://github.com/apache/mahout/pull/22#issuecomment-47697509

        Doing final tests before push, will close MAHOUT-1541, MAHOUT-1568, and MAHOUT-1569

        Show
        githubbot ASF GitHub Bot added a comment - Github user pferrel commented on the pull request: https://github.com/apache/mahout/pull/22#issuecomment-47697509 Doing final tests before push, will close MAHOUT-1541 , MAHOUT-1568 , and MAHOUT-1569
        Hide
        githubbot ASF GitHub Bot added a comment -

        Github user pferrel commented on the pull request:

        https://github.com/apache/mahout/pull/22#issuecomment-47711165

        And the tests fail. Legacy itemsimilarity is not producing the same results on the same files. Looks like a problem in CooccurrenceAnalysis.cooccurrence. The tests for that class have the wrong values and so pass incorrectly.

        Show
        githubbot ASF GitHub Bot added a comment - Github user pferrel commented on the pull request: https://github.com/apache/mahout/pull/22#issuecomment-47711165 And the tests fail. Legacy itemsimilarity is not producing the same results on the same files. Looks like a problem in CooccurrenceAnalysis.cooccurrence. The tests for that class have the wrong values and so pass incorrectly.
        Hide
        githubbot ASF GitHub Bot added a comment -

        Github user pferrel commented on the pull request:

        https://github.com/apache/mahout/pull/22#issuecomment-47960471

        The discrepancy between Spark and mrlegacy is in the final output value from the legacy code

        ```
        return 1.0 - 1.0 / (1.0 + logLikelihood);
        ```

        This produces the same results for spark and mrlegacy so i'll go with it but would love an explanation.

        Show
        githubbot ASF GitHub Bot added a comment - Github user pferrel commented on the pull request: https://github.com/apache/mahout/pull/22#issuecomment-47960471 The discrepancy between Spark and mrlegacy is in the final output value from the legacy code ``` return 1.0 - 1.0 / (1.0 + logLikelihood); ``` This produces the same results for spark and mrlegacy so i'll go with it but would love an explanation.
        Hide
        githubbot ASF GitHub Bot added a comment -

        Github user asfgit closed the pull request at:

        https://github.com/apache/mahout/pull/22

        Show
        githubbot ASF GitHub Bot added a comment - Github user asfgit closed the pull request at: https://github.com/apache/mahout/pull/22
        Hide
        pferrel Pat Ferrel added a comment -

        There is still work to do on this so I'll leave the Jira open for a bit.

        Two issues:

        • Do we care about output as tuple, like the hadoop version? The Spark version outputs a DRM with app specific item and user IDs.
        • We need an option to sort the indicators by strength and omit strength values from the output. This will allow the output to be indexed directly by a search engine. One step from logfiles to indexable indicators, can't wait.

        Something on the horizon. The algo for epinions data seems to need 5g of Spark executor memory. This seems like a lot and may have to do with the use of HashBiMaps of IDs that are broadcast to every node. This can be optimized first by not calculating ID dictionaries for users, since they are not in the output. When using legacy Mahout int IDs we don't need any Dictionaries. Also looking at the lifetime of the broadcast vals. The A'A dictionaries may be kept alive while doing the B'A calc--still investigating this.

        Quite a bit more difficult will be a truly scalable treatment of the dictionaries.

        Show
        pferrel Pat Ferrel added a comment - There is still work to do on this so I'll leave the Jira open for a bit. Two issues: Do we care about output as tuple, like the hadoop version? The Spark version outputs a DRM with app specific item and user IDs. We need an option to sort the indicators by strength and omit strength values from the output. This will allow the output to be indexed directly by a search engine. One step from logfiles to indexable indicators, can't wait. Something on the horizon. The algo for epinions data seems to need 5g of Spark executor memory. This seems like a lot and may have to do with the use of HashBiMaps of IDs that are broadcast to every node. This can be optimized first by not calculating ID dictionaries for users, since they are not in the output. When using legacy Mahout int IDs we don't need any Dictionaries. Also looking at the lifetime of the broadcast vals. The A'A dictionaries may be kept alive while doing the B'A calc--still investigating this. Quite a bit more difficult will be a truly scalable treatment of the dictionaries.
        Hide
        hudson Hudson added a comment -

        SUCCESS: Integrated in Mahout-Quality #2684 (See https://builds.apache.org/job/Mahout-Quality/2684/)
        MAHOUT-1541, MAHOUT-1568, MAHOUT-1569 fixed a build test problem, drivers have an option new to not search for MAHOUT_HOME and SPARK_HOME (pat: rev 32badb1d360ddf514e6b253f2dea9ae7e5df078a)

        • spark/src/main/scala/org/apache/mahout/drivers/MahoutDriver.scala
        • spark/src/main/scala/org/apache/mahout/drivers/ItemSimilarityDriver.scala
        • spark/src/test/scala/org/apache/mahout/drivers/ItemSimilarityDriverSuite.scala
        Show
        hudson Hudson added a comment - SUCCESS: Integrated in Mahout-Quality #2684 (See https://builds.apache.org/job/Mahout-Quality/2684/ ) MAHOUT-1541 , MAHOUT-1568 , MAHOUT-1569 fixed a build test problem, drivers have an option new to not search for MAHOUT_HOME and SPARK_HOME (pat: rev 32badb1d360ddf514e6b253f2dea9ae7e5df078a) spark/src/main/scala/org/apache/mahout/drivers/MahoutDriver.scala spark/src/main/scala/org/apache/mahout/drivers/ItemSimilarityDriver.scala spark/src/test/scala/org/apache/mahout/drivers/ItemSimilarityDriverSuite.scala
        Hide
        tdunning Ted Dunning added a comment -

        The use of the 1/(1+LLR) as a relatedness score by the map-reduce version is an attempt to shoehorn the LLR based cooccurrence into the framework of the other recommenders. That score should normally be ignored in search based recommendation systems.

        Show
        tdunning Ted Dunning added a comment - The use of the 1/(1+LLR) as a relatedness score by the map-reduce version is an attempt to shoehorn the LLR based cooccurrence into the framework of the other recommenders. That score should normally be ignored in search based recommendation systems.
        Hide
        pferrel Pat Ferrel added a comment -

        Ah, makes sense. I'll take it out since the other measures won't be supported (afaik). This formulation does seem to invert the meaning of a high score since LLR is in the denominator.

        BTW I was thinking of implementing something like the old hadoop all-recs-for-all-users since a lot of people seem to use it and it's now almost trivial to support. You can do it in the shell with a few lines of code. and it runs in a few minutes instead of hours.

        Show
        pferrel Pat Ferrel added a comment - Ah, makes sense. I'll take it out since the other measures won't be supported (afaik). This formulation does seem to invert the meaning of a high score since LLR is in the denominator. BTW I was thinking of implementing something like the old hadoop all-recs-for-all-users since a lot of people seem to use it and it's now almost trivial to support. You can do it in the shell with a few lines of code. and it runs in a few minutes instead of hours.
        Hide
        hudson Hudson added a comment -

        SUCCESS: Integrated in Mahout-Quality #2686 (See https://builds.apache.org/job/Mahout-Quality/2686/)
        MAHOUT-1541 backed out compatability with legacy Item Similarity, now outputs raw LLR scores (pat: rev 24cb5576f720737b73906ebb15be486d540ac629)

        • spark/src/test/scala/org/apache/mahout/cf/CooccurrenceAnalysisSuite.scala
        • spark/src/main/scala/org/apache/mahout/cf/CooccurrenceAnalysis.scala
        • spark/src/test/scala/org/apache/mahout/drivers/ItemSimilarityDriverSuite.scala
        Show
        hudson Hudson added a comment - SUCCESS: Integrated in Mahout-Quality #2686 (See https://builds.apache.org/job/Mahout-Quality/2686/ ) MAHOUT-1541 backed out compatability with legacy Item Similarity, now outputs raw LLR scores (pat: rev 24cb5576f720737b73906ebb15be486d540ac629) spark/src/test/scala/org/apache/mahout/cf/CooccurrenceAnalysisSuite.scala spark/src/main/scala/org/apache/mahout/cf/CooccurrenceAnalysis.scala spark/src/test/scala/org/apache/mahout/drivers/ItemSimilarityDriverSuite.scala
        Hide
        hudson Hudson added a comment -

        SUCCESS: Integrated in Mahout-Quality #2688 (See https://builds.apache.org/job/Mahout-Quality/2688/)
        MAHOUT-1541, MAHOUT-1568, added option to ItemSimilarityDriver to allow output that is directly search engine indexable, also some default schema's for input and output of TDF tuples and DRMs (pat: rev 9bfb767323833586873272af4db446f68f357f1f)

        • spark/src/test/scala/org/apache/mahout/drivers/ItemSimilarityDriverSuite.scala
        • spark/src/main/scala/org/apache/mahout/drivers/TextDelimitedReaderWriter.scala
        • spark/src/main/scala/org/apache/mahout/drivers/Schema.scala
        • spark/src/main/scala/org/apache/mahout/drivers/ItemSimilarityDriver.scala
        Show
        hudson Hudson added a comment - SUCCESS: Integrated in Mahout-Quality #2688 (See https://builds.apache.org/job/Mahout-Quality/2688/ ) MAHOUT-1541 , MAHOUT-1568 , added option to ItemSimilarityDriver to allow output that is directly search engine indexable, also some default schema's for input and output of TDF tuples and DRMs (pat: rev 9bfb767323833586873272af4db446f68f357f1f) spark/src/test/scala/org/apache/mahout/drivers/ItemSimilarityDriverSuite.scala spark/src/main/scala/org/apache/mahout/drivers/TextDelimitedReaderWriter.scala spark/src/main/scala/org/apache/mahout/drivers/Schema.scala spark/src/main/scala/org/apache/mahout/drivers/ItemSimilarityDriver.scala
        Hide
        githubbot ASF GitHub Bot added a comment -

        GitHub user pferrel opened a pull request:

        https://github.com/apache/mahout/pull/31

        MAHOUT-1541

        Covers MAHOUT-1464, and MAHOUT-1541

        Fixes cross-cooccurrence, which originally was of the wrong order. If A is the primary interaction matrix, them cross should be A'B but was B'A. Also added tests for A and B of different column cardinality, which should is allowed.

        Looked a bit deeper regarding @dev thread on changing cardinality of a sparse matrix. This is needed in ItemSimilarity because only once all cross-interaction matrices have been read in can the true cardinality of all be known. The much more typical case is that they can be computed from the input. So there needs to be a method to update the cardinality after the matrices have been read in.

        This implementation creates a new abstract method in CheckpointedDrm and implementation in CheckpointedDrmSpark. Here is the reasoning.

        1) for sparse DRMs there is no need for any representation of an empty row (or column) not even the keys need to be known only the cardinality. You only have to think about a transpose of sparse vectors to see that this must be so. Further it works and I’ve relied on it since the Hadoop mr version. Baring any revelation from the math gods--it is so.

        2) rbind semantics apply to dense matrices. This IMO should be avoided in this case because even if we rejigger rbind to only change the cardinality without inserting real rows it would seem to violate its semantics. Sparse matrices don’t fit the default R semantics in a few areas (in my non-expert opinion) and this is one. Unless someone feels strongly it will be in CheckpointedDrm as abstract and implemented in CheckpoimtedDrmSpark#addToRowCardinality(n: Int): Unit. Creating an op that returns a new CheckpointedDrm is also possible if there is something unsafe about my implementation, but rbind?

        3) I have implemented this so that there is no call to drm.nrow neither to read nor modify it. So it will remain lazy evaluated until needed by other math.

        4) ItemSimilarity for the A’B case now passes several asymmetric input cases and outputs the correct external IDs.

        You can merge this pull request into a Git repository by running:

        $ git pull https://github.com/pferrel/mahout mahout-1541

        Alternatively you can review and apply these changes as the patch at:

        https://github.com/apache/mahout/pull/31.patch

        To close this pull request, make a commit to your master/trunk branch
        with (at least) the following in the commit message:

        This closes #31


        commit 107a0ba9605241653a85b113661a8fa5c055529f
        Author: pferrel <pat@occamsmachete.com>
        Date: 2014-06-04T19:54:22Z

        added Sebastian's CooccurrenceAnalysis patch updated it to use current Mahout-DSL

        commit 74b9921c4c9bd8903585bbd74d9e66298ea8b7a0
        Author: pferrel <pat@occamsmachete.com>
        Date: 2014-06-04T20:09:07Z

        Adding stuff for itemsimilarity driver for Spark

        commit a59265931ed3a51ba81e1a0a7171ebb102be4fa4
        Author: pferrel <pat@occamsmachete.com>
        Date: 2014-06-04T20:13:13Z

        added scopt to pom deps

        commit 16c03f7fa73c156859d1dba3a333ef9e8bf922b0
        Author: pferrel <pat@occamsmachete.com>
        Date: 2014-06-04T21:32:18Z

        added Sebastian's MurmurHash changes

        Signed-off-by: pferrel <pat@occamsmachete.com>

        commit 8a4b4347ddb7b9ac97590aa20189d89d8a07a80a
        Author: pferrel <pat@occamsmachete.com>
        Date: 2014-06-04T21:33:11Z

        Merge branch 'mahout-1464' into mahout-1541

        commit 2f87f5433f90fa2c49ef386ca245943e1fc73beb
        Author: pferrel <pat@occamsmachete.com>
        Date: 2014-06-05T01:44:16Z

        MAHOUT-1541 still working on this, some refactoring in the DSL for abstracting away Spark has moved access to rddsno Jira is closed yet

        commit c6adaa44c80bba99d41600e260bbb1ad5c972e69
        Author: pferrel <pat@occamsmachete.com>
        Date: 2014-06-05T16:52:23Z

        MAHOUT-1464 import cleanup, minor changes to examples for running on Spark Cluster

        commit 2caceab31703ed214c1e66d5fc63b8bdb05d37a3
        Author: pferrel <pat@occamsmachete.com>
        Date: 2014-06-05T16:55:09Z

        Merge branch 'mahout-1464' into mahout-1541

        commit 6df6a54e3ff174d39bd817caf7d16c2d362be3f8
        Author: pferrel <pat@occamsmachete.com>
        Date: 2014-06-07T20:39:25Z

        Merge branch 'master' into mahout-1541

        commit a2f84dea3f32d3df3e98c61f085bc1fabd453551
        Author: pferrel <pat@occamsmachete.com>
        Date: 2014-06-07T21:27:06Z

        drmWrap seems to be the answer to the changed DrmLike interface. Code works again but more to do.

        commit d3a2ba5027436d0abef67a1a5e82557064f4ba49
        Author: pferrel <pat@occamsmachete.com>
        Date: 2014-06-17T16:00:38Z

        merged master, got new cooccurrence code

        commit 4b2fb07b21a8ac2d532ee51b65b27d1482293cb0
        Author: pferrel <pat@occamsmachete.com>
        Date: 2014-06-19T17:08:02Z

        for high level review, not ready for merge

        commit 996ccfb82a8ed3ff90f51968e661b2449f3c4759
        Author: pferrel <pat@occamsmachete.com>
        Date: 2014-06-19T17:46:23Z

        for high level review, not ready for merge. changed to dot notation

        commit f62ab071869ee205ad398a3e094d871138e11a9e
        Author: pferrel <pat@occamsmachete.com>
        Date: 2014-06-19T18:13:44Z

        for high level review, not ready for merge. fixed a couple scaladoc refs

        commit cbef0ee6264c28d0597cb2507427a647771c9bcd
        Author: pferrel <pat@occamsmachete.com>
        Date: 2014-06-23T21:49:20Z

        adding tests, had to modify some test framework Scala to make the masterUrl visible to tests

        commit ab8009f6176f0c21a07e15cc5cc8a9717dd7cc4c
        Author: pferrel <pat@occamsmachete.com>
        Date: 2014-06-25T15:41:54Z

        adding more tests for ItemSimilarityDriver

        commit 47258f59df7f215b1bb25830d13d9b85fa8d19e9
        Author: pferrel <pat@occamsmachete.com>
        Date: 2014-06-25T15:44:47Z

        merged master changes and fixed a pom conflict

        commit 9a02e2a5ea8540723c1bfc6ea01b045bb4175922
        Author: pferrel <pat@occamsmachete.com>
        Date: 2014-06-25T16:57:55Z

        remove tmp after all tests, fixed dangling comma in input file list

        commit 3c343ff18600f0a0e59f5bfd63bd86db0db0e8c5
        Author: pferrel <pat@occamsmachete.com>
        Date: 2014-06-26T22:19:48Z

        changes to pom, mahout driver script, and cleaned up help text

        commit 213b18dee259925de82c703451bdea640e1f068e
        Author: pferrel <pat@occamsmachete.com>
        Date: 2014-06-26T22:26:17Z

        added a job.xml assembly for creation of an all-dependencies jar

        commit 627d39f30860e4ab43783c72cc2cf8926060b73c
        Author: pferrel <pat@occamsmachete.com>
        Date: 2014-06-27T16:44:37Z

        registered HashBiMap with JavaSerializer in Kryo

        commit c273dc7de3c740189ce8157b334c2eef3a4c23ea
        Author: pferrel <pat@occamsmachete.com>
        Date: 2014-06-27T21:30:13Z

        increased the default max heep for mahout/JVM to 4g, using max of 4g for Spark executor

        commit 9dd2f2eabf1bf64660de6b5b5e49aafe18229a7a
        Author: Pat Ferrel <pat@farfetchers.com>
        Date: 2014-06-30T17:06:49Z

        tweaking memory requirements to process epinions with the ItemSimilarityDriver

        commit 6ec98f32775c791ee001fc996f475215e427f368
        Author: pferrel <pat@occamsmachete.com>
        Date: 2014-06-30T17:08:49Z

        refactored to use a DistributedContext instead of raw SparkContext

        commit 48774e154a6e55e04037c787f8d64bc9e545f1bd
        Author: pferrel <pat@occamsmachete.com>
        Date: 2014-06-30T17:08:59Z

        merging changes made on to run a large dataset through itemsimilarity on the cluster

        commit 8e70091a564c8464ea70bf90006d8124c3a7f208
        Author: pferrel <pat@occamsmachete.com>
        Date: 2014-06-30T20:11:42Z

        fixed a bug, SparkConf in driver was ignored and blank one passed in to create a DistributedContext

        commit 01a0341f56071d2244aabd6de8c6f528ad35b164
        Author: pferrel <pat@occamsmachete.com>
        Date: 2014-06-30T20:33:39Z

        added option for configuring Spark executor memory

        commit 2d9efd73def8207dded5cd1dd8699035a8cc1b34
        Author: pferrel <pat@occamsmachete.com>
        Date: 2014-06-30T22:37:19Z

        removed some outdated examples

        commit 9fb281022cba7666dd26701b3d97d200b13c35f8
        Author: pferrel <pat@occamsmachete.com>
        Date: 2014-07-01T18:17:42Z

        test naming and pom changed to up the jvm heap max to 512m for scalatests

        commit 674c9b7862f0bd0723de026eb4527546b52e8a0b
        Author: pferrel <pat@occamsmachete.com>
        Date: 2014-07-01T18:18:59Z

        Merge branch 'mahout-1541' of https://github.com/pferrel/mahout into mahout-1541


        Show
        githubbot ASF GitHub Bot added a comment - GitHub user pferrel opened a pull request: https://github.com/apache/mahout/pull/31 MAHOUT-1541 Covers MAHOUT-1464 , and MAHOUT-1541 Fixes cross-cooccurrence, which originally was of the wrong order. If A is the primary interaction matrix, them cross should be A'B but was B'A. Also added tests for A and B of different column cardinality, which should is allowed. Looked a bit deeper regarding @dev thread on changing cardinality of a sparse matrix. This is needed in ItemSimilarity because only once all cross-interaction matrices have been read in can the true cardinality of all be known. The much more typical case is that they can be computed from the input. So there needs to be a method to update the cardinality after the matrices have been read in. This implementation creates a new abstract method in CheckpointedDrm and implementation in CheckpointedDrmSpark. Here is the reasoning. 1) for sparse DRMs there is no need for any representation of an empty row (or column) not even the keys need to be known only the cardinality. You only have to think about a transpose of sparse vectors to see that this must be so. Further it works and I’ve relied on it since the Hadoop mr version. Baring any revelation from the math gods--it is so. 2) rbind semantics apply to dense matrices. This IMO should be avoided in this case because even if we rejigger rbind to only change the cardinality without inserting real rows it would seem to violate its semantics. Sparse matrices don’t fit the default R semantics in a few areas (in my non-expert opinion) and this is one. Unless someone feels strongly it will be in CheckpointedDrm as abstract and implemented in CheckpoimtedDrmSpark#addToRowCardinality(n: Int): Unit. Creating an op that returns a new CheckpointedDrm is also possible if there is something unsafe about my implementation, but rbind? 3) I have implemented this so that there is no call to drm.nrow neither to read nor modify it. So it will remain lazy evaluated until needed by other math. 4) ItemSimilarity for the A’B case now passes several asymmetric input cases and outputs the correct external IDs. You can merge this pull request into a Git repository by running: $ git pull https://github.com/pferrel/mahout mahout-1541 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/mahout/pull/31.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #31 commit 107a0ba9605241653a85b113661a8fa5c055529f Author: pferrel <pat@occamsmachete.com> Date: 2014-06-04T19:54:22Z added Sebastian's CooccurrenceAnalysis patch updated it to use current Mahout-DSL commit 74b9921c4c9bd8903585bbd74d9e66298ea8b7a0 Author: pferrel <pat@occamsmachete.com> Date: 2014-06-04T20:09:07Z Adding stuff for itemsimilarity driver for Spark commit a59265931ed3a51ba81e1a0a7171ebb102be4fa4 Author: pferrel <pat@occamsmachete.com> Date: 2014-06-04T20:13:13Z added scopt to pom deps commit 16c03f7fa73c156859d1dba3a333ef9e8bf922b0 Author: pferrel <pat@occamsmachete.com> Date: 2014-06-04T21:32:18Z added Sebastian's MurmurHash changes Signed-off-by: pferrel <pat@occamsmachete.com> commit 8a4b4347ddb7b9ac97590aa20189d89d8a07a80a Author: pferrel <pat@occamsmachete.com> Date: 2014-06-04T21:33:11Z Merge branch 'mahout-1464' into mahout-1541 commit 2f87f5433f90fa2c49ef386ca245943e1fc73beb Author: pferrel <pat@occamsmachete.com> Date: 2014-06-05T01:44:16Z MAHOUT-1541 still working on this, some refactoring in the DSL for abstracting away Spark has moved access to rddsno Jira is closed yet commit c6adaa44c80bba99d41600e260bbb1ad5c972e69 Author: pferrel <pat@occamsmachete.com> Date: 2014-06-05T16:52:23Z MAHOUT-1464 import cleanup, minor changes to examples for running on Spark Cluster commit 2caceab31703ed214c1e66d5fc63b8bdb05d37a3 Author: pferrel <pat@occamsmachete.com> Date: 2014-06-05T16:55:09Z Merge branch 'mahout-1464' into mahout-1541 commit 6df6a54e3ff174d39bd817caf7d16c2d362be3f8 Author: pferrel <pat@occamsmachete.com> Date: 2014-06-07T20:39:25Z Merge branch 'master' into mahout-1541 commit a2f84dea3f32d3df3e98c61f085bc1fabd453551 Author: pferrel <pat@occamsmachete.com> Date: 2014-06-07T21:27:06Z drmWrap seems to be the answer to the changed DrmLike interface. Code works again but more to do. commit d3a2ba5027436d0abef67a1a5e82557064f4ba49 Author: pferrel <pat@occamsmachete.com> Date: 2014-06-17T16:00:38Z merged master, got new cooccurrence code commit 4b2fb07b21a8ac2d532ee51b65b27d1482293cb0 Author: pferrel <pat@occamsmachete.com> Date: 2014-06-19T17:08:02Z for high level review, not ready for merge commit 996ccfb82a8ed3ff90f51968e661b2449f3c4759 Author: pferrel <pat@occamsmachete.com> Date: 2014-06-19T17:46:23Z for high level review, not ready for merge. changed to dot notation commit f62ab071869ee205ad398a3e094d871138e11a9e Author: pferrel <pat@occamsmachete.com> Date: 2014-06-19T18:13:44Z for high level review, not ready for merge. fixed a couple scaladoc refs commit cbef0ee6264c28d0597cb2507427a647771c9bcd Author: pferrel <pat@occamsmachete.com> Date: 2014-06-23T21:49:20Z adding tests, had to modify some test framework Scala to make the masterUrl visible to tests commit ab8009f6176f0c21a07e15cc5cc8a9717dd7cc4c Author: pferrel <pat@occamsmachete.com> Date: 2014-06-25T15:41:54Z adding more tests for ItemSimilarityDriver commit 47258f59df7f215b1bb25830d13d9b85fa8d19e9 Author: pferrel <pat@occamsmachete.com> Date: 2014-06-25T15:44:47Z merged master changes and fixed a pom conflict commit 9a02e2a5ea8540723c1bfc6ea01b045bb4175922 Author: pferrel <pat@occamsmachete.com> Date: 2014-06-25T16:57:55Z remove tmp after all tests, fixed dangling comma in input file list commit 3c343ff18600f0a0e59f5bfd63bd86db0db0e8c5 Author: pferrel <pat@occamsmachete.com> Date: 2014-06-26T22:19:48Z changes to pom, mahout driver script, and cleaned up help text commit 213b18dee259925de82c703451bdea640e1f068e Author: pferrel <pat@occamsmachete.com> Date: 2014-06-26T22:26:17Z added a job.xml assembly for creation of an all-dependencies jar commit 627d39f30860e4ab43783c72cc2cf8926060b73c Author: pferrel <pat@occamsmachete.com> Date: 2014-06-27T16:44:37Z registered HashBiMap with JavaSerializer in Kryo commit c273dc7de3c740189ce8157b334c2eef3a4c23ea Author: pferrel <pat@occamsmachete.com> Date: 2014-06-27T21:30:13Z increased the default max heep for mahout/JVM to 4g, using max of 4g for Spark executor commit 9dd2f2eabf1bf64660de6b5b5e49aafe18229a7a Author: Pat Ferrel <pat@farfetchers.com> Date: 2014-06-30T17:06:49Z tweaking memory requirements to process epinions with the ItemSimilarityDriver commit 6ec98f32775c791ee001fc996f475215e427f368 Author: pferrel <pat@occamsmachete.com> Date: 2014-06-30T17:08:49Z refactored to use a DistributedContext instead of raw SparkContext commit 48774e154a6e55e04037c787f8d64bc9e545f1bd Author: pferrel <pat@occamsmachete.com> Date: 2014-06-30T17:08:59Z merging changes made on to run a large dataset through itemsimilarity on the cluster commit 8e70091a564c8464ea70bf90006d8124c3a7f208 Author: pferrel <pat@occamsmachete.com> Date: 2014-06-30T20:11:42Z fixed a bug, SparkConf in driver was ignored and blank one passed in to create a DistributedContext commit 01a0341f56071d2244aabd6de8c6f528ad35b164 Author: pferrel <pat@occamsmachete.com> Date: 2014-06-30T20:33:39Z added option for configuring Spark executor memory commit 2d9efd73def8207dded5cd1dd8699035a8cc1b34 Author: pferrel <pat@occamsmachete.com> Date: 2014-06-30T22:37:19Z removed some outdated examples commit 9fb281022cba7666dd26701b3d97d200b13c35f8 Author: pferrel <pat@occamsmachete.com> Date: 2014-07-01T18:17:42Z test naming and pom changed to up the jvm heap max to 512m for scalatests commit 674c9b7862f0bd0723de026eb4527546b52e8a0b Author: pferrel <pat@occamsmachete.com> Date: 2014-07-01T18:18:59Z Merge branch 'mahout-1541' of https://github.com/pferrel/mahout into mahout-1541
        Hide
        githubbot ASF GitHub Bot added a comment -

        Github user pferrel commented on the pull request:

        https://github.com/apache/mahout/pull/31#issuecomment-49362805

        BTW I also moved CF into math-scala, leaving tests in spark. There may be a way to move those into math-scala with the new engine neutral test stuff, I'll look into that before I merge.

        Show
        githubbot ASF GitHub Bot added a comment - Github user pferrel commented on the pull request: https://github.com/apache/mahout/pull/31#issuecomment-49362805 BTW I also moved CF into math-scala, leaving tests in spark. There may be a way to move those into math-scala with the new engine neutral test stuff, I'll look into that before I merge.
        Hide
        githubbot ASF GitHub Bot added a comment -

        Github user pferrel commented on a diff in the pull request:

        https://github.com/apache/mahout/pull/31#discussion_r15150741

        — Diff: spark/src/main/scala/org/apache/mahout/drivers/IndexedDataset.scala —
        @@ -39,14 +40,29 @@ import org.apache.mahout.math.drm.CheckpointedDrm

        • to be not created when not needed.
          */

        -case class IndexedDataset(matrix: CheckpointedDrm[Int], rowIDs: BiMap[String,Int], columnIDs: BiMap[String,Int]) {
        +case class IndexedDataset(var matrix: CheckpointedDrm[Int], var rowIDs: BiMap[String,Int], var columnIDs: BiMap[String,Int]) {
        +
        + // we must allow the row dimension to be adjusted in the case where the data read in is incomplete and we
        + // learn this afterwards
        +
        + /**
        + * Adds the equivalent of blank rows to the sparse CheckpointedDrm, which only changes the row cardinality value.
        + * No physical changes are made to the underlying drm.
        + * @param n number to increase row carnindality by
        + * @note should be done before any BLAS optimizer actions are performed on the matrix or you'll get unpredictable
        + * results.
        + */
        + def addToRowCardinality(n: Int): Unit =

        { + assert(n > -1) + matrix.asInstanceOf[CheckpointedDrmSpark[Int]].addToRowCardinality(n) + }

        }
        — End diff –

        This supports an immutable CheckpointedDrm. It will create a new CheckpointedDrm with row cardinality that leaves some rows with no representation in the underlying rdd. The tests that use this seem to work but it may be an accident of the test data and this may not be a good implementation.

        The question is if a CheckpointedDrm, needs to be backed by an rdd with n => {} for all zero valued rows or if the absence of anything in the rdd, in other words a gap in the row key sequence, is equivalent to inserting an n = {}.

        Show
        githubbot ASF GitHub Bot added a comment - Github user pferrel commented on a diff in the pull request: https://github.com/apache/mahout/pull/31#discussion_r15150741 — Diff: spark/src/main/scala/org/apache/mahout/drivers/IndexedDataset.scala — @@ -39,14 +40,29 @@ import org.apache.mahout.math.drm.CheckpointedDrm to be not created when not needed. */ -case class IndexedDataset(matrix: CheckpointedDrm [Int] , rowIDs: BiMap [String,Int] , columnIDs: BiMap [String,Int] ) { +case class IndexedDataset(var matrix: CheckpointedDrm [Int] , var rowIDs: BiMap [String,Int] , var columnIDs: BiMap [String,Int] ) { + + // we must allow the row dimension to be adjusted in the case where the data read in is incomplete and we + // learn this afterwards + + /** + * Adds the equivalent of blank rows to the sparse CheckpointedDrm, which only changes the row cardinality value. + * No physical changes are made to the underlying drm. + * @param n number to increase row carnindality by + * @note should be done before any BLAS optimizer actions are performed on the matrix or you'll get unpredictable + * results. + */ + def addToRowCardinality(n: Int): Unit = { + assert(n > -1) + matrix.asInstanceOf[CheckpointedDrmSpark[Int]].addToRowCardinality(n) + } } — End diff – This supports an immutable CheckpointedDrm. It will create a new CheckpointedDrm with row cardinality that leaves some rows with no representation in the underlying rdd. The tests that use this seem to work but it may be an accident of the test data and this may not be a good implementation. The question is if a CheckpointedDrm, needs to be backed by an rdd with n => {} for all zero valued rows or if the absence of anything in the rdd, in other words a gap in the row key sequence, is equivalent to inserting an n = {}.
        Hide
        githubbot ASF GitHub Bot added a comment -

        Github user pferrel commented on a diff in the pull request:

        https://github.com/apache/mahout/pull/31#discussion_r15150759

        — Diff: spark/src/main/scala/org/apache/mahout/sparkbindings/drm/CheckpointedDrmSpark.scala —
        @@ -46,6 +46,19 @@ class CheckpointedDrmSpark[K: ClassTag](
        private var cached: Boolean = false
        override val context: DistributedContext = rdd.context

        + /**
        + * Adds the equivalent of blank rows to the sparse CheckpointedDrm, which only changes the
        + * [[org.apache.mahout.sparkbindings.drm
        +.CheckpointedDrmSpark#nrow]] value.
        + * No physical changes are made to the underlying rdd, now blank rows are added as would be done with rbind(blankRows)
        + * @param n number to increase row cardinality by
        + * @note should be done before any BLAS optimizer actions are performed on the matrix or you'll get unpredictable
        + * results.
        + */
        + override def addToRowCardinality(n: Int): CheckpointedDrm[K] =

        { + assert(n > -1) + new CheckpointedDrmSpark[K](rdd, nrow + n, ncol, _cacheStorageLevel ) + }

        — End diff –

        This supports an immutable CheckpointedDrm. It will create a new CheckpointedDrm with row cardinality that leaves some rows with no representation in the underlying rdd. The tests that use this seem to work but it may be an accident of the test data and this may not be a good implementation.

        The question is if a CheckpointedDrm, needs to be backed by an rdd with n => {} for all zero valued rows or if the absence of anything in the rdd, in other words a gap in the row key sequence, is equivalent to inserting an n = {}.

        Show
        githubbot ASF GitHub Bot added a comment - Github user pferrel commented on a diff in the pull request: https://github.com/apache/mahout/pull/31#discussion_r15150759 — Diff: spark/src/main/scala/org/apache/mahout/sparkbindings/drm/CheckpointedDrmSpark.scala — @@ -46,6 +46,19 @@ class CheckpointedDrmSpark [K: ClassTag] ( private var cached: Boolean = false override val context: DistributedContext = rdd.context + /** + * Adds the equivalent of blank rows to the sparse CheckpointedDrm, which only changes the + * [[org.apache.mahout.sparkbindings.drm +.CheckpointedDrmSpark#nrow]] value. + * No physical changes are made to the underlying rdd, now blank rows are added as would be done with rbind(blankRows) + * @param n number to increase row cardinality by + * @note should be done before any BLAS optimizer actions are performed on the matrix or you'll get unpredictable + * results. + */ + override def addToRowCardinality(n: Int): CheckpointedDrm [K] = { + assert(n > -1) + new CheckpointedDrmSpark[K](rdd, nrow + n, ncol, _cacheStorageLevel ) + } — End diff – This supports an immutable CheckpointedDrm. It will create a new CheckpointedDrm with row cardinality that leaves some rows with no representation in the underlying rdd. The tests that use this seem to work but it may be an accident of the test data and this may not be a good implementation. The question is if a CheckpointedDrm, needs to be backed by an rdd with n => {} for all zero valued rows or if the absence of anything in the rdd, in other words a gap in the row key sequence, is equivalent to inserting an n = {}.
        Hide
        githubbot ASF GitHub Bot added a comment -

        Github user avati commented on a diff in the pull request:

        https://github.com/apache/mahout/pull/31#discussion_r15150836

        — Diff: spark/src/main/scala/org/apache/mahout/sparkbindings/drm/CheckpointedDrmSpark.scala —
        @@ -46,6 +46,19 @@ class CheckpointedDrmSpark[K: ClassTag](
        private var cached: Boolean = false
        override val context: DistributedContext = rdd.context

        + /**
        + * Adds the equivalent of blank rows to the sparse CheckpointedDrm, which only changes the
        + * [[org.apache.mahout.sparkbindings.drm
        +.CheckpointedDrmSpark#nrow]] value.
        + * No physical changes are made to the underlying rdd, now blank rows are added as would be done with rbind(blankRows)
        + * @param n number to increase row cardinality by
        + * @note should be done before any BLAS optimizer actions are performed on the matrix or you'll get unpredictable
        + * results.
        + */
        + override def addToRowCardinality(n: Int): CheckpointedDrm[K] =

        { + assert(n > -1) + new CheckpointedDrmSpark[K](rdd, nrow + n, ncol, _cacheStorageLevel ) + }

        — End diff –

        This fixes the immutability problem, but the missing rows still create the following issues:

        • AewScalar: math errors
        • AewB: java exception
        • CbindAB: java exception

        All three are non-trivial to fix (i.e no one liner fixes).

        Show
        githubbot ASF GitHub Bot added a comment - Github user avati commented on a diff in the pull request: https://github.com/apache/mahout/pull/31#discussion_r15150836 — Diff: spark/src/main/scala/org/apache/mahout/sparkbindings/drm/CheckpointedDrmSpark.scala — @@ -46,6 +46,19 @@ class CheckpointedDrmSpark [K: ClassTag] ( private var cached: Boolean = false override val context: DistributedContext = rdd.context + /** + * Adds the equivalent of blank rows to the sparse CheckpointedDrm, which only changes the + * [[org.apache.mahout.sparkbindings.drm +.CheckpointedDrmSpark#nrow]] value. + * No physical changes are made to the underlying rdd, now blank rows are added as would be done with rbind(blankRows) + * @param n number to increase row cardinality by + * @note should be done before any BLAS optimizer actions are performed on the matrix or you'll get unpredictable + * results. + */ + override def addToRowCardinality(n: Int): CheckpointedDrm [K] = { + assert(n > -1) + new CheckpointedDrmSpark[K](rdd, nrow + n, ncol, _cacheStorageLevel ) + } — End diff – This fixes the immutability problem, but the missing rows still create the following issues: AewScalar: math errors AewB: java exception CbindAB: java exception All three are non-trivial to fix (i.e no one liner fixes).
        Hide
        githubbot ASF GitHub Bot added a comment -

        Github user pferrel commented on a diff in the pull request:

        https://github.com/apache/mahout/pull/31#discussion_r15184775

        — Diff: spark/src/main/scala/org/apache/mahout/sparkbindings/drm/CheckpointedDrmSpark.scala —
        @@ -46,6 +46,19 @@ class CheckpointedDrmSpark[K: ClassTag](
        private var cached: Boolean = false
        override val context: DistributedContext = rdd.context

        + /**
        + * Adds the equivalent of blank rows to the sparse CheckpointedDrm, which only changes the
        + * [[org.apache.mahout.sparkbindings.drm
        +.CheckpointedDrmSpark#nrow]] value.
        + * No physical changes are made to the underlying rdd, now blank rows are added as would be done with rbind(blankRows)
        + * @param n number to increase row cardinality by
        + * @note should be done before any BLAS optimizer actions are performed on the matrix or you'll get unpredictable
        + * results.
        + */
        + override def addToRowCardinality(n: Int): CheckpointedDrm[K] =

        { + assert(n > -1) + new CheckpointedDrmSpark[K](rdd, nrow + n, ncol, _cacheStorageLevel ) + }

        — End diff –

        I see no fundamental reason for these not to work but it may not be part of the DRM contract. So maybe I'll make a feature request Jira to support this.

        In the meantime rbind will not solve this because A will have missing rows at the end but B may have them throughout--let alone some future C. So I think reading in all data into one drm with one row and column id space then chopping into two or more drms based on column ranges should give us empty rows where they are needed (I certainly hope so or I'm in trouble). Will have to keep track of which column ids go in which slice but that's doable.

        Show
        githubbot ASF GitHub Bot added a comment - Github user pferrel commented on a diff in the pull request: https://github.com/apache/mahout/pull/31#discussion_r15184775 — Diff: spark/src/main/scala/org/apache/mahout/sparkbindings/drm/CheckpointedDrmSpark.scala — @@ -46,6 +46,19 @@ class CheckpointedDrmSpark [K: ClassTag] ( private var cached: Boolean = false override val context: DistributedContext = rdd.context + /** + * Adds the equivalent of blank rows to the sparse CheckpointedDrm, which only changes the + * [[org.apache.mahout.sparkbindings.drm +.CheckpointedDrmSpark#nrow]] value. + * No physical changes are made to the underlying rdd, now blank rows are added as would be done with rbind(blankRows) + * @param n number to increase row cardinality by + * @note should be done before any BLAS optimizer actions are performed on the matrix or you'll get unpredictable + * results. + */ + override def addToRowCardinality(n: Int): CheckpointedDrm [K] = { + assert(n > -1) + new CheckpointedDrmSpark[K](rdd, nrow + n, ncol, _cacheStorageLevel ) + } — End diff – I see no fundamental reason for these not to work but it may not be part of the DRM contract. So maybe I'll make a feature request Jira to support this. In the meantime rbind will not solve this because A will have missing rows at the end but B may have them throughout--let alone some future C. So I think reading in all data into one drm with one row and column id space then chopping into two or more drms based on column ranges should give us empty rows where they are needed (I certainly hope so or I'm in trouble). Will have to keep track of which column ids go in which slice but that's doable.
        Hide
        githubbot ASF GitHub Bot added a comment -

        Github user dlyubimov commented on a diff in the pull request:

        https://github.com/apache/mahout/pull/31#discussion_r15198464

        — Diff: spark/src/main/scala/org/apache/mahout/sparkbindings/drm/CheckpointedDrmSpark.scala —
        @@ -46,6 +46,19 @@ class CheckpointedDrmSpark[K: ClassTag](
        private var cached: Boolean = false
        override val context: DistributedContext = rdd.context

        + /**
        + * Adds the equivalent of blank rows to the sparse CheckpointedDrm, which only changes the
        + * [[org.apache.mahout.sparkbindings.drm
        +.CheckpointedDrmSpark#nrow]] value.
        + * No physical changes are made to the underlying rdd, now blank rows are added as would be done with rbind(blankRows)
        + * @param n number to increase row cardinality by
        + * @note should be done before any BLAS optimizer actions are performed on the matrix or you'll get unpredictable
        + * results.
        + */
        + override def addToRowCardinality(n: Int): CheckpointedDrm[K] =

        { + assert(n > -1) + new CheckpointedDrmSpark[K](rdd, nrow + n, ncol, _cacheStorageLevel ) + }

        — End diff –

        -1 on this PR.

        i am not sure what problem it is solving but there has got to be a different way to solve it. Matlab/R semantics is deemed sufficient to solve algebraic problems historically and they did not have a need for this. So shouldn't we.

        if nothing else, ultimately one can always exit to RDD level and re-format RDD content to whatever liking.

        Show
        githubbot ASF GitHub Bot added a comment - Github user dlyubimov commented on a diff in the pull request: https://github.com/apache/mahout/pull/31#discussion_r15198464 — Diff: spark/src/main/scala/org/apache/mahout/sparkbindings/drm/CheckpointedDrmSpark.scala — @@ -46,6 +46,19 @@ class CheckpointedDrmSpark [K: ClassTag] ( private var cached: Boolean = false override val context: DistributedContext = rdd.context + /** + * Adds the equivalent of blank rows to the sparse CheckpointedDrm, which only changes the + * [[org.apache.mahout.sparkbindings.drm +.CheckpointedDrmSpark#nrow]] value. + * No physical changes are made to the underlying rdd, now blank rows are added as would be done with rbind(blankRows) + * @param n number to increase row cardinality by + * @note should be done before any BLAS optimizer actions are performed on the matrix or you'll get unpredictable + * results. + */ + override def addToRowCardinality(n: Int): CheckpointedDrm [K] = { + assert(n > -1) + new CheckpointedDrmSpark[K](rdd, nrow + n, ncol, _cacheStorageLevel ) + } — End diff – -1 on this PR. i am not sure what problem it is solving but there has got to be a different way to solve it. Matlab/R semantics is deemed sufficient to solve algebraic problems historically and they did not have a need for this. So shouldn't we. if nothing else, ultimately one can always exit to RDD level and re-format RDD content to whatever liking.
        Hide
        githubbot ASF GitHub Bot added a comment -

        Github user pferrel commented on a diff in the pull request:

        https://github.com/apache/mahout/pull/31#discussion_r15198911

        — Diff: spark/src/main/scala/org/apache/mahout/sparkbindings/drm/CheckpointedDrmSpark.scala —
        @@ -46,6 +46,19 @@ class CheckpointedDrmSpark[K: ClassTag](
        private var cached: Boolean = false
        override val context: DistributedContext = rdd.context

        + /**
        + * Adds the equivalent of blank rows to the sparse CheckpointedDrm, which only changes the
        + * [[org.apache.mahout.sparkbindings.drm
        +.CheckpointedDrmSpark#nrow]] value.
        + * No physical changes are made to the underlying rdd, now blank rows are added as would be done with rbind(blankRows)
        + * @param n number to increase row cardinality by
        + * @note should be done before any BLAS optimizer actions are performed on the matrix or you'll get unpredictable
        + * results.
        + */
        + override def addToRowCardinality(n: Int): CheckpointedDrm[K] =

        { + assert(n > -1) + new CheckpointedDrmSpark[K](rdd, nrow + n, ncol, _cacheStorageLevel ) + }

        — End diff –

        you are looking at old code and this is not meant to be merged.

        Show
        githubbot ASF GitHub Bot added a comment - Github user pferrel commented on a diff in the pull request: https://github.com/apache/mahout/pull/31#discussion_r15198911 — Diff: spark/src/main/scala/org/apache/mahout/sparkbindings/drm/CheckpointedDrmSpark.scala — @@ -46,6 +46,19 @@ class CheckpointedDrmSpark [K: ClassTag] ( private var cached: Boolean = false override val context: DistributedContext = rdd.context + /** + * Adds the equivalent of blank rows to the sparse CheckpointedDrm, which only changes the + * [[org.apache.mahout.sparkbindings.drm +.CheckpointedDrmSpark#nrow]] value. + * No physical changes are made to the underlying rdd, now blank rows are added as would be done with rbind(blankRows) + * @param n number to increase row cardinality by + * @note should be done before any BLAS optimizer actions are performed on the matrix or you'll get unpredictable + * results. + */ + override def addToRowCardinality(n: Int): CheckpointedDrm [K] = { + assert(n > -1) + new CheckpointedDrmSpark[K](rdd, nrow + n, ncol, _cacheStorageLevel ) + } — End diff – you are looking at old code and this is not meant to be merged.
        Hide
        githubbot ASF GitHub Bot added a comment -

        Github user pferrel closed the pull request at:

        https://github.com/apache/mahout/pull/31

        Show
        githubbot ASF GitHub Bot added a comment - Github user pferrel closed the pull request at: https://github.com/apache/mahout/pull/31
        Hide
        githubbot ASF GitHub Bot added a comment -

        GitHub user pferrel opened a pull request:

        https://github.com/apache/mahout/pull/36

        MAHOUT-1541

        Parts of this address MAHOUT-1541, MAHOUT-1568, and MAHOUT-1569

        The previous merge of MAHOUT-1541 was supporting A'A primarily, this merge support A'B as well with all features. Lots of refactoring and new tests for A and B of different cardinality and using different item ID spaces. Took the forced cardinality matching from Math and put in the data prep part. This means passing in an nrow to drmWrap, which may be larger than the actual number of rows embodied in the drm/rdd. I've added tests for B.t %*% A as well as the actual driver for these cases (missing row cases).

        Can't complete the full epinions cross-cooccurrence on a single machine with an out of Java heap exception. So I'm now testing it on a cluster. The cooccurrence of A'A does complete on a single machine.

        One known improvement is to limit the use of dictionaries if they are not need and to look at replacing the Guava HashBiMap with a minimal Scala verison. This version uses dictionaries for IDs even if the input is using Mahout sequential int IDs.

        MAHOUT-1568: Proposed standards for text versions of DRM-ish output. These preserve the IDs of the application while using Mahout IDs internally. In other words output has application IDs. There are several configurable readers and writers of TD files. Reading Tuples into a DRM is implemented, Writing a DRM-ish TD file is also implemented.

        MAHOUT-1569: There is a refactored MahoutOptionParser and MahoutDriver with some default behavior that should make creating drivers a bit easier and DRYer than the last merge. The options are being proposed as standards across all drivers so we have one way to specify the formats of input/output files and other common options.

        This is for comment, I won't merge until several larger dataset are working on a cluster.

        You can merge this pull request into a Git repository by running:

        $ git pull https://github.com/pferrel/mahout mahout-1541

        Alternatively you can review and apply these changes as the patch at:

        https://github.com/apache/mahout/pull/36.patch

        To close this pull request, make a commit to your master/trunk branch
        with (at least) the following in the commit message:

        This closes #36


        commit 107a0ba9605241653a85b113661a8fa5c055529f
        Author: pferrel <pat@occamsmachete.com>
        Date: 2014-06-04T19:54:22Z

        added Sebastian's CooccurrenceAnalysis patch updated it to use current Mahout-DSL

        commit 74b9921c4c9bd8903585bbd74d9e66298ea8b7a0
        Author: pferrel <pat@occamsmachete.com>
        Date: 2014-06-04T20:09:07Z

        Adding stuff for itemsimilarity driver for Spark

        commit a59265931ed3a51ba81e1a0a7171ebb102be4fa4
        Author: pferrel <pat@occamsmachete.com>
        Date: 2014-06-04T20:13:13Z

        added scopt to pom deps

        commit 16c03f7fa73c156859d1dba3a333ef9e8bf922b0
        Author: pferrel <pat@occamsmachete.com>
        Date: 2014-06-04T21:32:18Z

        added Sebastian's MurmurHash changes

        Signed-off-by: pferrel <pat@occamsmachete.com>

        commit 8a4b4347ddb7b9ac97590aa20189d89d8a07a80a
        Author: pferrel <pat@occamsmachete.com>
        Date: 2014-06-04T21:33:11Z

        Merge branch 'mahout-1464' into mahout-1541

        commit 2f87f5433f90fa2c49ef386ca245943e1fc73beb
        Author: pferrel <pat@occamsmachete.com>
        Date: 2014-06-05T01:44:16Z

        MAHOUT-1541 still working on this, some refactoring in the DSL for abstracting away Spark has moved access to rddsno Jira is closed yet

        commit c6adaa44c80bba99d41600e260bbb1ad5c972e69
        Author: pferrel <pat@occamsmachete.com>
        Date: 2014-06-05T16:52:23Z

        MAHOUT-1464 import cleanup, minor changes to examples for running on Spark Cluster

        commit 2caceab31703ed214c1e66d5fc63b8bdb05d37a3
        Author: pferrel <pat@occamsmachete.com>
        Date: 2014-06-05T16:55:09Z

        Merge branch 'mahout-1464' into mahout-1541

        commit 6df6a54e3ff174d39bd817caf7d16c2d362be3f8
        Author: pferrel <pat@occamsmachete.com>
        Date: 2014-06-07T20:39:25Z

        Merge branch 'master' into mahout-1541

        commit a2f84dea3f32d3df3e98c61f085bc1fabd453551
        Author: pferrel <pat@occamsmachete.com>
        Date: 2014-06-07T21:27:06Z

        drmWrap seems to be the answer to the changed DrmLike interface. Code works again but more to do.

        commit d3a2ba5027436d0abef67a1a5e82557064f4ba49
        Author: pferrel <pat@occamsmachete.com>
        Date: 2014-06-17T16:00:38Z

        merged master, got new cooccurrence code

        commit 4b2fb07b21a8ac2d532ee51b65b27d1482293cb0
        Author: pferrel <pat@occamsmachete.com>
        Date: 2014-06-19T17:08:02Z

        for high level review, not ready for merge

        commit 996ccfb82a8ed3ff90f51968e661b2449f3c4759
        Author: pferrel <pat@occamsmachete.com>
        Date: 2014-06-19T17:46:23Z

        for high level review, not ready for merge. changed to dot notation

        commit f62ab071869ee205ad398a3e094d871138e11a9e
        Author: pferrel <pat@occamsmachete.com>
        Date: 2014-06-19T18:13:44Z

        for high level review, not ready for merge. fixed a couple scaladoc refs

        commit cbef0ee6264c28d0597cb2507427a647771c9bcd
        Author: pferrel <pat@occamsmachete.com>
        Date: 2014-06-23T21:49:20Z

        adding tests, had to modify some test framework Scala to make the masterUrl visible to tests

        commit ab8009f6176f0c21a07e15cc5cc8a9717dd7cc4c
        Author: pferrel <pat@occamsmachete.com>
        Date: 2014-06-25T15:41:54Z

        adding more tests for ItemSimilarityDriver

        commit 47258f59df7f215b1bb25830d13d9b85fa8d19e9
        Author: pferrel <pat@occamsmachete.com>
        Date: 2014-06-25T15:44:47Z

        merged master changes and fixed a pom conflict

        commit 9a02e2a5ea8540723c1bfc6ea01b045bb4175922
        Author: pferrel <pat@occamsmachete.com>
        Date: 2014-06-25T16:57:55Z

        remove tmp after all tests, fixed dangling comma in input file list

        commit 3c343ff18600f0a0e59f5bfd63bd86db0db0e8c5
        Author: pferrel <pat@occamsmachete.com>
        Date: 2014-06-26T22:19:48Z

        changes to pom, mahout driver script, and cleaned up help text

        commit 213b18dee259925de82c703451bdea640e1f068e
        Author: pferrel <pat@occamsmachete.com>
        Date: 2014-06-26T22:26:17Z

        added a job.xml assembly for creation of an all-dependencies jar

        commit 627d39f30860e4ab43783c72cc2cf8926060b73c
        Author: pferrel <pat@occamsmachete.com>
        Date: 2014-06-27T16:44:37Z

        registered HashBiMap with JavaSerializer in Kryo

        commit c273dc7de3c740189ce8157b334c2eef3a4c23ea
        Author: pferrel <pat@occamsmachete.com>
        Date: 2014-06-27T21:30:13Z

        increased the default max heep for mahout/JVM to 4g, using max of 4g for Spark executor

        commit 9dd2f2eabf1bf64660de6b5b5e49aafe18229a7a
        Author: Pat Ferrel <pat@farfetchers.com>
        Date: 2014-06-30T17:06:49Z

        tweaking memory requirements to process epinions with the ItemSimilarityDriver

        commit 6ec98f32775c791ee001fc996f475215e427f368
        Author: pferrel <pat@occamsmachete.com>
        Date: 2014-06-30T17:08:49Z

        refactored to use a DistributedContext instead of raw SparkContext

        commit 48774e154a6e55e04037c787f8d64bc9e545f1bd
        Author: pferrel <pat@occamsmachete.com>
        Date: 2014-06-30T17:08:59Z

        merging changes made on to run a large dataset through itemsimilarity on the cluster

        commit 8e70091a564c8464ea70bf90006d8124c3a7f208
        Author: pferrel <pat@occamsmachete.com>
        Date: 2014-06-30T20:11:42Z

        fixed a bug, SparkConf in driver was ignored and blank one passed in to create a DistributedContext

        commit 01a0341f56071d2244aabd6de8c6f528ad35b164
        Author: pferrel <pat@occamsmachete.com>
        Date: 2014-06-30T20:33:39Z

        added option for configuring Spark executor memory

        commit 2d9efd73def8207dded5cd1dd8699035a8cc1b34
        Author: pferrel <pat@occamsmachete.com>
        Date: 2014-06-30T22:37:19Z

        removed some outdated examples

        commit 9fb281022cba7666dd26701b3d97d200b13c35f8
        Author: pferrel <pat@occamsmachete.com>
        Date: 2014-07-01T18:17:42Z

        test naming and pom changed to up the jvm heap max to 512m for scalatests

        commit 674c9b7862f0bd0723de026eb4527546b52e8a0b
        Author: pferrel <pat@occamsmachete.com>
        Date: 2014-07-01T18:18:59Z

        Merge branch 'mahout-1541' of https://github.com/pferrel/mahout into mahout-1541


        Show
        githubbot ASF GitHub Bot added a comment - GitHub user pferrel opened a pull request: https://github.com/apache/mahout/pull/36 MAHOUT-1541 Parts of this address MAHOUT-1541 , MAHOUT-1568 , and MAHOUT-1569 The previous merge of MAHOUT-1541 was supporting A'A primarily, this merge support A'B as well with all features. Lots of refactoring and new tests for A and B of different cardinality and using different item ID spaces. Took the forced cardinality matching from Math and put in the data prep part. This means passing in an nrow to drmWrap, which may be larger than the actual number of rows embodied in the drm/rdd. I've added tests for B.t %*% A as well as the actual driver for these cases (missing row cases). Can't complete the full epinions cross-cooccurrence on a single machine with an out of Java heap exception. So I'm now testing it on a cluster. The cooccurrence of A'A does complete on a single machine. One known improvement is to limit the use of dictionaries if they are not need and to look at replacing the Guava HashBiMap with a minimal Scala verison. This version uses dictionaries for IDs even if the input is using Mahout sequential int IDs. MAHOUT-1568 : Proposed standards for text versions of DRM-ish output. These preserve the IDs of the application while using Mahout IDs internally. In other words output has application IDs. There are several configurable readers and writers of TD files. Reading Tuples into a DRM is implemented, Writing a DRM-ish TD file is also implemented. MAHOUT-1569 : There is a refactored MahoutOptionParser and MahoutDriver with some default behavior that should make creating drivers a bit easier and DRYer than the last merge. The options are being proposed as standards across all drivers so we have one way to specify the formats of input/output files and other common options. This is for comment, I won't merge until several larger dataset are working on a cluster. You can merge this pull request into a Git repository by running: $ git pull https://github.com/pferrel/mahout mahout-1541 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/mahout/pull/36.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #36 commit 107a0ba9605241653a85b113661a8fa5c055529f Author: pferrel <pat@occamsmachete.com> Date: 2014-06-04T19:54:22Z added Sebastian's CooccurrenceAnalysis patch updated it to use current Mahout-DSL commit 74b9921c4c9bd8903585bbd74d9e66298ea8b7a0 Author: pferrel <pat@occamsmachete.com> Date: 2014-06-04T20:09:07Z Adding stuff for itemsimilarity driver for Spark commit a59265931ed3a51ba81e1a0a7171ebb102be4fa4 Author: pferrel <pat@occamsmachete.com> Date: 2014-06-04T20:13:13Z added scopt to pom deps commit 16c03f7fa73c156859d1dba3a333ef9e8bf922b0 Author: pferrel <pat@occamsmachete.com> Date: 2014-06-04T21:32:18Z added Sebastian's MurmurHash changes Signed-off-by: pferrel <pat@occamsmachete.com> commit 8a4b4347ddb7b9ac97590aa20189d89d8a07a80a Author: pferrel <pat@occamsmachete.com> Date: 2014-06-04T21:33:11Z Merge branch 'mahout-1464' into mahout-1541 commit 2f87f5433f90fa2c49ef386ca245943e1fc73beb Author: pferrel <pat@occamsmachete.com> Date: 2014-06-05T01:44:16Z MAHOUT-1541 still working on this, some refactoring in the DSL for abstracting away Spark has moved access to rddsno Jira is closed yet commit c6adaa44c80bba99d41600e260bbb1ad5c972e69 Author: pferrel <pat@occamsmachete.com> Date: 2014-06-05T16:52:23Z MAHOUT-1464 import cleanup, minor changes to examples for running on Spark Cluster commit 2caceab31703ed214c1e66d5fc63b8bdb05d37a3 Author: pferrel <pat@occamsmachete.com> Date: 2014-06-05T16:55:09Z Merge branch 'mahout-1464' into mahout-1541 commit 6df6a54e3ff174d39bd817caf7d16c2d362be3f8 Author: pferrel <pat@occamsmachete.com> Date: 2014-06-07T20:39:25Z Merge branch 'master' into mahout-1541 commit a2f84dea3f32d3df3e98c61f085bc1fabd453551 Author: pferrel <pat@occamsmachete.com> Date: 2014-06-07T21:27:06Z drmWrap seems to be the answer to the changed DrmLike interface. Code works again but more to do. commit d3a2ba5027436d0abef67a1a5e82557064f4ba49 Author: pferrel <pat@occamsmachete.com> Date: 2014-06-17T16:00:38Z merged master, got new cooccurrence code commit 4b2fb07b21a8ac2d532ee51b65b27d1482293cb0 Author: pferrel <pat@occamsmachete.com> Date: 2014-06-19T17:08:02Z for high level review, not ready for merge commit 996ccfb82a8ed3ff90f51968e661b2449f3c4759 Author: pferrel <pat@occamsmachete.com> Date: 2014-06-19T17:46:23Z for high level review, not ready for merge. changed to dot notation commit f62ab071869ee205ad398a3e094d871138e11a9e Author: pferrel <pat@occamsmachete.com> Date: 2014-06-19T18:13:44Z for high level review, not ready for merge. fixed a couple scaladoc refs commit cbef0ee6264c28d0597cb2507427a647771c9bcd Author: pferrel <pat@occamsmachete.com> Date: 2014-06-23T21:49:20Z adding tests, had to modify some test framework Scala to make the masterUrl visible to tests commit ab8009f6176f0c21a07e15cc5cc8a9717dd7cc4c Author: pferrel <pat@occamsmachete.com> Date: 2014-06-25T15:41:54Z adding more tests for ItemSimilarityDriver commit 47258f59df7f215b1bb25830d13d9b85fa8d19e9 Author: pferrel <pat@occamsmachete.com> Date: 2014-06-25T15:44:47Z merged master changes and fixed a pom conflict commit 9a02e2a5ea8540723c1bfc6ea01b045bb4175922 Author: pferrel <pat@occamsmachete.com> Date: 2014-06-25T16:57:55Z remove tmp after all tests, fixed dangling comma in input file list commit 3c343ff18600f0a0e59f5bfd63bd86db0db0e8c5 Author: pferrel <pat@occamsmachete.com> Date: 2014-06-26T22:19:48Z changes to pom, mahout driver script, and cleaned up help text commit 213b18dee259925de82c703451bdea640e1f068e Author: pferrel <pat@occamsmachete.com> Date: 2014-06-26T22:26:17Z added a job.xml assembly for creation of an all-dependencies jar commit 627d39f30860e4ab43783c72cc2cf8926060b73c Author: pferrel <pat@occamsmachete.com> Date: 2014-06-27T16:44:37Z registered HashBiMap with JavaSerializer in Kryo commit c273dc7de3c740189ce8157b334c2eef3a4c23ea Author: pferrel <pat@occamsmachete.com> Date: 2014-06-27T21:30:13Z increased the default max heep for mahout/JVM to 4g, using max of 4g for Spark executor commit 9dd2f2eabf1bf64660de6b5b5e49aafe18229a7a Author: Pat Ferrel <pat@farfetchers.com> Date: 2014-06-30T17:06:49Z tweaking memory requirements to process epinions with the ItemSimilarityDriver commit 6ec98f32775c791ee001fc996f475215e427f368 Author: pferrel <pat@occamsmachete.com> Date: 2014-06-30T17:08:49Z refactored to use a DistributedContext instead of raw SparkContext commit 48774e154a6e55e04037c787f8d64bc9e545f1bd Author: pferrel <pat@occamsmachete.com> Date: 2014-06-30T17:08:59Z merging changes made on to run a large dataset through itemsimilarity on the cluster commit 8e70091a564c8464ea70bf90006d8124c3a7f208 Author: pferrel <pat@occamsmachete.com> Date: 2014-06-30T20:11:42Z fixed a bug, SparkConf in driver was ignored and blank one passed in to create a DistributedContext commit 01a0341f56071d2244aabd6de8c6f528ad35b164 Author: pferrel <pat@occamsmachete.com> Date: 2014-06-30T20:33:39Z added option for configuring Spark executor memory commit 2d9efd73def8207dded5cd1dd8699035a8cc1b34 Author: pferrel <pat@occamsmachete.com> Date: 2014-06-30T22:37:19Z removed some outdated examples commit 9fb281022cba7666dd26701b3d97d200b13c35f8 Author: pferrel <pat@occamsmachete.com> Date: 2014-07-01T18:17:42Z test naming and pom changed to up the jvm heap max to 512m for scalatests commit 674c9b7862f0bd0723de026eb4527546b52e8a0b Author: pferrel <pat@occamsmachete.com> Date: 2014-07-01T18:18:59Z Merge branch 'mahout-1541' of https://github.com/pferrel/mahout into mahout-1541
        Hide
        githubbot ASF GitHub Bot added a comment -

        Github user pferrel commented on the pull request:

        https://github.com/apache/mahout/pull/36#issuecomment-51114808

        Success on a cluster with cross-cooccurrence using epinions ratings and trust data.

        Show
        githubbot ASF GitHub Bot added a comment - Github user pferrel commented on the pull request: https://github.com/apache/mahout/pull/36#issuecomment-51114808 Success on a cluster with cross-cooccurrence using epinions ratings and trust data.
        Hide
        githubbot ASF GitHub Bot added a comment -

        Github user pferrel closed the pull request at:

        https://github.com/apache/mahout/pull/36

        Show
        githubbot ASF GitHub Bot added a comment - Github user pferrel closed the pull request at: https://github.com/apache/mahout/pull/36
        Hide
        hudson Hudson added a comment -

        SUCCESS: Integrated in Mahout-Quality #2733 (See https://builds.apache.org/job/Mahout-Quality/2733/)
        MAHOUT-1541, MAHOUT-1568, MAHOUT-1569 refactoring the options parser and option defaults to DRY up individual driver code putting more in base classes, tightened up the test suite with a better way of comparing actual with correct (pat: rev a80974037853c5227f9e5ef1c384a1fca134746e)

        • math-scala/src/main/scala/org/apache/mahout/math/cf/CooccurrenceAnalysis.scala
        • spark/src/main/scala/org/apache/mahout/drivers/ReaderWriter.scala
        • spark/src/main/scala/org/apache/mahout/sparkbindings/io/MahoutKryoRegistrator.scala
        • spark/src/main/scala/org/apache/mahout/drivers/MahoutOptionParser.scala
        • spark/src/main/scala/org/apache/mahout/drivers/IndexedDataset.scala
        • spark/src/main/scala/org/apache/mahout/drivers/MahoutDriver.scala
        • spark/src/main/scala/org/apache/mahout/cf/CooccurrenceAnalysis.scala
        • spark/src/main/scala/org/apache/mahout/sparkbindings/drm/CheckpointedDrmSpark.scala
        • spark/src/main/scala/org/apache/mahout/drivers/TextDelimitedReaderWriter.scala
        • spark/src/test/scala/org/apache/mahout/drivers/ItemSimilarityDriverSuite.scala
        • spark/src/main/scala/org/apache/mahout/drivers/ItemSimilarityDriver.scala
        • spark/src/main/scala/org/apache/mahout/drivers/Schema.scala
        • spark/src/test/scala/org/apache/mahout/cf/CooccurrenceAnalysisSuite.scala
        Show
        hudson Hudson added a comment - SUCCESS: Integrated in Mahout-Quality #2733 (See https://builds.apache.org/job/Mahout-Quality/2733/ ) MAHOUT-1541 , MAHOUT-1568 , MAHOUT-1569 refactoring the options parser and option defaults to DRY up individual driver code putting more in base classes, tightened up the test suite with a better way of comparing actual with correct (pat: rev a80974037853c5227f9e5ef1c384a1fca134746e) math-scala/src/main/scala/org/apache/mahout/math/cf/CooccurrenceAnalysis.scala spark/src/main/scala/org/apache/mahout/drivers/ReaderWriter.scala spark/src/main/scala/org/apache/mahout/sparkbindings/io/MahoutKryoRegistrator.scala spark/src/main/scala/org/apache/mahout/drivers/MahoutOptionParser.scala spark/src/main/scala/org/apache/mahout/drivers/IndexedDataset.scala spark/src/main/scala/org/apache/mahout/drivers/MahoutDriver.scala spark/src/main/scala/org/apache/mahout/cf/CooccurrenceAnalysis.scala spark/src/main/scala/org/apache/mahout/sparkbindings/drm/CheckpointedDrmSpark.scala spark/src/main/scala/org/apache/mahout/drivers/TextDelimitedReaderWriter.scala spark/src/test/scala/org/apache/mahout/drivers/ItemSimilarityDriverSuite.scala spark/src/main/scala/org/apache/mahout/drivers/ItemSimilarityDriver.scala spark/src/main/scala/org/apache/mahout/drivers/Schema.scala spark/src/test/scala/org/apache/mahout/cf/CooccurrenceAnalysisSuite.scala
        Hide
        hudson Hudson added a comment -

        SUCCESS: Integrated in Mahout-Quality #2779 (See https://builds.apache.org/job/Mahout-Quality/2779/)
        MAHOUT-1604, MAHOUT-1541 changes all reference to positon in the CLI to columns (pat: rev e24c4afb699c2930d372c701fe2de874a2a2f6c0)

        • spark/src/main/scala/org/apache/mahout/drivers/Schema.scala
        • spark/src/test/scala/org/apache/mahout/drivers/ItemSimilarityDriverSuite.scala
        • spark/src/main/scala/org/apache/mahout/drivers/ItemSimilarityDriver.scala
        • spark/src/main/scala/org/apache/mahout/drivers/MahoutOptionParser.scala
        • spark/src/main/scala/org/apache/mahout/drivers/TextDelimitedReaderWriter.scala
        Show
        hudson Hudson added a comment - SUCCESS: Integrated in Mahout-Quality #2779 (See https://builds.apache.org/job/Mahout-Quality/2779/ ) MAHOUT-1604 , MAHOUT-1541 changes all reference to positon in the CLI to columns (pat: rev e24c4afb699c2930d372c701fe2de874a2a2f6c0) spark/src/main/scala/org/apache/mahout/drivers/Schema.scala spark/src/test/scala/org/apache/mahout/drivers/ItemSimilarityDriverSuite.scala spark/src/main/scala/org/apache/mahout/drivers/ItemSimilarityDriver.scala spark/src/main/scala/org/apache/mahout/drivers/MahoutOptionParser.scala spark/src/main/scala/org/apache/mahout/drivers/TextDelimitedReaderWriter.scala
        Hide
        pferrel Pat Ferrel added a comment -

        Seems to work.

        Show
        pferrel Pat Ferrel added a comment - Seems to work.
        Hide
        sslavic Stevo Slavic added a comment -

        Bulk closing all 0.10.0 resolved issues

        Show
        sslavic Stevo Slavic added a comment - Bulk closing all 0.10.0 resolved issues

          People

          • Assignee:
            pferrel Pat Ferrel
            Reporter:
            pferrel Pat Ferrel
          • Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development