Mahout
  1. Mahout
  2. MAHOUT-371

Proposal to implement Distributed SVD++ Recommender using Hadoop

    Details

      Description

      Proposal Title: MAHOUT-371 Proposal to implement Distributed SVD++ Recommender using Hadoop

      Student Name: Richard Simon Just

      Student E-mail:info@richardsimonjust.co.uk

      Organization/Project: Apache Mahout

      Assigned Mentor:

      Proposal Abstract:
      During the Netflix Prize Challenge one of the most popular forms of Recommender algorithm was that of Matrix Factorisation, in particular Singular Value Decomposition (SVD). As such this proposal looks to implement a distributed version of one of the most successful SVD-based recommender algorithms from the Netflix competition. Namely, the SVD++ algorithm.
      The SVD++ improves upon other basic SVD algorithms by incorporating implicit feedback[1]. That is to say that it is able to take into account not just explicit user preferences, but also feedback such as, in the case of a company like Netflix, whether a movie has been rented. Implicit feedback assumes that the fact of there being some correlation between the user and the item is more important that whether the correlation is positive or negative. Implicit feedback would account for an item has being rated, but not what the rating was.
      The implementation will include testing, in-depth documentation and a demo/tutorial. If there is time, I will also look to developing the algorithm into the timeSVD++ algorithm[2,3]. The timeSVD++ further improves the results of the SVD algorithm by taking into account temporal dynamics. Temporal dynamics addresses the way user preferences in items and their behaviour in how they rate items can change over time. According to [2] the gains in accuracy implementing timeSVD++ are significantly bigger than the gains going from SVD to SVD++.
      The overall project will provide three deliverables:
      1. The basic framework for distributed SVD-based recommender
      2. A distributed SVD++ implementation and demo
      3. A distributed timeSVD++ implementation

      Detailed Description:

      The SVD++ algorithm uses the principle of categorising users and items into factors, combined with regularisation and implicit feedback to predict how much a user is likely to match an item. Factors are abstract categorises that are created from comparing the data presented. Factor values are grades saying how much each user/item is related to that category. For example with the Netflix data a factor could loosely correspond to a movie genre, a director or story line. The more factors used the more detailed the categories are likely to become, and thus the more accurate the predictions are likely to become.

      Implicit feedback is based on the theory that a user having any sort of relationship to an item is more important that whether they rated it, or what rating they gave. The assumption is that even if a user does not like an item, or has not rated it, the very fact that they choose to have some interaction with it indicates something about their preferences. In the Netflix case this would be represented by whether a user has rated a movie or not, it could also take into account the user's rental history.

      As well as the actual implementation of the code the project has two other deliverable focuses. The readability, documentation, testing of the code; and a full tutorial and demo of the code. It is felt that without these things the implementation is not really complete or fully usable.

      The recommender consists of 5 main parts. The job class that runs the code, an input conversion section, a training section, a re-trainer section and a prediction section. A brief overview of these sections follows.

      The SVD++ Classes:

      The Recommender Job class:
      This class is the foundation of the recommender and allows it to run on Hadoop by implementing the Tool interface through AbstractJob. This class will parse any user arguments and setup the jobs that will run the algorithm on Map/Reduce, much in the same way Mahout's other distributed recommenders, do such as RecommenderJob.

      Input Mapper/Reducer Classes:
      The Mapper will take the input data and convert it to key value pairs in the form of a hadoop Writeable. The Reducer will take the Mapper Writeables and create Sparse vectors. This is in keeping with Mahout's ToItemPrefsMapper and ToUserVectorReducer.

      Training Mapper/Reducer Classes:
      This phase creates the factor matrices that will be used to create the predictions later. It does this by making a prediction of a known rating using the SVD, comparing it against the known rating, calculating the error and updating the factors accordingly. The Mapper will loop through the training of the factor matrices. The Reducer will collect the outputs of the Mapper to create the dense factor matrices.

      Re-trainer Mapper/Reducer Classes:
      This is a separate job that would be used to update the factor matrices taking into account any new user ratings. The Mapper/Reducer follow a similar pattern to the training section.

      Prediction Mapper/Reducer Classes:
      This job would take the factor matrices from the training set and use them in the SVD to create the required predictions. Most likely this would be done for all user/item preferences, to create dense matrix of all user/item combinations. However potentially it could also be done for individual users.
      Just like training, the Mapper would handle the predictions the Reducer would collect them into a dense matrix.

      Timeline:

      The Warm Up/Bonding Period (<=May 23rd):

      • familiarise myself further with Mahout and Hadoop's code base and documentation
      • discuss with community the proposal, design and implementation as well as related code tests, optimisations and documentation they would like to see incorporated into the project
      • build a more detailed design of algorithm implementation and tweak timeline based on feedback
      • familiarise myself more with unit testing
      • finish building 3-4 node Hadoop cluster and play with all the examples

      Week 1 (May 24th-30th):

      • start writing the back bone of the code in the form of comments and skeleton code
      • implement SVDppRecommenderJob
      • start to integrate DistributedLanzcosSolver

      Week 2 (May 31st - June 6th):

      • complete DistributedLanzcosSolver integration
      • start implementing distributed training, prediction and regularisation

      Week 3 - 5 (June 7th - 27th):

      • complete implementation of distributed training, prediction and regularisation
      • work on testing, cleaning up code, and tying up any loose documentation ends
      • work on any documentation, tests and optimisation requested by community
      • Deliverable : basic framework for distributed SVD-based recommender

      Week 6 - 7 (June 28th-July 11th):

      • start implementation of SVD++ (keeping documentation and tests up-to-date)
      • prepare demo

      Week 8 (July 12th - 18th): Mid-Term Report by the 16th

      • complete SVD++ and iron out bugs
      • implement and document demo
      • write wiki articles and tutorial for what has been implemented including the demo

      Week 9 (July 19th - 25th):

      • work on any documentation, tests and optimisation requested by community during project
      • work on testing, cleaning up code, and tying up any loose documentation ends
      • Deliverable : Distributed SVD++ Recommender (including Demo)

      Week 10 - 11 (July 26th - Aug 8th):

      • incorporate temporal dynamics
      • write temporal dynamics documentation, including wiki article

      Week 12 (Aug 9th - 15th): Suggested Pencils Down

      • last optimisation and tidy up of code, documentation, tests and demo
      • discuss with community what comes next, consider what JIRA issues to contribute to
      • Deliverable: Distributed SVD++ Recommender with temporal dynamics

      Final Evaluations Hand-in: Aug 16th-20th.

      Additional Information:

      I am currently a final year undergraduate at the University of Reading, UK, studying BSc Applied Computer Science. As part of this degree I did a 9 month industrial placement programming in Java for an international non-profit organisation. While in this final year of study I have been fortunate enough to take modules in Distributed Computing, Evolutionary Computation and Concurrent Systems, which have proven to be the best of my University career. Each module required me to design and implement algorithms of either a distributed or machine learning nature [A], two in Java, one in C. For my final year project I have implemented a Gaming Persistence Platform in ActionScript, PHP and MySQL. It allows the user to play computer games integrated with the platform from different devices and/or different software platforms while retaining game progress across all instances. I am looking forward to the day when mobile computing goes distributed.

      I am a passionate user of open source software and now I am looking to become involved as a developer. Having enjoyed my Distributed Computing and Evolutionary Computation modules so much, and after reading all the introductory pages about the ASF I realised Mahout would be a great place to start. After graduation (and GSoC) I hope to continue contributing to Mahout while working in a related field.

      Outside of uni, when I'm not coding, I like to push my limits through meditation, mountaineering and running. I also love to cook!

      After my last exam on May 21st my only commitment over the summer will be GSoC. As such GSoC will be treated like a full-time job.

      [1] - Y. Koren, "Factorization Meets the Neighborhood: a Mulitfaceted Collaborative Filtering Model", ACM Press, 2008, http://public.research.att.com/~volinsky/netflix/kdd08koren.pdf
      [2] - Y. Koren, "Collaborative Filtering with temporal Dynamics", ACM Press, 2009, http://research.yahoo.com/files/kdd-fp074-koren.pdf
      [3] - R.M.Bell, Y. Koren, and C. Volinsky, "The Bellkor 2008 Solution to the Netflix Prize", http://www.netflixprize.com/assets/ProgressPrize2008_BellKor.pdf

      [A] - Example of code I've implemented as part of University degree, http://www.richardsimonjust.com/code/

      1. svd.zip
        11 kB
        Richard Simon Just
      2. svd.zip
        21 kB
        Laszlo Dosa

        Issue Links

          Activity

          Hide
          Richard Simon Just added a comment -

          Basic proposal replaced with proposal as submitted to GSoC.

          Any feedback would be greatly appreciated so that I can continue to tweak to proposal.

          Many Thanks
          Richard

          Show
          Richard Simon Just added a comment - Basic proposal replaced with proposal as submitted to GSoC. Any feedback would be greatly appreciated so that I can continue to tweak to proposal. Many Thanks Richard
          Hide
          Sean Owen added a comment -

          Looks like this was accept to GSoC, nice. Let the warmup period begin.
          Incidentally I am located in the UK too – London.

          Show
          Sean Owen added a comment - Looks like this was accept to GSoC, nice. Let the warmup period begin. Incidentally I am located in the UK too – London.
          Hide
          Richard Simon Just added a comment -

          Awesome! I won't lie, I'm super excited! Thank you!

          Oh you're practically down the road. I'd love to meet up at some point after my exams.

          In the meantime, where do we go from here?
          Cheers
          RSJ

          Show
          Richard Simon Just added a comment - Awesome! I won't lie, I'm super excited! Thank you! Oh you're practically down the road. I'd love to meet up at some point after my exams. In the meantime, where do we go from here? Cheers RSJ
          Hide
          Sean Owen added a comment -

          Your schedule maps it out well. In the next month, get to where you can start writing code.

          I will send you Chapter 6 of Mahout in Action which explains pretty well the structure of one current distributed recommender implementation. It should walk you through most of your setup steps including setting up Hadoop.

          I'd come up with a mental model of how you'll set up the computation Hadoop. this is the tricky part and worth talking on mahout-dev.

          Show
          Sean Owen added a comment - Your schedule maps it out well. In the next month, get to where you can start writing code. I will send you Chapter 6 of Mahout in Action which explains pretty well the structure of one current distributed recommender implementation. It should walk you through most of your setup steps including setting up Hadoop. I'd come up with a mental model of how you'll set up the computation Hadoop. this is the tricky part and worth talking on mahout-dev.
          Hide
          Richard Simon Just added a comment -

          Excellent! I haven't downloaded the latest MEAP version of MiA yet, so that would great. Not sure if it has changed much but will re-read the version I have and start looking at a more detailed design, before consulting mahout-dev.

          Show
          Richard Simon Just added a comment - Excellent! I haven't downloaded the latest MEAP version of MiA yet, so that would great. Not sure if it has changed much but will re-read the version I have and start looking at a more detailed design, before consulting mahout-dev.
          Hide
          Richard Simon Just added a comment -

          Two initial skeleton files to give an idea of where I'm heading with the initial implementation.

          Show
          Richard Simon Just added a comment - Two initial skeleton files to give an idea of where I'm heading with the initial implementation.
          Hide
          Sean Owen added a comment -

          Yep, all seems OK so far. The next key classes to review would be the Mappers and Reducers; the rest is detail which we can easily collaborate on to get right.

          It's OK to jump straight to the implementation if you feel confident about the structure of the computation, but it could well also be useful to review your "detailed design" – just, what are the MRs, what are inputs and outputs. It could save you time, should we all find at that stage that a different structure is called for.

          Show
          Sean Owen added a comment - Yep, all seems OK so far. The next key classes to review would be the Mappers and Reducers; the rest is detail which we can easily collaborate on to get right. It's OK to jump straight to the implementation if you feel confident about the structure of the computation, but it could well also be useful to review your "detailed design" – just, what are the MRs, what are inputs and outputs. It could save you time, should we all find at that stage that a different structure is called for.
          Hide
          Richard Simon Just added a comment -

          Basics of the SVDRecommendationMapper. Unless I'm mistaken this won't really need a reducer.

          Show
          Richard Simon Just added a comment - Basics of the SVDRecommendationMapper. Unless I'm mistaken this won't really need a reducer.
          Hide
          Sean Owen added a comment -

          This is the final stage, indeed. It might be a Mapper with IdentityReducer, or may slot in as the Reducer phase of some final map-reduce as it is in other implementations.

          To make this work I believe you need GenericRecommendedItem to be Comparable and sort by value, descending. Then the PriorityQueue needs to be passed Collections.REVERSE_ORDER since it needs to reverse this, to be a min-heap. Then at the end the List of results should be sorted and output.

          Show
          Sean Owen added a comment - This is the final stage, indeed. It might be a Mapper with IdentityReducer, or may slot in as the Reducer phase of some final map-reduce as it is in other implementations. To make this work I believe you need GenericRecommendedItem to be Comparable and sort by value, descending. Then the PriorityQueue needs to be passed Collections.REVERSE_ORDER since it needs to reverse this, to be a min-heap. Then at the end the List of results should be sorted and output.
          Hide
          Richard Simon Just added a comment -

          Updates of DistributedSVDRecommenderJob and SVDPredictionJob

          Show
          Richard Simon Just added a comment - Updates of DistributedSVDRecommenderJob and SVDPredictionJob
          Hide
          Richard Simon Just added a comment -

          M/R passes for finding average User pref and average Item prefs

          Show
          Richard Simon Just added a comment - M/R passes for finding average User pref and average Item prefs
          Hide
          Sean Owen added a comment -

          Cool, getting somewhere here. My comments:

          • Put the Apache copyright header in all files
          • Change indents to 2-space instead of a tab. You can try running "mvn site" in your copy to run a lot of reports including checkstyle, which will alert you to tiny issues like this
          • I'd not use @SuppressWarnings, and that is probably solved by...
          • Use the newer Hadoop APIs – basically, .mapreduce. instead of .mapred. classes. You can see the other recommender code doing that. I can easily help you migrate.
          • Tiny general point from AvgItemPrefReducer: you can hold that running sum in a double, instead of float. float is used to save storage at the individual preference level but might as well use the usual thing in computations like this. (Also check for count == 0 there!)
          • Might be missing something but in SVDRecommenderMapper, I think you need to reverse the recommendations at the end? they're stored in reverse order in the priority queue, yes, but need to be turned around before outputting.

          The actual Jobs look like the right sort of shell, and start, but they're still needing key steps implemented, if I am reading it correctly?
          One thing that will be very helpful, either in code comments, or comments here, is an overview of the shape of the computation. What are the mapreduces, what are inputs and outputs, how do they link up? A few lines of description each works.

          that would let us all evaluate whether it seems right, whether it can be streamlined, and whether the code does what you're suggesting. You're jumping straight to implementation, which isn't a show-stopper per se, but want to make sure we're all clear on what you're going for here.

          Show
          Sean Owen added a comment - Cool, getting somewhere here. My comments: Put the Apache copyright header in all files Change indents to 2-space instead of a tab. You can try running "mvn site" in your copy to run a lot of reports including checkstyle, which will alert you to tiny issues like this I'd not use @SuppressWarnings, and that is probably solved by... Use the newer Hadoop APIs – basically, .mapreduce. instead of .mapred. classes. You can see the other recommender code doing that. I can easily help you migrate. Tiny general point from AvgItemPrefReducer: you can hold that running sum in a double, instead of float. float is used to save storage at the individual preference level but might as well use the usual thing in computations like this. (Also check for count == 0 there!) Might be missing something but in SVDRecommenderMapper, I think you need to reverse the recommendations at the end? they're stored in reverse order in the priority queue, yes, but need to be turned around before outputting. The actual Jobs look like the right sort of shell, and start, but they're still needing key steps implemented, if I am reading it correctly? One thing that will be very helpful, either in code comments, or comments here, is an overview of the shape of the computation. What are the mapreduces, what are inputs and outputs, how do they link up? A few lines of description each works. that would let us all evaluate whether it seems right, whether it can be streamlined, and whether the code does what you're suggesting. You're jumping straight to implementation, which isn't a show-stopper per se, but want to make sure we're all clear on what you're going for here.
          Hide
          Richard Simon Just added a comment -

          Cheers for the info. Almost all done.

          Here's a breakdown of the mapreduces.

          The first is the basic input conversion
          IN - text file containing lines in the format "userID,itemID,pref"
          OUT - LongWritable, VectorWritable (ie initial Matrix A containing explicit feedback)

          The second job take the output of the first passing it to the Distributed Lanczos Solver which in turn runs M/R to calculate the SVD.
          IN - path to output of input job (Matrix A)
          OUT - sequence file containing IntWritable, VectorWritable pairs representing SVD vector matrix V and S

          The third job takes the output of the DistributedLanczosSolver and verifies it using the EigenVerificationJob
          IN - path to output of DLS
          OUT - sequence file containing verified IntWritable, VectorWritable pairs representing SVD vector matrix V

          The fourth job takes that output and calculates the predicted pref of each user/item pair using the M/R within the DistributedRowMatrix.timesSquared function.
          IN - path to sequence file of Matrix A and sequence file of verified Matrix V
          OUT - sequence file contianing LongWritable, VectorWritable pairs representing the Prediction Matrix

          The final job uses the prediction matrix to determined what each user should be recommended. This consists of the SVDRecommendationMapper and the default reducer which acts as an identity reducer.
          IN - path to sequence file of Prediciton Matrix
          OUT - sequence file of IntWritable, RecommendedItemsWritable pairs representing user recommendations

          The average user and average item pref mapper/reducers I've attached could technically be placed anywhere in the sequence before the prediction job. However I'm wondering if I could instead use counters to remove the extra M/R passes?

          Show
          Richard Simon Just added a comment - Cheers for the info. Almost all done. Here's a breakdown of the mapreduces. The first is the basic input conversion IN - text file containing lines in the format "userID,itemID,pref" OUT - LongWritable, VectorWritable (ie initial Matrix A containing explicit feedback) The second job take the output of the first passing it to the Distributed Lanczos Solver which in turn runs M/R to calculate the SVD. IN - path to output of input job (Matrix A) OUT - sequence file containing IntWritable, VectorWritable pairs representing SVD vector matrix V and S The third job takes the output of the DistributedLanczosSolver and verifies it using the EigenVerificationJob IN - path to output of DLS OUT - sequence file containing verified IntWritable, VectorWritable pairs representing SVD vector matrix V The fourth job takes that output and calculates the predicted pref of each user/item pair using the M/R within the DistributedRowMatrix.timesSquared function. IN - path to sequence file of Matrix A and sequence file of verified Matrix V OUT - sequence file contianing LongWritable, VectorWritable pairs representing the Prediction Matrix The final job uses the prediction matrix to determined what each user should be recommended. This consists of the SVDRecommendationMapper and the default reducer which acts as an identity reducer. IN - path to sequence file of Prediciton Matrix OUT - sequence file of IntWritable, RecommendedItemsWritable pairs representing user recommendations The average user and average item pref mapper/reducers I've attached could technically be placed anywhere in the sequence before the prediction job. However I'm wondering if I could instead use counters to remove the extra M/R passes?
          Hide
          Sean Owen added a comment -

          Step 1: The key is the user ID, mapping to a vector of user prefs right? (versus item ID mapping to item prefs) Should be, just checking. I don't know if this is an issue but does DLS want a Long key or Integer key? You'll find out.

          Step 2: To be clear there are two outputs, two output paths, for V and S?

          Step 3: The input is both V and S? Let's pretend I am not entirely familiar with EVJ (because I'm not really) – for the sake of writing out the thinking here for review, what's the purpose of this step if it also outputs V? One more level of pedantic is good here.

          Step 4: This is the heart of it. Let's make sure we have the same understanding of the process – at least, that I understand yours. In my understand you just reassemble the approximation of A by multiplying together the pruned matrices Uk, Sk, Vk (m x k, k x k, k x n respectively where k is small and m and n are big). Then you just look through the approximation Ak to discover new values that aren't zero anymore that make the best recommendations. (Of course if you want to recommend to just one user, you can run the above but only compute one row of Ak).

          How's that match your algorithm? I ask because I'm not sure how A, V, or multiplying by a vector squared come into play.

          Up to you about average item pref. I agree the job can be run anywhere, but yes where should it be injected? Probably it's "Step 4.5".

          Top priority is to get something running in a day or two since there's danger of having nothing concrete to evaluate for the mid-term eval, which must be done by the end of week.

          Show
          Sean Owen added a comment - Step 1: The key is the user ID, mapping to a vector of user prefs right? (versus item ID mapping to item prefs) Should be, just checking. I don't know if this is an issue but does DLS want a Long key or Integer key? You'll find out. Step 2: To be clear there are two outputs, two output paths, for V and S? Step 3: The input is both V and S? Let's pretend I am not entirely familiar with EVJ (because I'm not really) – for the sake of writing out the thinking here for review, what's the purpose of this step if it also outputs V? One more level of pedantic is good here. Step 4: This is the heart of it. Let's make sure we have the same understanding of the process – at least, that I understand yours. In my understand you just reassemble the approximation of A by multiplying together the pruned matrices Uk, Sk, Vk (m x k, k x k, k x n respectively where k is small and m and n are big). Then you just look through the approximation Ak to discover new values that aren't zero anymore that make the best recommendations. (Of course if you want to recommend to just one user, you can run the above but only compute one row of Ak). How's that match your algorithm? I ask because I'm not sure how A, V, or multiplying by a vector squared come into play. Up to you about average item pref. I agree the job can be run anywhere, but yes where should it be injected? Probably it's "Step 4.5". Top priority is to get something running in a day or two since there's danger of having nothing concrete to evaluate for the mid-term eval, which must be done by the end of week.
          Hide
          Richard Simon Just added a comment -

          Step 1: is indeed user ID mapping to user prefs. The long key isn't a problem for DLS at this point.

          Step 2: only one output path containing sequence file with IntWritable, VectorWritable pairs representing V. S is encoded within the Vectors under the 'name' variable. Below I've quoted Jake's description.

          The V output of EigenVerificationJob and DistributedLanczosSolver is yes, just a SequenceFile<IntWritable,VectorWritable>, where the ints (the keys) are row numbers (which run from 0 up to reducedRank [well, roughly]).

          S, on the other hand... is hackily encoded in the serialized "name" variable of the vector output of EigenVerificationJob.

          Step 3: EVJ is required as the output of DLS can sometimes create undesirables. ie duplicates, values that aren't orthogonal enough etc. The purpose of EVJ is to make sure the output of DLS is as expected. It's really just a cleaning job.

          Step 4: Sounds about right. A and V come into play because...
          1) A = USV'
          2) AV = USV'V = US
          therefore:
          1a) A = (AV)V'

          Looping through V.timesSquared(rowOfA) provides us with function 1a. This allows us to calculate the new A without needing S or U. Which saves us having to calculate U at all.

          Show
          Richard Simon Just added a comment - Step 1: is indeed user ID mapping to user prefs. The long key isn't a problem for DLS at this point. Step 2: only one output path containing sequence file with IntWritable, VectorWritable pairs representing V. S is encoded within the Vectors under the 'name' variable. Below I've quoted Jake's description. The V output of EigenVerificationJob and DistributedLanczosSolver is yes, just a SequenceFile<IntWritable,VectorWritable>, where the ints (the keys) are row numbers (which run from 0 up to reducedRank [well, roughly] ). S, on the other hand... is hackily encoded in the serialized "name" variable of the vector output of EigenVerificationJob. Step 3: EVJ is required as the output of DLS can sometimes create undesirables. ie duplicates, values that aren't orthogonal enough etc. The purpose of EVJ is to make sure the output of DLS is as expected. It's really just a cleaning job. Step 4: Sounds about right. A and V come into play because... 1) A = USV' 2) AV = USV'V = US therefore: 1a) A = (AV)V' Looping through V.timesSquared(rowOfA) provides us with function 1a. This allows us to calculate the new A without needing S or U. Which saves us having to calculate U at all.
          Hide
          Sean Owen added a comment -

          EVJ is picking out the top k eigenvalues – the biggest entries in S? or am I misunderstanding.

          I truly may have a misunderstanding about how this works – others jump in please to educate me if I'm way off.

          It's my understanding that we wish to calculate what I'll call Ak, which is Uk Sk Vk', where the Uk is just like U but keeping only the first k columns, and likewise for the others.

          1a) seems to be computing A, but we already had A from step 1. It doesn't have any new non-zero entries that constitute recommendations. If you tweak 1a to use the 'k' matrices, you still need Ak, which is the desired end product.

          I had though the point was that you really do just compute Uk Sk Vk', which isn't slow since each entry requires computing k (where k is quite small) products.

          Show
          Sean Owen added a comment - EVJ is picking out the top k eigenvalues – the biggest entries in S? or am I misunderstanding. I truly may have a misunderstanding about how this works – others jump in please to educate me if I'm way off. It's my understanding that we wish to calculate what I'll call Ak, which is Uk Sk Vk', where the Uk is just like U but keeping only the first k columns, and likewise for the others. 1a) seems to be computing A, but we already had A from step 1. It doesn't have any new non-zero entries that constitute recommendations. If you tweak 1a to use the 'k' matrices, you still need Ak, which is the desired end product. I had though the point was that you really do just compute Uk Sk Vk', which isn't slow since each entry requires computing k (where k is quite small) products.
          Hide
          Richard Simon Just added a comment -

          I must admit I'm now a little uncertain. However my understanding was that the k restriction is handled by 'rank' within the DLS.

          Meaning that 1a would look like Ak = A*Vk*Vk'. Though looking at it like that I'm no longer sure on the math. Jake?

          Show
          Richard Simon Just added a comment - I must admit I'm now a little uncertain. However my understanding was that the k restriction is handled by 'rank' within the DLS. Meaning that 1a would look like Ak = A*Vk*Vk'. Though looking at it like that I'm no longer sure on the math. Jake?
          Hide
          Ted Dunning added a comment - - edited

          Meaning that 1a would look like Ak = A*Vk*Vk'. Though looking at it like that I'm no longer sure on the math. Jake?

          That is mathematically true, but you would never actually compute Ak directly because it is a huge dense matrix. U_k, S_k and V_k provide a more concise representation.

          What we want for recommendations is to take a user history h that is an item vector, project it into the space spanned by the singular vectors, and then back to items. In user terms, we want to score all users with A h, then find all related items with A' (A h). Using A_k instead of A for the smoothing effect, this is (U_k S_k V'_k)' (U_k S_k V'_k) h = (V_k S_k^2 V'_k) h. We can compute V_k S_k^2 V'_k off-line and keep only large results or we can do it at recommendation time V_k ( S_k^2 (V'_k h)).

          Show
          Ted Dunning added a comment - - edited Meaning that 1a would look like Ak = A*Vk*Vk'. Though looking at it like that I'm no longer sure on the math. Jake? That is mathematically true, but you would never actually compute Ak directly because it is a huge dense matrix. U_k, S_k and V_k provide a more concise representation. What we want for recommendations is to take a user history h that is an item vector, project it into the space spanned by the singular vectors, and then back to items. In user terms, we want to score all users with A h, then find all related items with A' (A h). Using A_k instead of A for the smoothing effect, this is (U_k S_k V'_k)' (U_k S_k V'_k) h = (V_k S_k^2 V'_k) h. We can compute V_k S_k^2 V'_k off-line and keep only large results or we can do it at recommendation time V_k ( S_k^2 (V'_k h)).
          Hide
          Ted Dunning added a comment -

          Meaning that 1a would look like Ak = A*Vk*Vk'. Though looking at it like that I'm no longer sure on the math. Jake?

          If we expand this,

          A V_k V_k' = U S V' V_k V_k'

          V' V_k isn't quite the same as V'k V_k because V' has extra rows. IF we write V = [V_k V

          {k+1,n}

          ], then we see

                   [ V'_k V_k     ]   [ I_k ] 
          V' V_k = [              ] = [     ] 
                   [ V'_k+1,n V_k ]   [ 0_k ] 
          

          But

              [ S_k 0     ]
          S = [           ]
              [ 0 S_k+1,n ]
          

          So S V' V_k is S with the lower right hand part zeroed out. This means that

          U S V' V_k = U_k S_k V'_k V_k = U_k S_k V'_k

          and

          A V_k V'_k = U_k S_k V'_k

          as you originally intended.

          (put succinctly, you are correct mathematically)

          Show
          Ted Dunning added a comment - Meaning that 1a would look like Ak = A*Vk*Vk'. Though looking at it like that I'm no longer sure on the math. Jake? If we expand this, A V_k V_k' = U S V' V_k V_k' V' V_k isn't quite the same as V' k V_k because V' has extra rows. IF we write V = [V_k V {k+1,n} ], then we see [ V'_k V_k ] [ I_k ] V' V_k = [ ] = [ ] [ V'_k+1,n V_k ] [ 0_k ] But [ S_k 0 ] S = [ ] [ 0 S_k+1,n ] So S V' V_k is S with the lower right hand part zeroed out. This means that U S V' V_k = U_k S_k V'_k V_k = U_k S_k V'_k and A V_k V'_k = U_k S_k V'_k as you originally intended. (put succinctly, you are correct mathematically)
          Hide
          Richard Simon Just added a comment -

          That is close to true, but you would never actually compute Ak directly because it is a huge dense matrix. U_k, S_k and V_k provide a more concise representation.

          Good point. I hadn't really thought about just how big that's going to get. Really it would possibly have made more sense to do the prediction and recommendation for one user at a time. That way the predictions wouldn't need to be kept.

          What we want for recommendations is to take a user history h that is an item vector, project it into the space spanned by the singular vectors, and then back to items.

          When we talk of history are we talking implicit feedback? ie in a situation where we just have prefs, it would be a set of which items have prefs, but not the prefs themselves.

          Show
          Richard Simon Just added a comment - That is close to true, but you would never actually compute Ak directly because it is a huge dense matrix. U_k, S_k and V_k provide a more concise representation. Good point. I hadn't really thought about just how big that's going to get. Really it would possibly have made more sense to do the prediction and recommendation for one user at a time. That way the predictions wouldn't need to be kept. What we want for recommendations is to take a user history h that is an item vector, project it into the space spanned by the singular vectors, and then back to items. When we talk of history are we talking implicit feedback? ie in a situation where we just have prefs, it would be a set of which items have prefs, but not the prefs themselves.
          Hide
          Richard Simon Just added a comment -

          Cheers for the breakdown! Great to see it so clearly.

          Show
          Richard Simon Just added a comment - Cheers for the breakdown! Great to see it so clearly.
          Hide
          Sean Owen added a comment -

          Agreed, Ak never is fully computed and stored per se. Individual rows are computed one by one (per user), and bits are extracted as recommendations.

          Explicit/implicit feedback isn't relevant here. He's just saying take the input, wherever it came from, as user/item/pref tuples, and make them into per-user vectors. That's what you do already in step 1.

          Show
          Sean Owen added a comment - Agreed, Ak never is fully computed and stored per se. Individual rows are computed one by one (per user), and bits are extracted as recommendations. Explicit/implicit feedback isn't relevant here. He's just saying take the input, wherever it came from, as user/item/pref tuples, and make them into per-user vectors. That's what you do already in step 1.
          Hide
          Richard Simon Just added a comment -

          Ah of course, sorry.

          Show
          Richard Simon Just added a comment - Ah of course, sorry.
          Hide
          jason lopatec added a comment -

          Hi everyone. New to this thread but I've been watching for a while.

          I've implemented the Netflix matrix factorization algorithm in both Matlab &
          C, so I know it quite well. And would love to use this distributed version
          when available.

          One question I've had while watching this thread that I might have missed in
          previous discussions.

          The gradient descent algorithm used by Simon Funk and others is SVD "like"
          but not uniquely an SVD. The user-item matrix has a number of "?"
          representing locations you do not know but would like to predict. Setting
          these to any value (such as 0) and solving for an SVD would calculate ok but
          would not be valuable for predictions as the (0's) are now all your
          predictions for places you don't know. It happens that simon funks gradient
          decent algorithm uses only the sparse entries and the output it produces
          happens to be an SVD representation that makes great guesses at the "?"
          locations (when products are calculated). Any idea whether the Lanczos
          Solver will produce similar results? Is there something unique about sparse
          solutions to SVD that will make these two solutions similar?

          Apologize if this was already covered.

          On Tue, Jul 13, 2010 at 1:33 PM, Richard Simon Just (JIRA)

          Show
          jason lopatec added a comment - Hi everyone. New to this thread but I've been watching for a while. I've implemented the Netflix matrix factorization algorithm in both Matlab & C, so I know it quite well. And would love to use this distributed version when available. One question I've had while watching this thread that I might have missed in previous discussions. The gradient descent algorithm used by Simon Funk and others is SVD "like" but not uniquely an SVD. The user-item matrix has a number of "?" representing locations you do not know but would like to predict. Setting these to any value (such as 0) and solving for an SVD would calculate ok but would not be valuable for predictions as the (0's) are now all your predictions for places you don't know. It happens that simon funks gradient decent algorithm uses only the sparse entries and the output it produces happens to be an SVD representation that makes great guesses at the "?" locations (when products are calculated). Any idea whether the Lanczos Solver will produce similar results? Is there something unique about sparse solutions to SVD that will make these two solutions similar? Apologize if this was already covered. On Tue, Jul 13, 2010 at 1:33 PM, Richard Simon Just (JIRA)
          Hide
          Richard Simon Just added a comment -

          Hey Jason,

          Great question, and in all honesty I don't have a real answer for you. Though maybe one of the others guys will?

          It has been something that's crossed my mind recently. As far as I'm aware the actual SVD++ algorithm follows the same SVD-like approach as Funk. Whereas given that Mahout has the DLS, that's the way I'm going for now. That said I'm planning to implement a distributed version of the SVD-like solution as well. Though if actual SVD works, the SVD-like solution will come after GSoC.

          Show
          Richard Simon Just added a comment - Hey Jason, Great question, and in all honesty I don't have a real answer for you. Though maybe one of the others guys will? It has been something that's crossed my mind recently. As far as I'm aware the actual SVD++ algorithm follows the same SVD-like approach as Funk. Whereas given that Mahout has the DLS, that's the way I'm going for now. That said I'm planning to implement a distributed version of the SVD-like solution as well. Though if actual SVD works, the SVD-like solution will come after GSoC.
          Hide
          Richard Simon Just added a comment -

          Updates of DistributedSVDRecommender, SVDPredictionJob and SVDRecommenderMapper.

          Show
          Richard Simon Just added a comment - Updates of DistributedSVDRecommender, SVDPredictionJob and SVDRecommenderMapper.
          Hide
          Richard Simon Just added a comment -

          Update of all files. Includes fixes to all 'mvn site' errors.

          Show
          Richard Simon Just added a comment - Update of all files. Includes fixes to all 'mvn site' errors.
          Hide
          Sebastian Schelter added a comment -

          Hi Richard,

          I just read through the code and must admit I'm not familiar with SVD recommendations, but I think I might have found two things to look at:

          • could it be that you are not initializing the SVDPredictionJob with the correct input pathes, as you supply userVectorPath twice, shouldn't eigenVerificationPath be used as --matrixVInput?
          • would be better to use TasteHadoopUtils.splitPrefTokens(...) in AvgItemPrefMapper to split up the line from the preference file, so the delimiter character does not have to be specified explicitly
          Show
          Sebastian Schelter added a comment - Hi Richard, I just read through the code and must admit I'm not familiar with SVD recommendations, but I think I might have found two things to look at: could it be that you are not initializing the SVDPredictionJob with the correct input pathes, as you supply userVectorPath twice, shouldn't eigenVerificationPath be used as --matrixVInput? would be better to use TasteHadoopUtils.splitPrefTokens(...) in AvgItemPrefMapper to split up the line from the preference file, so the delimiter character does not have to be specified explicitly
          Hide
          Richard Simon Just added a comment -

          Hey Sebastian,

          Nice spot. I'd changed the --matrixVInput for debugging and forgot to change it back. FIXED.

          Cheers for the note on TasteHadoopUtils.splitPrefTokens(...) will check it out tomorrow.

          Show
          Richard Simon Just added a comment - Hey Sebastian, Nice spot. I'd changed the --matrixVInput for debugging and forgot to change it back. FIXED. Cheers for the note on TasteHadoopUtils.splitPrefTokens(...) will check it out tomorrow.
          Hide
          Sean Owen added a comment -

          What's the latest on how it works? does it run?

          Show
          Sean Owen added a comment - What's the latest on how it works? does it run?
          Hide
          Richard Simon Just added a comment -

          No I'm still getting cardinality exception errors from DLS.

          I don't get the error if I create input from versions of ToItemPrefsMapper and ToUserVectorReducer using the older Hadoop API. DLS and DistributedRowMatrix etc are all on the older API. Not sure why this should make a difference though.

          Have those files been updated yet? I'll check today for updates. Otherwise I'll try updating DLS related classes to the newer API and see what happens.

          I'm also going to look into using Counters to find AvgUserPrefs, AvgItemPrefs and OverallPrefs to save on using M/R.

          I'm also going to have a look at what could be done to distribute the more incremental 'SVD-like' computation used by Funk.

          Show
          Richard Simon Just added a comment - No I'm still getting cardinality exception errors from DLS. I don't get the error if I create input from versions of ToItemPrefsMapper and ToUserVectorReducer using the older Hadoop API. DLS and DistributedRowMatrix etc are all on the older API. Not sure why this should make a difference though. Have those files been updated yet? I'll check today for updates. Otherwise I'll try updating DLS related classes to the newer API and see what happens. I'm also going to look into using Counters to find AvgUserPrefs, AvgItemPrefs and OverallPrefs to save on using M/R. I'm also going to have a look at what could be done to distribute the more incremental 'SVD-like' computation used by Funk.
          Hide
          Richard Simon Just added a comment -

          Cardinality Exception Error fixed by the inclusion of an extra M/R. CardinalityCorrectionMapper takes input matrix and creates a new sequence file of VarLongWritable, VectorWritable pairs where the Vectors have the correct cardinality.

          Now working on EVJ input path errors.

          Show
          Richard Simon Just added a comment - Cardinality Exception Error fixed by the inclusion of an extra M/R. CardinalityCorrectionMapper takes input matrix and creates a new sequence file of VarLongWritable, VectorWritable pairs where the Vectors have the correct cardinality. Now working on EVJ input path errors.
          Hide
          Sean Owen added a comment -

          Calling this 'later' as it's not clear this issue will come to completion.

          Show
          Sean Owen added a comment - Calling this 'later' as it's not clear this issue will come to completion.
          Hide
          Jake Mannix added a comment -

          Hey Sean, Richard

          Do you have any insight into what is holding this up? What makes you say it's not clear this issue will be completed (ever?) ?

          Show
          Jake Mannix added a comment - Hey Sean, Richard Do you have any insight into what is holding this up? What makes you say it's not clear this issue will be completed (ever?) ?
          Hide
          Richard Simon Just added a comment -

          Hey Jake, sorry for the slow reply. The initial hold up was that I didn't pass the GSoC midterm review. Having lost the funding I therefore had to turn my attention to finding paid work.

          That said, all is not lost. Thanks in large to the coding of Lazslo Dosa, of Fredhopper(.com) this update runs the code without exceptions.

          Show
          Richard Simon Just added a comment - Hey Jake, sorry for the slow reply. The initial hold up was that I didn't pass the GSoC midterm review. Having lost the funding I therefore had to turn my attention to finding paid work. That said, all is not lost. Thanks in large to the coding of Lazslo Dosa, of Fredhopper(.com) this update runs the code without exceptions.
          Hide
          Laszlo Dosa added a comment -

          Hi,

          I worked further on Richard's code and improved scalability in SVDPredictionJob.java and SVDRecommenderMapper.java.

          I added as well the item-item recommendation job to the SVD. The approach is as follows:

          1. Calculate the eigenvectors like before
          2. Calculate item-item similarity with formula:
            ( Sum((q_ik - q_jk)^2) / Sum(q_ik^2) * Sum(q_jk^2)  )^-2
            

            where q_i is the vector of item i and q_ik is the element k of the vector i.

          3. Create top-N recommendations by selecting the N most similar items for each item.

          I'm happy to help further to get this ticket solved. Who could advise on how to move forward towards a code that could be committed to Mahout?

          Laszlo Dosa (Fredhopper)

          Show
          Laszlo Dosa added a comment - Hi, I worked further on Richard's code and improved scalability in SVDPredictionJob.java and SVDRecommenderMapper.java. I added as well the item-item recommendation job to the SVD. The approach is as follows: Calculate the eigenvectors like before Calculate item-item similarity with formula: ( Sum((q_ik - q_jk)^2) / Sum(q_ik^2) * Sum(q_jk^2) )^-2 where q_i is the vector of item i and q_ik is the element k of the vector i. Create top-N recommendations by selecting the N most similar items for each item. I'm happy to help further to get this ticket solved. Who could advise on how to move forward towards a code that could be committed to Mahout? Laszlo Dosa (Fredhopper)
          Hide
          Sean Owen added a comment -

          I can be your point of contact Laszlo, excellent.

          Show
          Sean Owen added a comment - I can be your point of contact Laszlo, excellent.
          Hide
          steve added a comment -

          I'm trying to execute this on an amazon EC2 cluster, but i receive this error at second job, someone can help me?

          Exception in thread "main" java.lang.IllegalArgumentException: Can not create a Path from a null string
          at org.apache.hadoop.fs.Path.checkPathArg(Path.java:78)
          at org.apache.hadoop.fs.Path.<init>(Path.java:90)
          at org.apache.mahout.math.hadoop.decomposer.DistributedLanczosSolver.run(DistributedLanczosSolver.java:85)
          at org.apache.mahout.math.hadoop.decomposer.DistributedLanczosSolver$DistributedLanczosSolverJob.run(DistributedLanczosSolver.java:252)
          at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
          at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
          at org.apache.mahout.cf.taste.hadoop.svd.DistributedSVDItemRecommenderJob.run(DistributedSVDItemRecommenderJob.java:101)
          at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
          at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
          at org.apache.mahout.cf.taste.hadoop.svd.DistributedSVDItemRecommenderJob.main(DistributedSVDItemRecommenderJob.java:134)
          at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
          at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
          at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
          at java.lang.reflect.Method.invoke(Method.java:597)
          at org.apache.hadoop.util.RunJar.main(RunJar.java:186)

          Show
          steve added a comment - I'm trying to execute this on an amazon EC2 cluster, but i receive this error at second job, someone can help me? Exception in thread "main" java.lang.IllegalArgumentException: Can not create a Path from a null string at org.apache.hadoop.fs.Path.checkPathArg(Path.java:78) at org.apache.hadoop.fs.Path.<init>(Path.java:90) at org.apache.mahout.math.hadoop.decomposer.DistributedLanczosSolver.run(DistributedLanczosSolver.java:85) at org.apache.mahout.math.hadoop.decomposer.DistributedLanczosSolver$DistributedLanczosSolverJob.run(DistributedLanczosSolver.java:252) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79) at org.apache.mahout.cf.taste.hadoop.svd.DistributedSVDItemRecommenderJob.run(DistributedSVDItemRecommenderJob.java:101) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79) at org.apache.mahout.cf.taste.hadoop.svd.DistributedSVDItemRecommenderJob.main(DistributedSVDItemRecommenderJob.java:134) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.RunJar.main(RunJar.java:186)
          Hide
          Sean Owen added a comment -

          I don't believe the code here was in a finished state, so, it may not quite work. I think you'd have to debug it yourself if interested in pursuing it.

          Show
          Sean Owen added a comment - I don't believe the code here was in a finished state, so, it may not quite work. I think you'd have to debug it yourself if interested in pursuing it.

            People

            • Assignee:
              Sean Owen
              Reporter:
              Richard Simon Just
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development