Uploaded image for project: 'Mahout'
  1. Mahout
  2. MAHOUT-1634

ALS don't work when it adds new files in Distributed Cache

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Won't Fix
    • Affects Version/s: 0.10.1
    • Fix Version/s: 0.12.0
    • Labels:
    • Environment:

      Cloudera 5.1 VM, eclipse, zookeeper

      Description

      ALS algorithm uses distributed cache to temp files, but the distributed cache have other uses too, especially to add dependencies
      (http://blog.cloudera.com/blog/2011/01/how-to-include-third-party-libraries-in-your-map-reduce-job/), so when in a hadoop's job we add a dependency library (or other file) ALS fails because it reads ALL files in Distribution Cache without distinction.

      This occurs in the project of my company because we need to add Mahout dependencies (mahout, lucene,...) in an hadoop Configuration to run Mahout's jobs, otherwise the Mahout's job fails because it don't find the dependencies.

      I propose two options (I think two valid options):
      1) Eliminate all .jar in the return of HadoopUtil.getCacheFiles
      2) Elliminate all Path object distinct of /part-*

      I prefer the first because it's less aggressive, and I think this solution will be resolve all problems.

      Pd: Sorry if my english is wrong.

        Attachments

        1. mahout.patch
          2 kB
          Cristian Galán

          Issue Links

            Activity

              People

              • Assignee:
                smarthi Suneel Marthi
                Reporter:
                cgalan Cristian Galán
              • Votes:
                0 Vote for this issue
                Watchers:
                6 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved:

                  Time Tracking

                  Estimated:
                  Original Estimate - 24h
                  24h
                  Remaining:
                  Remaining Estimate - 24h
                  24h
                  Logged:
                  Time Spent - Not Specified
                  Not Specified