Mahout
  1. Mahout
  2. MAHOUT-940

Clusterdumper - Get rid of map based implementation

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Won't Fix
    • Affects Version/s: 0.6
    • Fix Version/s: 0.8
    • Component/s: Clustering
    • Labels:

      Description

      Current implementation of ClusterDumper puts clusters and related vectors in map. This generally results in OOM.

      Since ClusterOutputProcessor is availabale now. The ClusterDumper will at first process the clusteredPoints, and then write down the clusters to a local file.

      The inability to properly read the clustering output due to ClusterDumper facing OOM is seen too often in the mailing list. This improvement will fix that problem.

        Activity

        Hide
        Saikat Kanjilal added a comment -

        Paritosh,
        I'm assuming OOM means out of memory, is that correct? So to be clear is the solution to this to use the ClusteredOutputProcessor instead of the ClusterDumper? I will research code and move forward with an implementation. Will ask questions if I get stuck,

        Show
        Saikat Kanjilal added a comment - Paritosh, I'm assuming OOM means out of memory, is that correct? So to be clear is the solution to this to use the ClusteredOutputProcessor instead of the ClusterDumper? I will research code and move forward with an implementation. Will ask questions if I get stuck,
        Hide
        Saikat Kanjilal added a comment -

        So after researching this some more:
        1) I don't see any class called ClusteredOutputProcessor so I assume you mean ClusteredOutputPostProcessor is that correct
        2) In looking at ClusterDumper more closely I see two maps,one for the postProcessingClusteredDirectories and the other called writersForClusters, I will replace both of those with the APIs nested inside ClusteredOutputPostProcessor

        Let me know if you see any concerns with the above approach.

        Show
        Saikat Kanjilal added a comment - So after researching this some more: 1) I don't see any class called ClusteredOutputProcessor so I assume you mean ClusteredOutputPostProcessor is that correct 2) In looking at ClusterDumper more closely I see two maps,one for the postProcessingClusteredDirectories and the other called writersForClusters, I will replace both of those with the APIs nested inside ClusteredOutputPostProcessor Let me know if you see any concerns with the above approach.
        Hide
        Paritosh Ranjan added a comment - - edited

        1) yes
        2) It might be a good idea to do some testing before/after your code change. i.e. Running all Junit tests, and some manual testing using clusterdumper ( dump a cluster using new implementation which was getting OOM with the older implementation). It will make sure that the code is working.

        Also, you can try to test quality before and after using the post processor. i.e. The results should be same, whether you use the map based or post processor based implementation.

        So, to test it, do not get rid of the older coder, rather provide an option to use the map based/post processor based implementation. This will help in testing. Later it can be decided which version to keep i.e. new/both.

        Show
        Paritosh Ranjan added a comment - - edited 1) yes 2) It might be a good idea to do some testing before/after your code change. i.e. Running all Junit tests, and some manual testing using clusterdumper ( dump a cluster using new implementation which was getting OOM with the older implementation). It will make sure that the code is working. Also, you can try to test quality before and after using the post processor. i.e. The results should be same, whether you use the map based or post processor based implementation. So, to test it, do not get rid of the older coder, rather provide an option to use the map based/post processor based implementation. This will help in testing. Later it can be decided which version to keep i.e. new/both.
        Hide
        Saikat Kanjilal added a comment -

        Still reading code to get a deeper understanding of what's happening, some more questions:

        1)The createClusterWriter method inside ClusterDumper creates 3 types of writers depending on the outputFormat, so one of the arguments to these writers is the map in question is shown below:

        private Map<Integer, List<WeightedVectorWritable>> clusterIdToPoints;

        Its not clear to me whether we need to do a deeper refactoring to rewrite/replace these different types of writers with the ClusterOutputPostProcessor, any thoughts on this, should we have a choice to either use the writers or the ClusterOutputPostProcessor?

        2) For the following line of code:
        long numWritten = clusterWriter.write(new SequenceFileDirValueIterable<ClusterWritable>(new Path(seqFileDir, "part-*"), PathType.GLOB, conf));

        Does the above just use an iterator to dump the points to different directories corresponding to the different clusters, the code is really hard to read and SequenceFileDirValueIterable is not well commented.

        Thanks for your help in getting a better understanding of this.

        Show
        Saikat Kanjilal added a comment - Still reading code to get a deeper understanding of what's happening, some more questions: 1)The createClusterWriter method inside ClusterDumper creates 3 types of writers depending on the outputFormat, so one of the arguments to these writers is the map in question is shown below: private Map<Integer, List<WeightedVectorWritable>> clusterIdToPoints; Its not clear to me whether we need to do a deeper refactoring to rewrite/replace these different types of writers with the ClusterOutputPostProcessor, any thoughts on this, should we have a choice to either use the writers or the ClusterOutputPostProcessor? 2) For the following line of code: long numWritten = clusterWriter.write(new SequenceFileDirValueIterable<ClusterWritable>(new Path(seqFileDir, "part-*"), PathType.GLOB, conf)); Does the above just use an iterator to dump the points to different directories corresponding to the different clusters, the code is really hard to read and SequenceFileDirValueIterable is not well commented. Thanks for your help in getting a better understanding of this.
        Hide
        Paritosh Ranjan added a comment -

        Even I am not familiar with this code. It will be great if you can find a way through.
        The point is to fix the OOM.

        Show
        Paritosh Ranjan added a comment - Even I am not familiar with this code. It will be great if you can find a way through. The point is to fix the OOM.
        Hide
        Saikat Kanjilal added a comment -

        Ha that's funny, ok time to keep digging and see what I can figure out

        Show
        Saikat Kanjilal added a comment - Ha that's funny, ok time to keep digging and see what I can figure out
        Hide
        Saikat Kanjilal added a comment -

        Paritosh,
        Having some time to work on this today, are there particular unit tests that cause the OOM to happen, maybe I can start with where the problem happens and work backwards from there, the unit tests on trunk seem to be compiling and running fine on my mac. Some more examples would help.

        Thanks

        Show
        Saikat Kanjilal added a comment - Paritosh, Having some time to work on this today, are there particular unit tests that cause the OOM to happen, maybe I can start with where the problem happens and work backwards from there, the unit tests on trunk seem to be compiling and running fine on my mac. Some more examples would help. Thanks
        Hide
        Paritosh Ranjan added a comment -

        Clustering a large dataset and then using clusterdumper, will give OOM as all the data will be in map.

        A new test case would be needed, which will fail due to OOM.
        Then the new implementation can be tried out, which will eventually let the Junit Test Pass.

        Show
        Paritosh Ranjan added a comment - Clustering a large dataset and then using clusterdumper, will give OOM as all the data will be in map. A new test case would be needed, which will fail due to OOM. Then the new implementation can be tried out, which will eventually let the Junit Test Pass.
        Hide
        Saikat Kanjilal added a comment -

        Ok, sorry about the delayed response, so have been swamped with work, ran a few examples with the reuters data set and couldn't repro the OOM issue. I believe this used the ClusterDumper from what I could see in the output. Are there other datasets I can try?

        Show
        Saikat Kanjilal added a comment - Ok, sorry about the delayed response, so have been swamped with work, ran a few examples with the reuters data set and couldn't repro the OOM issue. I believe this used the ClusterDumper from what I could see in the output. Are there other datasets I can try?
        Hide
        Paritosh Ranjan added a comment -

        I think the best way would be to write some code which generates sequence files with large number of random vectors.

        Show
        Paritosh Ranjan added a comment - I think the best way would be to write some code which generates sequence files with large number of random vectors.
        Hide
        Paritosh Ranjan added a comment -

        I would not be able to resolve this issue in 0.7. Neither it is that important. But I will surely like to take a shot at this later.
        Can we move this to 0.8?

        Show
        Paritosh Ranjan added a comment - I would not be able to resolve this issue in 0.7. Neither it is that important. But I will surely like to take a shot at this later. Can we move this to 0.8?
        Hide
        Saikat Kanjilal added a comment -

        Paritosh,
        I have not had the bandwidth to do anything else on this, I will try to script this in the next week or so but am swamped at work.
        Thanks

        Show
        Saikat Kanjilal added a comment - Paritosh, I have not had the bandwidth to do anything else on this, I will try to script this in the next week or so but am swamped at work. Thanks
        Hide
        Saikat Kanjilal added a comment -

        paritosh I was going to start writing up some code to generate these large files, should we check this into the mahout tree or will this be a one time thing?

        Show
        Saikat Kanjilal added a comment - paritosh I was going to start writing up some code to generate these large files, should we check this into the mahout tree or will this be a one time thing?
        Hide
        Paritosh Ranjan added a comment -

        Considering nobody has encountered this issue since last few months, and also that now we have ClusterOutputPostProcessor (clusterpp), which can do it for any number of clusters/vectors, I would like to drop this issue.

        I propose to mark it Won't Fix until someone feels otherwise. I will wait for a week or so, and then mark it won't fix.

        Show
        Paritosh Ranjan added a comment - Considering nobody has encountered this issue since last few months, and also that now we have ClusterOutputPostProcessor (clusterpp), which can do it for any number of clusters/vectors, I would like to drop this issue. I propose to mark it Won't Fix until someone feels otherwise. I will wait for a week or so, and then mark it won't fix.
        Hide
        Sean Owen added a comment -

        (Fine by me.)
        (For what it's worth, I have no problem just marking things WontFix if you have any reasonable feeling that it will not be controversial. These things can always be un-done if anyone objects. Just saves time and another 'hop' in the process.)

        Show
        Sean Owen added a comment - (Fine by me.) (For what it's worth, I have no problem just marking things WontFix if you have any reasonable feeling that it will not be controversial. These things can always be un-done if anyone objects. Just saves time and another 'hop' in the process.)
        Hide
        Paritosh Ranjan added a comment -

        Marking it won't fix for the reasons stated in last few comments.

        Show
        Paritosh Ranjan added a comment - Marking it won't fix for the reasons stated in last few comments.

          People

          • Assignee:
            Paritosh Ranjan
            Reporter:
            Paritosh Ranjan
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development