Mahout
  1. Mahout
  2. MAHOUT-212

Need random sampler for use in reducers

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.2
    • Fix Version/s: 0.3
    • Component/s: Integration
    • Labels:
      None

      Description

      For a variety of mining algorithms, it helps to have a uniform way to only process a sub-set of the records in a reducer.

      As such, I have written a simple generic sampler that filters an Iterator returning a fair sample of at most a specified size.

      1. MAHOUT-212-C.patch
        28 kB
        Sean Owen
      2. MAHOUT-212-b.patch
        18 kB
        Ted Dunning
      3. MAHOUT-212.patch
        9 kB
        Ted Dunning

        Activity

        Hide
        Ted Dunning added a comment -

        Awesome. Thanks.

        I have no dependencies. I simply wrote this to help you and anybody else with the cooccurrence stuff (you mentioned delaying the sampling until later).

        Show
        Ted Dunning added a comment - Awesome. Thanks. I have no dependencies. I simply wrote this to help you and anybody else with the cooccurrence stuff (you mentioned delaying the sampling until later).
        Hide
        Sean Owen added a comment -

        Ted I've gone big on this and collapsed a lot more iterator related things into this package, and tried to unify a little. It'll probably break any dependencies you have on it, but not badly. What say?

        Show
        Sean Owen added a comment - Ted I've gone big on this and collapsed a lot more iterator related things into this package, and tried to unify a little. It'll probably break any dependencies you have on it, but not badly. What say?
        Hide
        Ted Dunning added a comment -

        I am snowed under and won't get to this for at least a week. Feel free to commit.

        Show
        Ted Dunning added a comment - I am snowed under and won't get to this for at least a week. Feel free to commit.
        Hide
        Sean Owen added a comment -

        In case you're waiting on comment from me, as far as I am concerned you can submit.

        Show
        Sean Owen added a comment - In case you're waiting on comment from me, as far as I am concerned you can submit.
        Hide
        Ted Dunning added a comment -

        What's the idea behind the Entry class? I am not seeing how that originalIndex is used.

        It allows sorting of the results according the original ordering of the data elements. The originalIndex is used in the comparison function. Since we have to keep the original index to do this sort I needed a handy way to glue the index to the value. My preference would have been a side array and a sort function that returns a permutation. That would have allowed my two classes to merge completely with very little overhead. Sorts that return permutations are commonly found in R or Matlab, but (to my knowledge) not in Java.

        Random generator = RandomUtils.getRandom();

        This can be a private static final member.

        It should actually be a local after making the change you suggested. Sorry I missed that.

        The existing SamplingIterator could use the DelegatingIterator class too if you like.

        After you, Gaston!

        (seriously, my tiny window of time for coding just closed. Feel free to take this and do anything you like)

        I take the general point about injection but at some level a component isn't meaningfully injectable. What kind of fault would you inject in a Random?

        I have had a few cases where very rare, but legal, values from a generator would cause a fault. It was easiest to provide a mock generator to stimulate these corners.

        As a concrete example, an exponential distribution can be generated using -log(u) where u is a random variable from (0, 1]. But most random number generators generator doubles from [0, 1) so the sampling should really be done using -log(1-u). It is reasonable to inject a generator that returns a 0 to make sure that the edge condition is handled well.

        What I would suggest is that if this is or becmes important, we can add a method to truly inject a generator. It isn't important here so I left it out.

        Show
        Ted Dunning added a comment - What's the idea behind the Entry class? I am not seeing how that originalIndex is used. It allows sorting of the results according the original ordering of the data elements. The originalIndex is used in the comparison function. Since we have to keep the original index to do this sort I needed a handy way to glue the index to the value. My preference would have been a side array and a sort function that returns a permutation. That would have allowed my two classes to merge completely with very little overhead. Sorts that return permutations are commonly found in R or Matlab, but (to my knowledge) not in Java. Random generator = RandomUtils.getRandom(); This can be a private static final member. It should actually be a local after making the change you suggested. Sorry I missed that. The existing SamplingIterator could use the DelegatingIterator class too if you like. After you, Gaston! (seriously, my tiny window of time for coding just closed. Feel free to take this and do anything you like) I take the general point about injection but at some level a component isn't meaningfully injectable. What kind of fault would you inject in a Random? I have had a few cases where very rare, but legal, values from a generator would cause a fault. It was easiest to provide a mock generator to stimulate these corners. As a concrete example, an exponential distribution can be generated using -log(u) where u is a random variable from (0, 1]. But most random number generators generator doubles from [0, 1) so the sampling should really be done using -log(1-u). It is reasonable to inject a generator that returns a 0 to make sure that the edge condition is handled well. What I would suggest is that if this is or becmes important, we can add a method to truly inject a generator. It isn't important here so I left it out.
        Hide
        Sean Owen added a comment -

        What's the idea behind the Entry class? I am not seeing how that originalIndex is used.

        Random generator = RandomUtils.getRandom();
        This can be a private static final member.

        The existing SamplingIterator could use the DelegatingIterator class too if you like.

        I take the general point about injection but at some level a component isn't meaningfully injectable. What kind of fault would you inject in a Random? testing what happens when it returns a value outside the interval or an Error maybe but is that reasonable. What flexibility would it meaningfully provide, at least in comparison to the extra method and new possible error scenarios.

        statics have their place and this instance seems like a fine example to me. It's one of the few cases where I really do need to make sure every instance in the whole program works a certain way.

        Show
        Sean Owen added a comment - What's the idea behind the Entry class? I am not seeing how that originalIndex is used. Random generator = RandomUtils.getRandom(); This can be a private static final member. The existing SamplingIterator could use the DelegatingIterator class too if you like. I take the general point about injection but at some level a component isn't meaningfully injectable. What kind of fault would you inject in a Random? testing what happens when it returns a value outside the interval or an Error maybe but is that reasonable. What flexibility would it meaningfully provide, at least in comparison to the extra method and new possible error scenarios. statics have their place and this instance seems like a fine example to me. It's one of the few cases where I really do need to make sure every instance in the whole program works a certain way.
        Hide
        Ted Dunning added a comment -

        Here is the actual patch.

        Show
        Ted Dunning added a comment - Here is the actual patch.
        Hide
        Ted Dunning added a comment -

        Here is another patch. I moved SamplingIterator and Iterable. I moved to eager sampling and simplified code and test cases as a result. In the end, there is no additional functionality from an Iterable (it just wraps the Iterator in a way that it returns that same iterator) so I left it out.

        Sean> The Random shouldn't be instance variable right, and should be obtained from RandomUtils?

        ted> I like having it be injectable for testing purposes. As long as it exhibits the same interface as j.u.Random, we should be fine. There may be a better interface from RandomUtils. Feel free to suggest one, but I really do want to keep the injectability of the generator.

        I switched to use RandomUtils, but I still think it is a poor form of injection. Most notably, we can't inject faults using this approach, only a standard test sequence. Besides, I hate global variables ( which is what a static class like this is ). For these tests, those issues don't matter so I switched.

        Show
        Ted Dunning added a comment - Here is another patch. I moved SamplingIterator and Iterable. I moved to eager sampling and simplified code and test cases as a result. In the end, there is no additional functionality from an Iterable (it just wraps the Iterator in a way that it returns that same iterator) so I left it out. Sean> The Random shouldn't be instance variable right, and should be obtained from RandomUtils? ted> I like having it be injectable for testing purposes. As long as it exhibits the same interface as j.u.Random, we should be fine. There may be a better interface from RandomUtils. Feel free to suggest one, but I really do want to keep the injectability of the generator. I switched to use RandomUtils, but I still think it is a poor form of injection. Most notably, we can't inject faults using this approach, only a standard test sequence. Besides, I hate global variables ( which is what a static class like this is ). For these tests, those issues don't matter so I switched.
        Hide
        Sean Owen added a comment -

        Yeah test injection was the idea behind using RandomUtils, since it will return a generator that uses the same seed every time when set in test mode. The unit tests do (should) set it globally as such, to make sure the results are deterministic. Yes the returned generator is a MersenneTwisterRNG which just extends Random.

        Yes anything for testing should probably be package-private.

        (I'd also suggest making the instance fields private here? not sure there's a big case for extension, at least, one that isn't perhaps better answered with explicit getters)

        I dont' care about the test naming convention.

        Once this is in place I'll put my similar Iterator next to it.

        Show
        Sean Owen added a comment - Yeah test injection was the idea behind using RandomUtils, since it will return a generator that uses the same seed every time when set in test mode. The unit tests do (should) set it globally as such, to make sure the results are deterministic. Yes the returned generator is a MersenneTwisterRNG which just extends Random. Yes anything for testing should probably be package-private. (I'd also suggest making the instance fields private here? not sure there's a big case for extension, at least, one that isn't perhaps better answered with explicit getters) I dont' care about the test naming convention. Once this is in place I'll put my similar Iterator next to it.
        Hide
        Ted Dunning added a comment -

        I had suggested we not use both org.apache.mahout.common and org.apache.mahout.utils as the "common stuff" package, since that's redundant. We sort of standardized on common, but, retained utils for various reasons. I think this belongs in core/ and under .common

        Sounds good.

        The Random shouldn't be instance variable right, and should be obtained from RandomUtils?

        I like having it be injectable for testing purposes. As long as it exhibits the same interface as j.u.Random, we should be fine. There may be a better interface from RandomUtils. Feel free to suggest one, but I really do want to keep the injectability of the generator.

        It's not necessary to keep the original Iterator since as you show, you really must sample it all upfront as you do. In this sense it's almost not properly a class that should produce an Iterator, but a List, but, I like the tidiness of an Iterator wrapper.

        This is a point I waffled on. The real question here is whether we care about the corner case where we don't read anything from the iterator. I went slightly nuts and decided I did care to optimize that point, but you make a strong counter argument that the class could be simpler if copyInput were called from the constructor. That would simplify testing as well.

        Consider providing an Iterable counterpart for easy use with foreach loops, like I did with SamplingIterable

        Quite doable.

        Name it something ending with Iterator since it's an Iterator? FixedSizeSampleIterator?

        Also a fine idea.

        Are methods like copyInput() necessarily public, and is there a need to set the generator?

        They could be package level. I merely exposed it to be able to do more detailed testing. This adds weight to your argument about keeping the original iterator.

        Very picky, usually see test cases end in TestCase

        I usually see test cases that start or end with Test. It is an old convention from many ant builds that required regexes. I don't much care except that I would have a small preference for making abstract tests end in TestCase in order to distinguish them from concrete tests.

        If you agree with these but don't care t implement, I can do so.

        Let me take one more crack.

        Show
        Ted Dunning added a comment - I had suggested we not use both org.apache.mahout.common and org.apache.mahout.utils as the "common stuff" package, since that's redundant. We sort of standardized on common, but, retained utils for various reasons. I think this belongs in core/ and under .common Sounds good. The Random shouldn't be instance variable right, and should be obtained from RandomUtils? I like having it be injectable for testing purposes. As long as it exhibits the same interface as j.u.Random, we should be fine. There may be a better interface from RandomUtils. Feel free to suggest one, but I really do want to keep the injectability of the generator. It's not necessary to keep the original Iterator since as you show, you really must sample it all upfront as you do. In this sense it's almost not properly a class that should produce an Iterator, but a List, but, I like the tidiness of an Iterator wrapper. This is a point I waffled on. The real question here is whether we care about the corner case where we don't read anything from the iterator. I went slightly nuts and decided I did care to optimize that point, but you make a strong counter argument that the class could be simpler if copyInput were called from the constructor. That would simplify testing as well. Consider providing an Iterable counterpart for easy use with foreach loops, like I did with SamplingIterable Quite doable. Name it something ending with Iterator since it's an Iterator? FixedSizeSampleIterator? Also a fine idea. Are methods like copyInput() necessarily public, and is there a need to set the generator? They could be package level. I merely exposed it to be able to do more detailed testing. This adds weight to your argument about keeping the original iterator. Very picky, usually see test cases end in TestCase I usually see test cases that start or end with Test. It is an old convention from many ant builds that required regexes. I don't much care except that I would have a small preference for making abstract tests end in TestCase in order to distinguish them from concrete tests. If you agree with these but don't care t implement, I can do so. Let me take one more crack.
        Hide
        Sean Owen added a comment -

        Sure, understood. How about putting them in the same place, and making them be named / look / act similarly? That brings me to a number of comments on the patch:

        • I had suggested we not use both org.apache.mahout.common and org.apache.mahout.utils as the "common stuff" package, since that's redundant. We sort of standardized on common, but, retained utils for various reasons. I think this belongs in core/ and under .common
        • The Random shouldn't be instance variable right, and should be obtained from RandomUtils?
        • It's not necessary to keep the original Iterator since as you show, you really must sample it all upfront as you do. In this sense it's almost not properly a class that should produce an Iterator, but a List, but, I like the tidiness of an Iterator wrapper.
        • Consider providing an Iterable counterpart for easy use with foreach loops, like I did with SamplingIterable
        • Name it something ending with Iterator since it's an Iterator? FixedSizeSampleIterator?
        • Are methods like copyInput() necessarily public, and is there a need to set the generator?
        • Very picky, usually see test cases end in TestCase

        If you agree with these but don't care t implement, I can do so.

        Show
        Sean Owen added a comment - Sure, understood. How about putting them in the same place, and making them be named / look / act similarly? That brings me to a number of comments on the patch: I had suggested we not use both org.apache.mahout.common and org.apache.mahout.utils as the "common stuff" package, since that's redundant. We sort of standardized on common, but, retained utils for various reasons. I think this belongs in core/ and under .common The Random shouldn't be instance variable right, and should be obtained from RandomUtils? It's not necessary to keep the original Iterator since as you show, you really must sample it all upfront as you do. In this sense it's almost not properly a class that should produce an Iterator, but a List, but, I like the tidiness of an Iterator wrapper. Consider providing an Iterable counterpart for easy use with foreach loops, like I did with SamplingIterable Name it something ending with Iterator since it's an Iterator? FixedSizeSampleIterator? Are methods like copyInput() necessarily public, and is there a need to set the generator? Very picky, usually see test cases end in TestCase If you agree with these but don't care t implement, I can do so.
        Hide
        Ted Dunning added a comment -

        Kinda existed, but SamplingIterator takes a sample rate which can't be known if you don't know the total size. The FixedSizeSampler just takes a desired size and gives you exactly that many.

        Merging them makes sense from the point of view of a user, but there is little in common between the implementations.

        Show
        Ted Dunning added a comment - Kinda existed, but SamplingIterator takes a sample rate which can't be known if you don't know the total size. The FixedSizeSampler just takes a desired size and gives you exactly that many. Merging them makes sense from the point of view of a user, but there is little in common between the implementations.
        Hide
        Sean Owen added a comment -

        This kinda already existed as SamplingIterator – does that do the same thing? could these be merged then, pulling the class into a common location and combining aspects of both?

        Show
        Sean Owen added a comment - This kinda already existed as SamplingIterator – does that do the same thing? could these be merged then, pulling the class into a common location and combining aspects of both?
        Hide
        Ted Dunning added a comment -

        Hmm... didn't get asked for where the patch file was when marking the bug as patch available.

        Show
        Ted Dunning added a comment - Hmm... didn't get asked for where the patch file was when marking the bug as patch available.
        Hide
        Ted Dunning added a comment -

        Code plus test cases.

        Ready for use. I think.

        Show
        Ted Dunning added a comment - Code plus test cases. Ready for use. I think.

          People

          • Assignee:
            Sean Owen
            Reporter:
            Ted Dunning
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development