Mahout
  1. Mahout
  2. MAHOUT-747

Entropy implementation in Map/Reduce

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.6
    • Fix Version/s: 0.6
    • Component/s: Math
    • Labels:
      None

      Description

      Hi again,

      because I got much to work with entropy and information gain ratio, I want to implement the following distributed algorithms:

      This issue is at first only for entropy.

      Some questions:

      • In which package do the classes belong. I put them first at 'org.apache.mahout.math.stats', don't know if this is right, because they are components of information retrieval.
      • Entropy only reads a set of elements. As input i took a sequence file with keys of type Text and values anyone, because I only work with the keys. Is this the best practise?
      • Is there a generic solution, so that the type of keys can be anything inherited from Writable?

      In Hadoop is a TokenCounterMapper, which emits each value with an IntWritable(1). I added a KeyCounterMapper into 'org.apache.mahout.common.mapreduce' which does the same with the keys.

      Will append my patch soon.

      Regards, Christoph.

      1. MAHOUT-747.patch
        43 kB
        Christoph Nagel

        Activity

        Hide
        Christoph Nagel added a comment -

        First implementation

        Show
        Christoph Nagel added a comment - First implementation
        Hide
        Sean Owen added a comment -

        I somehow feel this is not a great fit for Hadoop. This is a really simple job, to be sure, and that's no bad thing per se. However it does raise the question of whether the big overheads of distributing buy you much.

        For example using a job to count the number of records in the input is a load of overhead for that task, even if you have done it well with a combiner.

        And the second job uses 1 reducer anyway; the "meat" of the computation is not distributed.

        I can surely imagine this computation existing as part of another computation, but I wonder if you wouldn't already have info at hand like the number of items? does this actually form a reusable component.

        Show
        Sean Owen added a comment - I somehow feel this is not a great fit for Hadoop. This is a really simple job, to be sure, and that's no bad thing per se. However it does raise the question of whether the big overheads of distributing buy you much. For example using a job to count the number of records in the input is a load of overhead for that task, even if you have done it well with a combiner. And the second job uses 1 reducer anyway; the "meat" of the computation is not distributed. I can surely imagine this computation existing as part of another computation, but I wonder if you wouldn't already have info at hand like the number of items? does this actually form a reusable component.
        Hide
        Christoph Nagel added a comment -

        Hi Sean,

        the name and javadoc for my first map/reduce step wasn't so good. The task is not to count all elements.
        In this step, the set gets grouped by elements and they were counted.
        Example: [a, b, c, a, b, a] =>

        { a: 3, b: 2, c: 1 }

        The second claim is right, I will shift the log-Computations in the mapper. Will be faster.

        Regards, Christoph.

        Show
        Christoph Nagel added a comment - Hi Sean, the name and javadoc for my first map/reduce step wasn't so good. The task is not to count all elements. In this step, the set gets grouped by elements and they were counted. Example: [a, b, c, a, b, a] => { a: 3, b: 2, c: 1 } The second claim is right, I will shift the log-Computations in the mapper. Will be faster. Regards, Christoph.
        Hide
        Christoph Nagel added a comment -

        Second implementation.
        Log-Calculation in Entropy shifted to the mapper, like discussed before.
        Furthermore, ConditionalEntropy, InformationGain and InformationGainRatio are included.
        Both last ones are used in decision trees to create the splits.

        Show
        Christoph Nagel added a comment - Second implementation. Log-Calculation in Entropy shifted to the mapper, like discussed before. Furthermore, ConditionalEntropy, InformationGain and InformationGainRatio are included. Both last ones are used in decision trees to create the splits.
        Hide
        Sean Owen added a comment -

        I have a number of comments from large to small.

        • I am still not sure how the entropy calculation is distributed. All map keys are NullWritable so they all go to one reducer.
        • CalculateSpecificConditionalEntropyMapper seems to store too much in memory – a mapping for all keys. Does this not blow up at scale?
        • Aren't CalculateSimilarityCombiner and DoubleSumReducer virtually the same?
        • Use VarIntWritable instead of IntWritable for much better I/O efficiency
        • There's already a StringTuple class that would let you write a pair of Strings
        • I prefer to avoid inner classes myself here but it's more a question of preference
        • I think it's more standard to use one capital letter for generic types ("K") rather than words ("Key")
        Show
        Sean Owen added a comment - I have a number of comments from large to small. I am still not sure how the entropy calculation is distributed. All map keys are NullWritable so they all go to one reducer. CalculateSpecificConditionalEntropyMapper seems to store too much in memory – a mapping for all keys. Does this not blow up at scale? Aren't CalculateSimilarityCombiner and DoubleSumReducer virtually the same? Use VarIntWritable instead of IntWritable for much better I/O efficiency There's already a StringTuple class that would let you write a pair of Strings I prefer to avoid inner classes myself here but it's more a question of preference I think it's more standard to use one capital letter for generic types ("K") rather than words ("Key")
        Hide
        Christoph Nagel added a comment -

        Added a new implementation.
        @Sean Owen:

        • why don't you think, that the entropy calculation is distributed? There are 2 m/r-tasks. First groups and counts, which can take max |key| reducer. Second does the calculation and sums all values in the reducer. OK, only one reducer can do this, but I don't see another way. Added a more complexity to it, sorry, because of better calculation. Thanks @Ted Dunning for his hint.
        • changed the conditionalEntropy calculation, so no join is needed.
        • sorry, don't understand the point "Aren't CalculateSimilarityCombiner and DoubleSumReducer virtually the same?". Can't find CalculateSimilarityCombiner.
        • Changed all IntWritable to VarIntWritable
        • Thanks for StringTuple, saves much lines.

        Regards, Christoph.

        Show
        Christoph Nagel added a comment - Added a new implementation. @Sean Owen: why don't you think, that the entropy calculation is distributed? There are 2 m/r-tasks. First groups and counts, which can take max |key| reducer. Second does the calculation and sums all values in the reducer. OK, only one reducer can do this, but I don't see another way. Added a more complexity to it, sorry, because of better calculation. Thanks @Ted Dunning for his hint. changed the conditionalEntropy calculation, so no join is needed. sorry, don't understand the point "Aren't CalculateSimilarityCombiner and DoubleSumReducer virtually the same?". Can't find CalculateSimilarityCombiner. Changed all IntWritable to VarIntWritable Thanks for StringTuple, saves much lines. Regards, Christoph.
        Hide
        Sean Owen added a comment -

        The issue is that you're using NullWritable as the key. All keys go to the same place – one reducer, whether or not you have run more than 1.

        I'm referring to CalculateConditionalEntropyReducer and DoubleSumReducer. They only differ in the key/value ordering, and it seems they don't need to behave differently in this regard.

        Show
        Sean Owen added a comment - The issue is that you're using NullWritable as the key. All keys go to the same place – one reducer, whether or not you have run more than 1. I'm referring to CalculateConditionalEntropyReducer and DoubleSumReducer. They only differ in the key/value ordering, and it seems they don't need to behave differently in this regard.
        Hide
        Christoph Nagel added a comment -

        Hi Sean,

        sorry, but if i need to sum a set of doubles to one value, how is it possible with more than one reducer?
        Is there a best practise I don't see?

        For the second point, if we have one value and no key, is it better to store the value as key or value in the sequence file?

        Regards, Christoph.

        Show
        Christoph Nagel added a comment - Hi Sean, sorry, but if i need to sum a set of doubles to one value, how is it possible with more than one reducer? Is there a best practise I don't see? For the second point, if we have one value and no key, is it better to store the value as key or value in the sequence file? Regards, Christoph.
        Hide
        Sean Owen added a comment -

        Yes that's what I was getting at originally... you can't really distribute this entirely. The counting is distributable though, yes.

        I don't think there's a strong convention for writing null,value or value,null when outputting a single value. I think that since it saves you writing another class, you can just go with null,value.

        There are some other items here I'd like to tweak in the code but they are small. For example I may want to move or merge some of these classes. But yes I can take a look soon and put it in. It's a simple job that does something useful, so seems OK to add.

        Show
        Sean Owen added a comment - Yes that's what I was getting at originally... you can't really distribute this entirely. The counting is distributable though, yes. I don't think there's a strong convention for writing null,value or value,null when outputting a single value. I think that since it saves you writing another class, you can just go with null,value. There are some other items here I'd like to tweak in the code but they are small. For example I may want to move or merge some of these classes. But yes I can take a look soon and put it in. It's a simple job that does something useful, so seems OK to add.
        Hide
        Sean Owen added a comment -

        OK I submitted this with moderate changes. Most of it is streamlining and using some common code that simplifies this. Much of it was putting this all into its own package. Some is small style changes. I had to fix it to ignore _SUCCESS files found in Hadoop 0.22+ distros.

        Show
        Sean Owen added a comment - OK I submitted this with moderate changes. Most of it is streamlining and using some common code that simplifies this. Much of it was putting this all into its own package. Some is small style changes. I had to fix it to ignore _SUCCESS files found in Hadoop 0.22+ distros.
        Hide
        Sebastian Schelter added a comment -

        Gratulations to your first committed Apache code, Christoph!

        Show
        Sebastian Schelter added a comment - Gratulations to your first committed Apache code, Christoph!
        Hide
        Christoph Nagel added a comment -

        Cool & Thanks. Looked at it and only one question.
        Isn't *.common.mapreduce the best package for DoubleSumReducer, KeyCounterMapper, ValueCounterMapper, VarInSumReducer? These are generic mapper and reducer and while coding, I was surprised, that nobody had implemented them yet.

        Regards, Christoph.

        Show
        Christoph Nagel added a comment - Cool & Thanks. Looked at it and only one question. Isn't *.common.mapreduce the best package for DoubleSumReducer, KeyCounterMapper, ValueCounterMapper, VarInSumReducer? These are generic mapper and reducer and while coding, I was surprised, that nobody had implemented them yet. Regards, Christoph.
        Hide
        Sean Owen added a comment -

        Yes they could go there too. My personal preference would be to keep it next to the code where it is used, for now, until there is a need to refactor it so it can be used elsewhere.

        Show
        Sean Owen added a comment - Yes they could go there too. My personal preference would be to keep it next to the code where it is used, for now, until there is a need to refactor it so it can be used elsewhere.

          People

          • Assignee:
            Sean Owen
            Reporter:
            Christoph Nagel
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development