because I got much to work with entropy and information gain ratio, I want to implement the following distributed algorithms:
- Entropy (https://secure.wikimedia.org/wikipedia/en/wiki/Entropy_%28information_theory%29)
- Conditional Entropy (https://secure.wikimedia.org/wikipedia/en/wiki/Conditional_entropy)
- Information Gain
- Information Gain Ratio (https://secure.wikimedia.org/wikipedia/en/wiki/Information_gain_ratio)
This issue is at first only for entropy.
- In which package do the classes belong. I put them first at 'org.apache.mahout.math.stats', don't know if this is right, because they are components of information retrieval.
- Entropy only reads a set of elements. As input i took a sequence file with keys of type Text and values anyone, because I only work with the keys. Is this the best practise?
- Is there a generic solution, so that the type of keys can be anything inherited from Writable?
In Hadoop is a TokenCounterMapper, which emits each value with an IntWritable(1). I added a KeyCounterMapper into 'org.apache.mahout.common.mapreduce' which does the same with the keys.
Will append my patch soon.
|Status||Resolved [ 5 ]||Closed [ 6 ]|
|Status||Open [ 1 ]||Resolved [ 5 ]|
|Assignee||Sean Owen [ srowen ]|
|Fix Version/s||0.6 [ 12316364 ]|
|Resolution||Fixed [ 1 ]|
|Attachment||MAHOUT-747.patch [ 12484876 ]|
|Attachment||MAHOUT-747.patch [ 12484615 ]|
|Status||Patch Available [ 10002 ]||Open [ 1 ]|
|Field||Original Value||New Value|
|Status||Open [ 1 ]||Patch Available [ 10002 ]|