Uploaded image for project: 'DataFu'
  1. DataFu
  2. DATAFU-2

UDFs for entropy and weighted sampling algorithms

    XMLWordPrintableJSON

    Details

    • Type: Task
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.3.0
    • Labels:
      None

      Description

      Jian Wang has suggested that we add UDFs for entropy and weighted random sampling and has implementations for each of these ready.

      In Jian's words:

      "In the real world, there are occasions we need to calculate the entropy of discrete random variables, for instance, to calculate the mutual information between variable X and Y using its entropy-based formula(mutual information calculation could be found at http://en.wikipedia.org/wiki/Mutual_information#Relation_to_other_quantities). Would suggest to implement a UDF to calculate the entropy of given input samples, following the definition at http://en.wikipedia.org/wiki/Entropy_%28information_theory%29

      This is the reference paper I use to learn about the weighted sampleing algorithm: http://utopia.duth.gr/~pefraimi/research/data/2007EncOfAlg.pdf

      The present WeightedSample.java implements the Algorithm D.

      We may try Algorithm A, A-res and A-expJ since they could be used in a data stream and distributed environment. These algorithms could be implemented based on ReservoirSample.java(inherit from this class?) since they also need a reservior to store the selected items."

        Attachments

          Activity

            People

            • Assignee:
              mhayes Matthew Hayes
              Reporter:
              mhayes Matthew Hayes
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: