Details

    • Type: New Feature
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:

      Description

      Implement a stratified sampling option ( http://en.wikipedia.org/wiki/Stratified_sampling ) in Pig's SAMPLE operator.

        Issue Links

          Activity

          Hide
          ddwijewardana Dishara Wijewardana added a comment -

          Hi Gianmarco
          I am Dishara who took part in previous GSoC 2012 in Apache Velocity project and successfully completed the JSR 223 implementation. I would like to contribute to the PIG project since it seems pretty interesting. As far as I understand this project idea is basically to implement a tolerable Stratified sampling algorithm on top of PIG. Correct me If I am wrong. Can you provide a bit more details of what aspects I need to look in and get in to this. (like what exactly expected eventually, so that may be I can provide potential algorithm as a patch to simulate this probably before the proposal)

          Show
          ddwijewardana Dishara Wijewardana added a comment - Hi Gianmarco I am Dishara who took part in previous GSoC 2012 in Apache Velocity project and successfully completed the JSR 223 implementation. I would like to contribute to the PIG project since it seems pretty interesting. As far as I understand this project idea is basically to implement a tolerable Stratified sampling algorithm on top of PIG. Correct me If I am wrong. Can you provide a bit more details of what aspects I need to look in and get in to this. (like what exactly expected eventually, so that may be I can provide potential algorithm as a patch to simulate this probably before the proposal)
          Hide
          ddwijewardana Dishara Wijewardana added a comment -

          By the way will this project suits to the GSoC project scope ? Asking since I am not aware of the complexity of this since I do not yet have the big picture.

          Show
          ddwijewardana Dishara Wijewardana added a comment - By the way will this project suits to the GSoC project scope ? Asking since I am not aware of the complexity of this since I do not yet have the big picture.
          Hide
          azaroth Gianmarco De Francisci Morales added a comment -

          Hi Dishara,
          Happy to see your interest.
          While we haven't discussed in detail with the rest of the Committers, my personal view on this project is that it should be combined with the one on Bootstrap sampling PIG-3221 to be worth of GSoC.

          Regarding the sampling, this part of the project requires designing and changing the parser to recognize new part of the syntax for the SAMPLE operator (to specify the strata), and implementing the logical and physical operators connected to it.

          Show
          azaroth Gianmarco De Francisci Morales added a comment - Hi Dishara, Happy to see your interest. While we haven't discussed in detail with the rest of the Committers, my personal view on this project is that it should be combined with the one on Bootstrap sampling PIG-3221 to be worth of GSoC. Regarding the sampling, this part of the project requires designing and changing the parser to recognize new part of the syntax for the SAMPLE operator (to specify the strata), and implementing the logical and physical operators connected to it.
          Hide
          saiph Saiph Kappa added a comment -

          Hi. This seems very interesting. I am currently a PhD student working on dataflows and I am also willing to work on this issue as part of the GSoC 2013. Maybe we can discuss this further through email: firstName(dot) lastName (at) gmail (dot) com

          Show
          saiph Saiph Kappa added a comment - Hi. This seems very interesting. I am currently a PhD student working on dataflows and I am also willing to work on this issue as part of the GSoC 2013. Maybe we can discuss this further through email: firstName(dot) lastName (at) gmail (dot) com
          Hide
          azaroth Gianmarco De Francisci Morales added a comment -

          Hi Saiph,
          I am happy to see interest in this project idea.

          This idea should be combined with the other sampling projects in Pig as shown in https://cwiki.apache.org/confluence/display/PIG/GSoc2013 to prepare a GSoC project proposal.

          In my view, reservoir and bootstrap sampling are the easiest, while stratified sampling might be more complicated.

          Show
          azaroth Gianmarco De Francisci Morales added a comment - Hi Saiph, I am happy to see interest in this project idea. This idea should be combined with the other sampling projects in Pig as shown in https://cwiki.apache.org/confluence/display/PIG/GSoc2013 to prepare a GSoC project proposal. In my view, reservoir and bootstrap sampling are the easiest, while stratified sampling might be more complicated.

            People

            • Assignee:
              Unassigned
              Reporter:
              azaroth Gianmarco De Francisci Morales
            • Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

              • Created:
                Updated:

                Development