Crunch
  1. Crunch
  2. CRUNCH-75

Have a method to create Bloom filters in Crunch

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.4.0
    • Component/s: None
    • Labels:
      None

      Description

      As per discussion on mailing list, Bloomfilters provide interesting addon to crunch. But the crunch-core is not the correct place to add such a thing. So create a module i.e. something along the lines on piggybank of Pig.

      1. 0002-Created-crunch-bytes-module-for-users-to-share-funct.patch
        138 kB
        Rahul Sharma
      2. 0002-CRUNCH-75-Added-BloomFilters-to-crunch-contrib.patch
        136 kB
        Rahul Sharma
      3. CRUNCH-75.patch
        136 kB
        Rahul Sharma
      4. CRUNCH-75.patch
        141 kB
        Rahul Sharma
      5. 0001-CRUNCH-75.patch
        139 kB
        Rahul Sharma

        Issue Links

          Activity

          Hide
          Rahul Sharma added a comment -

          Created a module crunch-bytes which can have user functions.
          Added methods to create BloomFilters for files at a location and for a PCollection. Filter creation will return back PObject types.

          Show
          Rahul Sharma added a comment - Created a module crunch-bytes which can have user functions. Added methods to create BloomFilters for files at a location and for a PCollection. Filter creation will return back PObject types.
          Hide
          Rahul Sharma added a comment -

          The filter creation is being done in aggregator which uses the config that is passed in reset method.

          Show
          Rahul Sharma added a comment - The filter creation is being done in aggregator which uses the config that is passed in reset method.
          Hide
          Josh Wills added a comment -

          Very cool. Q: was the intent for crunch-bytes to be a general purpose repo, ala piggybank? If so, how did we come up with the name? I don't remember the context.

          Show
          Josh Wills added a comment - Very cool. Q: was the intent for crunch-bytes to be a general purpose repo, ala piggybank? If so, how did we come up with the name? I don't remember the context.
          Hide
          Rahul Sharma added a comment -

          There was a discussion for BloomFilters, to have them at a place that is something on lines of PiggyBank. And my apologies that I didn't initiate a thread to discus the module but I created one that I had in mind. I will initiate a thread for the same .

          Show
          Rahul Sharma added a comment - There was a discussion for BloomFilters, to have them at a place that is something on lines of PiggyBank. And my apologies that I didn't initiate a thread to discus the module but I created one that I had in mind. I will initiate a thread for the same .
          Hide
          Rahul Sharma added a comment -

          Added BloomFilters to crunch-contrib

          Show
          Rahul Sharma added a comment - Added BloomFilters to crunch-contrib
          Hide
          Matthias Friedrich added a comment -

          Two things I noticed at first glance: You'll have to add crunch-contrib as a dependency of crunch-dest, otherwise it won't be part of the binary release (see CRUNCH-76). Second thing: Could you create the patch via --find-copies-harder, that'll make it much shorter and easier to read.

          Show
          Matthias Friedrich added a comment - Two things I noticed at first glance: You'll have to add crunch-contrib as a dependency of crunch-dest, otherwise it won't be part of the binary release (see CRUNCH-76 ). Second thing: Could you create the patch via --find-copies-harder, that'll make it much shorter and easier to read.
          Hide
          Rahul Sharma added a comment - - edited

          Made crunch-contrib as a dependency of crunch-dist. I tried doing git diff --find-hard-copies, but I do not think that it made much difference. Please let me know if something else was required, I will update the patch.

          Show
          Rahul Sharma added a comment - - edited Made crunch-contrib as a dependency of crunch-dist. I tried doing git diff --find-hard-copies, but I do not think that it made much difference. Please let me know if something else was required, I will update the patch.
          Hide
          Matthias Friedrich added a comment -

          Hi Rahul! I meant "git format-patch --find-copies-harder master". I tried that locally and found the problem: Our shakes.txt has CRLF line endings while the one you add has LF line endings (which is correct). Without --find-copies-harder, the patch just adds the file, with --find-copies-harder, the patch makes a copy and then changes line endings on every single line. So that's bad luck, sorry about the confusion!

          As for the patch: Right now it doesn't compile for me, and I'd also put everything in crunch-contrib into the Java package "org.apache.crunch.contrib" so it's grouped nicely when we create aggregated Javadoc. Speaking of Javadoc, could you add some, along with a package-info.java file for o.a.c.contrib and o.a.c.contrib.bloomfilter?

          Show
          Matthias Friedrich added a comment - Hi Rahul! I meant "git format-patch --find-copies-harder master". I tried that locally and found the problem: Our shakes.txt has CRLF line endings while the one you add has LF line endings (which is correct). Without --find-copies-harder, the patch just adds the file, with --find-copies-harder, the patch makes a copy and then changes line endings on every single line. So that's bad luck, sorry about the confusion! As for the patch: Right now it doesn't compile for me, and I'd also put everything in crunch-contrib into the Java package "org.apache.crunch.contrib" so it's grouped nicely when we create aggregated Javadoc. Speaking of Javadoc, could you add some, along with a package-info.java file for o.a.c.contrib and o.a.c.contrib.bloomfilter?
          Hide
          Rahul Sharma added a comment -

          Matthias +1 for the suggested changes, I will update the javadocs and other required ones. As for the compilation issue, the patch depends on CRUNCH-74, so you need to apply that patch and then it will start building.

          Show
          Rahul Sharma added a comment - Matthias +1 for the suggested changes, I will update the javadocs and other required ones. As for the compilation issue, the patch depends on CRUNCH-74 , so you need to apply that patch and then it will start building.
          Hide
          Rahul Sharma added a comment -

          Updated the patch for changes.

          Show
          Rahul Sharma added a comment - Updated the patch for changes.
          Hide
          Josh Wills added a comment -

          @Rahul you should use CrunchTestSupport and not include TemporaryPaths.java in crunch-test. We don't want to introduce a recursive dependency between crunch and crunch-test by referencing RuntimeParameters in crunch-test. CrunchTestSupport does what you want TemporaryPaths to do, just have your IT.java classes extend it. Otherwise, +1.

          Show
          Josh Wills added a comment - @Rahul you should use CrunchTestSupport and not include TemporaryPaths.java in crunch-test. We don't want to introduce a recursive dependency between crunch and crunch-test by referencing RuntimeParameters in crunch-test. CrunchTestSupport does what you want TemporaryPaths to do, just have your IT.java classes extend it. Otherwise, +1.
          Hide
          Matthias Friedrich added a comment -

          Other than the issue Josh pointed out this looks good now, thank you!

          Show
          Matthias Friedrich added a comment - Other than the issue Josh pointed out this looks good now, thank you!
          Hide
          Rahul Sharma added a comment -

          Patch updated, using CrunchTestSupport. My bad I should have used that earlier.

          Show
          Rahul Sharma added a comment - Patch updated, using CrunchTestSupport. My bad I should have used that earlier.
          Hide
          Josh Wills added a comment -

          Not a problem-- +1, this is a nice addition.

          Show
          Josh Wills added a comment - Not a problem-- +1, this is a nice addition.
          Hide
          Rahul Sharma added a comment -

          Committed.

          Show
          Rahul Sharma added a comment - Committed.
          Hide
          Matthias Friedrich added a comment -

          Closing all resolved issues for 0.4.0.

          Show
          Matthias Friedrich added a comment - Closing all resolved issues for 0.4.0.

            People

            • Assignee:
              Rahul Sharma
              Reporter:
              Rahul Sharma
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development