Crunch
  1. Crunch
  2. CRUNCH-97

Add helpers for parsing PCollection<String> instances

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.5.0
    • Component/s: Core
    • Labels:
      None

      Description

      We should make it a bit easier to parse delimited text files into specific data types (e.g., ints, floats, etc.) or combinations of types-- e.g., pairs of strings and ints, a Tuple3 of booleans, etc.

      1. CRUNCH-97.patch
        21 kB
        Josh Wills
      2. CRUNCH-97-contrib.patch
        50 kB
        Josh Wills
      3. CRUNCH-97-take2.patch
        27 kB
        Josh Wills
      4. CRUNCH-97-Tokenizer-v1.patch
        123 kB
        Matthias Friedrich
      5. CRUNCH-97v3.patch
        69 kB
        Josh Wills
      6. CRUNCH-97v4.patch
        75 kB
        Josh Wills

        Activity

        Hide
        Josh Wills added a comment -

        Closing out the issus for the 0.5.0 release.

        Show
        Josh Wills added a comment - Closing out the issus for the 0.5.0 release.
        Hide
        Josh Wills added a comment -

        Committed to contrib-- thanks all.

        Show
        Josh Wills added a comment - Committed to contrib-- thanks all.
        Hide
        Gabriel Reid added a comment -

        +1 from me

        Show
        Gabriel Reid added a comment - +1 from me
        Hide
        Josh Wills added a comment -

        Same as the v4 patch, just in contrib instead of core.

        Show
        Josh Wills added a comment - Same as the v4 patch, just in contrib instead of core.
        Hide
        Josh Wills added a comment -

        I agree we need to get some use for it and figure out what works and what doesn't. I'll create a patch that puts it in contrib for now.

        Show
        Josh Wills added a comment - I agree we need to get some use for it and figure out what works and what doesn't. I'll create a patch that puts it in contrib for now.
        Hide
        Matthias Friedrich added a comment -

        Like Gabriel, I think a text parsing library would be a useful addition and I'm sorry for being unable to offer a design that covers all use cases. We need more feedback, perhaps someone else has a good idea, but I guess that'll only happen when it's part of the release or at least part of the source tree. How about adding it to contrib or add a beta marker to the javadoc and ask for feedback? This way we can still change it in case we come up with the perfect design (whatever that is) and users get a fair warning that it's not stable yet.

        Show
        Matthias Friedrich added a comment - Like Gabriel, I think a text parsing library would be a useful addition and I'm sorry for being unable to offer a design that covers all use cases. We need more feedback, perhaps someone else has a good idea, but I guess that'll only happen when it's part of the release or at least part of the source tree. How about adding it to contrib or add a beta marker to the javadoc and ask for feedback? This way we can still change it in case we come up with the perfect design (whatever that is) and users get a fair warning that it's not stable yet.
        Hide
        Gabriel Reid added a comment -

        I don't have any direct use or need for it right now, but I do have the feeling that something like this is a really useful addition to Crunch, so I think it would be a shame to close this out now. I'm pretty indifferent about the Scanner vs Tokenizer discussion, but I don't have much context to base an opinion on for now.

        In any case, even if this would be primarily a help for prototyping, I think that that is more than enough reason to include it. It's a similar situation to reflection-based Avro – it might not be what you want to use in production, but it's incredibly useful for quick iterations in development. In any case, I'm very much in favour of adding this (in one of its incarnations) to Crunch.

        Show
        Gabriel Reid added a comment - I don't have any direct use or need for it right now, but I do have the feeling that something like this is a really useful addition to Crunch, so I think it would be a shame to close this out now. I'm pretty indifferent about the Scanner vs Tokenizer discussion, but I don't have much context to base an opinion on for now. In any case, even if this would be primarily a help for prototyping, I think that that is more than enough reason to include it. It's a similar situation to reflection-based Avro – it might not be what you want to use in production, but it's incredibly useful for quick iterations in development. In any case, I'm very much in favour of adding this (in one of its incarnations) to Crunch.
        Hide
        Josh Wills added a comment -

        Gabriel Reid does this stuff interest you at all? If you're also 'meh' on it, I'll just close out the issue as a won't fix.

        Show
        Josh Wills added a comment - Gabriel Reid does this stuff interest you at all? If you're also 'meh' on it, I'll just close out the issue as a won't fix.
        Hide
        Josh Wills added a comment -

        Re: the order in which fields are extracted, it seems like we should have a swap() method in PTables for that, which would be useful beyond just text parsing.

        I respect the -0, although I like Scanner as a balance between regex and split-style parsing. The Extractor interface and the Parse code are independent of any particular tokenizing implementation, so if there's a better way to do it, we do have the freedom to experiment.

        Show
        Josh Wills added a comment - Re: the order in which fields are extracted, it seems like we should have a swap() method in PTables for that, which would be useful beyond just text parsing. I respect the -0, although I like Scanner as a balance between regex and split-style parsing. The Extractor interface and the Parse code are independent of any particular tokenizing implementation, so if there's a better way to do it, we do have the freedom to experiment.
        Hide
        Matthias Friedrich added a comment -

        Hmm, there's no way to influence the order in which fields are extracted, which means I couldn't turn "1,2" into a (2, 1) tuple, for example. This means I can't always get a tuple where my business key is the first element, which seems to be a common convention. As a user I would probably use Parse for prototyping and roll my own for production. I think this whole Scanner stuff isn't very well suited for this use case, so -0 from me.

        Show
        Matthias Friedrich added a comment - Hmm, there's no way to influence the order in which fields are extracted, which means I couldn't turn "1,2" into a (2, 1) tuple, for example. This means I can't always get a tuple where my business key is the first element, which seems to be a common convention. As a user I would probably use Parse for prototyping and roll my own for production. I think this whole Scanner stuff isn't very well suited for this use case, so -0 from me.
        Hide
        Matthias Friedrich added a comment -

        Nice! I'll give it a try and write a few test cases to make sure it covers my use cases.

        Show
        Matthias Friedrich added a comment - Nice! I'll give it a try and write a few test cases to make sure it covers my use cases.
        Hide
        Josh Wills added a comment -

        Matthias Friedrich my interpretation of the Tokenizer idea: I made the ScannerFactory into a TokenizerFactory, where my Tokenizer is just a wrapper for a Scanner that knows whether it should bypass certain fields when it is called. Ends up being less typing for the default case (no need to specify indices for the extractors) while still supporting your use case.

        Show
        Josh Wills added a comment - Matthias Friedrich my interpretation of the Tokenizer idea: I made the ScannerFactory into a TokenizerFactory, where my Tokenizer is just a wrapper for a Scanner that knows whether it should bypass certain fields when it is called. Ends up being less typing for the default case (no need to specify indices for the extractors) while still supporting your use case.
        Hide
        Matthias Friedrich added a comment -

        Here's how it would look with Tokenizers. This is just a demo, I would clean it up if we decided to go this way. I'd also have to design a nice call syntax for Parse with tokenizers.

        What I don't like about it: It makes nested collections harder to implement (see the ugly hack for Extractors.xcollect()). We also have to pass column positions to ExtractorS - that's the price we pay for flexibility.

        Show
        Matthias Friedrich added a comment - Here's how it would look with Tokenizers. This is just a demo, I would clean it up if we decided to go this way. I'd also have to design a nice call syntax for Parse with tokenizers. What I don't like about it: It makes nested collections harder to implement (see the ugly hack for Extractors.xcollect()). We also have to pass column positions to ExtractorS - that's the price we pay for flexibility.
        Hide
        Matthias Friedrich added a comment -

        It seems RB doesn't like two commits on top of each other. I think the easiest solution would be to squash the two commits into one (git rebase -i master) which worked for me. IIRC RB can show diffs between two patches, making it easy to see how the patch got changed.

        Show
        Matthias Friedrich added a comment - It seems RB doesn't like two commits on top of each other. I think the easiest solution would be to squash the two commits into one (git rebase -i master) which worked for me. IIRC RB can show diffs between two patches, making it easy to see how the patch got changed.
        Hide
        Josh Wills added a comment -

        This patch addresses Gabriel's comments here and Matthias' comments on RB. I couldn't upload it to RB b/c RB objected to the changes to o.a.c.lib.PTables for some reason.

        Show
        Josh Wills added a comment - This patch addresses Gabriel's comments here and Matthias' comments on RB. I couldn't upload it to RB b/c RB objected to the changes to o.a.c.lib.PTables for some reason.
        Hide
        Gabriel Reid added a comment -

        I just (finally) took a look at the new ReviewBoard patch of this (https://reviews.apache.org/r/8151/), and I'm even more enthusiastic about this now – it almost seems amazing that we didn't have something like this already, as it's going to be incredibly useful.

        Just a few small comments on it (I'm putting my comments here instead of on ReviewBoard, not sure which is best)

        • The group name used for the error counters in Parse.java is "test". It would be good to have a better default ("parse"?), and/or add it as a parameter to Parse#parse and Parse#parseTable
        • It might be helpful to add debug logging to the parsing error condition in AbstractSimpleExtractor and AbstractCompositeExtractor. As it is right now, if there's a field that isn't parsing properly somewhere in an input using the built-in numerical parsing, there's no way of finding where it's going wrong.
        • It might be good to add a bit of javadoc to the fieldErrors field and getter in ExtractorStats, as it wasn't immediately clear to me exactly how this was to be filled in (on the other hand, in most cases this won't be used directly anyhow, so it's probably not that important).
        Show
        Gabriel Reid added a comment - I just (finally) took a look at the new ReviewBoard patch of this ( https://reviews.apache.org/r/8151/ ), and I'm even more enthusiastic about this now – it almost seems amazing that we didn't have something like this already, as it's going to be incredibly useful. Just a few small comments on it (I'm putting my comments here instead of on ReviewBoard, not sure which is best) The group name used for the error counters in Parse.java is "test". It would be good to have a better default ("parse"?), and/or add it as a parameter to Parse#parse and Parse#parseTable It might be helpful to add debug logging to the parsing error condition in AbstractSimpleExtractor and AbstractCompositeExtractor. As it is right now, if there's a field that isn't parsing properly somewhere in an input using the built-in numerical parsing, there's no way of finding where it's going wrong. It might be good to add a bit of javadoc to the fieldErrors field and getter in ExtractorStats, as it wasn't immediately clear to me exactly how this was to be filled in (on the other hand, in most cases this won't be used directly anyhow, so it's probably not that important).
        Hide
        Matthias Friedrich added a comment -

        Looks cool! Having parsed way too much text myself, there's a few things I'm missing. Right now there doesn't seem to be much in the way of error and missing value handling (noticed none in the test case at least). To make this universally applicable (which would be the goal for o.a.c.lib, as opposed to contrib) we'd need a bit more support for dealing with crappy data.

        At work we increment separate counters for each field that has an invalid value and a different counter for records that are completely broken. This helps a lot with monitoring data streams over time. Also, my experiences with Java 5 (I never re-measured this) was that throwing multiple exceptions per record when dealing with crapping data significantly slows down processing, even in situations when you think I/O bound should totally dominate. I've seen 600% increases in runtime in pathological situations (throwing exceptions was fast in Java 5, but creating the stack traces wasn't).

        A few things from the nitpicking category: I'd move the inner classes to their own files to make things easier to read, maybe move implementations to an Extractors class (Guava style); the private stuff could be made package private. We could also use a package-info.java file for the javadocs and the CRUNCH-97 marker is missing from the commit messages (you can squash all three commits together using "rebase -i", this lets you edit the messages, too).

        Show
        Matthias Friedrich added a comment - Looks cool! Having parsed way too much text myself, there's a few things I'm missing. Right now there doesn't seem to be much in the way of error and missing value handling (noticed none in the test case at least). To make this universally applicable (which would be the goal for o.a.c.lib, as opposed to contrib) we'd need a bit more support for dealing with crappy data. At work we increment separate counters for each field that has an invalid value and a different counter for records that are completely broken. This helps a lot with monitoring data streams over time. Also, my experiences with Java 5 (I never re-measured this) was that throwing multiple exceptions per record when dealing with crapping data significantly slows down processing, even in situations when you think I/O bound should totally dominate. I've seen 600% increases in runtime in pathological situations (throwing exceptions was fast in Java 5, but creating the stack traces wasn't). A few things from the nitpicking category: I'd move the inner classes to their own files to make things easier to read, maybe move implementations to an Extractors class (Guava style); the private stuff could be made package private. We could also use a package-info.java file for the javadocs and the CRUNCH-97 marker is missing from the commit messages (you can squash all three commits together using "rebase -i", this lets you edit the messages, too).
        Hide
        Josh Wills added a comment -

        Gabriel Reid agree with your comments-- here's a new rev.

        Show
        Josh Wills added a comment - Gabriel Reid agree with your comments-- here's a new rev.
        Hide
        Gabriel Reid added a comment -

        Looks very cool. It would be good to add some javadoc to the Parse.Builder class methods, just to make it clear that they basically all map to java.util.Scanner parameters.

        Also, what do you think of moving putting this in a new package: org.apache.crunch.lib.text? I'm not totally sure about that either way, but it just seems that everything else in o.a.c.lib is pretty generic, and this is specific to string data.

        Show
        Gabriel Reid added a comment - Looks very cool. It would be good to add some javadoc to the Parse.Builder class methods, just to make it clear that they basically all map to java.util.Scanner parameters. Also, what do you think of moving putting this in a new package: org.apache.crunch.lib.text? I'm not totally sure about that either way, but it just seems that everything else in o.a.c.lib is pretty generic, and this is specific to string data.
        Hide
        Josh Wills added a comment -

        A first take on this-- adds a Parse module to o.a.c.lib that makes it easy to construct parsers for converting lines of text into Crunch types.

        Show
        Josh Wills added a comment - A first take on this-- adds a Parse module to o.a.c.lib that makes it easy to construct parsers for converting lines of text into Crunch types.

          People

          • Assignee:
            Josh Wills
            Reporter:
            Josh Wills
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development