Pig
  1. Pig
  2. PIG-2641

Create toJSON function for all complex types: tuples, bags and maps

    Details

      Description

      It is a travesty that there are no UDFs in Piggybanks that, given an arbitrary Pig datatype, return a JSON string of same. I intend to fix this problem.

      1. PIG-2641.patch
        11 kB
        Russell Jurney
      2. PIG-2641-2.patch
        10 kB
        Russell Jurney
      3. PIG-2641-3.patch
        14 kB
        Russell Jurney
      4. PIG-2641-4.patch
        14 kB
        Russell Jurney
      5. PIG-2641-5.patch
        14 kB
        Russell Jurney
      6. PIG-2641-6.patch
        14 kB
        Russell Jurney
      7. PIG-2641-7.patch
        15 kB
        Daniel Dai

        Issue Links

          Activity

          Hide
          Russell Jurney added a comment -

          EvalFuncs do not have access to Tuple field names. Without access to the schema, this feature seems impossible.

          Show
          Russell Jurney added a comment - EvalFuncs do not have access to Tuple field names. Without access to the schema, this feature seems impossible.
          Hide
          Daniel Dai added a comment -

          Check EvalFunc.getInputSchema()

          Show
          Daniel Dai added a comment - Check EvalFunc.getInputSchema()
          Hide
          Russell Jurney added a comment -

          The essential work is done here: https://github.com/rjurney/pig-to-json

          I will clean this up/make unit test and try to get this into piggybank for 0.11.

          Show
          Russell Jurney added a comment - The essential work is done here: https://github.com/rjurney/pig-to-json I will clean this up/make unit test and try to get this into piggybank for 0.11.
          Hide
          Olga Natkovich added a comment -

          Hi Russell,

          Is this going to be done in the next couple of weeks? if not, should we move it to 12?

          Show
          Olga Natkovich added a comment - Hi Russell, Is this going to be done in the next couple of weeks? if not, should we move it to 12?
          Hide
          Russell Jurney added a comment -

          Move to 0.12. If I get the time I'll finish it and change it back.

          Show
          Russell Jurney added a comment - Move to 0.12. If I get the time I'll finish it and change it back.
          Hide
          Russell Jurney added a comment -

          Have this working. Initial patch coming.

          Show
          Russell Jurney added a comment - Have this working. Initial patch coming.
          Hide
          Russell Jurney added a comment -

          The tests are still dummy tests, but the actual builtin works.

          Show
          Russell Jurney added a comment - The tests are still dummy tests, but the actual builtin works.
          Hide
          Russell Jurney added a comment -

          Working test, ready for review.

          Show
          Russell Jurney added a comment - Working test, ready for review.
          Hide
          Russell Jurney added a comment -

          Implemented a ToJson builtin and added a unit test. Based on JsonStorage.

          Show
          Russell Jurney added a comment - Implemented a ToJson builtin and added a unit test. Based on JsonStorage.
          Hide
          Russell Jurney added a comment -

          Working patch with test file too.

          Show
          Russell Jurney added a comment - Working patch with test file too.
          Hide
          Russell Jurney added a comment -

          Was duping patch in the other patches. This is the patch.

          Show
          Russell Jurney added a comment - Was duping patch in the other patches. This is the patch.
          Hide
          Gunther Hagleitner added a comment -
          • The test file only seems to test the json loader but not the actual ToJson function
          • The json test input file doesn't seem to have all possible types
          • There's already a TOMAP function - just wondering if the naming should be TOJSON
          • There's copied code from JsonStorage. For easier maintenance, can you refactor to have both use the same code?
          • I'm not sure about the bytearray datatype. It uses "toString" which will interpret the data as utf8 characters and return null if it's not legal UTF8. Is that the right thing to do for JSON? Base64? Refuse bytearray since Json doesn't support it?
          Show
          Gunther Hagleitner added a comment - The test file only seems to test the json loader but not the actual ToJson function The json test input file doesn't seem to have all possible types There's already a TOMAP function - just wondering if the naming should be TOJSON There's copied code from JsonStorage. For easier maintenance, can you refactor to have both use the same code? I'm not sure about the bytearray datatype. It uses "toString" which will interpret the data as utf8 characters and return null if it's not legal UTF8. Is that the right thing to do for JSON? Base64? Refuse bytearray since Json doesn't support it?
          Hide
          Jonathan Coveney added a comment -

          Gunther's points are good ones, Russell. Should definitely be taken into account.

          I was thinking about easier ways to deal with vararg issues, and I think I thought of one. Instead of using getArgToFunc, on the front end in outputSchema, you can just verify that the schema is homogenous and concat-able (i.e. all strings, all longs, whatever). If you want to be really fancy, you can find the code that defines whether or not a type is coercible and base it on the first argument (i.e. string long long could be concat-able as all strings).

          I've realized now that the existence of getInputSchema largely makes the getArgToFunc construct irrelevant except in the case where you need frontend specific information passed to the backend via the constructor based on the type of the input (which is not the case here).

          Then on the back end, you lazy initialize based on the getInputSchema schema. You can constructor a concatenator based on the Schema which actually does the concatenating, and then just run every argument against it (i.e. if you detect String, String, String you initialize a StringConcatenator). There are a couple of ways you could design this, and this is where you could take into account coercion rules.

          Bam, no deep changes necessary to support varargs.

          Show
          Jonathan Coveney added a comment - Gunther's points are good ones, Russell. Should definitely be taken into account. I was thinking about easier ways to deal with vararg issues, and I think I thought of one. Instead of using getArgToFunc, on the front end in outputSchema, you can just verify that the schema is homogenous and concat-able (i.e. all strings, all longs, whatever). If you want to be really fancy, you can find the code that defines whether or not a type is coercible and base it on the first argument (i.e. string long long could be concat-able as all strings). I've realized now that the existence of getInputSchema largely makes the getArgToFunc construct irrelevant except in the case where you need frontend specific information passed to the backend via the constructor based on the type of the input (which is not the case here). Then on the back end, you lazy initialize based on the getInputSchema schema. You can constructor a concatenator based on the Schema which actually does the concatenating, and then just run every argument against it (i.e. if you detect String, String, String you initialize a StringConcatenator). There are a couple of ways you could design this, and this is where you could take into account coercion rules. Bam, no deep changes necessary to support varargs.
          Hide
          Russell Jurney added a comment -

          Gunther: the patch was bad. Re-attaching, and it does use ToJson. TOJSON and ToJson can both work, but I'm not sure how to do that mapping. Jco/dmitriy?

          Will factor out copied code. My treatment of bytearray is consistent with JsonStorage.

          Show
          Russell Jurney added a comment - Gunther: the patch was bad. Re-attaching, and it does use ToJson. TOJSON and ToJson can both work, but I'm not sure how to do that mapping. Jco/dmitriy? Will factor out copied code. My treatment of bytearray is consistent with JsonStorage.
          Hide
          Russell Jurney added a comment -

          Working patch, albeit with duped code.

          Show
          Russell Jurney added a comment - Working patch, albeit with duped code.
          Hide
          Russell Jurney added a comment -

          Working patch, fixes unit test issue. Still dupe code.

          Show
          Russell Jurney added a comment - Working patch, fixes unit test issue. Still dupe code.
          Hide
          Russell Jurney added a comment -

          Re: feedback, made a JsonSerializer static class within JsonStorage, which is now used by both JsonStorage and ToJson.

          Show
          Russell Jurney added a comment - Re: feedback, made a JsonSerializer static class within JsonStorage, which is now used by both JsonStorage and ToJson.
          Hide
          Russell Jurney added a comment -

          Have addressed most of review comments, still working but getting close.

          Show
          Russell Jurney added a comment - Have addressed most of review comments, still working but getting close.
          Hide
          Russell Jurney added a comment -

          Reviewboard happening. https://reviews.apache.org/r/9481/

          Show
          Russell Jurney added a comment - Reviewboard happening. https://reviews.apache.org/r/9481/
          Hide
          Jonathan Coveney added a comment -

          Jeremy Karn said:

          "I'm not sure if this is the right place to respond, but just a couple of comments:

          1) Our latest patch for Jira 1914 (https://issues.apache.org/jira/browse/PIG-1914) supports storing Maps with complex schemas. I think a lot of the tests will also be useful for this patch.

          2) I don't think we should serialize tuples as json lists. While it preserves order (which is nice) I suspect most users would rather have the direct mapping of pig alias to json key and to have their tuples as json objects with key/value pairs instead of as a list which requires referencing fields by position."

          Show
          Jonathan Coveney added a comment - Jeremy Karn said: "I'm not sure if this is the right place to respond, but just a couple of comments: 1) Our latest patch for Jira 1914 ( https://issues.apache.org/jira/browse/PIG-1914 ) supports storing Maps with complex schemas. I think a lot of the tests will also be useful for this patch. 2) I don't think we should serialize tuples as json lists. While it preserves order (which is nice) I suspect most users would rather have the direct mapping of pig alias to json key and to have their tuples as json objects with key/value pairs instead of as a list which requires referencing fields by position."
          Hide
          Jonathan Coveney added a comment -

          In response to #2, I agree that users will want to preserve the alias, thought IMHO the fact that tuples are ordered and json objects are not is a serious one. We can certainly maintain both, but it is going to lead to a more complex structure. Still, in my opinion, a proposal which does not maintain tuple order should be considered a bug (esp. if it currently does something different). Right now, Tuples are pretty much exclusively indexed based on numeric index, not on alias name. While it is certainly convenient to refer based on alias name, this goes against the pattern that has been promoted all along.

          I'm very open to suggestions on how to deal with this (any other Pig comitters want to comment on whether or not they agree that this is an issue?). I mean, I would even be ok with having the tuple be an object but with an accompany list that gives the proper order of all of the entries. As long as it is possible to reconstruct it as it existed in Pig.

          Show
          Jonathan Coveney added a comment - In response to #2, I agree that users will want to preserve the alias, thought IMHO the fact that tuples are ordered and json objects are not is a serious one. We can certainly maintain both, but it is going to lead to a more complex structure. Still, in my opinion, a proposal which does not maintain tuple order should be considered a bug (esp. if it currently does something different). Right now, Tuples are pretty much exclusively indexed based on numeric index, not on alias name. While it is certainly convenient to refer based on alias name, this goes against the pattern that has been promoted all along. I'm very open to suggestions on how to deal with this (any other Pig comitters want to comment on whether or not they agree that this is an issue?). I mean, I would even be ok with having the tuple be an object but with an accompany list that gives the proper order of all of the entries. As long as it is possible to reconstruct it as it existed in Pig.
          Hide
          Jeremy Karn added a comment -

          Assuming there's not too much of an overhead (or there's an optional parameter to disable it) I like the idea of the tuple being an object with an added field specifying the ordering of the fields. Seems like a nice solution to both concerns.

          Show
          Jeremy Karn added a comment - Assuming there's not too much of an overhead (or there's an optional parameter to disable it) I like the idea of the tuple being an object with an added field specifying the ordering of the fields. Seems like a nice solution to both concerns.
          Hide
          Jonathan Coveney added a comment -

          Sounds fine to me. Russell: have at it

          Show
          Jonathan Coveney added a comment - Sounds fine to me. Russell: have at it
          Hide
          Russell Jurney added a comment -

          I'm ok with fixing it up, but having talked to Doug Daniels, he has redone JSON generation/parsing to be more elephant-birdish (eat any valid json) in their branch of Pig. Therefore I am planning on working with them to get their changes in for 0.12. That being the case, I am hesitant to make any changes to JSON serialization in this ticket.

          I will file another JIRA for that change, and work with Doug on getting the code in/redoing JsonSerializer. I think we should discuss the fact that this could be a breaking change.

          Does anyone know how to make this thing respond to both caps and lowercase?

          Show
          Russell Jurney added a comment - I'm ok with fixing it up, but having talked to Doug Daniels, he has redone JSON generation/parsing to be more elephant-birdish (eat any valid json) in their branch of Pig. Therefore I am planning on working with them to get their changes in for 0.12. That being the case, I am hesitant to make any changes to JSON serialization in this ticket. I will file another JIRA for that change, and work with Doug on getting the code in/redoing JsonSerializer. I think we should discuss the fact that this could be a breaking change. Does anyone know how to make this thing respond to both caps and lowercase?
          Hide
          Jeremy Karn added a comment -

          Russell, the current patch for PIG-1914 contains the changes we're using at Mortar Data.

          I agree that we'd want to update the JsonLoader to look for this new field to preserve the order of tuples. So maybe it does make sense to just do it together in a new jira once 1914 and this are on trunk.

          Show
          Jeremy Karn added a comment - Russell, the current patch for PIG-1914 contains the changes we're using at Mortar Data. I agree that we'd want to update the JsonLoader to look for this new field to preserve the order of tuples. So maybe it does make sense to just do it together in a new jira once 1914 and this are on trunk.
          Hide
          Russell Jurney added a comment -

          Thanks, Jeremy. You're growing over there

          The current patch for PIG-1914 is against Piggybank, and I think it should instead alter the JsonStorage builtin. Is there a way to do this so that backwards compatibility is not broken?

          I also like the idea of including ordering in a field. We can return the json in a sorted fashion as well, but anything parsing it is likely to mess up order. Not our problem?

          Question: has there been the expectation that the json output of the current JsonStorage builtin will or can be read by external tools? If not, do we need to do anything other than ensure that new pig can read the old format to maintain backwards compatibility? If so... we're breaking compatibility. I suspect no guarantee was given, but is it relied on in practice?

          Show
          Russell Jurney added a comment - Thanks, Jeremy. You're growing over there The current patch for PIG-1914 is against Piggybank, and I think it should instead alter the JsonStorage builtin. Is there a way to do this so that backwards compatibility is not broken? I also like the idea of including ordering in a field. We can return the json in a sorted fashion as well, but anything parsing it is likely to mess up order. Not our problem? Question: has there been the expectation that the json output of the current JsonStorage builtin will or can be read by external tools? If not, do we need to do anything other than ensure that new pig can read the old format to maintain backwards compatibility? If so... we're breaking compatibility. I suspect no guarantee was given, but is it relied on in practice?
          Hide
          Russell Jurney added a comment -

          Conflicting patches

          Show
          Russell Jurney added a comment - Conflicting patches
          Hide
          Russell Jurney added a comment -

          Latest patch incorporating reviews.

          Show
          Russell Jurney added a comment - Latest patch incorporating reviews.
          Hide
          Russell Jurney added a comment -

          I'm thinking about this, and I want to get this done and get it committed. Then I want to create a new ticket for the changes in PIG-1914, and then finish them - porting the work in Piggybank to JsonStorage myself in as backward-compatible way as possible.

          Show
          Russell Jurney added a comment - I'm thinking about this, and I want to get this done and get it committed. Then I want to create a new ticket for the changes in PIG-1914 , and then finish them - porting the work in Piggybank to JsonStorage myself in as backward-compatible way as possible.
          Hide
          Prashant Kommireddi added a comment -

          Hey Russell Jurney, I should have noticed this earlier - it would make sense to have "json.close()" in finally block as well, just in case we hit an exception before the code path reaches this statement. Apologies for not having commented earlier regarding this.

          Show
          Prashant Kommireddi added a comment - Hey Russell Jurney , I should have noticed this earlier - it would make sense to have "json.close()" in finally block as well, just in case we hit an exception before the code path reaches this statement. Apologies for not having commented earlier regarding this.
          Hide
          Russell Jurney added a comment -

          Prashant: I don't think that is going to work. The output on two rows is corrupted by re-using the json object.

          Show
          Russell Jurney added a comment - Prashant: I don't think that is going to work. The output on two rows is corrupted by re-using the json object.
          Hide
          Russell Jurney added a comment -

          Oh, wait - you're right. But I was commented re: re-using the json object.

          Show
          Russell Jurney added a comment - Oh, wait - you're right. But I was commented re: re-using the json object.
          Hide
          Russell Jurney added a comment -

          Updated with finally to close json

          Show
          Russell Jurney added a comment - Updated with finally to close json
          Hide
          Russell Jurney added a comment -

          I would like to alter the way Pig handles JSON serialization in another ticket and get this one committed. Making JSON serialization happen in a static class as I've done should facilitate additional changes. However, owing to backwards compatibility issues, we need to think carefully about changes to the JSON serialization in its own ticket.

          Can plz to commit?

          Show
          Russell Jurney added a comment - I would like to alter the way Pig handles JSON serialization in another ticket and get this one committed. Making JSON serialization happen in a static class as I've done should facilitate additional changes. However, owing to backwards compatibility issues, we need to think carefully about changes to the JSON serialization in its own ticket. Can plz to commit?
          Hide
          Daniel Dai added a comment -

          Looks good. Several minor comments:

          • Seems it is better to take JsonSerializer out as an independent class
          • ToJson:30: r is never used, shall we remove this line?
          • Can we add javadoc to ToJson? And xdoc, since it is in builtin
          Show
          Daniel Dai added a comment - Looks good. Several minor comments: Seems it is better to take JsonSerializer out as an independent class ToJson:30: r is never used, shall we remove this line? Can we add javadoc to ToJson? And xdoc, since it is in builtin
          Hide
          Alan Gates added a comment -

          Canceling patch pending changes per Daniel's feedback.

          Show
          Alan Gates added a comment - Canceling patch pending changes per Daniel's feedback.
          Hide
          Daniel Dai added a comment -

          Russell Jurney, are you still working on it?

          Show
          Daniel Dai added a comment - Russell Jurney , are you still working on it?
          Hide
          Russell Jurney added a comment -

          How long do I have to get this into 0.12? Is that still possible?

          Show
          Russell Jurney added a comment - How long do I have to get this into 0.12? Is that still possible?
          Hide
          Daniel Dai added a comment -

          Since 0.12 is already branched, we don't suppose to commit new features. Can we defer to 0.13.0?

          Show
          Daniel Dai added a comment - Since 0.12 is already branched, we don't suppose to commit new features. Can we defer to 0.13.0?
          Hide
          Russell Jurney added a comment -

          Ok, I have a need to JSONize data to get bags of tuples into an HBase field. I am going to commit this to the DataFu project, which is entering the Apache Incubator. Will update this ticket with a github pull request when i make one.

          Show
          Russell Jurney added a comment - Ok, I have a need to JSONize data to get bags of tuples into an HBase field. I am going to commit this to the DataFu project, which is entering the Apache Incubator. Will update this ticket with a github pull request when i make one.
          Hide
          Russell Jurney added a comment -

          Is anyone else able to get the 6th patch to apply to trunk? None of this code has changed, so far as I know but I can't get the patch to apply.

          Show
          Russell Jurney added a comment - Is anyone else able to get the 6th patch to apply to trunk? None of this code has changed, so far as I know but I can't get the patch to apply.
          Hide
          Daniel Dai added a comment -

          That patch is out of sync. Attach PIG-2641-7.patch rebased with trunk.

          Show
          Daniel Dai added a comment - That patch is out of sync. Attach PIG-2641 -7.patch rebased with trunk.

            People

            • Assignee:
              Russell Jurney
              Reporter:
              Russell Jurney
            • Votes:
              1 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

              • Created:
                Updated:

                Time Tracking

                Estimated:
                Original Estimate - 96h
                96h
                Remaining:
                Remaining Estimate - 96h
                96h
                Logged:
                Time Spent - Not Specified
                Not Specified

                  Development