Pig
  1. Pig
  2. PIG-1914

Support load/store JSON data in Pig

    Details

    • Type: New Feature New Feature
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: 0.11
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None
    • Release Note:
      Adds Piggybank functions for loading/storing JSON without relying on storing metadata alongside it.
    • Tags:
      JSON LoadFunc StoreFunc

      Description

      The JSON is a commonly used data storage format. It is popular for storing structured data, especially for JavaScript data exchange.
      Pig should have the ability to load/store JSON format data. I plan to write one for the piggy bank.

      1. json.patch
        67 kB
        Jonathan Packer
      2. PIG-1914.patch
        13 kB
        Michael May

        Issue Links

          Activity

          Hide
          Dmitriy V. Ryaboy added a comment -

          There is already a JSON loader in Elephant-Bird.

          Show
          Dmitriy V. Ryaboy added a comment - There is already a JSON loader in Elephant-Bird.
          Hide
          Olga Natkovich added a comment -

          Unlinking from the release.

          Please check the one Dmitry suggested. Also, to get this into the release, you need to find somebody who would commit to do the work on this soon. We are going to be starting to stabilize 0.9 in a week or so

          Show
          Olga Natkovich added a comment - Unlinking from the release. Please check the one Dmitry suggested. Also, to get this into the release, you need to find somebody who would commit to do the work on this soon. We are going to be starting to stabilize 0.9 in a week or so
          Hide
          Chao Tian added a comment -

          Hi Dmitry, could you share a link for the JSON loader you talked about.

          Show
          Chao Tian added a comment - Hi Dmitry, could you share a link for the JSON loader you talked about.
          Hide
          Dmitriy V. Ryaboy added a comment -

          For Pig 0.6: https://github.com/kevinweil/elephant-bird/tree/master/src/java/com/twitter/elephantbird/pig/load
          For Pig 0.8: https://github.com/dvryaboy/elephant-bird/tree/pig-08/src/java/com/twitter/elephantbird/pig8/load

          A Pig 0.9 version might be interesting because in this version, Pig understands typed keys, so it's finally possible to return complex structures as values, actually delivering the whole Json object.

          If you want to add directly to Pig, you'll probably want to use Jackson for parsing instead of SimpleJson, as that library is already included in Pig dependencies (and maybe even Hadoop ones?).

          Show
          Dmitriy V. Ryaboy added a comment - For Pig 0.6: https://github.com/kevinweil/elephant-bird/tree/master/src/java/com/twitter/elephantbird/pig/load For Pig 0.8: https://github.com/dvryaboy/elephant-bird/tree/pig-08/src/java/com/twitter/elephantbird/pig8/load A Pig 0.9 version might be interesting because in this version, Pig understands typed keys, so it's finally possible to return complex structures as values, actually delivering the whole Json object. If you want to add directly to Pig, you'll probably want to use Jackson for parsing instead of SimpleJson, as that library is already included in Pig dependencies (and maybe even Hadoop ones?).
          Hide
          Chao Tian added a comment -

          Hi Dmitry,

          Thanks to your comment. It is good to see that there is one JSON loader already. I read that code. I found the current solution is parsing the input json data into a map object.

          However, in my design , i plan to support JSON to Tuple conversion. The element key of each JSON object would be load as the alias of Tuple. And the element value would be load as data in tuple. The simple data type could be converted easily. For the complex type, the object of JSON could be mapped into Tuple of Pig, and the array of JSON could be mapped into DataBag of Pig.

          And I also plan to write a storer to store data in JSON format.

          Any thought?

          Thanks,
          Chao

          Show
          Chao Tian added a comment - Hi Dmitry, Thanks to your comment. It is good to see that there is one JSON loader already. I read that code. I found the current solution is parsing the input json data into a map object. However, in my design , i plan to support JSON to Tuple conversion. The element key of each JSON object would be load as the alias of Tuple. And the element value would be load as data in tuple. The simple data type could be converted easily. For the complex type, the object of JSON could be mapped into Tuple of Pig, and the array of JSON could be mapped into DataBag of Pig. And I also plan to write a storer to store data in JSON format. Any thought? Thanks, Chao
          Hide
          Dmitriy V. Ryaboy added a comment -

          That is a good idea, it would be quite useful for a number of scenarios.

          One problem with this design is that JSON objects often do not have a consistent set of keys, and each of the json objects you read may in fact have a totally new set of keys. How do you suggest dealing with something like that?

          Show
          Dmitriy V. Ryaboy added a comment - That is a good idea, it would be quite useful for a number of scenarios. One problem with this design is that JSON objects often do not have a consistent set of keys, and each of the json objects you read may in fact have a totally new set of keys. How do you suggest dealing with something like that?
          Hide
          Chao Tian added a comment -

          Yeah, i agree with you that we have this problem. However, i thought we should have the assumption that the JSON records in the same data file should have similar schema. The small difference could be allowed, but they should be similar, right?

          To deal with these small difference, we could define the schema for the loaded tuple by using the complete set keys. I plan to have two method of loading schema of the data, 1) User could pass a schema string which indicate the schema of the loaded data 2) If user pass nothing, the loader would parse the first line of input data to get the schema. After doing that, the loaded data would have a schema anyway. This schema should be the complete set of the keys. If some JSON records do not contain some fileds, they would be left as null in Pig.

          I thought this method could solve our problem. And by this method, we could also support the columnar filter, which means we just load the desired columns of JSON data, in future.

          Show
          Chao Tian added a comment - Yeah, i agree with you that we have this problem. However, i thought we should have the assumption that the JSON records in the same data file should have similar schema. The small difference could be allowed, but they should be similar, right? To deal with these small difference, we could define the schema for the loaded tuple by using the complete set keys. I plan to have two method of loading schema of the data, 1) User could pass a schema string which indicate the schema of the loaded data 2) If user pass nothing, the loader would parse the first line of input data to get the schema. After doing that, the loaded data would have a schema anyway. This schema should be the complete set of the keys. If some JSON records do not contain some fileds, they would be left as null in Pig. I thought this method could solve our problem. And by this method, we could also support the columnar filter, which means we just load the desired columns of JSON data, in future.
          Hide
          Dmitriy V. Ryaboy added a comment -

          That design makes sense, but the assumption that the first few records you read are going to have the full set of keys often does not hold true in my experience. It's probably very useful for a large subset of json-loading needs out there, though. Sounds like a good approach.

          Show
          Dmitriy V. Ryaboy added a comment - That design makes sense, but the assumption that the first few records you read are going to have the full set of keys often does not hold true in my experience. It's probably very useful for a large subset of json-loading needs out there, though. Sounds like a good approach.
          Hide
          Ed Summers added a comment -

          +1 for a JSON Loader/Storer that are part of PiggyBank. elephant-bird is nice, but elephantbird needs to a) be discovered, and b) built ... which is non-trivial given the various dependencies. elephant-bird also seems to only be compatible with Pig v0.6.

          Show
          Ed Summers added a comment - +1 for a JSON Loader/Storer that are part of PiggyBank. elephant-bird is nice, but elephantbird needs to a) be discovered, and b) built ... which is non-trivial given the various dependencies. elephant-bird also seems to only be compatible with Pig v0.6.
          Hide
          Chao Tian added a comment -

          Thanks Ed. I am working on the loader right now. I have finished a json.org version now, and i try to re-write this one by using the jackson streaming api to parse JSON from bytes stream.

          Show
          Chao Tian added a comment - Thanks Ed. I am working on the loader right now. I have finished a json.org version now, and i try to re-write this one by using the jackson streaming api to parse JSON from bytes stream.
          Hide
          Bill Graham added a comment -

          +1 a Map solution that allows for unknown json key/values to be handled. We often run jobs that create summaries of counts of all json keys, many of which are either unknown or not reliably implied by reading a random row.

          If instead a json loader is contributed that returns Tuples from either a pre-difined schema or via introspection, I suggest it's named in a way that implies this. Multiple implementations can be supported.

          Show
          Bill Graham added a comment - +1 a Map solution that allows for unknown json key/values to be handled. We often run jobs that create summaries of counts of all json keys, many of which are either unknown or not reliably implied by reading a random row. If instead a json loader is contributed that returns Tuples from either a pre-difined schema or via introspection, I suggest it's named in a way that implies this. Multiple implementations can be supported.
          Hide
          Michael May added a comment -

          This issues is from several months ago, any word on progress? I haven't seen any JSON stuff pop up in PiggyBank.

          I've been using the JSON loader as seen here: https://gist.github.com/601331
          Note this is only for loading, not storing!

          I realize this is only half of the requested JSON features(load, not store) but I think having a JSON loader is better than the JSON nothing that is currently in PiggyBank. I know I was very sad when I noticed that PiggyBank contains a CSV Loader and XML Loader, but no JSON Loader.

          I'd be more than happy to get this loader rolled into the PiggyBank SVN with approval.

          Show
          Michael May added a comment - This issues is from several months ago, any word on progress? I haven't seen any JSON stuff pop up in PiggyBank. I've been using the JSON loader as seen here: https://gist.github.com/601331 Note this is only for loading, not storing! I realize this is only half of the requested JSON features(load, not store) but I think having a JSON loader is better than the JSON nothing that is currently in PiggyBank. I know I was very sad when I noticed that PiggyBank contains a CSV Loader and XML Loader, but no JSON Loader. I'd be more than happy to get this loader rolled into the PiggyBank SVN with approval.
          Hide
          Dmitriy V. Ryaboy added a comment -

          All you gotta do is post a patch.. though that particular gist is a little encumbered since it's had so many authors and all of them would have to sign off on the fact that they are cool with an apache license.

          Show
          Dmitriy V. Ryaboy added a comment - All you gotta do is post a patch.. though that particular gist is a little encumbered since it's had so many authors and all of them would have to sign off on the fact that they are cool with an apache license.
          Hide
          Michael May added a comment -

          I'm getting close to being ready to post a patch for a loader, but have a question (pardon me if this is not the right place to ask it, but this thread seems like a reasonable place).

          The JSON Parser I'm currently using is part of an external dependency (namely, json-simple). I'm /assuming/ it's ok to add this dependency into the project. I'm familiar with maven's way of handling dependencies, but not so much with ant's. After doing a little digging around I found /ivy/pig.pom which looks similar to the dependency section of a maven pom.xml file. Can I add the dependency in here, or is there some other location where I can specify this dependency?

          Also, (somewhat unrelated, but a noob question) I'm currently working with this feature off of trunk. Is that where I should be working? The specified 'affected version' is 0.8.0 and I see there are 0.8 branches and 0.9 branches. Just want to make sure I'm working in the right place.

          Thanks

          Show
          Michael May added a comment - I'm getting close to being ready to post a patch for a loader, but have a question (pardon me if this is not the right place to ask it, but this thread seems like a reasonable place). The JSON Parser I'm currently using is part of an external dependency (namely, json-simple). I'm /assuming/ it's ok to add this dependency into the project. I'm familiar with maven's way of handling dependencies, but not so much with ant's. After doing a little digging around I found /ivy/pig.pom which looks similar to the dependency section of a maven pom.xml file. Can I add the dependency in here, or is there some other location where I can specify this dependency? Also, (somewhat unrelated, but a noob question) I'm currently working with this feature off of trunk. Is that where I should be working? The specified 'affected version' is 0.8.0 and I see there are 0.8 branches and 0.9 branches. Just want to make sure I'm working in the right place. Thanks
          Hide
          Dmitriy V. Ryaboy added a comment -

          Michael,
          I would strongly encourage you to use Jackson instead. It's already a dependency, and a lot of folks are starting to complain about the weight of the pig jar.

          Trunk's the right place to add new features. Initially this should go into contrib/piggybank until it proves stable.

          Show
          Dmitriy V. Ryaboy added a comment - Michael, I would strongly encourage you to use Jackson instead. It's already a dependency, and a lot of folks are starting to complain about the weight of the pig jar. Trunk's the right place to add new features. Initially this should go into contrib/piggybank until it proves stable.
          Hide
          Michael May added a comment -

          I didn't realize there was already a dependency for doing json parsing. That is good news! I'll work with it.

          Currently I have this in contrib/piggybank/storage. If I need to move it up one directory level, then that is no problem.

          Show
          Michael May added a comment - I didn't realize there was already a dependency for doing json parsing. That is good news! I'll work with it. Currently I have this in contrib/piggybank/storage. If I need to move it up one directory level, then that is no problem.
          Hide
          Dmitriy V. Ryaboy added a comment -

          no, storage is the right place, I just meant don't put it into Pig builtins.

          Show
          Dmitriy V. Ryaboy added a comment - no, storage is the right place, I just meant don't put it into Pig builtins.
          Hide
          Dmitriy V. Ryaboy added a comment -

          Very cool.

          Some quick code review notes:

          Tiny typo here:
          "e = foreach d generate flatten(men#'value') as val;" – that should read menu#'value'

          boolean notDone = in.nextKeyValue();
          if (!notDone) {
              return null;
          }
          

          Better:

          if (!in.nextKeyValue()) {
              return null;
          }
          

          Parse exceptions: it's better to increment a counter and move on than to break on a bad input string. Throwing an exception kills the whole job. So maybe something like

          t = null;
          while (t == null && in.nextKeyValue()) {
           ...
          }
          return t;
          

          In flatten_array, if the value is an array, you allocate a new bag, populate it recursively, and add the contents of the new bag to the old bag. Why not skip the object allocation and copy, and simply pass the original bag into the recursive call?

          Also: are null values for keys just plain unsupported? You skip them.

          setLocation: not that it really matters, but for consistency, you should use PigTextInputFormat instead of PigFileInputFormat here.

          schema: probably makes sense to implement getSchema?

          Show
          Dmitriy V. Ryaboy added a comment - Very cool. Some quick code review notes: Tiny typo here: "e = foreach d generate flatten(men#'value') as val;" – that should read menu#'value' boolean notDone = in.nextKeyValue(); if (!notDone) { return null ; } Better: if (!in.nextKeyValue()) { return null ; } Parse exceptions: it's better to increment a counter and move on than to break on a bad input string. Throwing an exception kills the whole job. So maybe something like t = null ; while (t == null && in.nextKeyValue()) { ... } return t; In flatten_array, if the value is an array, you allocate a new bag, populate it recursively, and add the contents of the new bag to the old bag. Why not skip the object allocation and copy, and simply pass the original bag into the recursive call? Also: are null values for keys just plain unsupported? You skip them. setLocation: not that it really matters, but for consistency, you should use PigTextInputFormat instead of PigFileInputFormat here. schema: probably makes sense to implement getSchema?
          Hide
          Dmitriy V. Ryaboy added a comment -

          canceling patch status, pending review response.

          please note that in the mean time, JsonStorage/Loader were added to Pig, but they are bound to a strict schema and the loader essentially only works on json stored by JsonStorage, not any json. So we probably still need an alternative loader.

          Also note that EB is now much more modular (so, fewer dependencies required if you do not need them), and the json storage module there allows deep parsing (tuples, maps, the works). It does not sample any records to auto-determine schema, and still returns a map.

          -D

          Show
          Dmitriy V. Ryaboy added a comment - canceling patch status, pending review response. please note that in the mean time, JsonStorage/Loader were added to Pig, but they are bound to a strict schema and the loader essentially only works on json stored by JsonStorage, not any json. So we probably still need an alternative loader. Also note that EB is now much more modular (so, fewer dependencies required if you do not need them), and the json storage module there allows deep parsing (tuples, maps, the works). It does not sample any records to auto-determine schema, and still returns a map. -D
          Hide
          Jonathan Packer added a comment -

          Adds Piggybank functions for loading/storing JSON without relying on storing metadata alongside it.

          Show
          Jonathan Packer added a comment - Adds Piggybank functions for loading/storing JSON without relying on storing metadata alongside it.
          Hide
          Jonathan Packer added a comment -

          Hi, I submitted a patch with an implementation of JSON load and store functions which do not rely on metadata being stored alongside the data. There is javadoc documentation for each function, but here is a summary of the features.

          The JsonLoader can either be passed a schema as a string argument, or it can infer a schema if none is provided. If passed a schema, it will load fields in the JSON which match the field names in the schema, ignoring extra fields, writing nulls for missing fields, and handling out-of-order fields properly.

          If not passed a schema, it will load the entire document as a map. The values of the map will either be bytearrays (for scalar values) or further maps/bags (for nested objects and arrays).

          Example usage:

          json = LOAD '$INPUT_PATH' USING org.apache.pig.piggybank.storage.JsonLoader('a: int, t: (i: int, j: int)');

          STORE json INTO '$OUTPUT_PATH' USING org.apache.pig.piggybank.storage.JsonStorage();

          Jonathan Packer (Mortar Data)

          Show
          Jonathan Packer added a comment - Hi, I submitted a patch with an implementation of JSON load and store functions which do not rely on metadata being stored alongside the data. There is javadoc documentation for each function, but here is a summary of the features. The JsonLoader can either be passed a schema as a string argument, or it can infer a schema if none is provided. If passed a schema, it will load fields in the JSON which match the field names in the schema, ignoring extra fields, writing nulls for missing fields, and handling out-of-order fields properly. If not passed a schema, it will load the entire document as a map. The values of the map will either be bytearrays (for scalar values) or further maps/bags (for nested objects and arrays). Example usage: json = LOAD '$INPUT_PATH' USING org.apache.pig.piggybank.storage.JsonLoader('a: int, t: (i: int, j: int)'); STORE json INTO '$OUTPUT_PATH' USING org.apache.pig.piggybank.storage.JsonStorage(); Jonathan Packer (Mortar Data)
          Hide
          Jonathan Packer added a comment -

          A note about handling arrays: the proposed JsonLoader will wrap the values of a flat JSON array, ex. "arr": [1, 2, 3, 4], in single-element tuples by default. However, if a tuple schema, for example coords: (lat: double, long: double), is specified for a field which is a flat JSON array, the JsonLoader will cast the array to a tuple. Nested arrays are loaded properly if a valid schema is specified.

          Show
          Jonathan Packer added a comment - A note about handling arrays: the proposed JsonLoader will wrap the values of a flat JSON array, ex. "arr": [1, 2, 3, 4] , in single-element tuples by default. However, if a tuple schema, for example coords: (lat: double, long: double), is specified for a field which is a flat JSON array, the JsonLoader will cast the array to a tuple. Nested arrays are loaded properly if a valid schema is specified.
          Hide
          Russell Jurney added a comment -

          I don't think this should go in Piggybank, I think it should add a more robust handing to the JsonStorage builtin.

          Show
          Russell Jurney added a comment - I don't think this should go in Piggybank, I think it should add a more robust handing to the JsonStorage builtin.
          Hide
          Russell Jurney added a comment -

          See PIG-2641 for more discussion.

          Show
          Russell Jurney added a comment - See PIG-2641 for more discussion.
          Hide
          Russell Jurney added a comment -

          Conflicting patches

          Show
          Russell Jurney added a comment - Conflicting patches
          Hide
          Cheolsoo Park added a comment -

          As per discussion, it would be nicer if we could improve the built-in JsonLoader instead of adding a new one to piggybank. Canceling the patch.

          Show
          Cheolsoo Park added a comment - As per discussion, it would be nicer if we could improve the built-in JsonLoader instead of adding a new one to piggybank. Canceling the patch.

            People

            • Assignee:
              Unassigned
              Reporter:
              Chao Tian
            • Votes:
              6 Vote for this issue
              Watchers:
              20 Start watching this issue

              Dates

              • Created:
                Updated:

                Development