Pig
  1. Pig
  2. PIG-3121

Optionally convert long to chararray in JsonStorage

    Details

    • Type: Improvement Improvement
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: 0.10.0
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      I work with a data set that uses random longs (64 bit integers) as identifiers. Recently I've been accessing the data from Pig and using JsonStorage to save records, that I then run through another script to get JSON that I can feed into other tools. One of the tools I use is broken in the sense that it treats all numbers as 64 bit floating point, and it can't faithfully reproduce most of the identifiers I pass it. My work around is to convert the identifiers to strings before they get to that tool.

      If I provide a patch, is there interest in adding an option to JsonStorage that tells it to serialize all longs as if they are strings?

        Activity

        Hide
        Alan Gates added a comment -

        Do you need to patch JsonStorage? Can't you change your script to cast these values to chararray before storing?

        Show
        Alan Gates added a comment - Do you need to patch JsonStorage? Can't you change your script to cast these values to chararray before storing?
        Hide
        Josh Levy added a comment -

        I think (but would love to be convinced otherwise) that casting would be unpleasant and fragile. The original data has a very complex schema. The Pig DESCRIBE command creates almost 10kb of output. I'm not excited about generating cast statements from the description, and I'd hate to have to redo the casts if the schema gets tweaked.

        For a bit more background on that, the data comes from Protobufs files. I use ElephantBird to load the data into Pig, and ElephantBird automatically creates the Pig schema from Protobufs.

        I do have other options besides changing JsonStorage.

        • Changing the values in Protobufs is politically difficult for me, but it is probably the most elegant / least hacky solution
        • I could modify ElephantBird to optionally do the cast at load time
        • I could write a UDF to walk through an arbitrary schema and do all of the casts
        • I could modify JsonStorage as proposed
        • I could continue what I'm currently doing and postprocess the output of JsonStorage

        The real problem is in the other tool and not in JsonStorage. Patching JsonStorage is attractive because it is so easy to take advantage of when writing new Pig scripts, and hopefully it can give others a quick path out of this problem

        Show
        Josh Levy added a comment - I think (but would love to be convinced otherwise) that casting would be unpleasant and fragile. The original data has a very complex schema. The Pig DESCRIBE command creates almost 10kb of output. I'm not excited about generating cast statements from the description, and I'd hate to have to redo the casts if the schema gets tweaked. For a bit more background on that, the data comes from Protobufs files. I use ElephantBird to load the data into Pig, and ElephantBird automatically creates the Pig schema from Protobufs. I do have other options besides changing JsonStorage. Changing the values in Protobufs is politically difficult for me, but it is probably the most elegant / least hacky solution I could modify ElephantBird to optionally do the cast at load time I could write a UDF to walk through an arbitrary schema and do all of the casts I could modify JsonStorage as proposed I could continue what I'm currently doing and postprocess the output of JsonStorage The real problem is in the other tool and not in JsonStorage. Patching JsonStorage is attractive because it is so easy to take advantage of when writing new Pig scripts, and hopefully it can give others a quick path out of this problem
        Hide
        Alan Gates added a comment - - edited

        My concern is what you brought up, the problem here isn't JsonStorage.

        One other option I'd like to point out is that you could extend JsonStorage with a new class CasterJsonStorage. The only method it would implement would be putNext. In that method it could do the casts and then call super.putNext(). This is hopefully light weight enough from your viewpoint and avoids pushing one off features into JsonStorage.

        Show
        Alan Gates added a comment - - edited My concern is what you brought up, the problem here isn't JsonStorage. One other option I'd like to point out is that you could extend JsonStorage with a new class CasterJsonStorage. The only method it would implement would be putNext. In that method it could do the casts and then call super.putNext(). This is hopefully light weight enough from your viewpoint and avoids pushing one off features into JsonStorage.

          People

          • Assignee:
            Unassigned
            Reporter:
            Josh Levy
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:

              Development