Pig
  1. Pig
  2. PIG-2209

JsonMetadata fails to find schema for glob paths

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Blocker Blocker
    • Resolution: Fixed
    • Affects Version/s: 0.10.0
    • Fix Version/s: 0.10.0
    • Component/s: None
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      JsonMetadata, used in PigStorage to work with serialized schemas, does not correctly interpret paths like '/foo/bar/

      {1,2,3}

      ' and throws an exception:

      Caused by: org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1131: Could not find schema file for file:///foo/bar/{1,2}
      	at org.apache.pig.builtin.JsonMetadata.nullOrException(JsonMetadata.java:217)
      	at org.apache.pig.builtin.JsonMetadata.getSchema(JsonMetadata.java:186)
      	at org.apache.pig.builtin.PigStorage.getSchema(PigStorage.java:438)
      	at org.apache.pig.newplan.logical.relational.LOLoad.getSchemaFromMetaData(LOLoad.java:150)
      	... 17 more
      Caused by: java.io.IOException: Unable to read file:///foo/bar/z/{1,2}
      	at org.apache.pig.builtin.JsonMetadata.findMetaFile(JsonMetadata.java:106)
      	at org.apache.pig.builtin.JsonMetadata.getSchema(JsonMetadata.java:183)
      	... 19 more
      Caused by: java.net.URISyntaxException: Illegal character in path at index 36: file:///foo/bar/{1,2}
      	at java.net.URI$Parser.fail(URI.java:2809)
      	at java.net.URI$Parser.checkChars(URI.java:2982)
      

        Issue Links

          Activity

          Hide
          Daniel Dai added a comment -

          Unit tests pass. test-patch:
          [exec] -1 overall.
          [exec]
          [exec] +1 @author. The patch does not contain any @author tags.
          [exec]
          [exec] +1 tests included. The patch appears to include 4 new or modified tests.
          [exec]
          [exec] +1 javadoc. The javadoc tool did not generate any warning messages.
          [exec]
          [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings.
          [exec]
          [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings.
          [exec]
          [exec] -1 release audit. The applied patch generated 473 release audit warnings (more than the trunk's current 466 warnings).

          Patch committed to trunk and 0.10 branch.

          Show
          Daniel Dai added a comment - Unit tests pass. test-patch: [exec] -1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] +1 tests included. The patch appears to include 4 new or modified tests. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings. [exec] [exec] -1 release audit. The applied patch generated 473 release audit warnings (more than the trunk's current 466 warnings). Patch committed to trunk and 0.10 branch.
          Hide
          Dmitriy V. Ryaboy added a comment -

          Looks good.

          I think we need to firm up schema resolution when dealing with globs, but that's a different ticket (loading foo/

          {bar,baz}

          where the schema is under foo/bar/.pig_schema doesn't give you the schema)

          Show
          Dmitriy V. Ryaboy added a comment - Looks good. I think we need to firm up schema resolution when dealing with globs, but that's a different ticket (loading foo/ {bar,baz} where the schema is under foo/bar/.pig_schema doesn't give you the schema)
          Hide
          Daniel Dai added a comment -

          Thanks!

          Show
          Daniel Dai added a comment - Thanks!
          Hide
          Dmitriy V. Ryaboy added a comment -

          yep looking at it today.

          Show
          Dmitriy V. Ryaboy added a comment - yep looking at it today.
          Hide
          Daniel Dai added a comment -

          The patch is ready for review. Dmitriy, are you able to take a look?

          Show
          Daniel Dai added a comment - The patch is ready for review. Dmitriy, are you able to take a look?
          Hide
          Dmitriy V. Ryaboy added a comment -

          FWIW, the reason I filed this bug when I did was that I tried switching it out in production at Twitter. I will do that again as soon as we have a fix, before the 0.10 release; that should surface issues pretty quickly, if they exist.

          Show
          Dmitriy V. Ryaboy added a comment - FWIW, the reason I filed this bug when I did was that I tried switching it out in production at Twitter. I will do that again as soon as we have a fix, before the 0.10 release; that should surface issues pretty quickly, if they exist.
          Hide
          Olga Natkovich added a comment -

          Our users especially production ones are not very keen on changing their processes.

          I think this probably an ok change since our users do not have schema files right now. I am worried that we have done such a major re-work of PigStorage which is one of the main ways to get the data and how much problem we will see as the result like the one we are seeing in the bug

          Show
          Olga Natkovich added a comment - Our users especially production ones are not very keen on changing their processes. I think this probably an ok change since our users do not have schema files right now. I am worried that we have done such a major re-work of PigStorage which is one of the main ways to get the data and how much problem we will see as the result like the one we are seeing in the bug
          Hide
          Dmitriy V. Ryaboy added a comment -

          Olga, we have a "-noschema" parameter that lets people ignore whatever's on disk w.r.t. schemas, so if a bug is tickled, it will be easy to work around it.

          Show
          Dmitriy V. Ryaboy added a comment - Olga, we have a "-noschema" parameter that lets people ignore whatever's on disk w.r.t. schemas, so if a bug is tickled, it will be easy to work around it.
          Hide
          Daniel Dai added a comment -

          I checked the code again, we don't do extensive globStatus. Only one globStatus will be called in the case of globbing. So if we agree we want to pay this price in trade of the benefit of schema file, the only thing left is just bug fix. TestPigStorage do have test cases cover basic schema/noschema, we do need to add some globbing tests.

          I will submit a patch shortly.

          Show
          Daniel Dai added a comment - I checked the code again, we don't do extensive globStatus. Only one globStatus will be called in the case of globbing. So if we agree we want to pay this price in trade of the benefit of schema file, the only thing left is just bug fix. TestPigStorage do have test cases cover basic schema/noschema, we do need to add some globbing tests. I will submit a patch shortly.
          Hide
          Olga Natkovich added a comment -

          My concern is that the data is generated by one set of users but used by another and I am not sure how well we have tested the code that messages the data since that was not a very common use case.

          I could howevere see that if the schema file is not there nothing changes which could be good enough. I do recommend that we at least add more test case for schema provided both in file and in load statement

          Show
          Olga Natkovich added a comment - My concern is that the data is generated by one set of users but used by another and I am not sure how well we have tested the code that messages the data since that was not a very common use case. I could howevere see that if the schema file is not there nothing changes which could be good enough. I do recommend that we at least add more test case for schema provided both in file and in load statement
          Hide
          Alan Gates added a comment -

          I agree with Dmitriy that this will be very useful to most users with pretty minimal cost. My concern is in the glob case, where we're potentially doing thousands of stats on the NameNode. I would suggest adding a cap on the number of directories it could read, and providing a variable users could set to up this if they need to. For example, if a glob tried to access more than 100 directories, it would fail with a message like:

          Error: PigStorage exceeded max number of input directories. To avoid this, you can turn of auto schema detection by setting what.ever.the.variable.is to false or you can increase the maximum allowed directories by setting what.ever.that.variable.is (warning, this will increase the load on your NameNode).

          Olga, I don't understand your concern for backward compatibility. If the user has both a schema and an as clause we try to massage the schema into the as clause. The only issue will be if they store it with a schema and then give an as clause that is not compatible by our casting rules (e.g. the schema says a field is a long and they declare it as a string in the as clause). Do you think that case is common?

          Show
          Alan Gates added a comment - I agree with Dmitriy that this will be very useful to most users with pretty minimal cost. My concern is in the glob case, where we're potentially doing thousands of stats on the NameNode. I would suggest adding a cap on the number of directories it could read, and providing a variable users could set to up this if they need to. For example, if a glob tried to access more than 100 directories, it would fail with a message like: Error: PigStorage exceeded max number of input directories. To avoid this, you can turn of auto schema detection by setting what.ever.the.variable.is to false or you can increase the maximum allowed directories by setting what.ever.that.variable.is (warning, this will increase the load on your NameNode). Olga, I don't understand your concern for backward compatibility. If the user has both a schema and an as clause we try to massage the schema into the as clause. The only issue will be if they store it with a schema and then give an as clause that is not compatible by our casting rules (e.g. the schema says a field is a long and they declare it as a string in the as clause). Do you think that case is common?
          Hide
          Olga Natkovich added a comment -

          My main concern is backward compatibility. In use cases at Yahoo, we trained users to provide name and type of the data during load time. Are we sure (and do we have enough use cases) that they will still see the names and types they requested?

          I think for 10, we should turn it off by default and see if it make sense to switch that down the road.

          Show
          Olga Natkovich added a comment - My main concern is backward compatibility. In use cases at Yahoo, we trained users to provide name and type of the data during load time. Are we sure (and do we have enough use cases) that they will still see the names and types they requested? I think for 10, we should turn it off by default and see if it make sense to switch that down the road.
          Hide
          Dmitriy V. Ryaboy added a comment -

          Not quite, I am saying there is huge benefit when it is generated by Pig, and there is fairly little cost when it isn't. If it's an actual problem (and I doubt it would be), we should give a property to turn this behavior off. Prematurely optimizing by kneecapping the feature just seems like a good way to ensure most people (who, by and large, barely hit the NN) will never discover it.

          Show
          Dmitriy V. Ryaboy added a comment - Not quite, I am saying there is huge benefit when it is generated by Pig, and there is fairly little cost when it isn't. If it's an actual problem (and I doubt it would be), we should give a property to turn this behavior off. Prematurely optimizing by kneecapping the feature just seems like a good way to ensure most people (who, by and large, barely hit the NN) will never discover it.
          Hide
          Daniel Dai added a comment -

          You assume the input file is generated by PigStorage. But in many cases, it is not true.

          Show
          Daniel Dai added a comment - You assume the input file is generated by PigStorage. But in many cases, it is not true.
          Hide
          Dmitriy V. Ryaboy added a comment -

          Actually, having given it 30 seconds more thought, I disagree with myself. It should be enabled by default. Otherwise most users won't get the benefit (because they won't know they can), and having this on by default is a huge usability benefit. The few users who are running clusters with heavy enough load that the extra NN call makes a difference can turn this off – and we can call this out in release notes. I think this is a case where the benefits are worth the cost. Every single file generated with PigStorage will have a schema associated with it, that'll be very useful.

          Show
          Dmitriy V. Ryaboy added a comment - Actually, having given it 30 seconds more thought, I disagree with myself. It should be enabled by default. Otherwise most users won't get the benefit (because they won't know they can), and having this on by default is a huge usability benefit. The few users who are running clusters with heavy enough load that the extra NN call makes a difference can turn this off – and we can call this out in release notes. I think this is a case where the benefits are worth the cost. Every single file generated with PigStorage will have a schema associated with it, that'll be very useful.
          Hide
          Dmitriy V. Ryaboy added a comment -

          Global flag works for me (in my case, most uses of pig storage would use the schema).

          Show
          Dmitriy V. Ryaboy added a comment - Global flag works for me (in my case, most uses of pig storage would use the schema).
          Hide
          Daniel Dai added a comment -

          That way we avoid globStatus, but still several existence check is needed. Provide most PigStorage use cases don't use schema I believe, it should disable by default. How about a global flag?

          Show
          Daniel Dai added a comment - That way we avoid globStatus, but still several existence check is needed. Provide most PigStorage use cases don't use schema I believe, it should disable by default. How about a global flag?
          Hide
          Dmitriy V. Ryaboy added a comment -

          I think the behavior we want is
          1) if the schema exists, use it
          2) if the schema does not exist, behave as it behaves currently.

          If we get rid of the per-file schema thing, and instead always only look for a .pig_schema file in the directory that contains the files, the cost should be relatively small.

          Show
          Dmitriy V. Ryaboy added a comment - I think the behavior we want is 1) if the schema exists, use it 2) if the schema does not exist, behave as it behaves currently. If we get rid of the per-file schema thing, and instead always only look for a .pig_schema file in the directory that contains the files, the cost should be relatively small.
          Hide
          Olga Natkovich added a comment -

          I don't think that's what we want. This could be costly and non-backward compatible

          Show
          Olga Natkovich added a comment - I don't think that's what we want. This could be costly and non-backward compatible
          Hide
          Daniel Dai added a comment -

          I also see by default, PigStorage will search for schema files in input directory, is that intentional? That will add burden to namenode. Shall we flip it?

          Show
          Daniel Dai added a comment - I also see by default, PigStorage will search for schema files in input directory, is that intentional? That will add burden to namenode. Shall we flip it?
          Hide
          Dmitriy V. Ryaboy added a comment -

          that would be fantastic.

          Show
          Dmitriy V. Ryaboy added a comment - that would be fantastic.
          Hide
          Daniel Dai added a comment -

          Hi, Dmitriy, I can make time, can I help on this?

          Show
          Daniel Dai added a comment - Hi, Dmitriy, I can make time, can I help on this?
          Hide
          Dmitriy V. Ryaboy added a comment -

          I can try but I am not sure I have the time at the moment.. it's my bad, though, so if no one can make time, I'll figure something out to unblock the release.

          Show
          Dmitriy V. Ryaboy added a comment - I can try but I am not sure I have the time at the moment.. it's my bad, though, so if no one can make time, I'll figure something out to unblock the release.
          Hide
          Olga Natkovich added a comment -

          Dmitry are you planning to resolve this quickly?

          Show
          Olga Natkovich added a comment - Dmitry are you planning to resolve this quickly?
          Hide
          Dmitriy V. Ryaboy added a comment -

          Upgraded to blocker, as this can break existing scripts. I'll work on this.

          Show
          Dmitriy V. Ryaboy added a comment - Upgraded to blocker, as this can break existing scripts. I'll work on this.
          Hide
          Dmitriy V. Ryaboy added a comment -

          Slight correction: this only happens when the items referred to in the glob are directories. It works fine when they are files or parts of file names.

          Show
          Dmitriy V. Ryaboy added a comment - Slight correction: this only happens when the items referred to in the glob are directories. It works fine when they are files or parts of file names.

            People

            • Assignee:
              Daniel Dai
              Reporter:
              Dmitriy V. Ryaboy
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development