Pig
  1. Pig
  2. PIG-3223

AvroStorage does not handle comma separated input paths

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.10.0, 0.11
    • Fix Version/s: 0.12.0, 0.11.2
    • Component/s: piggybank
    • Labels:
      None

      Description

      In pig 0.11, a patch was issued to AvroStorage to support globs and comma separated input paths (PIG-2492). While this function works fine for glob-formatted input paths, it fails when issued a standard comma separated list of paths. fs.globStatus does not seem to be able to parse out such a list, and a java.net.URISyntaxException is thrown when toURI is called on the path.

      I have a working fix for this, but it's extremely ugly (basically checking if the string of input paths is globbed, otherwise splitting on ","). I'm sure there's a more elegant solution. I'd be happy to post the relevant methods and "fixes" if necessary.

      1. PIG-3223.branch-0.11.patch.txt
        8 kB
        Johnny Zhang
      2. PIG-3223.patch.txt
        8 kB
        Johnny Zhang
      3. PIG-3223.patch.txt
        8 kB
        Johnny Zhang
      4. PIG-3223.patch.txt
        8 kB
        Johnny Zhang
      5. PIG-3223.patch.txt
        8 kB
        Johnny Zhang
      6. PIG-3223.patch.txt
        2 kB
        Johnny Zhang
      7. AvroStorageUtils.patch-2
        26 kB
        Michael Kramer
      8. AvroStorage.patch-2
        28 kB
        Michael Kramer
      9. AvroStorageUtils.patch
        26 kB
        Michael Kramer
      10. AvroStorage.patch
        28 kB
        Michael Kramer

        Issue Links

          Activity

          Hide
          Rohini Palaniswamy added a comment -

          +1. Committed to 0.11.2 and trunk. Thanks Johnny

          Show
          Rohini Palaniswamy added a comment - +1. Committed to 0.11.2 and trunk. Thanks Johnny
          Hide
          Johnny Zhang added a comment -

          Rohini, thanks for your +1 on RB. Here is the patch for branch 0.11. Could please help commit the patch (to trunk and 0.11)? Appreciate.

          Show
          Johnny Zhang added a comment - Rohini, thanks for your +1 on RB. Here is the patch for branch 0.11. Could please help commit the patch (to trunk and 0.11)? Appreciate.
          Hide
          Johnny Zhang added a comment -

          latest patch addressed Rohini's comments in RB

          Show
          Johnny Zhang added a comment - latest patch addressed Rohini's comments in RB
          Hide
          Viraj Bhat added a comment -

          Johnny, Thanks let me check as to why it is failing in trunk. I never tested on trunk, as I thought this was only going to be committed in Pig 0.11.

          Show
          Viraj Bhat added a comment - Johnny, Thanks let me check as to why it is failing in trunk. I never tested on trunk, as I thought this was only going to be committed in Pig 0.11.
          Hide
          Johnny Zhang added a comment -

          Rohini Palaniswamy, could you please review the latest patch https://issues.apache.org/jira/secure/attachment/12581645/PIG-3223.patch.txt ?
          new added test cases in TestAvroStorage is also clean. Please let me know any concern regarding to the implementation, I will revised it as soon as possible!
          Let me know if you want me post another patch for 0.11 branch too!

          I also tried Viraj's patch 'PIG-3223.viraj.txt', it not clean on trunk

          patching file contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/avro/AvroStorage.java
          patching file contrib/piggybank/java/src/test/java/org/apache/pig/piggybank/test/storage/avro/TestAvroStorage.java
          Hunk #2 succeeded at 45 (offset 1 line).
          Hunk #3 succeeded at 72 (offset 1 line).
          Hunk #4 succeeded at 91 with fuzz 1 (offset 1 line).
          Hunk #5 succeeded at 1005 (offset 19 lines).
          

          also the TestAvroStorage test failed

              <error message="Error during parsing. java.net.URISyntaxException: Illegal character in scheme name at index 4: test_glob1.avro,file:" type="org.apache.pig.impl.logicalLayer.FrontendException">org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1000: Error during parsing. java.net.URISyntaxException: Illegal character in scheme name at index 4: test_glob1.avro,file:
                  at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1670)
                  at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1608)
                  at org.apache.pig.PigServer.registerQuery(PigServer.java:565)
                  at org.apache.pig.PigServer.registerQuery(PigServer.java:578)
                  at org.apache.pig.piggybank.test.storage.avro.TestAvroStorage.testAvroStorage(TestAvroStorage.java:1058)
                  at org.apache.pig.piggybank.test.storage.avro.TestAvroStorage.testAvroStorage(TestAvroStorage.java:1051)
                  at org.apache.pig.piggybank.test.storage.avro.TestAvroStorage.testComma1(TestAvroStorage.java:1020)
          Caused by: Failed to parse: java.net.URISyntaxException: Illegal character in scheme name at index 4: test_glob1.avro,file:
                  at org.apache.pig.parser.QueryParserDriver.parse(QueryParserDriver.java:191)
                  at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1661)
          Caused by: java.lang.IllegalArgumentException: java.net.URISyntaxException: Illegal character in scheme name at index 4: test_glob1.avro,file:
                  at org.apache.hadoop.fs.Path.initialize(Path.java:148)
                  at org.apache.hadoop.fs.Path.&lt;init&gt;(Path.java:126)
                  at org.apache.hadoop.fs.Path.&lt;init&gt;(Path.java:50)
                  at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:1084)
                  at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:1087)
                  at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:1087)
                  at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:1087)
                  at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:1087)
                  at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:1087)
                  at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:1087)
                  at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:1087)
                  at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:1087)
                  at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:1087)
                  at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:1087)
                  at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:1087)
                  at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:1087)
                  at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:1087)
                  at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:1087)
                  at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:1087)
                  at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:1087)
                  at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:1087)
                  at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:1087)
                  at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:1087)
                  at org.apache.hadoop.fs.FileSystem.globStatusInternal(FileSystem.java:1023)
                  at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:987)
                  at org.apache.pig.piggybank.storage.avro.AvroStorageUtils.getAllSubDirs(AvroStorageUtils.java:120)
                  at org.apache.pig.piggybank.storage.avro.AvroStorage.getSchema(AvroStorage.java:387)
                  at org.apache.pig.newplan.logical.relational.LOLoad.getSchemaFromMetaData(LOLoad.java:174)
                  at org.apache.pig.newplan.logical.relational.LOLoad.&lt;init&gt;(LOLoad.java:88)
                  at org.apache.pig.parser.LogicalPlanBuilder.buildLoadOp(LogicalPlanBuilder.java:856)
                  at org.apache.pig.parser.LogicalPlanGenerator.load_clause(LogicalPlanGenerator.java:3256)
                  at org.apache.pig.parser.LogicalPlanGenerator.op_clause(LogicalPlanGenerator.java:1335)
                  at org.apache.pig.parser.LogicalPlanGenerator.general_statement(LogicalPlanGenerator.java:819)
                  at org.apache.pig.parser.LogicalPlanGenerator.statement(LogicalPlanGenerator.java:537)
                  at org.apache.pig.parser.LogicalPlanGenerator.query(LogicalPlanGenerator.java:412)
                  at org.apache.pig.parser.QueryParserDriver.parse(QueryParserDriver.java:181)
          Caused by: java.net.URISyntaxException: Illegal character in scheme name at index 4: test_glob1.avro,file:
                  at java.net.URI$Parser.fail(URI.java:2809)
                  at java.net.URI$Parser.checkChars(URI.java:2982)
                  at java.net.URI$Parser.parse(URI.java:3009)
                  at java.net.URI.&lt;init&gt;(URI.java:736)
                  at org.apache.hadoop.fs.Path.initialize(Path.java:145)
          
          Show
          Johnny Zhang added a comment - Rohini Palaniswamy , could you please review the latest patch https://issues.apache.org/jira/secure/attachment/12581645/PIG-3223.patch.txt ? new added test cases in TestAvroStorage is also clean. Please let me know any concern regarding to the implementation, I will revised it as soon as possible! Let me know if you want me post another patch for 0.11 branch too! I also tried Viraj's patch ' PIG-3223 .viraj.txt', it not clean on trunk patching file contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/avro/AvroStorage.java patching file contrib/piggybank/java/src/test/java/org/apache/pig/piggybank/test/storage/avro/TestAvroStorage.java Hunk #2 succeeded at 45 (offset 1 line). Hunk #3 succeeded at 72 (offset 1 line). Hunk #4 succeeded at 91 with fuzz 1 (offset 1 line). Hunk #5 succeeded at 1005 (offset 19 lines). also the TestAvroStorage test failed <error message="Error during parsing. java.net.URISyntaxException: Illegal character in scheme name at index 4: test_glob1.avro,file:" type="org.apache.pig.impl.logicalLayer.FrontendException">org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1000: Error during parsing. java.net.URISyntaxException: Illegal character in scheme name at index 4: test_glob1.avro,file: at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1670) at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1608) at org.apache.pig.PigServer.registerQuery(PigServer.java:565) at org.apache.pig.PigServer.registerQuery(PigServer.java:578) at org.apache.pig.piggybank.test.storage.avro.TestAvroStorage.testAvroStorage(TestAvroStorage.java:1058) at org.apache.pig.piggybank.test.storage.avro.TestAvroStorage.testAvroStorage(TestAvroStorage.java:1051) at org.apache.pig.piggybank.test.storage.avro.TestAvroStorage.testComma1(TestAvroStorage.java:1020) Caused by: Failed to parse: java.net.URISyntaxException: Illegal character in scheme name at index 4: test_glob1.avro,file: at org.apache.pig.parser.QueryParserDriver.parse(QueryParserDriver.java:191) at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1661) Caused by: java.lang.IllegalArgumentException: java.net.URISyntaxException: Illegal character in scheme name at index 4: test_glob1.avro,file: at org.apache.hadoop.fs.Path.initialize(Path.java:148) at org.apache.hadoop.fs.Path.&lt;init&gt;(Path.java:126) at org.apache.hadoop.fs.Path.&lt;init&gt;(Path.java:50) at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:1084) at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:1087) at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:1087) at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:1087) at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:1087) at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:1087) at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:1087) at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:1087) at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:1087) at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:1087) at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:1087) at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:1087) at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:1087) at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:1087) at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:1087) at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:1087) at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:1087) at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:1087) at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:1087) at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:1087) at org.apache.hadoop.fs.FileSystem.globStatusInternal(FileSystem.java:1023) at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:987) at org.apache.pig.piggybank.storage.avro.AvroStorageUtils.getAllSubDirs(AvroStorageUtils.java:120) at org.apache.pig.piggybank.storage.avro.AvroStorage.getSchema(AvroStorage.java:387) at org.apache.pig.newplan.logical.relational.LOLoad.getSchemaFromMetaData(LOLoad.java:174) at org.apache.pig.newplan.logical.relational.LOLoad.&lt;init&gt;(LOLoad.java:88) at org.apache.pig.parser.LogicalPlanBuilder.buildLoadOp(LogicalPlanBuilder.java:856) at org.apache.pig.parser.LogicalPlanGenerator.load_clause(LogicalPlanGenerator.java:3256) at org.apache.pig.parser.LogicalPlanGenerator.op_clause(LogicalPlanGenerator.java:1335) at org.apache.pig.parser.LogicalPlanGenerator.general_statement(LogicalPlanGenerator.java:819) at org.apache.pig.parser.LogicalPlanGenerator.statement(LogicalPlanGenerator.java:537) at org.apache.pig.parser.LogicalPlanGenerator.query(LogicalPlanGenerator.java:412) at org.apache.pig.parser.QueryParserDriver.parse(QueryParserDriver.java:181) Caused by: java.net.URISyntaxException: Illegal character in scheme name at index 4: test_glob1.avro,file: at java.net.URI$Parser.fail(URI.java:2809) at java.net.URI$Parser.checkChars(URI.java:2982) at java.net.URI$Parser.parse(URI.java:3009) at java.net.URI.&lt;init&gt;(URI.java:736) at org.apache.hadoop.fs.Path.initialize(Path.java:145)
          Hide
          Johnny Zhang added a comment -

          post the revised patch based on Rohini's comments, I also run TestAvroStorage and it pass for me. Thanks.

          Show
          Johnny Zhang added a comment - post the revised patch based on Rohini's comments, I also run TestAvroStorage and it pass for me. Thanks.
          Hide
          Viraj Bhat added a comment -

          Updated patch from Viraj

          Show
          Viraj Bhat added a comment - Updated patch from Viraj
          Hide
          Viraj Bhat added a comment -

          Sorry please assign this Jira to yourself and continue. Since I already had a patch and I thought of uploading it.. Somehow could not get an option to attach the patch hence I changed the owner.
          Viraj

          Show
          Viraj Bhat added a comment - Sorry please assign this Jira to yourself and continue. Since I already had a patch and I thought of uploading it.. Somehow could not get an option to attach the patch hence I changed the owner. Viraj
          Hide
          Johnny Zhang added a comment -

          @Rohini and Viraj Bhat, thanks for your comments. I appreciate. Is that OK to still keep this jira in my queue (I notice the sssignee changed) so that I can deliver my contribution completely? I am actually working on revising the patch at this moment Thanks.

          Show
          Johnny Zhang added a comment - @Rohini and Viraj Bhat , thanks for your comments. I appreciate. Is that OK to still keep this jira in my queue (I notice the sssignee changed) so that I can deliver my contribution completely? I am actually working on revising the patch at this moment Thanks.
          Hide
          Rohini Palaniswamy added a comment -

          I think we should get this patch into 0.11 also to benefit current users as trunk has a rewrite of AvroStorage. Without comma separated paths, it is not possible to use it with oozie if one is consuming data from more than one directory.

          Show
          Rohini Palaniswamy added a comment - I think we should get this patch into 0.11 also to benefit current users as trunk has a rewrite of AvroStorage. Without comma separated paths, it is not possible to use it with oozie if one is consuming data from more than one directory.
          Hide
          Johnny Zhang added a comment -

          Viraj Bhat, thanks for your comments & patch, my previous patch got some comments from Rohini on the review board, I am going to update a revised on later today. I will review your patch as well. Thanks!

          Show
          Johnny Zhang added a comment - Viraj Bhat , thanks for your comments & patch, my previous patch got some comments from Rohini on the review board, I am going to update a revised on later today. I will review your patch as well. Thanks!
          Hide
          Viraj Bhat added a comment -

          Hi Johnny are you planning to work on this patch. Meanwhile I will post an updated patch, which I got working on Pig 0.11 and trunk.
          Viraj

          Show
          Viraj Bhat added a comment - Hi Johnny are you planning to work on this patch. Meanwhile I will post an updated patch, which I got working on Pig 0.11 and trunk. Viraj
          Hide
          Johnny Zhang added a comment -

          this is the review board link

          Show
          Johnny Zhang added a comment - this is the review board link
          Hide
          Johnny Zhang added a comment -

          Prashant Kommireddi, sorry I don't know why globbing is implemented in AvroStorage this way. Maybe a good idea to follow PIG-3015 to polish AvroStorage implementation.

          Show
          Johnny Zhang added a comment - Prashant Kommireddi , sorry I don't know why globbing is implemented in AvroStorage this way. Maybe a good idea to follow PIG-3015 to polish AvroStorage implementation.
          Hide
          Prashant Kommireddi added a comment -

          Thanks Johnny Zhang.

          I have a question regarding the current approach - why isn't the globbing implemented in PigAvroInputFormat? Overriding listStatus(JobContext job) should be cleaner, unless I am missing something very specific to Avro?

          Show
          Prashant Kommireddi added a comment - Thanks Johnny Zhang . I have a question regarding the current approach - why isn't the globbing implemented in PigAvroInputFormat? Overriding listStatus(JobContext job) should be cleaner, unless I am missing something very specific to Avro?
          Hide
          Johnny Zhang added a comment -

          the latest patch makes AvroStorage working for comma separated input. The patch also adds two test cases for below inputs

          final private String testCommaSeparated1 = getInputFile("test_dir1/test_glob1.avro,test_dir1/test_glob2.avro,test_dir1/test_glob3.avro");
          final private String testCommaSeparated2 = getInputFile("test_dir1/*, test_dir2/test_glob4.avro, test_dir2/test_glob5.avro");
          
          Show
          Johnny Zhang added a comment - the latest patch makes AvroStorage working for comma separated input. The patch also adds two test cases for below inputs final private String testCommaSeparated1 = getInputFile( "test_dir1/test_glob1.avro,test_dir1/test_glob2.avro,test_dir1/test_glob3.avro" ); final private String testCommaSeparated2 = getInputFile( "test_dir1/*, test_dir2/test_glob4.avro, test_dir2/test_glob5.avro" );
          Hide
          Johnny Zhang added a comment -

          Thanks for clarification, I will post patch soon.

          Show
          Johnny Zhang added a comment - Thanks for clarification, I will post patch soon.
          Hide
          Cheolsoo Park added a comment -

          Michael Kramer, thanks for the explanation.

          I actually asked an Oozie developer about this, and you're right that Oozie recommends to include the "hdfs://namenode:8020" in the path since the same Oozie server might be used for multiple Hadoop clusters. I agree that comma-separated input paths should be supported by AvroStorage.

          Show
          Cheolsoo Park added a comment - Michael Kramer , thanks for the explanation. I actually asked an Oozie developer about this, and you're right that Oozie recommends to include the "hdfs://namenode:8020" in the path since the same Oozie server might be used for multiple Hadoop clusters. I agree that comma-separated input paths should be supported by AvroStorage.
          Hide
          Michael Kramer added a comment -

          Cheolsoo Park, thanks for getting back to me so quickly!

          We're using variable substitution and input path generation via Oozie Coordinator. We include the hdfs://namenode:8020 at the beginning of our path templates, which I think is pretty standard (e.g. something like <uri-template>${nameNode}/data/</uri-template> ) When Oozie constructs input paths to be passed to the pig script or map reduce job, it enumerates the paths via a comma separated list, something like hdfs://namenode:8020/data/1,hdfs://namenode:8020/data/2. This is how we figured out AvroStorage was breaking in the first place.

          A good coordinator/workflow example that is indicative of the types of workflows we're running can be found in the Oozie source examples: https://github.com/apache/oozie/blob/trunk/examples/src/main/apps/aggregator/coordinator.xml

          Show
          Michael Kramer added a comment - Cheolsoo Park , thanks for getting back to me so quickly! We're using variable substitution and input path generation via Oozie Coordinator. We include the hdfs://namenode:8020 at the beginning of our path templates, which I think is pretty standard (e.g. something like <uri-template>${nameNode}/data/</uri-template> ) When Oozie constructs input paths to be passed to the pig script or map reduce job, it enumerates the paths via a comma separated list, something like hdfs://namenode:8020/data/1,hdfs://namenode:8020/data/2. This is how we figured out AvroStorage was breaking in the first place. A good coordinator/workflow example that is indicative of the types of workflows we're running can be found in the Oozie source examples: https://github.com/apache/oozie/blob/trunk/examples/src/main/apps/aggregator/coordinator.xml
          Hide
          Cheolsoo Park added a comment -

          Michael Kramer, thank you for the clarification.

          You're right:

          1. PigStorage supports commma-separated input paths.
          2. The fully qualified paths such as {hdfs://namenode:8020/testdir1/,hdfs://namenode:8020/testdir2} don't work in AvroStorage.

          I am not against adding the support for comma-separated list like you suggest.

          That said, I don't understand your use case:

          The driving force for us comes from how Oozie constructs input paths. If comma separated paths aren't supported in this way, AvroStorage as is can't be used with Oozie.

          Can you please provide your Oozie workflow and Pig script? Why do input paths need to be fully qualified paths?

          Show
          Cheolsoo Park added a comment - Michael Kramer , thank you for the clarification. You're right: PigStorage supports commma-separated input paths. The fully qualified paths such as {hdfs://namenode:8020/testdir1/,hdfs://namenode:8020/testdir2} don't work in AvroStorage. I am not against adding the support for comma-separated list like you suggest. That said, I don't understand your use case: The driving force for us comes from how Oozie constructs input paths. If comma separated paths aren't supported in this way, AvroStorage as is can't be used with Oozie. Can you please provide your Oozie workflow and Pig script? Why do input paths need to be fully qualified paths?
          Hide
          Michael Kramer added a comment -

          Furthermore, the driving force for us comes from how Oozie constructs input paths. If comma separated paths aren't supported in this way, AvroStorage as is can't be used with Oozie.

          Show
          Michael Kramer added a comment - Furthermore, the driving force for us comes from how Oozie constructs input paths. If comma separated paths aren't supported in this way, AvroStorage as is can't be used with Oozie.
          Hide
          Michael Kramer added a comment -

          If the goal here is to mirror the functionality of PigStorage, then I think we'd want to support the same variations in input path formatting.

          Show
          Michael Kramer added a comment - If the goal here is to mirror the functionality of PigStorage, then I think we'd want to support the same variations in input path formatting.
          Hide
          Michael Kramer added a comment -

          The problem is when we're dealing with comma separated paths that aren't enclosed by {}. Globs put restrictions on fully qualified hdfs paths. If we're dealing with non-globbed input paths, which PigStorage does handle, this function breaks, i.e if I passed testdir1/,testdir2/ as input, it would fail. It also fails with something like

          {hdfs://namenode:8020/testdir1/,hdfs://namenode:8020/testdir2}

          /*

          Show
          Michael Kramer added a comment - The problem is when we're dealing with comma separated paths that aren't enclosed by {}. Globs put restrictions on fully qualified hdfs paths. If we're dealing with non-globbed input paths, which PigStorage does handle, this function breaks, i.e if I passed testdir1/,testdir2/ as input, it would fail. It also fails with something like {hdfs://namenode:8020/testdir1/,hdfs://namenode:8020/testdir2} /*
          Hide
          Johnny Zhang added a comment -

          thanks for the comments, Cheolsoo Park, I think if this is the case, we can close the jira. Michael Kramer, please let us know if you have any other concern.

          Show
          Johnny Zhang added a comment - thanks for the comments, Cheolsoo Park , I think if this is the case, we can close the jira. Michael Kramer , please let us know if you have any other concern.
          Hide
          Cheolsoo Park added a comment -

          Johnny is correct. Comma-separated lists must be surrounded by "{ }".

          Johnny Zhang, in fact, this is covered by the following:

          test_dir1/test_glob{1,2,3}.avro
          test_dir1/test_glob{3,2,1}.avro
          {test_dir1,test_dir2}/test_glob*.avro
          {test_dir2,test_dir1}/test_glob*.avro
          test_dir{1,2}/file_that_does_not_exist*.avro
          

          These test cases go through the same code path as your test case does (test_dir1/{test_glob1.avro,test_glob2.avro,test_glob3.avro}).

          Show
          Cheolsoo Park added a comment - Johnny is correct. Comma-separated lists must be surrounded by "{ }". Johnny Zhang , in fact, this is covered by the following: test_dir1/test_glob{1,2,3}.avro test_dir1/test_glob{3,2,1}.avro {test_dir1,test_dir2}/test_glob*.avro {test_dir2,test_dir1}/test_glob*.avro test_dir{1,2}/file_that_does_not_exist*.avro These test cases go through the same code path as your test case does (test_dir1/{test_glob1.avro,test_glob2.avro,test_glob3.avro}).
          Hide
          Johnny Zhang added a comment -

          Michael Kramer, my apologize for late reply. Actually I found you properly can load comma separated avro file when using globe wrap it. like

          in = LOAD 'test_dir1/

          {test_glob1.avro,test_glob2.avro,test_glob3.avro}

          ' USING org.apache.pig.piggybank.storage.avro.AvroStorage ();

          Does this satisfy your requirement? If this is the case, there is no code change required.
          There is no test cover it right now, I just added one more test case to TestAvroStorage to make sure it won't break in the future.

          Show
          Johnny Zhang added a comment - Michael Kramer , my apologize for late reply. Actually I found you properly can load comma separated avro file when using globe wrap it. like in = LOAD 'test_dir1/ {test_glob1.avro,test_glob2.avro,test_glob3.avro} ' USING org.apache.pig.piggybank.storage.avro.AvroStorage (); Does this satisfy your requirement? If this is the case, there is no code change required. There is no test cover it right now, I just added one more test case to TestAvroStorage to make sure it won't break in the future.
          Hide
          Michael Kramer added a comment -

          It appears by returning the boolean status of the recursed "getSubDirs" method always results in "false". I'm attaching a patch that reverts back to the original logic of discarding the recursive boolean statuses and just returning true if the first entrance into the method succeeds.

          Show
          Michael Kramer added a comment - It appears by returning the boolean status of the recursed "getSubDirs" method always results in "false". I'm attaching a patch that reverts back to the original logic of discarding the recursive boolean statuses and just returning true if the first entrance into the method succeeds.
          Hide
          Michael Kramer added a comment -

          Johnny ZhangThat's great Johnny. I'm attaching the patch now. I've modified the getSubDirs method call in AvroStorageUtils and added a rather ugly isGlob helper function. I imagine there's a better filesystem api call that already takes care of these issues.

          Show
          Michael Kramer added a comment - Johnny Zhang That's great Johnny. I'm attaching the patch now. I've modified the getSubDirs method call in AvroStorageUtils and added a rather ugly isGlob helper function. I imagine there's a better filesystem api call that already takes care of these issues.
          Hide
          Johnny Zhang added a comment -

          Michael Kramer, I am willing to continue working on this item. Seems very interesting. Do you mind post your current patch?

          Show
          Johnny Zhang added a comment - Michael Kramer , I am willing to continue working on this item. Seems very interesting. Do you mind post your current patch?

            People

            • Assignee:
              Johnny Zhang
              Reporter:
              Michael Kramer
            • Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development