Pig
  1. Pig
  2. PIG-2492

AvroStorage should recognize globs and commas

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.9.1, 0.10.0
    • Fix Version/s: 0.11
    • Component/s: piggybank
    • Labels:
      None
    • Patch Info:
      Patch Available

      Description

      I've patched AvroStorage and AvroStorageUtils to support the same file input syntax as currently supported
      by hadoop's FileInputFormat. Specifically, globs and commas are supported.

      Somebody should write some unit tests for theses changes; I am currently pressed for time.

      1. AvroStorage.patch
        3 kB
        Stan Rosenberg
      2. AvroStorageUtils.patch
        2 kB
        Stan Rosenberg
      3. PIG-2492.patch
        10 kB
        Cheolsoo Park
      4. avro_test_files.tar.gz
        11 kB
        Cheolsoo Park
      5. avro_test_files-2.tar.gz
        11 kB
        Cheolsoo Park
      6. PIG-2492-2.patch
        15 kB
        Cheolsoo Park
      7. PIG-2492-3.patch
        16 kB
        Cheolsoo Park
      8. PIG-2492-4.patch
        15 kB
        Cheolsoo Park

        Activity

        Transition Time In Source Status Execution Times Last Executer Last Execution Date
        Open Open Patch Available Patch Available
        171d 7h 52m 1 Cheolsoo Park 15/Jul/12 03:44
        Patch Available Patch Available Resolved Resolved
        6d 21h 48m 1 Santhosh Srinivasan 22/Jul/12 01:32
        Resolved Resolved Closed Closed
        215d 4h 20m 1 Bill Graham 22/Feb/13 04:53
        Bill Graham made changes -
        Status Resolved [ 5 ] Closed [ 6 ]
        Hide
        Bart Verwilst added a comment -

        Ok, I was using the piggybank.jar from 0.10, i compiled piggybank.jar from 0.11, and now it works! So disregard my previous comment!

        Show
        Bart Verwilst added a comment - Ok, I was using the piggybank.jar from 0.10, i compiled piggybank.jar from 0.11, and now it works! So disregard my previous comment!
        Hide
        Bart Verwilst added a comment -

        I've compiled 0.11 from the svn branch just now, and get this:

        ~/branch-0.11/build $ pig
        2012-11-26 15:59:39,439 [main] INFO org.apache.pig.Main - Apache Pig version 0.11.0-SNAPSHOT (r1413659) compiled Nov 26 2012, 15:55:29
        grunt> REGISTER 'hdfs:///lib/avro-1.7.2.jar';
        grunt> REGISTER 'hdfs:///lib/json-simple-1.1.1.jar';
        grunt> REGISTER 'hdfs:///lib/piggybank.jar';
        grunt>
        grunt> DEFINE AvroStorage org.apache.pig.piggybank.storage.avro.AvroStorage();
        grunt> avro = load '/test/*' USING AvroStorage();
        grunt> describe avro;
        Schema for avro unknown.
        grunt> avro = load '/test/*.avro' USING AvroStorage();
        grunt> describe avro;
        Schema for avro unknown.
        grunt>
        grunt> avro = load '/test/2012-11-25.avro' USING AvroStorage();
        grunt> describe avro;
        avro: {id: long,timestamp: long,latitude: int,longitude: int,speed: int,heading: int,terminalid: int,customerid: chararray,mileage: int,creationtime: long,tracetype: int,traceproperties: {ARRAY_ELEM: (id: long,value: chararray,pkey: chararray)}}
        grunt> avro = load '/test/2012-11-2

        {5}

        .avro' USING AvroStorage();
        2012-11-26 16:01:07,284 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1200: null
        Details at logfile: /home/verwilst/branch-0.11/build/pig_1353941979436.log
        grunt> avro = load '/test/2012-11-2?.avro' USING AvroStorage();
        grunt> describe avro;
        Schema for avro unknown.
        grunt> avro = load '/test/2012-11-2*.avro' USING AvroStorage();
        grunt> describe avro;
        Schema for avro unknown.
        grunt>

        ~/branch-0.11/build $ hadoop fs -ls /test/
        Found 1 items
        rw-rr- 3 hdfs supergroup 63140500 2012-11-26 14:13 /test/2012-11-25.avro

        Shouldn't this be working here?

        Show
        Bart Verwilst added a comment - I've compiled 0.11 from the svn branch just now, and get this: ~/branch-0.11/build $ pig 2012-11-26 15:59:39,439 [main] INFO org.apache.pig.Main - Apache Pig version 0.11.0-SNAPSHOT (r1413659) compiled Nov 26 2012, 15:55:29 grunt> REGISTER 'hdfs:///lib/avro-1.7.2.jar'; grunt> REGISTER 'hdfs:///lib/json-simple-1.1.1.jar'; grunt> REGISTER 'hdfs:///lib/piggybank.jar'; grunt> grunt> DEFINE AvroStorage org.apache.pig.piggybank.storage.avro.AvroStorage(); grunt> avro = load '/test/*' USING AvroStorage(); grunt> describe avro; Schema for avro unknown. grunt> avro = load '/test/*.avro' USING AvroStorage(); grunt> describe avro; Schema for avro unknown. grunt> grunt> avro = load '/test/2012-11-25.avro' USING AvroStorage(); grunt> describe avro; avro: {id: long,timestamp: long,latitude: int,longitude: int,speed: int,heading: int,terminalid: int,customerid: chararray,mileage: int,creationtime: long,tracetype: int,traceproperties: {ARRAY_ELEM: (id: long,value: chararray,pkey: chararray)}} grunt> avro = load '/test/2012-11-2 {5} .avro' USING AvroStorage(); 2012-11-26 16:01:07,284 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1200: null Details at logfile: /home/verwilst/branch-0.11/build/pig_1353941979436.log grunt> avro = load '/test/2012-11-2?.avro' USING AvroStorage(); grunt> describe avro; Schema for avro unknown. grunt> avro = load '/test/2012-11-2*.avro' USING AvroStorage(); grunt> describe avro; Schema for avro unknown. grunt> ~/branch-0.11/build $ hadoop fs -ls /test/ Found 1 items rw-r r - 3 hdfs supergroup 63140500 2012-11-26 14:13 /test/2012-11-25.avro Shouldn't this be working here?
        Santhosh Srinivasan made changes -
        Status Patch Available [ 10002 ] Resolved [ 5 ]
        Fix Version/s 0.11 [ 12318878 ]
        Resolution Fixed [ 1 ]
        Hide
        Santhosh Srinivasan added a comment -

        I have committed the patch. Thanks Cheolsoo!

        Please note that for Hadoop 20, TestDBStorage was failing and for Hadoop 23, TestDBStorage and TestMultiStorage were failing prior to this commit.

        Show
        Santhosh Srinivasan added a comment - I have committed the patch. Thanks Cheolsoo! Please note that for Hadoop 20, TestDBStorage was failing and for Hadoop 23, TestDBStorage and TestMultiStorage were failing prior to this commit.
        Hide
        Cheolsoo Park added a comment -

        Attached PIG-2492-4.patch is the newest patch.

        There is one thing that I'd like to mention although I already discussed it in the review board.

        I changed the type of 1st parameter of AvroStorageUtils.getAllSubDirs() from URI to hadoop.fs.Path. This is needed because '

        {' and '}

        ' are not allowed in URI, so URI.create() throws a URISyntaxException on a glob pattern that contains those characters.

        But these characters are automatically escaped when constructing a Path, so what I did is constructing a Path with the given glob pattern string and getting a URI from that Path by Path.toUri().

        In fact, this reverts some changes made by PIG-2540 (https://issues.apache.org/jira/browse/PIG-2540). However, this does not break S3 support because inside AvroStorageUtils.getAllSubDirs(), file system is still constructed with the given URI, and globStatus() is called on that file system.

        FileSystem fs = FileSystem.get(path.toUri(), job.getConfiguration());
        FileStatus[] matchedFiles = fs.globStatus(path);
        

        So if path is a s3 URI, S3 file system will be used.

        Please let me know if I am wrong. Thanks!

        Show
        Cheolsoo Park added a comment - Attached PIG-2492-4.patch is the newest patch. There is one thing that I'd like to mention although I already discussed it in the review board. I changed the type of 1st parameter of AvroStorageUtils.getAllSubDirs() from URI to hadoop.fs.Path. This is needed because ' {' and '} ' are not allowed in URI, so URI.create() throws a URISyntaxException on a glob pattern that contains those characters. But these characters are automatically escaped when constructing a Path, so what I did is constructing a Path with the given glob pattern string and getting a URI from that Path by Path.toUri(). In fact, this reverts some changes made by PIG-2540 ( https://issues.apache.org/jira/browse/PIG-2540 ). However, this does not break S3 support because inside AvroStorageUtils.getAllSubDirs(), file system is still constructed with the given URI, and globStatus() is called on that file system. FileSystem fs = FileSystem.get(path.toUri(), job.getConfiguration()); FileStatus[] matchedFiles = fs.globStatus(path); So if path is a s3 URI, S3 file system will be used. Please let me know if I am wrong. Thanks!
        Cheolsoo Park made changes -
        Attachment PIG-2492-4.patch [ 12537469 ]
        Cheolsoo Park made changes -
        Attachment PIG-2492-3.patch [ 12537294 ]
        Cheolsoo Park made changes -
        Attachment avro_test_files-2.tar.gz [ 12537110 ]
        Attachment PIG-2492-2.patch [ 12537111 ]
        Hide
        Cheolsoo Park added a comment -
        Show
        Cheolsoo Park added a comment - Review board: https://reviews.apache.org/r/5936/
        Cheolsoo Park made changes -
        Status Open [ 1 ] Patch Available [ 10002 ]
        Affects Version/s 0.10.0 [ 12316246 ]
        Cheolsoo Park made changes -
        Assignee Cheolsoo Park [ cheolsoo ]
        Hide
        Cheolsoo Park added a comment -

        Hi,

        I am interested in getting this jira resolved, so I posted a new patch PIG-2492.patch that hopefully addresses concerns expressed here. To summarize, I did the following:

        1) I used functions that hadoop provides instead of implementing my own glob pattern matching. In fact, it was slightly more complicated than what Scott described for two reasons:

        • FileInputFormat.setInputFiles() doesn't find files in sub-directories. But currently, if the path is a directory, AvroStorage recursively loads files in a directory and its sub-directories.
        • AvroStorage needs to know the schema of the files to load, so t is necessary to expand the glob pattern in AvroStorage.

        Nevertheless, I was able to implement glob/comma support using FileSystem.globStatus() and FileInputFormat.setInputFiles() while not changing the current recursive load semantics.

        2) URIs are handled properly because glob patterns are expanded by hadoop that knows how to handle URIs properly.

        3) The glob syntax is the same as what's supported in PigStorage since PigStorage also uses FileInputFormat.setInputFiles() to expand glob patterns. Some examples are as follows:

        test_dir1/*
        test_dir1/test_glob{1,2,3}.avro
        {test_dir1,test_dir2}/test_glob*.avro
        

        4) I assumed that all the files that match the glob pattern have the same schema. In fact, this is the same limitation that we have for loading a directory:

        If the input directory is a leaf directory, then we assume Avro data files in it have the same schema;
        If the input directory contains sub-directoies, then we assume Avro data files in all sub-directories have the same schema.

        https://cwiki.apache.org/PIG/avrostorage.html

        4) I added 4 unit tests to verify the functionality as follow:

        • testDir verifies that AvroStorage recursively loads files in a directory and its sub-directories.
        • testGlob1 to 3 verify that glob patterns are expanded properly.

        In addition to the patch, I uploaded some .avro files avro_test_files.tar.gz that are needed for my tests. To run the tests, please do the following:

        tar -xf avro_test_files.tar.gz
        ant clean compile-test piggybank -Dhadoopversion=20
        cd contrib/piggybank/java
        ant test -Dtestcase=TestAvroStorage
        

        Please let me know what you think.

        Thanks!

        Show
        Cheolsoo Park added a comment - Hi, I am interested in getting this jira resolved, so I posted a new patch PIG-2492.patch that hopefully addresses concerns expressed here. To summarize, I did the following: 1) I used functions that hadoop provides instead of implementing my own glob pattern matching. In fact, it was slightly more complicated than what Scott described for two reasons: FileInputFormat.setInputFiles() doesn't find files in sub-directories. But currently, if the path is a directory, AvroStorage recursively loads files in a directory and its sub-directories. AvroStorage needs to know the schema of the files to load, so t is necessary to expand the glob pattern in AvroStorage. Nevertheless, I was able to implement glob/comma support using FileSystem.globStatus() and FileInputFormat.setInputFiles() while not changing the current recursive load semantics. 2) URIs are handled properly because glob patterns are expanded by hadoop that knows how to handle URIs properly. 3) The glob syntax is the same as what's supported in PigStorage since PigStorage also uses FileInputFormat.setInputFiles() to expand glob patterns. Some examples are as follows: test_dir1/* test_dir1/test_glob{1,2,3}.avro {test_dir1,test_dir2}/test_glob*.avro 4) I assumed that all the files that match the glob pattern have the same schema. In fact, this is the same limitation that we have for loading a directory: If the input directory is a leaf directory, then we assume Avro data files in it have the same schema; If the input directory contains sub-directoies, then we assume Avro data files in all sub-directories have the same schema. https://cwiki.apache.org/PIG/avrostorage.html 4) I added 4 unit tests to verify the functionality as follow: testDir verifies that AvroStorage recursively loads files in a directory and its sub-directories. testGlob1 to 3 verify that glob patterns are expanded properly. In addition to the patch, I uploaded some .avro files avro_test_files.tar.gz that are needed for my tests. To run the tests, please do the following: tar -xf avro_test_files.tar.gz ant clean compile-test piggybank -Dhadoopversion=20 cd contrib/piggybank/java ant test -Dtestcase=TestAvroStorage Please let me know what you think. Thanks!
        Cheolsoo Park made changes -
        Attachment avro_test_files.tar.gz [ 12536534 ]
        Cheolsoo Park made changes -
        Attachment PIG-2492.patch [ 12536533 ]
        Hide
        Fabian Alenius added a comment -

        I just confirmed that the patch doesn't apply on trunk. We are already reliant on glob support so we would like to see this issue resolved soon. If someone with insight can clarify how this should be resolved, I can help out.

        Show
        Fabian Alenius added a comment - I just confirmed that the patch doesn't apply on trunk. We are already reliant on glob support so we would like to see this issue resolved soon. If someone with insight can clarify how this should be resolved, I can help out.
        Hide
        Scott Carey added a comment -

        Something seems way off here.

        I have a custom LoadFunc for Avro (very different feature set, built before AvroStorage).

        It has worked with globs since the beginning, with only this:

          @Override
          public void setLocation(String location, Job job) throws IOException {
            FileInputFormat.setInputPaths(job, location);
          }
        

        This is much, much simpler.

        This also solves the "only works with *.avro" file issue. But it changes the syntax you would need in the LOAD statement.

        In my scripts, If I might do something like

        A = LOAD '/events/2012/03/23/

        {views,clicks}

        /*.avro' using MyCustomStorageFunc();

        In other words, use the glob and the well tested FileInputFormat to find files, don't write it in your LoadFunc.

        Show
        Scott Carey added a comment - Something seems way off here. I have a custom LoadFunc for Avro (very different feature set, built before AvroStorage). It has worked with globs since the beginning, with only this: @Override public void setLocation( String location, Job job) throws IOException { FileInputFormat.setInputPaths(job, location); } This is much, much simpler. This also solves the "only works with *.avro" file issue. But it changes the syntax you would need in the LOAD statement. In my scripts, If I might do something like A = LOAD '/events/2012/03/23/ {views,clicks} /*.avro' using MyCustomStorageFunc(); In other words, use the glob and the well tested FileInputFormat to find files, don't write it in your LoadFunc.
        Hide
        Russell Jurney added a comment -

        2540 is tested, but have to patch against the latest today.

        This patch caused me a world of hurt on EMR/S3. I had applied it.

        Show
        Russell Jurney added a comment - 2540 is tested, but have to patch against the latest today. This patch caused me a world of hurt on EMR/S3. I had applied it.
        Hide
        Alex Rovner added a comment -

        Russel – How can this patch cause PIG-2540 if PIG-2540 was reported in prior versions?

        Would we need to create a patch after applying the 2540 patch? What is the status with the 2540? Seems like the patch needs to be regenerated based on the latest code base.

        Show
        Alex Rovner added a comment - Russel – How can this patch cause PIG-2540 if PIG-2540 was reported in prior versions? Would we need to create a patch after applying the 2540 patch? What is the status with the 2540? Seems like the patch needs to be regenerated based on the latest code base.
        Hide
        Russell Jurney added a comment -

        My main objection is that this patch causes PIG-2540.

        Regarding your work on AvroStorage, please note lots of other avro patches going in soon. Don't wanna duplicate efforts.

        https://issues.apache.org/jira/browse/PIG-2540 need to upload a patch and tests tonight.
        https://issues.apache.org/jira/browse/PIG-2505 needs tests
        https://issues.apache.org/jira/browse/PIG-2411 needs tests

        https://issues.apache.org/jira/browse/PIG-2527 would be wonderful, but can wait.

        Show
        Russell Jurney added a comment - My main objection is that this patch causes PIG-2540 . Regarding your work on AvroStorage, please note lots of other avro patches going in soon. Don't wanna duplicate efforts. https://issues.apache.org/jira/browse/PIG-2540 need to upload a patch and tests tonight. https://issues.apache.org/jira/browse/PIG-2505 needs tests https://issues.apache.org/jira/browse/PIG-2411 needs tests https://issues.apache.org/jira/browse/PIG-2527 would be wonderful, but can wait.
        Hide
        Stan Rosenberg added a comment -

        I agree that there needs to be a uniform way of parsing locations of load/store UDFs. I am not entirely convinced that this functionality should be in a super class; some UDFs may have an entirely different notion of locations, e.g., a DB table.

        I am currently in the process of patching a good chunk of AvroStorage; e.g., it now supports reading avro files with different schemas so long that a meaningful union schema can be derived. Would anyone be willing to help write/run some unit tests?

        Show
        Stan Rosenberg added a comment - I agree that there needs to be a uniform way of parsing locations of load/store UDFs. I am not entirely convinced that this functionality should be in a super class; some UDFs may have an entirely different notion of locations, e.g., a DB table. I am currently in the process of patching a good chunk of AvroStorage; e.g., it now supports reading avro files with different schemas so long that a meaningful union schema can be derived. Would anyone be willing to help write/run some unit tests?
        Hide
        Russell Jurney added a comment -

        This patch should be done in a superclass, as it is not specific to AvroStorage. Also, as it is a lengthy patch, it needs thorough unit tests. There are bugs in this patch, for instance it is not possible to use URIs for filesystems using the AvroStorageUtils.glob() method.

        Needs more work, not gonna fix it up for 0.10.

        See PIG-2538

        Show
        Russell Jurney added a comment - This patch should be done in a superclass, as it is not specific to AvroStorage. Also, as it is a lengthy patch, it needs thorough unit tests. There are bugs in this patch, for instance it is not possible to use URIs for filesystems using the AvroStorageUtils.glob() method. Needs more work, not gonna fix it up for 0.10. See PIG-2538
        Hide
        Stan Rosenberg added a comment -

        I didn't do anything about the suffix; I merely patched the code to support globs. What's the rationale for not using
        the .avro suffix? The best approach imho is to admit any file suffix and read the magic header instead to detect if it's
        an actual avro file.

        Show
        Stan Rosenberg added a comment - I didn't do anything about the suffix; I merely patched the code to support globs. What's the rationale for not using the .avro suffix? The best approach imho is to admit any file suffix and read the magic header instead to detect if it's an actual avro file.
        Hide
        Russell Jurney added a comment -

        This does not work for me - at least, I am still unable to load avro files that do not explicitly end in .avro.

        Show
        Russell Jurney added a comment - This does not work for me - at least, I am still unable to load avro files that do not explicitly end in .avro.
        Stan Rosenberg made changes -
        Description I've patched AvroStorage and AvroStorageUtils to support the same file input syntax as currently supported
        by hadoop's FileInputFormat. Specifically, globs and commas are supported.
        I've patched AvroStorage and AvroStorageUtils to support the same file input syntax as currently supported
        by hadoop's FileInputFormat. Specifically, globs and commas are supported.

        Somebody should write some unit tests for theses changes; I am currently pressed for time.
        Stan Rosenberg made changes -
        Field Original Value New Value
        Attachment AvroStorage.patch [ 12511865 ]
        Attachment AvroStorageUtils.patch [ 12511866 ]
        Stan Rosenberg created issue -

          People

          • Assignee:
            Cheolsoo Park
            Reporter:
            Stan Rosenberg
          • Votes:
            3 Vote for this issue
            Watchers:
            11 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development