I am interested in getting this jira resolved, so I posted a new patch PIG-2492.patch that hopefully addresses concerns expressed here. To summarize, I did the following:
1) I used functions that hadoop provides instead of implementing my own glob pattern matching. In fact, it was slightly more complicated than what Scott described for two reasons:
- FileInputFormat.setInputFiles() doesn't find files in sub-directories. But currently, if the path is a directory, AvroStorage recursively loads files in a directory and its sub-directories.
- AvroStorage needs to know the schema of the files to load, so t is necessary to expand the glob pattern in AvroStorage.
Nevertheless, I was able to implement glob/comma support using FileSystem.globStatus() and FileInputFormat.setInputFiles() while not changing the current recursive load semantics.
2) URIs are handled properly because glob patterns are expanded by hadoop that knows how to handle URIs properly.
3) The glob syntax is the same as what's supported in PigStorage since PigStorage also uses FileInputFormat.setInputFiles() to expand glob patterns. Some examples are as follows:
4) I assumed that all the files that match the glob pattern have the same schema. In fact, this is the same limitation that we have for loading a directory:
If the input directory is a leaf directory, then we assume Avro data files in it have the same schema;
If the input directory contains sub-directoies, then we assume Avro data files in all sub-directories have the same schema.
4) I added 4 unit tests to verify the functionality as follow:
- testDir verifies that AvroStorage recursively loads files in a directory and its sub-directories.
- testGlob1 to 3 verify that glob patterns are expanded properly.
In addition to the patch, I uploaded some .avro files avro_test_files.tar.gz that are needed for my tests. To run the tests, please do the following:
tar -xf avro_test_files.tar.gz
ant clean compile-test piggybank -Dhadoopversion=20
ant test -Dtestcase=TestAvroStorage
Please let me know what you think.