Pig
  1. Pig
  2. PIG-1890

Fix piggybank unit test TestAvroStorage

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.9.0
    • Fix Version/s: 0.9.0, 0.10.0
    • Component/s: impl
    • Labels:
    • Hadoop Flags:
      Reviewed
    • Release Note:
      Fixed AvroStorage unit tests.

      Description

      TestAvroStorage fail on trunk. There are two reasons:
      1. After PIG-1680, we call LoadFunc.setLocation one more time.
      2. The schema for AvroStorage seems to be wrong. For example, in first test case testArrayDefault, the schema for "in" is set to "PIG_WRAPPER: (FIELD:

      {PIG_WRAPPER: (ARRAY_ELEM: float)}

      )". It seems PIG_WRAPPER is redundant. This issue is hidden until PIG-1188 checked in.

      1. PIG-1890-4.patch
        19 kB
        Ken Goodhope
      2. pig_setloc_avro.txt
        16 kB
        Patrick Hunt
      3. PIG-1890-3.patch
        17 kB
        Ken Goodhope
      4. PIG-1890-2.patch
        15 kB
        Ken Goodhope
      5. PIG-1890-1.patch
        2 kB
        Daniel Dai

        Activity

        Hide
        Daniel Dai added a comment -

        PIG-1890-1.patch fix the first issue. I temporary comment out all test cases in TestAvroStorage.

        Show
        Daniel Dai added a comment - PIG-1890 -1.patch fix the first issue. I temporary comment out all test cases in TestAvroStorage.
        Hide
        Olga Natkovich added a comment -

        Hi Jacob,

        Are you planning to address the additional issue for 0.9 or should we delay this?

        Show
        Olga Natkovich added a comment - Hi Jacob, Are you planning to address the additional issue for 0.9 or should we delay this?
        Hide
        Ken Goodhope added a comment -

        I have been working on some fixes to AvroStorage already. I should be able to make sure this issue gets addressed in those fixes as will. Will have it done sometime this week.

        Show
        Ken Goodhope added a comment - I have been working on some fixes to AvroStorage already. I should be able to make sure this issue gets addressed in those fixes as will. Will have it done sometime this week.
        Hide
        Jakob Homan added a comment -

        @Ken - any update now that we're in a new week?

        Show
        Jakob Homan added a comment - @Ken - any update now that we're in a new week?
        Hide
        Ken Goodhope added a comment -

        For testArrayDefault, we are attempting to return an entire avro array, which is consistent with the schema. The result is tuple with one column, a bag of floats". In POProject.getNext(Tuple), tuples with one column have their single column extracted, cast to a tuple, and then returned. Obviously in this case, this results in trying to cast the bag of floats into a tuple and an exception being thrown.

        Does anyone know why this is being done in POProject?

        Show
        Ken Goodhope added a comment - For testArrayDefault, we are attempting to return an entire avro array, which is consistent with the schema. The result is tuple with one column, a bag of floats". In POProject.getNext(Tuple), tuples with one column have their single column extracted, cast to a tuple, and then returned. Obviously in this case, this results in trying to cast the bag of floats into a tuple and an exception being thrown. Does anyone know why this is being done in POProject?
        Hide
        Daniel Dai added a comment -

        Seems it should call POProject.getNext(DataBag) instead. Project one item assumes this item already has the correct type and need not convert. The issue should be caused by plan generation, which results a wrong result type for POProject.

        Show
        Daniel Dai added a comment - Seems it should call POProject.getNext(DataBag) instead. Project one item assumes this item already has the correct type and need not convert. The issue should be caused by plan generation, which results a wrong result type for POProject.
        Hide
        Ken Goodhope added a comment -

        Right now, in this test, AvroStorage is attempting to pass back a single array of floats with one call to next. To be consistent with intent of how the data is stored we want this array returned as a single unit(databag) with each foreach call. In other words we don't want foreach to return each element of that array one at a time. If I am understanding the code right, it appears that is what it is trying to do. Am I missing something? Is there a way to control this behavior?

        Show
        Ken Goodhope added a comment - Right now, in this test, AvroStorage is attempting to pass back a single array of floats with one call to next. To be consistent with intent of how the data is stored we want this array returned as a single unit(databag) with each foreach call. In other words we don't want foreach to return each element of that array one at a time. If I am understanding the code right, it appears that is what it is trying to do. Am I missing something? Is there a way to control this behavior?
        Hide
        Ken Goodhope added a comment -

        I need some clarification on the contract for POProject.getNext(Tuple). Right now, if it receives a tuple with a single element, it extracts that element and attempts to cast it as a tuple and return it. This breaks with any single element tuple that where the single element is not a tuple. The code could be modified to not extract non-tuple elements.

        Show
        Ken Goodhope added a comment - I need some clarification on the contract for POProject.getNext(Tuple). Right now, if it receives a tuple with a single element, it extracts that element and attempts to cast it as a tuple and return it. This breaks with any single element tuple that where the single element is not a tuple. The code could be modified to not extract non-tuple elements.
        Hide
        Olga Natkovich added a comment -

        About to cut the release

        Show
        Olga Natkovich added a comment - About to cut the release
        Hide
        Ken Goodhope added a comment -

        The fix for this jira involves two parts, making setLocation idempotent, and a fix in POProject. I have added a jira for POProject issue PIG-2153. I will try and get a patch for the setLocation issue added this weekend. I have made some other changes to the version of AvroStorage we are using at LinkedIn and want to seperate those changes from any patch I submit for this.

        Show
        Ken Goodhope added a comment - The fix for this jira involves two parts, making setLocation idempotent, and a fix in POProject. I have added a jira for POProject issue PIG-2153 . I will try and get a patch for the setLocation issue added this weekend. I have made some other changes to the version of AvroStorage we are using at LinkedIn and want to seperate those changes from any patch I submit for this.
        Hide
        Ken Goodhope added a comment -

        Attached patch. Only works if PIG-2153 is fixed. Until then the unit tests still break. This patch fixes setLocation.

        Show
        Ken Goodhope added a comment - Attached patch. Only works if PIG-2153 is fixed. Until then the unit tests still break. This patch fixes setLocation.
        Hide
        Dmitriy V. Ryaboy added a comment -

        Marked PIG-2153 as a blocker to this.

        I have a feeling that ticket is also blocking EB issue 60 https://github.com/kevinweil/elephant-bird/issues/60

        Show
        Dmitriy V. Ryaboy added a comment - Marked PIG-2153 as a blocker to this. I have a feeling that ticket is also blocking EB issue 60 https://github.com/kevinweil/elephant-bird/issues/60
        Hide
        Patrick Hunt added a comment -

        Hi, I'm seeing an issue with both versions of the attached patches when I run the following:

        REGISTER avro-1.4.1.jar; 
        REGISTER json-simple-1.1.jar; 
        REGISTER piggybank.jar;
        
        A = LOAD 'input_123.avro' USING 
        org.apache.pig.piggybank.storage.avro.AvroStorage();
        
        B = LOAD 'input_789.avro' USING 
        org.apache.pig.piggybank.storage.avro.AvroStorage();
        
        C = UNION A, B; 
        DUMP C;
        

        where each file contains a single tuple; input_123.avro contains "1,2,3" (ints) and input_789.avro contains "7,8,9"
        Dump C should be returning 2 tuples; 1 tuple 1,2,3 and 1 tuple 7,8,9.

        Without the patch I see 6 tuples output (3 1,2,3 and 3 7,8,9)
        With either of the proposed patches applied I see 4 tuples output (2 1,2,3 and 2 7,8,9)

        From looking at other pig loader functions it seems like the following would address the setLocation issue:

             public void setLocation(String location, Job job) throws IOException {
        -        if(AvroStorageUtils.addInputPaths(location, job) && inputAvroSchema == null) {
        -            inputAvroSchema = getAvroSchema(location, job);
        -        }
        +        FileInputFormat.setInputPaths(job, location);
        +        inputAvroSchema = getAvroSchema(location, job);
             }
        

        This does resolve the issue for the script I described. However the "addInputPaths" functionality of AvroStorageUtils is lost - but I'm wondering why this was added rather than just rely on the std capabilities of LOAD? (such as globbing).

        I'd be happy to package up my suggestion as a patch if there's interest.

        Show
        Patrick Hunt added a comment - Hi, I'm seeing an issue with both versions of the attached patches when I run the following: REGISTER avro-1.4.1.jar; REGISTER json-simple-1.1.jar; REGISTER piggybank.jar; A = LOAD 'input_123.avro' USING org.apache.pig.piggybank.storage.avro.AvroStorage(); B = LOAD 'input_789.avro' USING org.apache.pig.piggybank.storage.avro.AvroStorage(); C = UNION A, B; DUMP C; where each file contains a single tuple; input_123.avro contains "1,2,3" (ints) and input_789.avro contains "7,8,9" Dump C should be returning 2 tuples; 1 tuple 1,2,3 and 1 tuple 7,8,9. Without the patch I see 6 tuples output (3 1,2,3 and 3 7,8,9) With either of the proposed patches applied I see 4 tuples output (2 1,2,3 and 2 7,8,9) From looking at other pig loader functions it seems like the following would address the setLocation issue: public void setLocation(String location, Job job) throws IOException { - if(AvroStorageUtils.addInputPaths(location, job) && inputAvroSchema == null) { - inputAvroSchema = getAvroSchema(location, job); - } + FileInputFormat.setInputPaths(job, location); + inputAvroSchema = getAvroSchema(location, job); } This does resolve the issue for the script I described. However the "addInputPaths" functionality of AvroStorageUtils is lost - but I'm wondering why this was added rather than just rely on the std capabilities of LOAD? (such as globbing). I'd be happy to package up my suggestion as a patch if there's interest.
        Hide
        Ken Goodhope added a comment -

        Hi Patrick, for our purposes we need setLocation to add all sub-directories, including directories more than 2 levels deep. A common use case for us to to have directories organized by time, yyyy/MM/dd/hh/mm. In that case if you want to load all the data from a particular month, then you need to add all the subdirs. Your right that a UNION can accomplish this, but it can be painful to add the directories that way. I will take a look at why this is still breaking in your case.

        Show
        Ken Goodhope added a comment - Hi Patrick, for our purposes we need setLocation to add all sub-directories, including directories more than 2 levels deep. A common use case for us to to have directories organized by time, yyyy/MM/dd/hh/mm. In that case if you want to load all the data from a particular month, then you need to add all the subdirs. Your right that a UNION can accomplish this, but it can be painful to add the directories that way. I will take a look at why this is still breaking in your case.
        Hide
        Mads Moeller added a comment -

        Hi Ken,

        I am have the same use case as you and encountering the same behavior as Patrick. I made a few modifications to the methods "addInputPaths" and "addAllSubDirs" from your patch, which seems to solve the UNION issue.

            public static boolean addInputPaths(String pathString, Job job)
                throws IOException {
        
                Set<Path> pathSet = new HashSet<Path>();
                
                if (addAllSubDirs(new Path(pathString), job, pathSet)) {
                    Path[] paths = pathSet.toArray(new Path[pathSet.size()]);
         
                    return true;
                }
                return false;
            }
        
            /**
             * Adds all non-hidden directories and subdirectories to the paths set
             * 
             * @throws IOException
             */
          	private static boolean addAllSubDirs(Path path, Job job, Set<Path> paths) throws IOException {
          		FileSystem fs = FileSystem.get(job.getConfiguration());
        
          		if (PATH_FILTER.accept(path)) {
          			try {
          				FileStatus file = fs.getFileStatus(path);
          				if (file.isDir()) {
          					for (FileStatus sub : fs.listStatus(path)) {
          						addAllSubDirs(sub.getPath(), job, paths);
          					}
          				} else {
          					AvroStorageLog.details("Add input file:" + file);
          					paths.add(file.getPath());
          				}
          			} catch (FileNotFoundException e) {
          				AvroStorageLog.details("Input path does not exist: " + path);
          				return false;
          			}
          			return true;
          		}
          		return false;
          	}
        
        Show
        Mads Moeller added a comment - Hi Ken, I am have the same use case as you and encountering the same behavior as Patrick. I made a few modifications to the methods "addInputPaths" and "addAllSubDirs" from your patch, which seems to solve the UNION issue. public static boolean addInputPaths( String pathString, Job job) throws IOException { Set<Path> pathSet = new HashSet<Path>(); if (addAllSubDirs( new Path(pathString), job, pathSet)) { Path[] paths = pathSet.toArray( new Path[pathSet.size()]); return true ; } return false ; } /** * Adds all non-hidden directories and subdirectories to the paths set * * @ throws IOException */ private static boolean addAllSubDirs(Path path, Job job, Set<Path> paths) throws IOException { FileSystem fs = FileSystem.get(job.getConfiguration()); if (PATH_FILTER.accept(path)) { try { FileStatus file = fs.getFileStatus(path); if (file.isDir()) { for (FileStatus sub : fs.listStatus(path)) { addAllSubDirs(sub.getPath(), job, paths); } } else { AvroStorageLog.details( "Add input file:" + file); paths.add(file.getPath()); } } catch (FileNotFoundException e) { AvroStorageLog.details( "Input path does not exist: " + path); return false ; } return true ; } return false ; }
        Hide
        Mads Moeller added a comment -

        Re-pasting addInputPaths.

            /**
             * get input paths to job config
             */
            public static boolean addInputPaths(String pathString, Job job)
                throws IOException {
        
                Set<Path> pathSet = new HashSet<Path>();
                
                if (addAllSubDirs(new Path(pathString), job, pathSet)) {
                    Path[] paths = pathSet.toArray(new Path[pathSet.size()]);
        
                    FileInputFormat.setInputPaths(job, paths);  
                    return true;
                }
                return false;
            }
        
        Show
        Mads Moeller added a comment - Re-pasting addInputPaths. /** * get input paths to job config */ public static boolean addInputPaths( String pathString, Job job) throws IOException { Set<Path> pathSet = new HashSet<Path>(); if (addAllSubDirs( new Path(pathString), job, pathSet)) { Path[] paths = pathSet.toArray( new Path[pathSet.size()]); FileInputFormat.setInputPaths(job, paths); return true ; } return false ; }
        Hide
        Patrick Hunt added a comment -

        @ken (and @mads) thanks, I figured something like that. Could this possibly be an issue in pig itself? I do see this

        LoadFunc.setLocation:
             * This method will be called in the backend multiple times. Implementations
             * should bear in mind that this method is called multiple times and should
             * ensure there are no inconsistent side effects due to the multiple calls.
        

        But what I'm seeing in this UNION case is that setLocation is being called multiple times on the same AvroStorage instance, for the same job, with different files. This results (current avrostorage code with pig-1890-2.patch applied) in the duplication - 2 files are added rather than one (my patch fixes this by only taking the most recent argument to setLocation, which is consistent with existing loader funcs, whereas avrostorage keeps adding). If you check the debugging output you'll see this (I might have added a bit more debugging to setLocation to capture this event...)

        Regards.

        Show
        Patrick Hunt added a comment - @ken (and @mads) thanks, I figured something like that. Could this possibly be an issue in pig itself? I do see this LoadFunc.setLocation: * This method will be called in the backend multiple times. Implementations * should bear in mind that this method is called multiple times and should * ensure there are no inconsistent side effects due to the multiple calls. But what I'm seeing in this UNION case is that setLocation is being called multiple times on the same AvroStorage instance, for the same job, with different files. This results (current avrostorage code with pig-1890-2.patch applied) in the duplication - 2 files are added rather than one (my patch fixes this by only taking the most recent argument to setLocation, which is consistent with existing loader funcs, whereas avrostorage keeps adding). If you check the debugging output you'll see this (I might have added a bit more debugging to setLocation to capture this event...) Regards.
        Hide
        Dmitriy V. Ryaboy added a comment -

        I've been a bit out of the loop on this – you are doing your own directory traversal? You shouldn't need to do that in the Pig layer, this should be done in your InputFormat. I had to write a wrapper to emulate what MAPREDUCE-1501 does in Elephant-Bird, and I believe Pig does the same thing (but without caring about the mapred.input.dir.recursive config).

        As for setLocation, yes. Making it idempotent is "fun".

        I am curious about this business with calling it with different files for the same instance for the same job. Patrick, can you show some debug output that has the sequence of calls?

        Show
        Dmitriy V. Ryaboy added a comment - I've been a bit out of the loop on this – you are doing your own directory traversal? You shouldn't need to do that in the Pig layer, this should be done in your InputFormat. I had to write a wrapper to emulate what MAPREDUCE-1501 does in Elephant-Bird, and I believe Pig does the same thing (but without caring about the mapred.input.dir.recursive config). As for setLocation, yes. Making it idempotent is "fun". I am curious about this business with calling it with different files for the same instance for the same job. Patrick, can you show some debug output that has the sequence of calls?
        Hide
        Ken Goodhope added a comment -

        There are places where we use addInputDir as a true add, not set. Otherwise your solution would work. I did incorporate the use in a set for addAllSubDirs. Since the method name was no longer descriptive, I changed it to getAllSubDirs. This new patch passed unit tests, but currently there isn't a test for UNION. Let me know if this works.

        Show
        Ken Goodhope added a comment - There are places where we use addInputDir as a true add, not set. Otherwise your solution would work. I did incorporate the use in a set for addAllSubDirs. Since the method name was no longer descriptive, I changed it to getAllSubDirs. This new patch passed unit tests, but currently there isn't a test for UNION. Let me know if this works.
        Hide
        Ken Goodhope added a comment -

        Dmitry, when I inherited the code it was already doing the traversal in setLocation, and I didn't consider doing in the InputFormat. To be honest, I am not crazy about adding all the subdirs by default, since this is inconsistent with the way a standard map-reduce job works. But, our users expect this behavior, and have pig jobs that depend on it.

        If the current patch works, I am inclined to leave it, until I get time to do a better re-factoring.

        Show
        Ken Goodhope added a comment - Dmitry, when I inherited the code it was already doing the traversal in setLocation, and I didn't consider doing in the InputFormat. To be honest, I am not crazy about adding all the subdirs by default, since this is inconsistent with the way a standard map-reduce job works. But, our users expect this behavior, and have pig jobs that depend on it. If the current patch works, I am inclined to leave it, until I get time to do a better re-factoring.
        Hide
        Dmitriy V. Ryaboy added a comment -

        Ken, adding all subdirs is how Hadoop + whatever patchset works, given the right value for mapred.input.dir.recursive

        Now, what version of Hadoop, I have no idea, but it's in there somewhere . And since that's what people decided on it probably behooves us to respect it. But fixing that issue is a separate concern from what this ticket tries to address. We should open a ticket, though.

        Show
        Dmitriy V. Ryaboy added a comment - Ken, adding all subdirs is how Hadoop + whatever patchset works, given the right value for mapred.input.dir.recursive Now, what version of Hadoop, I have no idea, but it's in there somewhere . And since that's what people decided on it probably behooves us to respect it. But fixing that issue is a separate concern from what this ticket tries to address. We should open a ticket, though.
        Hide
        Patrick Hunt added a comment -

        @Dmitriy thanks.

        Patrick, can you show some debug output that has the sequence of calls?

        Sure, I didn't save the original so I re-ran it, see attached (pig_setloc_avro.txt) for full details using the UNION example (this is with current trunk - notice that there are 6 tuples output rather than 2). I mis-remembered one detail - it's calling setLoc for the same job, with different files, but different AvroStorage objects. (see first two lines of setLocation debug message).

        Why are there 8 AvroStorage objects being created, shouldn't there just be 2, one for loading each of the two input files?

        Show
        Patrick Hunt added a comment - @Dmitriy thanks. Patrick, can you show some debug output that has the sequence of calls? Sure, I didn't save the original so I re-ran it, see attached (pig_setloc_avro.txt) for full details using the UNION example (this is with current trunk - notice that there are 6 tuples output rather than 2). I mis-remembered one detail - it's calling setLoc for the same job, with different files, but different AvroStorage objects. (see first two lines of setLocation debug message). Why are there 8 AvroStorage objects being created, shouldn't there just be 2, one for loading each of the two input files?
        Hide
        Patrick Hunt added a comment -

        demonstrate setLocation calls on AvroStorage.

        Show
        Patrick Hunt added a comment - demonstrate setLocation calls on AvroStorage.
        Hide
        Ken Goodhope added a comment -

        A recent change in Pig causes setLocation to be called twice, and if setLocation isn't idempotent, then you get twice the output. My suspicion is UNION is further exasperating the problem leading to the input being added 4X. Did you still see the problem with the last patch I added?

        Show
        Ken Goodhope added a comment - A recent change in Pig causes setLocation to be called twice, and if setLocation isn't idempotent, then you get twice the output. My suspicion is UNION is further exasperating the problem leading to the input being added 4X. Did you still see the problem with the last patch I added?
        Hide
        Mads Moeller added a comment -

        Hi Ken,

        With the latest patch the UNION behaves as expected for me.

        Thanks,
        Mads

        Show
        Mads Moeller added a comment - Hi Ken, With the latest patch the UNION behaves as expected for me. Thanks, Mads
        Hide
        Ken Goodhope added a comment -

        Uploading new patch that contains the same fixes to setLocation contained in the previous patch. New patch adds fixes to the schema that resolve the issues around the unit tests.

        Show
        Ken Goodhope added a comment - Uploading new patch that contains the same fixes to setLocation contained in the previous patch. New patch adds fixes to the schema that resolve the issues around the unit tests.
        Hide
        Ken Goodhope added a comment -

        Removing the blocker for PIG-2153. Turns out the problem, as first asserted, was in AvroStorage. The new logical plan must handle implicit wrapping tuples differently than used to be the case. In order to make this work, I removed the wrapping tuple from the schema produced by getSchema. getNext still returns its result in the wrapping tuple. I also had to modify putNext, to expect a piq schema without the implicit wrapping tuple.

        Show
        Ken Goodhope added a comment - Removing the blocker for PIG-2153 . Turns out the problem, as first asserted, was in AvroStorage. The new logical plan must handle implicit wrapping tuples differently than used to be the case. In order to make this work, I removed the wrapping tuple from the schema produced by getSchema. getNext still returns its result in the wrapping tuple. I also had to modify putNext, to expect a piq schema without the implicit wrapping tuple.
        Hide
        Ken Goodhope added a comment -

        PIG-1890-4.patch ready for review. All unit test now working against trunk.

        Show
        Ken Goodhope added a comment - PIG-1890 -4.patch ready for review. All unit test now working against trunk.
        Hide
        Patrick Hunt added a comment -

        I tested PIG-1890-4.patch against trunk using the UNION example and it generated expected (i.e. correct) results.

        Show
        Patrick Hunt added a comment - I tested PIG-1890 -4.patch against trunk using the UNION example and it generated expected (i.e. correct) results.
        Hide
        Daniel Dai added a comment -

        All Avro unit tests pass now, and test-patch returns all +1. Now we don't get a double PIG_WRAPPER, the schema generated by ArvoStorage looks good to me. Thanks guys for your hard working!

        Show
        Daniel Dai added a comment - All Avro unit tests pass now, and test-patch returns all +1. Now we don't get a double PIG_WRAPPER, the schema generated by ArvoStorage looks good to me. Thanks guys for your hard working!
        Hide
        Daniel Dai added a comment -

        Patch committed to trunk.

        Show
        Daniel Dai added a comment - Patch committed to trunk.
        Hide
        Daniel Dai added a comment -

        Also committed to 0.9 branch.

        Show
        Daniel Dai added a comment - Also committed to 0.9 branch.

          People

          • Assignee:
            Ken Goodhope
            Reporter:
            Daniel Dai
          • Votes:
            1 Vote for this issue
            Watchers:
            8 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development