Pig
  1. Pig
  2. PIG-1748

Add load/store function AvroStorage for avro data

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.9.0
    • Component/s: impl
    • Labels:
      None
    • Hadoop Flags:
      Reviewed
    • Tags:
      Pig Avro

      Description

      We want to use Pig to process arbitrary Avro data and store results as Avro files. AvroStorage() extends two PigFuncs: LoadFunc and StoreFunc.

      Due to discrepancies of Avro and Pig data models, AvroStorage has:
      1. Limited support for "record": we do not support recursively defined record because the number of fields in such records is data dependent.
      2. Limited support for "union": we only accept nullable union like ["null", "some-type"].

      For simplicity, we also make the following assumptions:
      If the input directory is a leaf directory, then we assume Avro data files in it have the same schema;
      If the input directory contains sub-directories, then we assume Avro data files in all sub-directories have the same schema.

      AvroStorage takes no input parameters when used as a LoadFunc (except for "debug [debug-level]").
      Users can provide parameters to AvroStorage when used as a StoreFunc. If they don't, Avro schema of output data is derived from its
      Pig schema.

      Detailed documentation can be found in http://linkedin.jira.com/wiki/display/HTOOLS/AvroStorage+-+Pig+support+for+Avro+data

      1. PIG-1748-3.patch
        127 kB
        Jakob Homan
      2. PIG-1748-2.patch
        126 kB
        Jakob Homan
      3. AvroStorageUtils-bagfix.patch
        1 kB
        Russell Jurney
      4. avro_test_files.tar.gz
        5 kB
        Jakob Homan
      5. avro_storage.patch
        128 kB
        lin guo

        Issue Links

          Activity

          Hide
          lin guo added a comment -

          this patch contains codes/unittests of AvroStorage; it also updates dependent jars.

          Show
          lin guo added a comment - this patch contains codes/unittests of AvroStorage; it also updates dependent jars.
          Hide
          Doug Cutting added a comment -

          I skimmed this, and overall it looks great! The only thing I noticed is that it should probably depend on avro-1.4.1 rather than 1.4.0.

          Show
          Doug Cutting added a comment - I skimmed this, and overall it looks great! The only thing I noticed is that it should probably depend on avro-1.4.1 rather than 1.4.0.
          Hide
          lin guo added a comment -

          Thanks... and I will update it to use the latest version.

          Best,
          Lin

          Show
          lin guo added a comment - Thanks... and I will update it to use the latest version. Best, Lin
          Hide
          Jakob Homan added a comment -

          Attaching binary test avro files used by unit tests. Need to be untgz'ed and placed in contrib/piggybank/java/src/test/java/org/apache/pig/piggybank/test/storage/avro/avro_test_files by reviewer/committer

          Show
          Jakob Homan added a comment - Attaching binary test avro files used by unit tests. Need to be untgz'ed and placed in contrib/piggybank/java/src/test/java/org/apache/pig/piggybank/test/storage/avro/avro_test_files by reviewer/committer
          Hide
          Jakob Homan added a comment -

          Attaching updated patch. I'll be finishing this JIRA for Lin. Delta from her patch: Avro 1.4, test refactoring and brought in line with Pig coding conventions. Ready for review.

          Show
          Jakob Homan added a comment - Attaching updated patch. I'll be finishing this JIRA for Lin. Delta from her patch: Avro 1.4, test refactoring and brought in line with Pig coding conventions. Ready for review.
          Hide
          Daniel Dai added a comment -

          Seems you forget contrib/piggybank/java/src/test/java/org/apache/pig/piggybank/test/storage/avro/avro_test_files/test_array.avro

          Show
          Daniel Dai added a comment - Seems you forget contrib/piggybank/java/src/test/java/org/apache/pig/piggybank/test/storage/avro/avro_test_files/test_array.avro
          Hide
          Daniel Dai added a comment -

          Ok, get it in a separate file.

          Show
          Daniel Dai added a comment - Ok, get it in a separate file.
          Hide
          Scott Carey added a comment -

          About plans on the Avro side:

          I plan on merging my work with this (great!) work into the Avro project. In the long run the Avro project is a better place for this for several reasons, but in the short term it does not matter. It will be some time before it is available from Avro.

          • Avro is fully mavenized in 1.5.0 (due out in a few weeks), meaning it is easy to add sub module jars such as 'avro-pig.jar'. Furthermore, its easy to have multiple versions for each version of pig if needed. For example we could simultaneously release avro-pig0.7.jar, avro-pig0.8.jar etc. as part of Avro 1.6.0 if it was necessary due to API breakage or extra features enabled in newer versions of Pig.
          • A lot of the work here is applicable to multiple systems, I plan to share code with Avro Hive SerDe's when those are implemented. This may lead to a general module that helps projects translate their schemas to avro and back.

          None of this impacts the work here in the short term, but I'm sure people will be interested in these plans and may have other ideas/suggestions on how to work on this in a way that is not too fragmented.

          Show
          Scott Carey added a comment - About plans on the Avro side: I plan on merging my work with this (great!) work into the Avro project. In the long run the Avro project is a better place for this for several reasons, but in the short term it does not matter. It will be some time before it is available from Avro. Avro is fully mavenized in 1.5.0 (due out in a few weeks), meaning it is easy to add sub module jars such as 'avro-pig.jar'. Furthermore, its easy to have multiple versions for each version of pig if needed. For example we could simultaneously release avro-pig0.7.jar, avro-pig0.8.jar etc. as part of Avro 1.6.0 if it was necessary due to API breakage or extra features enabled in newer versions of Pig. A lot of the work here is applicable to multiple systems, I plan to share code with Avro Hive SerDe's when those are implemented. This may lead to a general module that helps projects translate their schemas to avro and back. None of this impacts the work here in the short term, but I'm sure people will be interested in these plans and may have other ideas/suggestions on how to work on this in a way that is not too fragmented.
          Hide
          Daniel Dai added a comment -

          To Jokob:
          I get 2 failure in TestAvroStorage: testArrayWithSame, testRecordWithFieldSchema. Error message is similar:

          could not instantiate 'org.apache.pig.piggybank.storage.avro.AvroStorage' with arguments '[same, src/test/java/org/apache/pig/piggybank/test/storage/avro/avro_test_files/test_array.avro]'
          java.lang.RuntimeException: could not instantiate 'org.apache.pig.piggybank.storage.avro.AvroStorage' with arguments '[same, src/test/java/org/apache/pig/piggybank/test/storage/avro/avro_test_files/test_array.avro]'
          at org.apache.pig.impl.PigContext.instantiateFuncFromSpec(PigContext.java:494)
          at org.apache.pig.impl.logicalLayer.parser.QueryParser.NonEvalFuncSpec(QueryParser.java:5660)
          at org.apache.pig.impl.logicalLayer.parser.QueryParser.StoreClause(QueryParser.java:4034)
          at org.apache.pig.impl.logicalLayer.parser.QueryParser.BaseExpr(QueryParser.java:1501)
          at org.apache.pig.impl.logicalLayer.parser.QueryParser.Expr(QueryParser.java:1013)
          at org.apache.pig.impl.logicalLayer.parser.QueryParser.Parse(QueryParser.java:825)
          at org.apache.pig.impl.logicalLayer.LogicalPlanBuilder.parse(LogicalPlanBuilder.java:63)
          at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1708)
          at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1658)
          at org.apache.pig.PigServer.registerQuery(PigServer.java:546)
          at org.apache.pig.PigServer.registerQuery(PigServer.java:570)
          at org.apache.pig.piggybank.test.storage.avro.TestAvroStorage.testAvroStorage(Unknown Source)
          at org.apache.pig.piggybank.test.storage.avro.TestAvroStorage.testArrayWithSame(Unknown Source)
          Caused by: java.lang.reflect.InvocationTargetException
          at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
          at org.apache.pig.impl.PigContext.instantiateFuncFromSpec(PigContext.java:484)
          Caused by: java.net.ConnectException: Call to localhost.localdomain/127.0.0.1:57284 failed on connection exception: java.net.ConnectException: Connection refused
          at org.apache.hadoop.ipc.Client.wrapException(Client.java:767)
          at org.apache.hadoop.ipc.Client.call(Client.java:743)
          at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220)
          at $Proxy9.getProtocolVersion(Unknown Source)
          at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:359)
          at org.apache.hadoop.hdfs.DFSClient.createRPCNamenode(DFSClient.java:106)
          at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:207)
          at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:170)
          at org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:82)
          at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1378)
          at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:66)
          at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1390)
          at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:196)
          at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:95)
          at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:180)
          at org.apache.pig.piggybank.storage.avro.AvroStorage.init(AvroStorage.java:372)
          at org.apache.pig.piggybank.storage.avro.AvroStorage.<init>(AvroStorage.java:110)
          Caused by: java.net.ConnectException: Connection refused
          at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
          at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:574)
          at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
          at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:404)
          at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:304)
          at org.apache.hadoop.ipc.Client$Connection.access$1700(Client.java:176)
          at org.apache.hadoop.ipc.Client.getConnection(Client.java:860)
          at org.apache.hadoop.ipc.Client.call(Client.java:720)

          Know what happen?

          Show
          Daniel Dai added a comment - To Jokob: I get 2 failure in TestAvroStorage: testArrayWithSame, testRecordWithFieldSchema. Error message is similar: could not instantiate 'org.apache.pig.piggybank.storage.avro.AvroStorage' with arguments ' [same, src/test/java/org/apache/pig/piggybank/test/storage/avro/avro_test_files/test_array.avro] ' java.lang.RuntimeException: could not instantiate 'org.apache.pig.piggybank.storage.avro.AvroStorage' with arguments ' [same, src/test/java/org/apache/pig/piggybank/test/storage/avro/avro_test_files/test_array.avro] ' at org.apache.pig.impl.PigContext.instantiateFuncFromSpec(PigContext.java:494) at org.apache.pig.impl.logicalLayer.parser.QueryParser.NonEvalFuncSpec(QueryParser.java:5660) at org.apache.pig.impl.logicalLayer.parser.QueryParser.StoreClause(QueryParser.java:4034) at org.apache.pig.impl.logicalLayer.parser.QueryParser.BaseExpr(QueryParser.java:1501) at org.apache.pig.impl.logicalLayer.parser.QueryParser.Expr(QueryParser.java:1013) at org.apache.pig.impl.logicalLayer.parser.QueryParser.Parse(QueryParser.java:825) at org.apache.pig.impl.logicalLayer.LogicalPlanBuilder.parse(LogicalPlanBuilder.java:63) at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1708) at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1658) at org.apache.pig.PigServer.registerQuery(PigServer.java:546) at org.apache.pig.PigServer.registerQuery(PigServer.java:570) at org.apache.pig.piggybank.test.storage.avro.TestAvroStorage.testAvroStorage(Unknown Source) at org.apache.pig.piggybank.test.storage.avro.TestAvroStorage.testArrayWithSame(Unknown Source) Caused by: java.lang.reflect.InvocationTargetException at java.lang.reflect.Constructor.newInstance(Constructor.java:513) at org.apache.pig.impl.PigContext.instantiateFuncFromSpec(PigContext.java:484) Caused by: java.net.ConnectException: Call to localhost.localdomain/127.0.0.1:57284 failed on connection exception: java.net.ConnectException: Connection refused at org.apache.hadoop.ipc.Client.wrapException(Client.java:767) at org.apache.hadoop.ipc.Client.call(Client.java:743) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220) at $Proxy9.getProtocolVersion(Unknown Source) at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:359) at org.apache.hadoop.hdfs.DFSClient.createRPCNamenode(DFSClient.java:106) at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:207) at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:170) at org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:82) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1378) at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:66) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1390) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:196) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:95) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:180) at org.apache.pig.piggybank.storage.avro.AvroStorage.init(AvroStorage.java:372) at org.apache.pig.piggybank.storage.avro.AvroStorage.<init>(AvroStorage.java:110) Caused by: java.net.ConnectException: Connection refused at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:574) at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206) at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:404) at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:304) at org.apache.hadoop.ipc.Client$Connection.access$1700(Client.java:176) at org.apache.hadoop.ipc.Client.getConnection(Client.java:860) at org.apache.hadoop.ipc.Client.call(Client.java:720) Know what happen?
          Hide
          Jakob Homan added a comment -

          @Daniel- Let me take a look.

          @Scott - It's worth noting that projects can include Avro support as they wish, just as Avro can incorporate that work as it wishes. But I'm not sure I understand. You're saying that you'd rather have any higher-level application supporting Avro to have that support hosted in Avro, rather than treating it as a library to be included? This seems like an odd approach to me, essentially inverting the domain knowledge of each application to Avro, rather than the application itself where its developers frolic and work. Is there something I'm missing here?

          Show
          Jakob Homan added a comment - @Daniel- Let me take a look. @Scott - It's worth noting that projects can include Avro support as they wish, just as Avro can incorporate that work as it wishes. But I'm not sure I understand. You're saying that you'd rather have any higher-level application supporting Avro to have that support hosted in Avro, rather than treating it as a library to be included? This seems like an odd approach to me, essentially inverting the domain knowledge of each application to Avro, rather than the application itself where its developers frolic and work. Is there something I'm missing here?
          Hide
          Scott Carey added a comment -

          @Jacob
          Of course projects can do what they wish. I'm simply hoping many can collaborate together on this general problem category.

          This seems like an odd approach to me, essentially inverting the domain knowledge of each application to Avro, rather than the application itself where its developers frolic and work. Is there something I'm missing here?

          Writing a Pig storage adapter requires Avro domain knowledge and Pig domain knowledge. I found that it required more knowledge of Avro than Pig to do well. If all you ever want to achieve is:

          Pig - >> Avro file - >> Pig, then maybe it doesn't matter who hosts it.

          But what if you want to do:
          Pig - >> Avro file - >> Cascading - >> Avro file - >> Hive - >> Avro file - >> Pig ?

          Now which project should host what defines how all those data models can interact through a common schema system? pig contrib? hive contrib? howl? cascading (gpl . . .)?

          In the longer term, the common elements needed by all of the above can crystallize out into an avro module general to all, and individual modules hosted by each project can use that. What that might look like won't be apparent until there are enough example use cases however.

          Show
          Scott Carey added a comment - @Jacob Of course projects can do what they wish. I'm simply hoping many can collaborate together on this general problem category. This seems like an odd approach to me, essentially inverting the domain knowledge of each application to Avro, rather than the application itself where its developers frolic and work. Is there something I'm missing here? Writing a Pig storage adapter requires Avro domain knowledge and Pig domain knowledge. I found that it required more knowledge of Avro than Pig to do well. If all you ever want to achieve is: Pig - >> Avro file - >> Pig, then maybe it doesn't matter who hosts it. But what if you want to do: Pig - >> Avro file - >> Cascading - >> Avro file - >> Hive - >> Avro file - >> Pig ? Now which project should host what defines how all those data models can interact through a common schema system? pig contrib? hive contrib? howl? cascading (gpl . . .)? In the longer term, the common elements needed by all of the above can crystallize out into an avro module general to all, and individual modules hosted by each project can use that. What that might look like won't be apparent until there are enough example use cases however.
          Hide
          Jakob Homan added a comment -

          @Scott
          I can't say I'm convinced, and am in fact more concerned from your example, given that this approach essentially builds dependencies on all of those projects into Avro. However, this JIRA isn't the best place to discuss this. Is there a discussion about this type of integration going on in Avro that the community can contribute to? Is there a JIRA? Thanks.

          Show
          Jakob Homan added a comment - @Scott I can't say I'm convinced, and am in fact more concerned from your example, given that this approach essentially builds dependencies on all of those projects into Avro. However, this JIRA isn't the best place to discuss this. Is there a discussion about this type of integration going on in Avro that the community can contribute to? Is there a JIRA? Thanks.
          Hide
          Scott Carey added a comment -

          @Jacob

          I can't say I'm convinced, and am in fact more concerned from your example, given that this approach essentially builds dependencies on all of those projects into Avro.

          Avro is completely modularized now, so there would not be any dependency mess like that. It is now easy to add separate modules such as 'avro-pig.jar' or 'avro-hive.jar'. It already has 'avro-mapred.jar'.
          https://cwiki.apache.org/confluence/display/AVRO/Build+Documentation#BuildDocumentation-Java

          As this gets off topic, we can use Avro developer mailing list. Related issues are https://issues.apache.org/jira/browse/AVRO-647 and the issues linked to it, as well as https://issues.apache.org/jira/browse/AVRO-592. There is no ticket yet on the broader scope stuff.

          Show
          Scott Carey added a comment - @Jacob I can't say I'm convinced, and am in fact more concerned from your example, given that this approach essentially builds dependencies on all of those projects into Avro. Avro is completely modularized now, so there would not be any dependency mess like that. It is now easy to add separate modules such as 'avro-pig.jar' or 'avro-hive.jar'. It already has 'avro-mapred.jar'. https://cwiki.apache.org/confluence/display/AVRO/Build+Documentation#BuildDocumentation-Java As this gets off topic, we can use Avro developer mailing list. Related issues are https://issues.apache.org/jira/browse/AVRO-647 and the issues linked to it, as well as https://issues.apache.org/jira/browse/AVRO-592 . There is no ticket yet on the broader scope stuff.
          Hide
          Felix Gao added a comment -

          I noticed the avro loader does not support file globbing.

          log_load = LOAD '/user/felix/avro/access_log.test.avro' USING org.apache.pig.piggybank.storage.avro.AvroStorage(); <--- works fine
          but
          log_load = LOAD '/user/felix/avro/*.avro' USING org.apache.pig.piggybank.storage.avro.AvroStorage();

          ERROR 1018: Problem determining schema during load

          org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1000: Error during parsing. Problem determining schema during load
          at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1342)
          at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1286)
          at org.apache.pig.PigServer.registerQuery(PigServer.java:460)
          at org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:738)

          at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:324)
          at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:163)
          at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:139)
          at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:89)
          at org.apache.pig.Main.main(Main.java:414)
          Caused by: org.apache.pig.impl.logicalLayer.parser.ParseException: Problem determining schema during load
          at org.apache.pig.impl.logicalLayer.parser.QueryParser.Parse(QueryParser.java:752)

          at org.apache.pig.impl.logicalLayer.LogicalPlanBuilder.parse(LogicalPlanBuilder.java:63)
          at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1336)
          ... 8 more
          Caused by: org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1018: Problem determining schema during load
          at org.apache.pig.impl.logicalLayer.LOLoad.getSchema(LOLoad.java:156)
          at org.apache.pig.impl.logicalLayer.parser.QueryParser.Parse(QueryParser.java:750)
          ... 10 more
          Caused by: java.io.FileNotFoundException: File does not exist: /user/felix/avro/*.avro
          at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.openInfo(DFSClient.java:1586)
          at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.<init>(DFSClient.java:1577)
          at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:428)
          at org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:185)
          at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:431)
          at org.apache.pig.piggybank.storage.avro.AvroStorage.getSchema(AvroStorage.java:181)

          at org.apache.pig.piggybank.storage.avro.AvroStorage.getAvroSchema(AvroStorage.java:133)
          at org.apache.pig.piggybank.storage.avro.AvroStorage.getAvroSchema(AvroStorage.java:108)
          at org.apache.pig.piggybank.storage.avro.AvroStorage.getSchema(AvroStorage.java:233)
          at org.apache.pig.impl.logicalLayer.LOLoad.determineSchema(LOLoad.java:169)
          at org.apache.pig.impl.logicalLayer.LOLoad.getSchema(LOLoad.java:150)
          ... 11 more

          Show
          Felix Gao added a comment - I noticed the avro loader does not support file globbing. log_load = LOAD '/user/felix/avro/access_log.test.avro' USING org.apache.pig.piggybank.storage.avro.AvroStorage(); <--- works fine but log_load = LOAD '/user/felix/avro/*.avro' USING org.apache.pig.piggybank.storage.avro.AvroStorage(); ERROR 1018: Problem determining schema during load org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1000: Error during parsing. Problem determining schema during load at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1342) at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1286) at org.apache.pig.PigServer.registerQuery(PigServer.java:460) at org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:738) at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:324) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:163) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:139) at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:89) at org.apache.pig.Main.main(Main.java:414) Caused by: org.apache.pig.impl.logicalLayer.parser.ParseException: Problem determining schema during load at org.apache.pig.impl.logicalLayer.parser.QueryParser.Parse(QueryParser.java:752) at org.apache.pig.impl.logicalLayer.LogicalPlanBuilder.parse(LogicalPlanBuilder.java:63) at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1336) ... 8 more Caused by: org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1018: Problem determining schema during load at org.apache.pig.impl.logicalLayer.LOLoad.getSchema(LOLoad.java:156) at org.apache.pig.impl.logicalLayer.parser.QueryParser.Parse(QueryParser.java:750) ... 10 more Caused by: java.io.FileNotFoundException: File does not exist: /user/felix/avro/*.avro at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.openInfo(DFSClient.java:1586) at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.<init>(DFSClient.java:1577) at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:428) at org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:185) at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:431) at org.apache.pig.piggybank.storage.avro.AvroStorage.getSchema(AvroStorage.java:181) at org.apache.pig.piggybank.storage.avro.AvroStorage.getAvroSchema(AvroStorage.java:133) at org.apache.pig.piggybank.storage.avro.AvroStorage.getAvroSchema(AvroStorage.java:108) at org.apache.pig.piggybank.storage.avro.AvroStorage.getSchema(AvroStorage.java:233) at org.apache.pig.impl.logicalLayer.LOLoad.determineSchema(LOLoad.java:169) at org.apache.pig.impl.logicalLayer.LOLoad.getSchema(LOLoad.java:150) ... 11 more
          Hide
          Jakob Homan added a comment -

          Figured out the test failures. Turns out that when one does a full run of the unit tests (which I cannot get to succeed on my machine), the ~/pigtest directory is left running during the contrib tests and within the contrib build.xml file is a junit.hadoop.conf variable pointing those tests to the hdfs the pig tests had running but is no longer up. This conf trickles down to the test which ends up using it as the default filesystem and tries to connect to it, but can't since that HDFS is gone. This doesn't occur when run through an idea like IntelliJ since the IDE doesn't use contrib's build.xml settings.

          I've fixed this by explicitly referencing the local file system in the tests, though this seems like a bug in the contrib build system to me. I'll open a JIRA to address this.

          @Felix - good catch. To provide a cleaner separation between my work and Lin's, I would like to go ahead and fix this bug in a separate JIRA after 1748 is committed. How does this sound to you?

          Contrib tests pass, except org.apache.pig.piggybank.test.TestPigStorageSchema, which fails for me with or without the patch. Version 3 of the patch is updated to include better behavior in for directories with files that should be filtered out.

          Show
          Jakob Homan added a comment - Figured out the test failures. Turns out that when one does a full run of the unit tests (which I cannot get to succeed on my machine), the ~/pigtest directory is left running during the contrib tests and within the contrib build.xml file is a junit.hadoop.conf variable pointing those tests to the hdfs the pig tests had running but is no longer up. This conf trickles down to the test which ends up using it as the default filesystem and tries to connect to it, but can't since that HDFS is gone. This doesn't occur when run through an idea like IntelliJ since the IDE doesn't use contrib's build.xml settings. I've fixed this by explicitly referencing the local file system in the tests, though this seems like a bug in the contrib build system to me. I'll open a JIRA to address this. @Felix - good catch. To provide a cleaner separation between my work and Lin's, I would like to go ahead and fix this bug in a separate JIRA after 1748 is committed. How does this sound to you? Contrib tests pass, except org.apache.pig.piggybank.test.TestPigStorageSchema, which fails for me with or without the patch. Version 3 of the patch is updated to include better behavior in for directories with files that should be filtered out.
          Hide
          Dmitriy V. Ryaboy added a comment -

          The TestPigStorageSchema thing is mine, someone else just opened a ticket. Will fix.

          Show
          Dmitriy V. Ryaboy added a comment - The TestPigStorageSchema thing is mine, someone else just opened a ticket. Will fix.
          Hide
          Daniel Dai added a comment -

          Patch committed to trunk. Thanks Lin, Jakob!

          Show
          Daniel Dai added a comment - Patch committed to trunk. Thanks Lin, Jakob!
          Hide
          Russell Jurney added a comment -

          This patch is a fix for a bug for persisting bags of tuples via AvroStorage.

          The script that alerted me to a bug is:

          messages = LOAD '/tmp/messages.avro' USING AvroStorage();
          user_groups = GROUP messages by user_id;
          per_user = FOREACH user_groups

          { sorted = ORDER messages BY message_id DESC; GENERATE group AS user_id, sorted AS messages; }

          DESCRIBE per_user
          > per_user: {user_id: int,messages: {(message_id: int,topic: chararray,user_id: int)}}
          STORE per_user INTO '/tmp/per_user.avro' USING AvroStorage();

          The error is:

          Pig Stack Trace
          ---------------
          ERROR 1002: Unable to store alias per_user

          org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1002: Unable to store alias per_user
          at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1596)
          at org.apache.pig.PigServer.registerQuery(PigServer.java:584)
          at org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:942)
          at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:386)
          at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:188)
          at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:164)
          at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:67)
          at org.apache.pig.Main.run(Main.java:487)
          at org.apache.pig.Main.main(Main.java:108)
          at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
          at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
          at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
          at java.lang.reflect.Method.invoke(Method.java:597)
          at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
          Caused by: java.lang.NullPointerException
          at org.apache.pig.piggybank.storage.avro.AvroStorageUtils.isTupleWrapper(AvroStorageUtils.java:327)
          at org.apache.pig.piggybank.storage.avro.PigSchema2Avro.convert(PigSchema2Avro.java:82)
          at org.apache.pig.piggybank.storage.avro.PigSchema2Avro.convert(PigSchema2Avro.java:105)
          at org.apache.pig.piggybank.storage.avro.PigSchema2Avro.convertRecord(PigSchema2Avro.java:151)
          at org.apache.pig.piggybank.storage.avro.PigSchema2Avro.convert(PigSchema2Avro.java:62)
          at org.apache.pig.piggybank.storage.avro.AvroStorage.checkSchema(AvroStorage.java:502)
          at org.apache.pig.newplan.logical.rules.InputOutputFileValidator$InputOutputFileVisitor.visit(InputOutputFileValidator.java:65)
          at org.apache.pig.newplan.logical.relational.LOStore.accept(LOStore.java:77)
          at org.apache.pig.newplan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:64)
          at org.apache.pig.newplan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:66)
          at org.apache.pig.newplan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:66)
          at org.apache.pig.newplan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:66)
          at org.apache.pig.newplan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:66)
          at org.apache.pig.newplan.DepthFirstWalker.walk(DepthFirstWalker.java:53)
          at org.apache.pig.newplan.PlanVisitor.visit(PlanVisitor.java:50)
          at org.apache.pig.newplan.logical.rules.InputOutputFileValidator.validate(InputOutputFileValidator.java:45)
          at org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.compile(HExecutionEngine.java:292)
          at org.apache.pig.PigServer.compilePp(PigServer.java:1360)
          at org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:1297)
          at org.apache.pig.PigServer.execute(PigServer.java:1286)
          at org.apache.pig.PigServer.access$400(PigServer.java:125)
          at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1591)
          ... 13 more

          Show
          Russell Jurney added a comment - This patch is a fix for a bug for persisting bags of tuples via AvroStorage. The script that alerted me to a bug is: messages = LOAD '/tmp/messages.avro' USING AvroStorage(); user_groups = GROUP messages by user_id; per_user = FOREACH user_groups { sorted = ORDER messages BY message_id DESC; GENERATE group AS user_id, sorted AS messages; } DESCRIBE per_user > per_user: {user_id: int,messages: {(message_id: int,topic: chararray,user_id: int)}} STORE per_user INTO '/tmp/per_user.avro' USING AvroStorage(); The error is: Pig Stack Trace --------------- ERROR 1002: Unable to store alias per_user org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1002: Unable to store alias per_user at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1596) at org.apache.pig.PigServer.registerQuery(PigServer.java:584) at org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:942) at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:386) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:188) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:164) at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:67) at org.apache.pig.Main.run(Main.java:487) at org.apache.pig.Main.main(Main.java:108) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.RunJar.main(RunJar.java:156) Caused by: java.lang.NullPointerException at org.apache.pig.piggybank.storage.avro.AvroStorageUtils.isTupleWrapper(AvroStorageUtils.java:327) at org.apache.pig.piggybank.storage.avro.PigSchema2Avro.convert(PigSchema2Avro.java:82) at org.apache.pig.piggybank.storage.avro.PigSchema2Avro.convert(PigSchema2Avro.java:105) at org.apache.pig.piggybank.storage.avro.PigSchema2Avro.convertRecord(PigSchema2Avro.java:151) at org.apache.pig.piggybank.storage.avro.PigSchema2Avro.convert(PigSchema2Avro.java:62) at org.apache.pig.piggybank.storage.avro.AvroStorage.checkSchema(AvroStorage.java:502) at org.apache.pig.newplan.logical.rules.InputOutputFileValidator$InputOutputFileVisitor.visit(InputOutputFileValidator.java:65) at org.apache.pig.newplan.logical.relational.LOStore.accept(LOStore.java:77) at org.apache.pig.newplan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:64) at org.apache.pig.newplan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:66) at org.apache.pig.newplan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:66) at org.apache.pig.newplan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:66) at org.apache.pig.newplan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:66) at org.apache.pig.newplan.DepthFirstWalker.walk(DepthFirstWalker.java:53) at org.apache.pig.newplan.PlanVisitor.visit(PlanVisitor.java:50) at org.apache.pig.newplan.logical.rules.InputOutputFileValidator.validate(InputOutputFileValidator.java:45) at org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.compile(HExecutionEngine.java:292) at org.apache.pig.PigServer.compilePp(PigServer.java:1360) at org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:1297) at org.apache.pig.PigServer.execute(PigServer.java:1286) at org.apache.pig.PigServer.access$400(PigServer.java:125) at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1591) ... 13 more
          Hide
          Russell Jurney added a comment -

          I've attached a patch that fixes a bug I ran into in serializing bags of tuples.

          Show
          Russell Jurney added a comment - I've attached a patch that fixes a bug I ran into in serializing bags of tuples.
          Hide
          Doug Cutting added a comment -

          Russell, this patch looks good to me, except for the print statement. But you should probably open a new issue and add it there, as this issue has already been committed, closed and released.

          Show
          Doug Cutting added a comment - Russell, this patch looks good to me, except for the print statement. But you should probably open a new issue and add it there, as this issue has already been committed, closed and released.
          Hide
          Russell Jurney added a comment -

          Thanks, Doug. Created https://issues.apache.org/jira/browse/PIG-2411 and submitted patch.

          Show
          Russell Jurney added a comment - Thanks, Doug. Created https://issues.apache.org/jira/browse/PIG-2411 and submitted patch.
          Hide
          deb ashish added a comment -

          REGISTER /path/avro-1.4.1.jar
          REGISTER /path/json-simple-1.1.jar
          REGISTER /path/piggybank.jar
          REGISTER /path/jackson-core-asl-1.5.5.jar
          REGISTER /path/jackson-mapper-asl-1.5.5.jar
          avro = LOAD '/hdfs path/part-r-00000.avro' USING org.apache.pig.piggybank.storage.avro.AvroStorage();

          Im trying this code but it's unable to read the avro file,showing the following exception

          Pig Stack Trace
          ---------------
          ERROR 2997: Unable to recreate exception from backed error: Error: Found class org.apache.hadoop.mapreduce.TaskAttemptContext, but interface was expected

          org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1066: Unable to open iterator for alias sc. Backend error : Unable to recreate exception from backed error: Error: Found class org.apache.hadoop.mapreduce.TaskAttemptContext, but interface was expected
          at org.apache.pig.PigServer.openIterator(PigServer.java:742)
          at org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:612)
          at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:303)
          at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:165)
          at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:141)
          at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:90)
          at org.apache.pig.Main.run(Main.java:406)
          at org.apache.pig.Main.main(Main.java:107)
          Caused by: org.apache.pig.backend.executionengine.ExecException: ERROR 2997: Unable to recreate exception from backed error: Error: Found class org.apache.hadoop.mapreduce.TaskAttemptContext, but interface was expected
          at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.Launcher.getErrorMessages(Launcher.java:221)
          at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.Launcher.getStats(Launcher.java:151)
          at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.launchPig(MapReduceLauncher.java:337)
          at org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.execute(HExecutionEngine.java:378)
          at org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:1198)
          at org.apache.pig.PigServer.storeEx(PigServer.java:874)
          at org.apache.pig.PigServer.store(PigServer.java:816)
          at org.apache.pig.PigServer.openIterator(PigServer.java:728)
          ... 7 more

          please help me asap

          Show
          deb ashish added a comment - REGISTER /path/avro-1.4.1.jar REGISTER /path/json-simple-1.1.jar REGISTER /path/piggybank.jar REGISTER /path/jackson-core-asl-1.5.5.jar REGISTER /path/jackson-mapper-asl-1.5.5.jar avro = LOAD '/hdfs path/part-r-00000.avro' USING org.apache.pig.piggybank.storage.avro.AvroStorage(); Im trying this code but it's unable to read the avro file,showing the following exception Pig Stack Trace --------------- ERROR 2997: Unable to recreate exception from backed error: Error: Found class org.apache.hadoop.mapreduce.TaskAttemptContext, but interface was expected org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1066: Unable to open iterator for alias sc. Backend error : Unable to recreate exception from backed error: Error: Found class org.apache.hadoop.mapreduce.TaskAttemptContext, but interface was expected at org.apache.pig.PigServer.openIterator(PigServer.java:742) at org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:612) at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:303) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:165) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:141) at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:90) at org.apache.pig.Main.run(Main.java:406) at org.apache.pig.Main.main(Main.java:107) Caused by: org.apache.pig.backend.executionengine.ExecException: ERROR 2997: Unable to recreate exception from backed error: Error: Found class org.apache.hadoop.mapreduce.TaskAttemptContext, but interface was expected at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.Launcher.getErrorMessages(Launcher.java:221) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.Launcher.getStats(Launcher.java:151) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.launchPig(MapReduceLauncher.java:337) at org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.execute(HExecutionEngine.java:378) at org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:1198) at org.apache.pig.PigServer.storeEx(PigServer.java:874) at org.apache.pig.PigServer.store(PigServer.java:816) at org.apache.pig.PigServer.openIterator(PigServer.java:728) ... 7 more please help me asap
          Hide
          Jakob Homan added a comment -

          @deb - questions like these should be directed to the pig user list, not JIRA. You'll receive assistance there.

          Show
          Jakob Homan added a comment - @deb - questions like these should be directed to the pig user list, not JIRA. You'll receive assistance there.

            People

            • Assignee:
              lin guo
              Reporter:
              lin guo
            • Votes:
              0 Vote for this issue
              Watchers:
              15 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development