Pig
  1. Pig
  2. PIG-2495

Using merge JOIN from a HBaseStorage produces an error

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.9.1, 0.9.2, 0.13.1
    • Fix Version/s: 0.14.0
    • Component/s: None
    • Labels:
      None
    • Environment:

      HBase 0.90.3, Hadoop 0.20-append

    • Hadoop Flags:
      Reviewed

      Description

      To increase performance of my computation, I would like to use a merge join between two tables to increase speed computation but it produces an error.

      Here is the script:

      start_sessions = LOAD 'hbase://startSession.bea000000.dev.ubithere.com' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage('meta:infoid meta:imei meta:timestamp', '-loadKey') AS (sid:chararray, infoid:chararray, imei:chararray, start:long);
      end_sessions = LOAD 'hbase://endSession.bea000000.dev.ubithere.com' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage('meta:timestamp meta:locid', '-loadKey') AS (sid:chararray, end:long, locid:chararray);
      sessions = JOIN start_sessions BY sid, end_sessions BY sid USING 'merge';
      STORE sessions INTO 'sessionsTest' USING PigStorage ('*');
      

      Here is the result of this script :

      2012-01-30 16:12:43,920 [main] INFO  org.apache.pig.Main - Logging error messages to: /root/pig_1327939963919.log
      2012-01-30 16:12:44,025 [main] INFO  org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: hdfs://lxc233:9000
      2012-01-30 16:12:44,102 [main] INFO  org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to map-reduce job tracker at: lxc233:9001
      2012-01-30 16:12:44,760 [main] INFO  org.apache.pig.tools.pigstats.ScriptState - Pig features used in the script: MERGE_JION
      2012-01-30 16:12:44,923 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler - File concatenation threshold: 100 optimistic? false
      2012-01-30 16:12:44,982 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size before optimization: 2
      2012-01-30 16:12:44,982 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size after optimization: 2
      2012-01-30 16:12:45,001 [main] INFO  org.apache.pig.tools.pigstats.ScriptState - Pig script settings are added to the job
      2012-01-30 16:12:45,006 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - mapred.job.reduce.markreset.buffer.percent is not set, set to default 0.3
      2012-01-30 16:12:45,039 [main] INFO  org.apache.zookeeper.ZooKeeper - Client environment:zookeeper.version=3.3.2-1031432, built on 11/05/2010 05:32 GMT
      2012-01-30 16:12:45,039 [main] INFO  org.apache.zookeeper.ZooKeeper - Client environment:host.name=lxc233.machine.com
      2012-01-30 16:12:45,039 [main] INFO  org.apache.zookeeper.ZooKeeper - Client environment:java.version=1.6.0_22
      2012-01-30 16:12:45,039 [main] INFO  org.apache.zookeeper.ZooKeeper - Client environment:java.vendor=Sun Microsystems Inc.
      2012-01-30 16:12:45,039 [main] INFO  org.apache.zookeeper.ZooKeeper - Client environment:java.home=/usr/lib/jvm/java-6-sun-1.6.0.22/jre
      2012-01-30 16:12:45,039 [main] INFO  org.apache.zookeeper.ZooKeeper - Client environment:java.class.path=/opt/hadoop/conf:/usr/lib/jvm/java-6-sun/jre/lib/tools.jar:/opt/hadoop:/opt/hadoop/hadoop-0.20-append-core.jar:/opt/hadoop/lib/commons-cli-1.2.jar:/opt/hadoop/lib/commons-codec-1.3.jar:/opt/hadoop/lib/commons-el-1.0.jar:/opt/hadoop/lib/commons-httpclient-3.0.1.jar:/opt/hadoop/lib/commons-logging-1.0.4.jar:/opt/hadoop/lib/commons-logging-api-1.0.4.jar:/opt/hadoop/lib/commons-net-1.4.1.jar:/opt/hadoop/lib/core-3.1.1.jar:/opt/hadoop/lib/hadoop-fairscheduler-0.20-append.jar:/opt/hadoop/lib/hadoop-gpl-compression-0.2.0-dev.jar:/opt/hadoop/lib/hadoop-lzo-0.4.14.jar:/opt/hadoop/lib/hsqldb-1.8.0.10.jar:/opt/hadoop/lib/jasper-compiler-5.5.12.jar:/opt/hadoop/lib/jasper-runtime-5.5.12.jar:/opt/hadoop/lib/jets3t-0.6.1.jar:/opt/hadoop/lib/jetty-6.1.14.jar:/opt/hadoop/lib/jetty-util-6.1.14.jar:/opt/hadoop/lib/junit-4.5.jar:/opt/hadoop/lib/kfs-0.2.2.jar:/opt/hadoop/lib/log4j-1.2.15.jar:/opt/hadoop/lib/mockito-all-1.8.2.jar:/opt/hadoop/lib/oro-2.0.8.jar:/opt/hadoop/lib/servlet-api-2.5-6.1.14.jar:/opt/hadoop/lib/slf4j-api-1.4.3.jar:/opt/hadoop/lib/slf4j-log4j12-1.4.3.jar:/opt/hadoop/lib/xmlenc-0.52.jar:/opt/hadoop/lib/jsp-2.1/jsp-2.1.jar:/opt/hadoop/lib/jsp-2.1/jsp-api-2.1.jar:/opt/pig/bin/../conf:/usr/lib/jvm/java-6-sun/jre/lib/tools.jar:/opt/hadoop/lib/commons-codec-1.3.jar:/opt/hbase/lib/guava-r06.jar:/opt/hbase/hbase-0.90.3.jar:/opt/hadoop/lib/log4j-1.2.15.jar:/opt/hadoop/lib/commons-cli-1.2.jar:/opt/hadoop/lib/commons-logging-1.0.4.jar:/opt/pig/pig-withouthadoop.jar:/opt/hadoop/conf_computation:/opt/hbase/conf:/opt/pig/bin/../lib/hadoop-0.20-append-core.jar:/opt/pig/bin/../lib/hadoop-gpl-compression-0.2.0-dev.jar:/opt/pig/bin/../lib/hbase-0.90.3.jar:/opt/pig/bin/../lib/pigudfs.jar:/opt/pig/bin/../lib/zookeeper-3.3.2.jar:/opt/pig/bin/../pig-withouthadoop.jar:
      2012-01-30 16:12:45,039 [main] INFO  org.apache.zookeeper.ZooKeeper - Client environment:java.library.path=/opt/hadoop/lib/native/Linux-amd64-64
      2012-01-30 16:12:45,039 [main] INFO  org.apache.zookeeper.ZooKeeper - Client environment:java.io.tmpdir=/tmp
      2012-01-30 16:12:45,039 [main] INFO  org.apache.zookeeper.ZooKeeper - Client environment:java.compiler=<NA>
      2012-01-30 16:12:45,039 [main] INFO  org.apache.zookeeper.ZooKeeper - Client environment:os.name=Linux
      2012-01-30 16:12:45,039 [main] INFO  org.apache.zookeeper.ZooKeeper - Client environment:os.arch=amd64
      2012-01-30 16:12:45,039 [main] INFO  org.apache.zookeeper.ZooKeeper - Client environment:os.version=2.6.32-5-amd64
      2012-01-30 16:12:45,039 [main] INFO  org.apache.zookeeper.ZooKeeper - Client environment:user.name=root
      2012-01-30 16:12:45,039 [main] INFO  org.apache.zookeeper.ZooKeeper - Client environment:user.home=/root
      2012-01-30 16:12:45,039 [main] INFO  org.apache.zookeeper.ZooKeeper - Client environment:user.dir=/root
      2012-01-30 16:12:45,039 [main] INFO  org.apache.zookeeper.ZooKeeper - Initiating client connection, connectString=lxc233.machine.com:2222,lxc231.machine.com:2222,lxc234.machine.com:2222 sessionTimeout=180000 watcher=hconnection
      2012-01-30 16:12:45,048 [main-SendThread()] INFO  org.apache.zookeeper.ClientCnxn - Opening socket connection to server lxc231.machine.com/192.168.1.231:2222
      2012-01-30 16:12:45,049 [main-SendThread(lxc231.machine.com:2222)] INFO  org.apache.zookeeper.ClientCnxn - Socket connection established to lxc231.machine.com/192.168.1.231:2222, initiating session
      2012-01-30 16:12:45,081 [main-SendThread(lxc231.machine.com:2222)] INFO  org.apache.zookeeper.ClientCnxn - Session establishment complete on server lxc231.machine.com/192.168.1.231:2222, sessionid = 0x134c294771a073f, negotiated timeout = 180000
      2012-01-30 16:12:46,569 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Setting up single store job
      2012-01-30 16:12:46,590 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 1 map-reduce job(s) waiting for submission.
      2012-01-30 16:12:46,870 [Thread-13] INFO  org.apache.zookeeper.ZooKeeper - Initiating client connection, connectString=lxc233.machine.com:2222,lxc231.machine.com:2222,lxc234.machine.com:2222 sessionTimeout=180000 watcher=hconnection
      2012-01-30 16:12:46,871 [Thread-13-SendThread()] INFO  org.apache.zookeeper.ClientCnxn - Opening socket connection to server lxc233.machine.com/192.168.1.233:2222
      2012-01-30 16:12:46,871 [Thread-13-SendThread(lxc233.machine.com:2222)] INFO  org.apache.zookeeper.ClientCnxn - Socket connection established to lxc233.machine.com/192.168.1.233:2222, initiating session
      2012-01-30 16:12:46,872 [Thread-13-SendThread(lxc233.machine.com:2222)] INFO  org.apache.zookeeper.ClientCnxn - Session establishment complete on server lxc233.machine.com/192.168.1.233:2222, sessionid = 0x2343822449935e1, negotiated timeout = 180000
      2012-01-30 16:12:46,880 [Thread-13] INFO  org.apache.zookeeper.ZooKeeper - Initiating client connection, connectString=lxc233.machine.com:2222,lxc231.machine.com:2222,lxc234.machine.com:2222 sessionTimeout=180000 watcher=hconnection
      2012-01-30 16:12:46,880 [Thread-13-SendThread()] INFO  org.apache.zookeeper.ClientCnxn - Opening socket connection to server lxc233.machine.com/192.168.1.233:2222
      2012-01-30 16:12:46,880 [Thread-13-SendThread(lxc233.machine.com:2222)] INFO  org.apache.zookeeper.ClientCnxn - Socket connection established to lxc233.machine.com/192.168.1.233:2222, initiating session
      2012-01-30 16:12:46,882 [Thread-13-SendThread(lxc233.machine.com:2222)] INFO  org.apache.zookeeper.ClientCnxn - Session establishment complete on server lxc233.machine.com/192.168.1.233:2222, sessionid = 0x2343822449935e2, negotiated timeout = 180000
      2012-01-30 16:12:47,091 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 0% complete
      2012-01-30 16:12:47,703 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - HadoopJobId: job_201201201546_0890
      2012-01-30 16:12:47,703 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - More information at: http://lxc233:50030/jobdetails.jsp?jobid=job_201201201546_0890
      2012-01-30 16:12:55,723 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 25% complete
      2012-01-30 16:13:49,312 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 33% complete
      2012-01-30 16:13:55,322 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 50% complete
      2012-01-30 16:13:57,327 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - job job_201201201546_0890 has failed! Stop running all dependent jobs
      2012-01-30 16:13:57,327 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 100% complete
      2012-01-30 16:13:57,337 [main] ERROR org.apache.pig.tools.pigstats.SimplePigStats - ERROR: Could create instance of class org.apache.pig.backend.hadoop.hbase.HBaseStorage$1, while attempting to de-serialize it. (no default constructor ?)
      2012-01-30 16:13:57,337 [main] ERROR org.apache.pig.tools.pigstats.PigStatsUtil - 1 map reduce job(s) failed!
      2012-01-30 16:13:57,338 [main] INFO  org.apache.pig.tools.pigstats.SimplePigStats - Script Statistics: 
      
      HadoopVersion	PigVersion	UserId	StartedAt	FinishedAt	Features
      0.20-append	0.9.2-SNAPSHOT	root	2012-01-30 16:12:44	2012-01-30 16:13:57	MERGE_JION
      
      Failed!
      
      Failed Jobs:
      JobId	Alias	Feature	Message	Outputs
      job_201201201546_0890	end_sessions	INDEXER	Message: Job failed!	
      
      Input(s):
      Failed to read data from "hbase://endSession.bea000000.dev.ubithere.com"
      
      Output(s):
      
      Counters:
      Total records written : 0
      Total bytes written : 0
      Spillable Memory Manager spill count : 0
      Total bags proactively spilled: 0
      Total records proactively spilled: 0
      
      Job DAG:
      job_201201201546_0890	->	null,
      null
      
      
      2012-01-30 16:13:57,338 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Failed!
      2012-01-30 16:13:57,339 [main] ERROR org.apache.pig.tools.grunt.GruntParser - ERROR 2997: Encountered IOException. Could create instance of class org.apache.pig.backend.hadoop.hbase.HBaseStorage$1, while attempting to de-serialize it. (no default constructor ?)
      Details at logfile: /root/pig_1327939963919.log
      2012-01-30 16:13:57,339 [main] ERROR org.apache.pig.tools.grunt.GruntParser - ERROR 2244: Job failed, hadoop does not return any error message
      Details at logfile: /root/pig_1327939963919.log
      

      And here is the result in the log file :

      Backend error message
      ---------------------
      java.io.IOException: Could create instance of class org.apache.pig.backend.hadoop.hbase.HBaseStorage$1, while attempting to de-serialize it. (no default constructor ?)
      	at org.apache.pig.data.BinInterSedes.readWritable(BinInterSedes.java:235)
      	at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:336)
      	at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:251)
      	at org.apache.pig.data.BinInterSedes.addColsToTuple(BinInterSedes.java:556)
      	at org.apache.pig.data.BinSedesTuple.readFields(BinSedesTuple.java:64)
      	at org.apache.pig.impl.io.PigNullableWritable.readFields(PigNullableWritable.java:114)
      	at org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:67)
      	at org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:40)
      	at org.apache.hadoop.mapreduce.ReduceContext.nextKeyValue(ReduceContext.java:113)
      	at org.apache.hadoop.mapreduce.ReduceContext.nextKey(ReduceContext.java:92)
      	at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:175)
      	at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:566)
      	at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:408)
      	at org.apache.hadoop.mapred.Child.main(Child.java:170)
      Caused by: java.lang.InstantiationException: org.apache.pig.backend.hadoop.hbase.HBaseStorage$1
      	at java.lang.Class.newInstance0(Class.java:340)
      	at java.lang.Class.newInstance(Class.java:308)
      	at org.apache.pig.data.BinInterSedes.readWritable(BinInterSedes.java:231)
      	... 13 more
      
      Pig Stack Trace
      ---------------
      ERROR 2997: Encountered IOException. Could create instance of class org.apache.pig.backend.hadoop.hbase.HBaseStorage$1, while attempting to de-serialize it. (no default constructor ?)
      
      java.io.IOException: Could create instance of class org.apache.pig.backend.hadoop.hbase.HBaseStorage$1, while attempting to de-serialize it. (no default constructor ?)
      	at org.apache.pig.data.BinInterSedes.readWritable(BinInterSedes.java:235)
      	at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:336)
      	at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:251)
      	at org.apache.pig.data.BinInterSedes.addColsToTuple(BinInterSedes.java:556)
      	at org.apache.pig.data.BinSedesTuple.readFields(BinSedesTuple.java:64)
      	at org.apache.pig.impl.io.PigNullableWritable.readFields(PigNullableWritable.java:114)
      	at org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:67)
      	at org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:40)
      	at org.apache.hadoop.mapreduce.ReduceContext.nextKeyValue(ReduceContext.java:113)
      	at org.apache.hadoop.mapreduce.ReduceContext.nextKey(ReduceContext.java:92)
      	at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:175)
      	at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:566)
      	at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:408)
      	at org.apache.hadoop.mapred.Child.main(Child.java:170)
      Caused by: java.lang.InstantiationException: org.apache.pig.backend.hadoop.hbase.HBaseStorage$1
      	at java.lang.Class.newInstance0(Class.java:340)
      	at java.lang.Class.newInstance(Class.java:308)
      	at org.apache.pig.data.BinInterSedes.readWritable(BinInterSedes.java:231)
      ================================================================================
      Pig Stack Trace
      ---------------
      ERROR 2244: Job failed, hadoop does not return any error message
      
      org.apache.pig.backend.executionengine.ExecException: ERROR 2244: Job failed, hadoop does not return any error message
      	at org.apache.pig.tools.grunt.GruntParser.executeBatch(GruntParser.java:139)
      	at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:192)
      	at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:164)
      	at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:81)
      	at org.apache.pig.Main.run(Main.java:561)
      	at org.apache.pig.Main.main(Main.java:111)
      	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
      	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
      	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
      	at java.lang.reflect.Method.invoke(Method.java:597)
      	at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
      ================================================================================
      

      The same script without using merge works without any problem.

      1. PIG-2495-Collectable.patch
        8 kB
        Daniel Dai
      2. PIG-2495-2.patch
        25 kB
        Daniel Dai
      3. PIG-2495.patch
        6 kB
        Kevin Lion
      4. patch
        27 kB
        Brian Johnson
      5. patch
        12 kB
        Brian Johnson

        Activity

        Hide
        Bill Graham added a comment -

        Thanks for the patch Kevin! A few note about Pig code style:

        • Indentation should be 4 spaces, you have 2 in some spots.
        • Curly brackets should go at the end of the class name or constructor/method signature, not below it.
        • Please include the standard apache header above the package name for TableSplitComparable
        • I think we favor brackets in if/else clauses but I'll let someone else confirm.

        And a few more notes comments:

        • I would think TableSplitComparable should implement WritableComparable<TableSplit> instead of WritableComparable<TableSplitComparable>, right?
        • Your hashcode method seems like it could just be
          return ((tsplit == null) ? 0 : tsplit.hashCode());
          

        since it's just delegating to tsplit.

        • Also, the condition in equals could just be:
        else {
          return tsplit.equals(other.tsplit);
        }
        
        • I don't think WritableComparable needs to implement Serializable and serialVersionUID.
        • Should the wrapped TableSplit be initialized to an empty split? It seems like it should have to be explicitly set, right?
        • In getSplitComparable you can just return new TableSplitComparable((TableSplit)split); after a ! instanceof check that throws an exception.
        Show
        Bill Graham added a comment - Thanks for the patch Kevin! A few note about Pig code style: Indentation should be 4 spaces, you have 2 in some spots. Curly brackets should go at the end of the class name or constructor/method signature, not below it. Please include the standard apache header above the package name for TableSplitComparable I think we favor brackets in if/else clauses but I'll let someone else confirm. And a few more notes comments: I would think TableSplitComparable should implement WritableComparable<TableSplit> instead of WritableComparable<TableSplitComparable> , right? Your hashcode method seems like it could just be return ((tsplit == null) ? 0 : tsplit.hashCode()); since it's just delegating to tsplit. Also, the condition in equals could just be: else { return tsplit.equals(other.tsplit); } I don't think WritableComparable needs to implement Serializable and serialVersionUID. Should the wrapped TableSplit be initialized to an empty split? It seems like it should have to be explicitly set, right? In getSplitComparable you can just return new TableSplitComparable((TableSplit)split); after a ! instanceof check that throws an exception.
        Hide
        Kevin Lion added a comment -

        Thanks for those remarks. I've modified my patch with all your comments.
        Does it seems okay for you ?

        Show
        Kevin Lion added a comment - Thanks for those remarks. I've modified my patch with all your comments. Does it seems okay for you ?
        Hide
        Bill Graham added a comment -

        Looks good to me besides one last nit, which is to include TableSplit as the generic type in the HBaseStorage.getSplitComparable signature like so:

        public WritableComparable<TableSplit> getSplitComparable(InputSplit split) throws IOException
        
        Show
        Bill Graham added a comment - Looks good to me besides one last nit, which is to include TableSplit as the generic type in the HBaseStorage.getSplitComparable signature like so: public WritableComparable<TableSplit> getSplitComparable(InputSplit split) throws IOException
        Hide
        Bill Graham added a comment -

        Looks good to me besides one last nit, which is to include TableSplit as the generic type in the HBaseStorage.getSplitComparable signature like so:

        public WritableComparable<TableSplit> getSplitComparable(InputSplit split) throws IOException
        
        Show
        Bill Graham added a comment - Looks good to me besides one last nit, which is to include TableSplit as the generic type in the HBaseStorage.getSplitComparable signature like so: public WritableComparable<TableSplit> getSplitComparable(InputSplit split) throws IOException
        Hide
        Kevin Lion added a comment -

        Okay, it's done, thanks for your help! What must I do next? "Resolve" the issue?

        Show
        Kevin Lion added a comment - Okay, it's done, thanks for your help! What must I do next? "Resolve" the issue?
        Hide
        Alan Gates added a comment -

        Resolve is when a committer has checked it in. The next step will be for one of the committers to review it and run tests, and possibly commit it.

        Show
        Alan Gates added a comment - Resolve is when a committer has checked it in. The next step will be for one of the committers to review it and run tests, and possibly commit it.
        Hide
        Dmitriy V. Ryaboy added a comment -

        Assigning to Kevin (you don't have to do anything, that's just to give you credit). I'll review the patch.

        Show
        Dmitriy V. Ryaboy added a comment - Assigning to Kevin (you don't have to do anything, that's just to give you credit). I'll review the patch.
        Hide
        Dmitriy V. Ryaboy added a comment -

        A few minor comments:

        The @since annotation is wrong – even if we decide to backport this all the way to 0.9 branch, we have to make it 0.9.3 since 0.9.2 is released.

        toString – should probably return something more useful than just the class name? Maybe concatenate the actual split's toString()?

        Overall, I'm not sure what caused the old code to not work and how this is supposed to fix the issue. Just for my edification, can you explain? The difference is only that before, we implemented WritableComparable<InputSplit> and now you implement WritableComparable<TableSplit>? The test you added fails when I apply it to trunk:

        Testcase: testMergeJoin took 21.401 sec
        FAILED
        expected:<0> but was:<48>
        junit.framework.AssertionFailedError: expected:<0> but was:<48>
        at org.apache.pig.test.TestHBaseStorage.testMergeJoin(TestHBaseStorage.java:910)

        Show
        Dmitriy V. Ryaboy added a comment - A few minor comments: The @since annotation is wrong – even if we decide to backport this all the way to 0.9 branch, we have to make it 0.9.3 since 0.9.2 is released. toString – should probably return something more useful than just the class name? Maybe concatenate the actual split's toString()? Overall, I'm not sure what caused the old code to not work and how this is supposed to fix the issue. Just for my edification, can you explain? The difference is only that before, we implemented WritableComparable<InputSplit> and now you implement WritableComparable<TableSplit>? The test you added fails when I apply it to trunk: Testcase: testMergeJoin took 21.401 sec FAILED expected:<0> but was:<48> junit.framework.AssertionFailedError: expected:<0> but was:<48> at org.apache.pig.test.TestHBaseStorage.testMergeJoin(TestHBaseStorage.java:910)
        Hide
        Kevin Lion added a comment -

        OKay, I've modified the @since to 0.9.3. toString() now return the class name and the split's toString().

        There was an issue because the getSplitComparable must return something Serializable (with a default constructor). The previously anonymous class hadn't a such thing.

        I've also modified the unit test but I can't test it on my computer because I've a timeout when I run it: does someone knows why?

        Show
        Kevin Lion added a comment - OKay, I've modified the @since to 0.9.3. toString() now return the class name and the split's toString(). There was an issue because the getSplitComparable must return something Serializable (with a default constructor). The previously anonymous class hadn't a such thing. I've also modified the unit test but I can't test it on my computer because I've a timeout when I run it: does someone knows why?
        Hide
        Cheolsoo Park added a comment -

        Hi Kevin,

        Sorry for the late reply. I was looking into your patch to commit it, but I ran into the same test failure as what Dmitriy mentioned.

        I've also modified the unit test but I can't test it on my computer because I've a timeout when I run it: does someone knows why?

        Does your test log contain anything? Can you please upload your test log to the jira? It can be found at build/test/logs/TEST-org.apache.pig.test.TestHBaseStorage.txt.

        I am canceling patch available for now until the test is fixed.

        Thanks!

        Show
        Cheolsoo Park added a comment - Hi Kevin, Sorry for the late reply. I was looking into your patch to commit it, but I ran into the same test failure as what Dmitriy mentioned. I've also modified the unit test but I can't test it on my computer because I've a timeout when I run it: does someone knows why? Does your test log contain anything? Can you please upload your test log to the jira? It can be found at build/test/logs/TEST-org.apache.pig.test.TestHBaseStorage.txt. I am canceling patch available for now until the test is fixed. Thanks!
        Hide
        Pradeep Gollakota added a comment -

        Hi Kevin,

        I have a very minor request for your patch. When throwing the RuntimeException, could you also include the class information for the given type? This could potentially be useful for debugging purposes.

        Show
        Pradeep Gollakota added a comment - Hi Kevin, I have a very minor request for your patch. When throwing the RuntimeException, could you also include the class information for the given type? This could potentially be useful for debugging purposes.
        Hide
        Yi Zou added a comment -

        Hi, I wonder if this patch is still being worked on or not? is it still planned to be pulled to the main tip of pig? I am hitting the exactly the same bug doing a merge JOIN on 0.12.0 stable, only that it is actually a self join from the same input hbase table, thought I would not think that matter for the sake of this bug being still there. If you guys need help on getting the patch move forward, I will be glad to help, it's a good fix in my opinion.

        thanks

        Show
        Yi Zou added a comment - Hi, I wonder if this patch is still being worked on or not? is it still planned to be pulled to the main tip of pig? I am hitting the exactly the same bug doing a merge JOIN on 0.12.0 stable, only that it is actually a self join from the same input hbase table, thought I would not think that matter for the sake of this bug being still there. If you guys need help on getting the patch move forward, I will be glad to help, it's a good fix in my opinion. thanks
        Hide
        Brian Johnson added a comment -

        I am working on it. I fixed the test, it was a mismatch with binary with string types in hbase. I'm also implementing IndexableLoadFunc and CollectableLoadFunc to give HBaseStorage the full range of optimized joins and groups

        Show
        Brian Johnson added a comment - I am working on it. I fixed the test, it was a mismatch with binary with string types in hbase. I'm also implementing IndexableLoadFunc and CollectableLoadFunc to give HBaseStorage the full range of optimized joins and groups
        Hide
        Brian Johnson added a comment -

        There was also an issue with the join key in the test

        Show
        Brian Johnson added a comment - There was also an issue with the join key in the test
        Hide
        Brian Johnson added a comment -

        Here is my patch for 0.12.1 that includes the existing patch, but fixes the test and adds support for IndexableLoadFunc and CollectableLoadFunc

        Show
        Brian Johnson added a comment - Here is my patch for 0.12.1 that includes the existing patch, but fixes the test and adds support for IndexableLoadFunc and CollectableLoadFunc
        Hide
        Daniel Dai added a comment -

        I am fine with SplitComparable changes since I see all the review comments from Bill and Dmitriy are addressed and TestHBaseStorage pass for me.

        In seekNear, we'd better to use existing objToBytes method, so that we can deal alternative caster, additional primary key (DateTime, BigInterger), and complex key. I modified the patch with this change and also rebase with trunk.

        Show
        Daniel Dai added a comment - I am fine with SplitComparable changes since I see all the review comments from Bill and Dmitriy are addressed and TestHBaseStorage pass for me. In seekNear, we'd better to use existing objToBytes method, so that we can deal alternative caster, additional primary key (DateTime, BigInterger), and complex key. I modified the patch with this change and also rebase with trunk.
        Hide
        Brian Johnson added a comment -

        Although the test passes, there appeared to be some problems when I ran this on a larger data set so I think there might be an issue with the implementation. I'll try it again with your changes and verify whether or not there was an issue. Perhaps it's best to split off the changes for IndexableLoadFunc from the CollectableLoadFunc ones?

        Show
        Brian Johnson added a comment - Although the test passes, there appeared to be some problems when I ran this on a larger data set so I think there might be an issue with the implementation. I'll try it again with your changes and verify whether or not there was an issue. Perhaps it's best to split off the changes for IndexableLoadFunc from the CollectableLoadFunc ones?
        Hide
        Daniel Dai added a comment -

        Thanks for verifying. Do you see any exception?

        Also what do you mean "split off the changes for IndexableLoadFunc from the CollectableLoadFunc"?

        Show
        Daniel Dai added a comment - Thanks for verifying. Do you see any exception? Also what do you mean "split off the changes for IndexableLoadFunc from the CollectableLoadFunc"?
        Hide
        Brian Johnson added a comment -

        Two different interfaces were implemented in this patch. IndexableLoadFunc, which possibly still has an issue and CollectableLoadFunc which is a no-op on hbase because keys are unique. I'm suggesting maybe I should make two separate patches because CollectableLoadFunc works great (although there is a related bug in pig https://issues.apache.org/jira/browse/PIG-4166), but I'm not 100% sure on IndexableLoadFunc. I didn't get any exceptions, but the data returned by the script on our actual data didn't look right.

        Show
        Brian Johnson added a comment - Two different interfaces were implemented in this patch. IndexableLoadFunc, which possibly still has an issue and CollectableLoadFunc which is a no-op on hbase because keys are unique. I'm suggesting maybe I should make two separate patches because CollectableLoadFunc works great (although there is a related bug in pig https://issues.apache.org/jira/browse/PIG-4166 ), but I'm not 100% sure on IndexableLoadFunc. I didn't get any exceptions, but the data returned by the script on our actual data didn't look right.
        Hide
        Daniel Dai added a comment -

        Sounds good. We can get CollectableLoadFunc part in first.

        Show
        Daniel Dai added a comment - Sounds good. We can get CollectableLoadFunc part in first.
        Hide
        Daniel Dai added a comment -

        Does it work if you put "--caster HBaseBinaryConverter" in HBaseStorage option? By default, HBaseStorage uses Utf8StorageConverter, which is Pig specific.

        Show
        Daniel Dai added a comment - Does it work if you put "--caster HBaseBinaryConverter" in HBaseStorage option? By default, HBaseStorage uses Utf8StorageConverter, which is Pig specific.
        Hide
        Brian Johnson added a comment -

        here is the patch against branch-0.13 for merge join and collected group

        Show
        Brian Johnson added a comment - here is the patch against branch-0.13 for merge join and collected group
        Hide
        Daniel Dai added a comment -

        LGTM, attach a patch resync with trunk.

        How about IndexableLoadFunc? Do you still want to do it?

        Show
        Daniel Dai added a comment - LGTM, attach a patch resync with trunk. How about IndexableLoadFunc? Do you still want to do it?
        Hide
        Brian Johnson added a comment -

        Yes, but I think there are some problems with the current implementation. Probably best to open another ticket for outer merge join so this one can be closed out with the patch.

        Show
        Brian Johnson added a comment - Yes, but I think there are some problems with the current implementation. Probably best to open another ticket for outer merge join so this one can be closed out with the patch.
        Show
        Brian Johnson added a comment - https://issues.apache.org/jira/browse/PIG-4254
        Hide
        Daniel Dai added a comment -

        PIG-2495-Collectable.patch committed to both trunk and 0.14 branch. Created PIG-4255 for IndexableLoadFunc change. Thanks Brian!

        Show
        Daniel Dai added a comment - PIG-2495 -Collectable.patch committed to both trunk and 0.14 branch. Created PIG-4255 for IndexableLoadFunc change. Thanks Brian!
        Hide
        Brian Johnson added a comment -

        I think we doubled up on the new ticket. I'll let you decide which one to keep so we don't end up deleting both of them.

        Show
        Brian Johnson added a comment - I think we doubled up on the new ticket. I'll let you decide which one to keep so we don't end up deleting both of them.
        Hide
        Daniel Dai added a comment -

        I removed PIG-4255.

        Show
        Daniel Dai added a comment - I removed PIG-4255 .

          People

          • Assignee:
            Brian Johnson
            Reporter:
            Kevin Lion
          • Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development