Details

    • Type: Improvement
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 1.0.0
    • Fix Version/s: 1.0.0
    • Component/s: None
    • Labels:
      None
    1. CRUNCH-619_v3.patch
      64 kB
      Gergő Pásztor
    2. CRUNCH-619_v4_hbase1.patch
      54 kB
      Gergő Pásztor
    3. CRUNCH-619_v4_hbase2.patch
      64 kB
      Gergő Pásztor
    4. CRUNCH-619_v5_hbase2.patch
      67 kB
      Gergő Pásztor
    5. CRUNCH-619_v6_hbase2.patch
      69 kB
      Gergő Pásztor
    6. CRUNCH-619_v7_hbase2.patch
      69 kB
      Gergő Pásztor
    7. CRUNCH-619.patch
      61 kB
      Tom White
    8. CRUNCH-619-2.patch
      63 kB
      Attila Sasvari
    9. CRUNCH-619-v8_hbase2.patch
      71 kB
      Attila Sasvari

      Issue Links

        Activity

        Hide
        tomwhite Tom White added a comment -

        HBase 2 is not out yet, but there's a SNAPSHOT version to try out. This patch includes the changes to run against it. The APIs have changed quite substantially, particularly around HFile (which Crunch accesses directly), so unfortunately it doesn't seem to be very feasible to support both HBase 1 and 2.

        Show
        tomwhite Tom White added a comment - HBase 2 is not out yet, but there's a SNAPSHOT version to try out. This patch includes the changes to run against it. The APIs have changed quite substantially, particularly around HFile (which Crunch accesses directly), so unfortunately it doesn't seem to be very feasible to support both HBase 1 and 2.
        Hide
        jmhsieh Jonathan Hsieh added a comment -

        Hey Tom White, I took a quick look at the patch and though I haven't tried it, most of the changes should be able to run against hbase 1.0+ and hbase 2.x when it comes out. The preferred HBase 1.x API is changed between the 0.98/0.96 APIs previously used in crunch, but were still present in HBase 1.x's. The Hbase 2.x line will remove the older apis, and thus forces all components to move to the 1.x api.

        The caveat is the HFile readers and writers which you mention aren't part of the public HBase API [1]. So for the HFile writers, I wonder if it would be possible to wrap or extend the existing public HBase HFileOutputFormat2 [2] so that you don't have to get into the internals.

        A few notes: KeyValue is no longer public and may go away in the future, (there are equivalent methods in CellUtil)

        Do you all use review board? I could comment/code review more easily there.

        [1] http://hbase.apache.org/devapidocs/org/apache/hadoop/hbase/io/hfile/HFile.html
        [2] http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/HFileOutputFormat2.html

        Show
        jmhsieh Jonathan Hsieh added a comment - Hey Tom White , I took a quick look at the patch and though I haven't tried it, most of the changes should be able to run against hbase 1.0+ and hbase 2.x when it comes out. The preferred HBase 1.x API is changed between the 0.98/0.96 APIs previously used in crunch, but were still present in HBase 1.x's. The Hbase 2.x line will remove the older apis, and thus forces all components to move to the 1.x api. The caveat is the HFile readers and writers which you mention aren't part of the public HBase API [1] . So for the HFile writers, I wonder if it would be possible to wrap or extend the existing public HBase HFileOutputFormat2 [2] so that you don't have to get into the internals. A few notes: KeyValue is no longer public and may go away in the future, (there are equivalent methods in CellUtil) Do you all use review board? I could comment/code review more easily there. [1] http://hbase.apache.org/devapidocs/org/apache/hadoop/hbase/io/hfile/HFile.html [2] http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/HFileOutputFormat2.html
        Hide
        tomwhite Tom White added a comment -

        Thanks for taking a look, Jonathan Hsieh.

        There seem to be some APIs that don't exist in both HBase 1 and 2, e.g. CellUtil#createFirstOnRow, and CellComparator#COMPARATOR. Are these going to be backported to HBase 1 to make the transition smoother?

        There's a comment in HFileOutputFormatForCrunch that explains why the HBase equivalent is not used. I guess that still applies.

        HBase's official HFileOutputFormat is not used, because it shuffles on row-key only and
        does in-memory sort at reducer side (so the size of output HFile is limited to reducer's memory).
        As crunch supports more complex and flexible MapReduce pipeline, we would prefer thin and pure
        OutputFormat here.

        No reviewboard for Crunch, I'm afraid

        Show
        tomwhite Tom White added a comment - Thanks for taking a look, Jonathan Hsieh . There seem to be some APIs that don't exist in both HBase 1 and 2, e.g. CellUtil#createFirstOnRow, and CellComparator#COMPARATOR. Are these going to be backported to HBase 1 to make the transition smoother? There's a comment in HFileOutputFormatForCrunch that explains why the HBase equivalent is not used. I guess that still applies. HBase's official HFileOutputFormat is not used, because it shuffles on row-key only and does in-memory sort at reducer side (so the size of output HFile is limited to reducer's memory). As crunch supports more complex and flexible MapReduce pipeline, we would prefer thin and pure OutputFormat here. No reviewboard for Crunch, I'm afraid
        Hide
        asasvari Attila Sasvari added a comment -

        I applied the patch and some Spark integration tests failed.

        Tests in error: 
          SparkHFileTargetIT.setUpClass:129 ? RetriesExhausted Failed after attempts=36,...
          SparkWordCountHBaseIT.setUp:110 ? RetriesExhausted Failed after attempts=36, e...
          SparkWordCountHBaseIT.setUp:110 ? RetriesExhausted Failed after attempts=36, e...
        

        I checked org.apache.hadoop.hbase.ipc.CallTimeoutException was thrown during the execution of SparkHFileTargetIT:

        org.apache.crunch.SparkHFileTargetIT  Time elapsed: 67.833 sec  <<< ERROR!
        org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after attempts=36, exceptions:
        Thu Feb 02 16:55:00 CET 2017, null, java.net.SocketTimeoutException: callTimeout=60000, callDuration=60136: Call to /192.168.1.102:64404 failed on local exception: org.apache.hadoop.hbase.ipc.CallTimeoutException: Call id=0, waitTime=60002, rpcTimetout=59999 row '' on table 'hbase:meta' at region=hbase:meta,,1.1588230740, hostname=192.168.1.102,64404,1486050837780, seqNum=0
        
        	at org.apache.hadoop.hbase.client.RpcRetryingCallerWithReadReplicas.throwEnrichedException(RpcRetryingCallerWithReadReplicas.java:255)
        	at org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(ScannerCallableWithReplicas.java:229)
        	at org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(ScannerCallableWithReplicas.java:59)
        	at org.apache.hadoop.hbase.client.RpcRetryingCallerImpl.callWithoutRetries(RpcRetryingCallerImpl.java:177)
        	at org.apache.hadoop.hbase.client.ClientScanner.call(ClientScanner.java:314)
        	at org.apache.hadoop.hbase.client.ClientScanner.nextScanner(ClientScanner.java:290)
        	at org.apache.hadoop.hbase.client.ClientScanner.initializeScannerInConstruction(ClientScanner.java:169)
        	at org.apache.hadoop.hbase.client.ClientScanner.<init>(ClientScanner.java:162)
        	at org.apache.hadoop.hbase.client.ClientSimpleScanner.<init>(ClientSimpleScanner.java:39)
        	at org.apache.hadoop.hbase.client.HTable.getScanner(HTable.java:378)
        	at org.apache.hadoop.hbase.HBaseTestingUtility.startMiniHBaseCluster(HBaseTestingUtility.java:1105)
        	at org.apache.hadoop.hbase.HBaseTestingUtility.startMiniCluster(HBaseTestingUtility.java:1057)
        	at org.apache.hadoop.hbase.HBaseTestingUtility.startMiniCluster(HBaseTestingUtility.java:929)
        	at org.apache.hadoop.hbase.HBaseTestingUtility.startMiniCluster(HBaseTestingUtility.java:911)
        	at org.apache.hadoop.hbase.HBaseTestingUtility.startMiniCluster(HBaseTestingUtility.java:898)
        	at org.apache.crunch.SparkHFileTargetIT.setUpClass(SparkHFileTargetIT.java:129)
        Caused by: java.net.SocketTimeoutException: callTimeout=60000, callDuration=60136: Call to /192.168.1.102:64404 failed on local exception: org.apache.hadoop.hbase.ipc.CallTimeoutException: Call id=0, waitTime=60002, rpcTimetout=59999 row '' on table 'hbase:meta' at region=hbase:meta,,1.1588230740, hostname=192.168.1.102,64404,1486050837780, seqNum=0
        	at org.apache.hadoop.hbase.client.RpcRetryingCallerImpl.callWithRetries(RpcRetryingCallerImpl.java:144)
        	at org.apache.hadoop.hbase.client.ResultBoundedCompletionService$QueueingFuture.run(ResultBoundedCompletionService.java:80)
        	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        	at java.lang.Thread.run(Thread.java:745)
        Caused by: java.io.IOException: Call to /192.168.1.102:64404 failed on local exception: org.apache.hadoop.hbase.ipc.CallTimeoutException: Call id=0, waitTime=60002, rpcTimetout=59999
        	at org.apache.hadoop.hbase.ipc.IPCUtil.wrapException(IPCUtil.java:172)
        	at org.apache.hadoop.hbase.ipc.AbstractRpcClient.onCallFinished(AbstractRpcClient.java:387)
        	at org.apache.hadoop.hbase.ipc.AbstractRpcClient.access$100(AbstractRpcClient.java:94)
        	at org.apache.hadoop.hbase.ipc.AbstractRpcClient$3.run(AbstractRpcClient.java:407)
        	at org.apache.hadoop.hbase.ipc.AbstractRpcClient$3.run(AbstractRpcClient.java:403)
        	at org.apache.hadoop.hbase.ipc.Call.setTimeout(Call.java:96)
        	at org.apache.hadoop.hbase.ipc.RpcConnection$1.run(RpcConnection.java:195)
        	at io.netty.util.HashedWheelTimer$HashedWheelTimeout.expire(HashedWheelTimer.java:581)
        	at io.netty.util.HashedWheelTimer$HashedWheelBucket.expireTimeouts(HashedWheelTimer.java:655)
        	at io.netty.util.HashedWheelTimer$Worker.run(HashedWheelTimer.java:367)
        	at java.lang.Thread.run(Thread.java:745)
        Caused by: org.apache.hadoop.hbase.ipc.CallTimeoutException: Call id=0, waitTime=60002, rpcTimetout=59999
        	at org.apache.hadoop.hbase.ipc.RpcConnection$1.run(RpcConnection.java:196)
        	at io.netty.util.HashedWheelTimer$HashedWheelTimeout.expire(HashedWheelTimer.java:581)
        	at io.netty.util.HashedWheelTimer$HashedWheelBucket.expireTimeouts(HashedWheelTimer.java:655)
        	at io.netty.util.HashedWheelTimer$Worker.run(HashedWheelTimer.java:367)
        	at java.lang.Thread.run(Thread.java:745)
        

        HbaseMiniCluster cannot be contacted for some reason.

        I also noticed the following:

        44274 [VolumeScannerThread(/root/crunch/crunch-hbase/target/test-data/a3979225-61d0-46fb-9b7a-227cf12cb8c5/dfscluster_5bb4ef9f-6747-48d2-9f0a-389634b8446d/dfs/data/data2)] ERROR org.apache.hadoop.hdfs.server.datanode.VolumeScanner  - VolumeScanner(/root/crunch/crunch-hbase/target/test-data/a3979225-61d0-46fb-9b7a-227cf12cb8c5/dfscluster_5bb4ef9f-6747-48d2-9f0a-389634b8446d/dfs/data/data2, DS-fd97dce5-3b9a-43e8-b02f-73d0789ccb54) exiting because of exception 
        java.lang.NoSuchMethodError: org.codehaus.jackson.map.ObjectMapper.writerWithDefaultPrettyPrinter()Lorg/codehaus/jackson/map/ObjectWriter;
        	at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl$BlockIteratorImpl.save(FsVolumeImpl.java:676)
        	at org.apache.hadoop.hdfs.server.datanode.VolumeScanner.saveBlockIterator(VolumeScanner.java:314)
        	at org.apache.hadoop.hdfs.server.datanode.VolumeScanner.runLoop(VolumeScanner.java:535)
        	at org.apache.hadoop.hdfs.server.datanode.VolumeScanner.run(VolumeScanner.java:619)
        

        It is related to the hadoop update in root pom.xml (bumped to 2.7.1).

        To load the proper classes, I added the following dependencies to crunch-spark pom.xml

        <dependency>
        <groupId>com.fasterxml.jackson.core</groupId>
        <artifactId>jackson-annotations</artifactId>
        <version>2.4.4</version>
        <type>jar</type>
        </dependency>
        <dependency>
        <groupId>org.codehaus.jackson</groupId>
        <artifactId>jackson-mapper-asl</artifactId>
        <version>1.9.13</version>
        </dependency>
        
        <dependency>
        <groupId>org.codehaus.jackson</groupId>
        <artifactId>jackson-core-lgpl</artifactId>
        <version>1.9.13</version>
        </dependency>
        
        Show
        asasvari Attila Sasvari added a comment - I applied the patch and some Spark integration tests failed. Tests in error: SparkHFileTargetIT.setUpClass:129 ? RetriesExhausted Failed after attempts=36,... SparkWordCountHBaseIT.setUp:110 ? RetriesExhausted Failed after attempts=36, e... SparkWordCountHBaseIT.setUp:110 ? RetriesExhausted Failed after attempts=36, e... I checked org.apache.hadoop.hbase.ipc.CallTimeoutException was thrown during the execution of SparkHFileTargetIT: org.apache.crunch.SparkHFileTargetIT Time elapsed: 67.833 sec <<< ERROR! org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after attempts=36, exceptions: Thu Feb 02 16:55:00 CET 2017, null , java.net.SocketTimeoutException: callTimeout=60000, callDuration=60136: Call to /192.168.1.102:64404 failed on local exception: org.apache.hadoop.hbase.ipc.CallTimeoutException: Call id=0, waitTime=60002, rpcTimetout=59999 row '' on table 'hbase:meta' at region=hbase:meta,,1.1588230740, hostname=192.168.1.102,64404,1486050837780, seqNum=0 at org.apache.hadoop.hbase.client.RpcRetryingCallerWithReadReplicas.throwEnrichedException(RpcRetryingCallerWithReadReplicas.java:255) at org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(ScannerCallableWithReplicas.java:229) at org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(ScannerCallableWithReplicas.java:59) at org.apache.hadoop.hbase.client.RpcRetryingCallerImpl.callWithoutRetries(RpcRetryingCallerImpl.java:177) at org.apache.hadoop.hbase.client.ClientScanner.call(ClientScanner.java:314) at org.apache.hadoop.hbase.client.ClientScanner.nextScanner(ClientScanner.java:290) at org.apache.hadoop.hbase.client.ClientScanner.initializeScannerInConstruction(ClientScanner.java:169) at org.apache.hadoop.hbase.client.ClientScanner.<init>(ClientScanner.java:162) at org.apache.hadoop.hbase.client.ClientSimpleScanner.<init>(ClientSimpleScanner.java:39) at org.apache.hadoop.hbase.client.HTable.getScanner(HTable.java:378) at org.apache.hadoop.hbase.HBaseTestingUtility.startMiniHBaseCluster(HBaseTestingUtility.java:1105) at org.apache.hadoop.hbase.HBaseTestingUtility.startMiniCluster(HBaseTestingUtility.java:1057) at org.apache.hadoop.hbase.HBaseTestingUtility.startMiniCluster(HBaseTestingUtility.java:929) at org.apache.hadoop.hbase.HBaseTestingUtility.startMiniCluster(HBaseTestingUtility.java:911) at org.apache.hadoop.hbase.HBaseTestingUtility.startMiniCluster(HBaseTestingUtility.java:898) at org.apache.crunch.SparkHFileTargetIT.setUpClass(SparkHFileTargetIT.java:129) Caused by: java.net.SocketTimeoutException: callTimeout=60000, callDuration=60136: Call to /192.168.1.102:64404 failed on local exception: org.apache.hadoop.hbase.ipc.CallTimeoutException: Call id=0, waitTime=60002, rpcTimetout=59999 row '' on table 'hbase:meta' at region=hbase:meta,,1.1588230740, hostname=192.168.1.102,64404,1486050837780, seqNum=0 at org.apache.hadoop.hbase.client.RpcRetryingCallerImpl.callWithRetries(RpcRetryingCallerImpl.java:144) at org.apache.hadoop.hbase.client.ResultBoundedCompletionService$QueueingFuture.run(ResultBoundedCompletionService.java:80) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang. Thread .run( Thread .java:745) Caused by: java.io.IOException: Call to /192.168.1.102:64404 failed on local exception: org.apache.hadoop.hbase.ipc.CallTimeoutException: Call id=0, waitTime=60002, rpcTimetout=59999 at org.apache.hadoop.hbase.ipc.IPCUtil.wrapException(IPCUtil.java:172) at org.apache.hadoop.hbase.ipc.AbstractRpcClient.onCallFinished(AbstractRpcClient.java:387) at org.apache.hadoop.hbase.ipc.AbstractRpcClient.access$100(AbstractRpcClient.java:94) at org.apache.hadoop.hbase.ipc.AbstractRpcClient$3.run(AbstractRpcClient.java:407) at org.apache.hadoop.hbase.ipc.AbstractRpcClient$3.run(AbstractRpcClient.java:403) at org.apache.hadoop.hbase.ipc.Call.setTimeout(Call.java:96) at org.apache.hadoop.hbase.ipc.RpcConnection$1.run(RpcConnection.java:195) at io.netty.util.HashedWheelTimer$HashedWheelTimeout.expire(HashedWheelTimer.java:581) at io.netty.util.HashedWheelTimer$HashedWheelBucket.expireTimeouts(HashedWheelTimer.java:655) at io.netty.util.HashedWheelTimer$Worker.run(HashedWheelTimer.java:367) at java.lang. Thread .run( Thread .java:745) Caused by: org.apache.hadoop.hbase.ipc.CallTimeoutException: Call id=0, waitTime=60002, rpcTimetout=59999 at org.apache.hadoop.hbase.ipc.RpcConnection$1.run(RpcConnection.java:196) at io.netty.util.HashedWheelTimer$HashedWheelTimeout.expire(HashedWheelTimer.java:581) at io.netty.util.HashedWheelTimer$HashedWheelBucket.expireTimeouts(HashedWheelTimer.java:655) at io.netty.util.HashedWheelTimer$Worker.run(HashedWheelTimer.java:367) at java.lang. Thread .run( Thread .java:745) HbaseMiniCluster cannot be contacted for some reason. I also noticed the following: 44274 [VolumeScannerThread(/root/crunch/crunch-hbase/target/test-data/a3979225-61d0-46fb-9b7a-227cf12cb8c5/dfscluster_5bb4ef9f-6747-48d2-9f0a-389634b8446d/dfs/data/data2)] ERROR org.apache.hadoop.hdfs.server.datanode.VolumeScanner - VolumeScanner(/root/crunch/crunch-hbase/target/test-data/a3979225-61d0-46fb-9b7a-227cf12cb8c5/dfscluster_5bb4ef9f-6747-48d2-9f0a-389634b8446d/dfs/data/data2, DS-fd97dce5-3b9a-43e8-b02f-73d0789ccb54) exiting because of exception java.lang.NoSuchMethodError: org.codehaus.jackson.map.ObjectMapper.writerWithDefaultPrettyPrinter()Lorg/codehaus/jackson/map/ObjectWriter; at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl$BlockIteratorImpl.save(FsVolumeImpl.java:676) at org.apache.hadoop.hdfs.server.datanode.VolumeScanner.saveBlockIterator(VolumeScanner.java:314) at org.apache.hadoop.hdfs.server.datanode.VolumeScanner.runLoop(VolumeScanner.java:535) at org.apache.hadoop.hdfs.server.datanode.VolumeScanner.run(VolumeScanner.java:619) It is related to the hadoop update in root pom.xml (bumped to 2.7.1). To load the proper classes, I added the following dependencies to crunch-spark pom.xml <dependency> <groupId>com.fasterxml.jackson.core</groupId> <artifactId>jackson-annotations</artifactId> <version>2.4.4</version> <type>jar</type> </dependency> <dependency> <groupId>org.codehaus.jackson</groupId> <artifactId>jackson-mapper-asl</artifactId> <version>1.9.13</version> </dependency> <dependency> <groupId>org.codehaus.jackson</groupId> <artifactId>jackson-core-lgpl</artifactId> <version>1.9.13</version> </dependency>
        Hide
        asasvari Attila Sasvari added a comment -

        HBase 2 uses netty 4.1.1.Final, and it was not addressed in crunch-spark.

        In fact, the dependency org.apache.spark:spark-core_2.11:jar:2.0.0 pulled in io.netty:netty-all:jar:4.0.29.Final:provided. That caused the runtime org.apache.hadoop.hbase.ipc.CallTimeoutException.

        So I had to add the following dependency to crunch-spark / pom.xml

        <dependency>
              <groupId>io.netty</groupId>
              <artifactId>netty-all</artifactId>
              <version>4.1.1.Final</version>
              <type>jar</type>
        </dependency>
        

        After this, all integration tests passed.

        I will upload a new patch soon.

        Show
        asasvari Attila Sasvari added a comment - HBase 2 uses netty 4.1.1.Final , and it was not addressed in crunch-spark. In fact, the dependency org.apache.spark:spark-core_2.11:jar:2.0.0 pulled in io.netty:netty-all:jar:4.0.29.Final:provided . That caused the runtime org.apache.hadoop.hbase.ipc.CallTimeoutException . So I had to add the following dependency to crunch-spark / pom.xml <dependency> <groupId>io.netty</groupId> <artifactId>netty-all</artifactId> <version>4.1.1.Final</version> <type>jar</type> </dependency> After this, all integration tests passed. I will upload a new patch soon.
        Hide
        pairg Gergő Pásztor added a comment -

        Updated for the current master.

        Show
        pairg Gergő Pásztor added a comment - Updated for the current master.
        Hide
        pairg Gergő Pásztor added a comment - - edited

        Tested also on CRUNCH-618 (Spark 2) and it's working.

        Show
        pairg Gergő Pásztor added a comment - - edited Tested also on CRUNCH-618 (Spark 2) and it's working.
        Hide
        tomwhite Tom White added a comment -

        I ran this too, and all tests pass. The HBase dependency is still a snapshot though, so I'm not sure we want to commit this yet.

        Show
        tomwhite Tom White added a comment - I ran this too, and all tests pass. The HBase dependency is still a snapshot though, so I'm not sure we want to commit this yet.
        Hide
        jmhsieh Jonathan Hsieh added a comment -

        A suggestion: This patch could be broken up in to two pieces – the pre1.0 api to 1.x api related changes and the transitive dependency related pom fixes.

        The API related changes should be commitable today, and we could hold off on the dependencies until an hbase 2 alpha comes out.

        Show
        jmhsieh Jonathan Hsieh added a comment - A suggestion: This patch could be broken up in to two pieces – the pre1.0 api to 1.x api related changes and the transitive dependency related pom fixes. The API related changes should be commitable today, and we could hold off on the dependencies until an hbase 2 alpha comes out.
        Hide
        pairg Gergő Pásztor added a comment -

        Jonathan Hsieh But in this case the code will be broken, because this will not work with HBase 1.x. Or am I missing something?

        My suggestion is that I can create a separate ticket and patch that is updating the HBase version from the snapshot to the normal version (alpha or whatever). We can sign this new ticket as a blocker for ver 1.0.0, so we will not forget to update it.

        Show
        pairg Gergő Pásztor added a comment - Jonathan Hsieh But in this case the code will be broken, because this will not work with HBase 1.x. Or am I missing something? My suggestion is that I can create a separate ticket and patch that is updating the HBase version from the snapshot to the normal version (alpha or whatever). We can sign this new ticket as a blocker for ver 1.0.0, so we will not forget to update it.
        Hide
        jmhsieh Jonathan Hsieh added a comment -

        What I'm suggesting is that most (if not all) of the code changes should actually work against hbase 1.x and that the current code is using an api that is deprecated hbase 1.0. The master branch of crunch is compiling against hbase 1.0 today. [1]

        The hbase 2.0 portions force the issue because the deprecated apis are removed which forces you to use the hbase 1.0 apis.

        I'm confused from your last sentence so let me reword, explictly calling out projects and versions to make sure we are sure talking about the same thing. Here's what I think we are suggesting:

        1) create new crunch ticket to move to the hbase 1.0.0 api. this should be testable and committable into crunch. This new ticket could be a blocker for the next crunch release (crunch 1.0.0?).
        2) keep this ticket open and have this deal with the pom changes/transitive dependency fixes required for working against hbase 2.0.0-alphaX. It will remain open until a hbase 2.0.0-alpha is available for crunch to build against and the changes to crunch were made. Whether it is a blocker is up to you guys.

        [1] https://github.com/apache/crunch/blob/master/pom.xml#L104

        Show
        jmhsieh Jonathan Hsieh added a comment - What I'm suggesting is that most (if not all) of the code changes should actually work against hbase 1.x and that the current code is using an api that is deprecated hbase 1.0. The master branch of crunch is compiling against hbase 1.0 today. [1] The hbase 2.0 portions force the issue because the deprecated apis are removed which forces you to use the hbase 1.0 apis. I'm confused from your last sentence so let me reword, explictly calling out projects and versions to make sure we are sure talking about the same thing. Here's what I think we are suggesting: 1) create new crunch ticket to move to the hbase 1.0.0 api. this should be testable and committable into crunch. This new ticket could be a blocker for the next crunch release (crunch 1.0.0?). 2) keep this ticket open and have this deal with the pom changes/transitive dependency fixes required for working against hbase 2.0.0-alphaX. It will remain open until a hbase 2.0.0-alpha is available for crunch to build against and the changes to crunch were made. Whether it is a blocker is up to you guys. [1] https://github.com/apache/crunch/blob/master/pom.xml#L104
        Hide
        pairg Gergő Pásztor added a comment -

        Jonathan Hsieh Thanks for the clarification! If nobody has other suggestion I will do what Jonathan suggested.

        Show
        pairg Gergő Pásztor added a comment - Jonathan Hsieh Thanks for the clarification! If nobody has other suggestion I will do what Jonathan suggested.
        Hide
        pairg Gergő Pásztor added a comment -

        I checked this and the current code is not compatible with HBase 1.0, so the separation is not just about poms. I'm trying to create two different version from this: one for HBase 1.0 and one for HBase 2.0. I want to try out this separation to see what will be the differences. Based on the current status it looks like that we have some stuff that is not in HBase 1.0 at all.

        Show
        pairg Gergő Pásztor added a comment - I checked this and the current code is not compatible with HBase 1.0, so the separation is not just about poms. I'm trying to create two different version from this: one for HBase 1.0 and one for HBase 2.0. I want to try out this separation to see what will be the differences. Based on the current status it looks like that we have some stuff that is not in HBase 1.0 at all.
        Hide
        pairg Gergő Pásztor added a comment -

        I attached two patches. One is for HBase 1.0.0 without pom modifications, the other one is the last patch with HBase 2 updated for the current master. I don't recommend the first one, because it's different in many places from the HBase 2 version. I would rather wait for the HBase 2 release and just push the second one to the master.

        Show
        pairg Gergő Pásztor added a comment - I attached two patches. One is for HBase 1.0.0 without pom modifications, the other one is the last patch with HBase 2 updated for the current master. I don't recommend the first one, because it's different in many places from the HBase 2 version. I would rather wait for the HBase 2 release and just push the second one to the master.
        Hide
        pairg Gergő Pásztor added a comment - - edited

        Also Gabriel Reid pls check the "CRUNCH-619_v4_hbase2.patch"! Your last modification (CRUNCH-644) is slightly affected here, so please check that I didn't brake anything around it!

        Show
        pairg Gergő Pásztor added a comment - - edited Also Gabriel Reid pls check the " CRUNCH-619 _v4_hbase2.patch"! Your last modification ( CRUNCH-644 ) is slightly affected here, so please check that I didn't brake anything around it!
        Hide
        pairg Gergő Pásztor added a comment -

        Updated for the current hbase2 version

        Show
        pairg Gergő Pásztor added a comment - Updated for the current hbase2 version
        Hide
        gabriel.reid Gabriel Reid added a comment -

        Gergő Pásztor I'm really sorry for taking forever to get back to you on this.

        I looked at the changes that the the hbase2_v5 patch introduces around the node affinity stuff added in CRUNCH-644. It looks like it will still work just fine, but I think that the API for using it becomes pretty awkward because you have to supply both a Table and a RegionLocator to the HFileUtils.writeXXXToHFilesForIncrementalLoad methods.

        Seeing as Table and RegionLocator are both bound to a single table, and they are both accessible via Connection (and this change seems to break source compatibility anyhow), I would suggest changing those methods to take a Connection and TableName, and then internally retrieve the Table and RegionLocator from the Connection.

        If we want to keep source compatibility, we can just keep the existing version of those methods (without the RegionLocator) and change the underlying implementations to be able to work with a null RegionLocator (i.e. just disable the node affinity stuff in that case).

        Show
        gabriel.reid Gabriel Reid added a comment - Gergő Pásztor I'm really sorry for taking forever to get back to you on this. I looked at the changes that the the hbase2_v5 patch introduces around the node affinity stuff added in CRUNCH-644 . It looks like it will still work just fine, but I think that the API for using it becomes pretty awkward because you have to supply both a Table and a RegionLocator to the HFileUtils.writeXXXToHFilesForIncrementalLoad methods. Seeing as Table and RegionLocator are both bound to a single table, and they are both accessible via Connection (and this change seems to break source compatibility anyhow), I would suggest changing those methods to take a Connection and TableName, and then internally retrieve the Table and RegionLocator from the Connection. If we want to keep source compatibility, we can just keep the existing version of those methods (without the RegionLocator) and change the underlying implementations to be able to work with a null RegionLocator (i.e. just disable the node affinity stuff in that case).
        Hide
        asasvari Attila Sasvari added a comment -

        Gergő Pásztor thanks for the updated patch.

        I started to look at hbase2_v5 patch, some comments:

        • Unit tests passed. Also ran SparkHFileTargetIT and SparkWordCountHBaseIT, and they passed.
        • In HFileTargetIT.java and SparkHFileTargetIT.java we should not use wildcard import such as import org.apache.hadoop.hbase.*;
        Show
        asasvari Attila Sasvari added a comment - Gergő Pásztor thanks for the updated patch. I started to look at hbase2_v5 patch, some comments: Unit tests passed. Also ran SparkHFileTargetIT and SparkWordCountHBaseIT, and they passed. In HFileTargetIT.java and SparkHFileTargetIT.java we should not use wildcard import such as import org.apache.hadoop.hbase.*;
        Hide
        pairg Gergő Pásztor added a comment -

        Patch updated with Gabriel Reid suggestion to use Connection and TableName as parameters in HFileUtils and the ".*" imports are fixed.

        Show
        pairg Gergő Pásztor added a comment - Patch updated with Gabriel Reid suggestion to use Connection and TableName as parameters in HFileUtils and the ".*" imports are fixed.
        Hide
        pairg Gergő Pásztor added a comment -

        Patch v7 uploaded, this is the same as the v6, but created with "git format-patch".

        Show
        pairg Gergő Pásztor added a comment - Patch v7 uploaded, this is the same as the v6, but created with "git format-patch".
        Hide
        asasvari Attila Sasvari added a comment -

        Gergő Pásztor Do you mind if I take this task over? HBase is working on 2.0.0-alpha3-SNAPSHOT (see pom.xml on branch-2). HBASE-18640 splits out a lot of things from hbase-server into a new module hbase-mapreduce. So CRUNCH-619_v7_hbase2.patch will not compile. I will attach a patch soon.

        Show
        asasvari Attila Sasvari added a comment - Gergő Pásztor Do you mind if I take this task over? HBase is working on 2.0.0-alpha3-SNAPSHOT (see pom.xml on branch-2 ). HBASE-18640 splits out a lot of things from hbase-server into a new module hbase-mapreduce . So CRUNCH-619 _v7_hbase2.patch will not compile. I will attach a patch soon.
        Hide
        pairg Gergő Pásztor added a comment -

        Attila Sasvari Sure, go with it!

        Show
        pairg Gergő Pásztor added a comment - Attila Sasvari Sure, go with it!

          People

          • Assignee:
            asasvari Attila Sasvari
            Reporter:
            tomwhite Tom White
          • Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

            • Created:
              Updated:

              Development