Uploaded image for project: 'Cassandra'
  1. Cassandra
  2. CASSANDRA-1927

Hadoop Integration doesn't work when one node is down

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Fix Version/s: 0.7.1
    • Component/s: None
    • Labels:
      None

      Description

      using the same directives in the sample code:

      When I start the CFInputFormat to read a CF in a keyspace of RF=3 on a 4-node cluster:

      • If all the nodes are all up, everything works fine and I don't have any problems walking through the all data in the CF, however
      • If there's a node down, the hadoop job does not even start, just dies without any errors or exceptions.

      So I'm really sorry for not being able to post any errors or exceptions, though it's really easy to reproduce. Just startup a cluster and take one node down and you're there

        Activity

        Hide
        michaelsembwever mck added a comment -

        Client side (hadoop job):

        java.io.IOException: Could not get input splits
        at org.apache.cassandra.hadoop.ColumnFamilyInputFormat.getSplits(ColumnFamilyInputFormat.java:127)
        at org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:885)
        at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:779)
        at org.apache.hadoop.mapreduce.Job.submit(Job.java:432)
        at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:447)
        at no.finntech.countstats.reduce.FakeAdCounterTableReduce.run(FakeAdCounterTableReduce.java:421)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
        at no.finntech.countstats.reduce.FakeAdCounterTableReduce.main(FakeAdCounterTableReduce.java:75)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
        Caused by: java.util.concurrent.ExecutionException: java.io.IOException: unable to connect to server
        at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:222)
        at java.util.concurrent.FutureTask.get(FutureTask.java:83)
        at org.apache.cassandra.hadoop.ColumnFamilyInputFormat.getSplits(ColumnFamilyInputFormat.java:123)
        ... 13 more
        Caused by: java.io.IOException: unable to connect to server
        at org.apache.cassandra.hadoop.ColumnFamilyInputFormat.createConnection(ColumnFamilyInputFormat.java:212)
        at org.apache.cassandra.hadoop.ColumnFamilyInputFormat.getSubSplits(ColumnFamilyInputFormat.java:187)
        at org.apache.cassandra.hadoop.ColumnFamilyInputFormat.access$200(ColumnFamilyInputFormat.java:74)
        at org.apache.cassandra.hadoop.ColumnFamilyInputFormat$SplitCallable.call(ColumnFamilyInputFormat.java:160)
        at org.apache.cassandra.hadoop.ColumnFamilyInputFormat$SplitCallable.call(ColumnFamilyInputFormat.java:145)
        at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
        at java.util.concurrent.FutureTask.run(FutureTask.java:138)
        at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
        at java.lang.Thread.run(Thread.java:619)
        Caused by: org.apache.thrift.transport.TTransportException: java.net.ConnectException: Connection refused
        at org.apache.thrift.transport.TSocket.open(TSocket.java:185)
        at org.apache.thrift.transport.TFramedTransport.open(TFramedTransport.java:81)
        at org.apache.cassandra.hadoop.ColumnFamilyInputFormat.createConnection(ColumnFamilyInputFormat.java:208)
        ... 9 more
        Caused by: java.net.ConnectException: Connection refused
        at java.net.PlainSocketImpl.socketConnect(Native Method)
        at java.net.PlainSocketImpl.doConnect(PlainSocketImpl.java:333)
        at java.net.PlainSocketImpl.connectToAddress(PlainSocketImpl.java:195)
        at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:182)
        at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:366)
        at java.net.Socket.connect(Socket.java:525)
        at org.apache.thrift.transport.TSocket.open(TSocket.java:180)
        ... 11 more

        Show
        michaelsembwever mck added a comment - Client side (hadoop job): java.io.IOException: Could not get input splits at org.apache.cassandra.hadoop.ColumnFamilyInputFormat.getSplits(ColumnFamilyInputFormat.java:127) at org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:885) at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:779) at org.apache.hadoop.mapreduce.Job.submit(Job.java:432) at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:447) at no.finntech.countstats.reduce.FakeAdCounterTableReduce.run(FakeAdCounterTableReduce.java:421) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79) at no.finntech.countstats.reduce.FakeAdCounterTableReduce.main(FakeAdCounterTableReduce.java:75) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.RunJar.main(RunJar.java:156) Caused by: java.util.concurrent.ExecutionException: java.io.IOException: unable to connect to server at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:222) at java.util.concurrent.FutureTask.get(FutureTask.java:83) at org.apache.cassandra.hadoop.ColumnFamilyInputFormat.getSplits(ColumnFamilyInputFormat.java:123) ... 13 more Caused by: java.io.IOException: unable to connect to server at org.apache.cassandra.hadoop.ColumnFamilyInputFormat.createConnection(ColumnFamilyInputFormat.java:212) at org.apache.cassandra.hadoop.ColumnFamilyInputFormat.getSubSplits(ColumnFamilyInputFormat.java:187) at org.apache.cassandra.hadoop.ColumnFamilyInputFormat.access$200(ColumnFamilyInputFormat.java:74) at org.apache.cassandra.hadoop.ColumnFamilyInputFormat$SplitCallable.call(ColumnFamilyInputFormat.java:160) at org.apache.cassandra.hadoop.ColumnFamilyInputFormat$SplitCallable.call(ColumnFamilyInputFormat.java:145) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:619) Caused by: org.apache.thrift.transport.TTransportException: java.net.ConnectException: Connection refused at org.apache.thrift.transport.TSocket.open(TSocket.java:185) at org.apache.thrift.transport.TFramedTransport.open(TFramedTransport.java:81) at org.apache.cassandra.hadoop.ColumnFamilyInputFormat.createConnection(ColumnFamilyInputFormat.java:208) ... 9 more Caused by: java.net.ConnectException: Connection refused at java.net.PlainSocketImpl.socketConnect(Native Method) at java.net.PlainSocketImpl.doConnect(PlainSocketImpl.java:333) at java.net.PlainSocketImpl.connectToAddress(PlainSocketImpl.java:195) at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:182) at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:366) at java.net.Socket.connect(Socket.java:525) at org.apache.thrift.transport.TSocket.open(TSocket.java:180) ... 11 more
        Hide
        michaelsembwever mck added a comment -

        There's a todo comment in ColumnFamilyInputFormat
        // TODO handle failure of range replicas & retry

        line 198 only tries the first endpoint. a loop on the TException trying the next endpoint is needed.

        Show
        michaelsembwever mck added a comment - There's a todo comment in ColumnFamilyInputFormat // TODO handle failure of range replicas & retry line 198 only tries the first endpoint. a loop on the TException trying the next endpoint is needed.
        Hide
        michaelsembwever mck added a comment - - edited

        Utku: are you able to test this patch?

        ( It didn't work for me because RF was never really set to 3. using cassandra-cli "describe keyspace xxx" reported "Replication Factor: 1" ) :-$

        Show
        michaelsembwever mck added a comment - - edited Utku: are you able to test this patch? ( It didn't work for me because RF was never really set to 3. using cassandra-cli "describe keyspace xxx" reported "Replication Factor: 1" ) :-$
        Hide
        uctopcu Utku Can Topcu added a comment -

        Mck: Right now I can't access to our compilation server. However I can replace the running binaries and test them if I have the patched rc4. Can you somehow provide me the compiled package?

        Show
        uctopcu Utku Can Topcu added a comment - Mck: Right now I can't access to our compilation server. However I can replace the running binaries and test them if I have the patched rc4. Can you somehow provide me the compiled package?
        Hide
        michaelsembwever mck added a comment -

        Sent DM. If it doesn't work you should at minimum see the job's IOException stacktrace change from "unable to connect to server" to "failed connecting to all endpoints".

        Show
        michaelsembwever mck added a comment - Sent DM. If it doesn't work you should at minimum see the job's IOException stacktrace change from "unable to connect to server" to "failed connecting to all endpoints".
        Hide
        uctopcu Utku Can Topcu added a comment -

        I'll be testing it in a few hours. I'll write down the results. something urgent came up.

        Show
        uctopcu Utku Can Topcu added a comment - I'll be testing it in a few hours. I'll write down the results. something urgent came up.
        Hide
        michaelsembwever mck added a comment -

        After fixing my local RF problem this patch works for me.

        Show
        michaelsembwever mck added a comment - After fixing my local RF problem this patch works for me.
        Hide
        jbellis Jonathan Ellis added a comment -

        It looks like this patch includes the code from CASSANDRA-1921, which is causing conflicts b/c it's already applied on 0.7 and trunk. Can you create a patch for 1927 only?

        Show
        jbellis Jonathan Ellis added a comment - It looks like this patch includes the code from CASSANDRA-1921 , which is causing conflicts b/c it's already applied on 0.7 and trunk. Can you create a patch for 1927 only?
        Hide
        michaelsembwever mck added a comment -

        Putting Stu as reviewer since he was for CASSANDRA-342 (which the TODO comment in question was added under).

        Show
        michaelsembwever mck added a comment - Putting Stu as reviewer since he was for CASSANDRA-342 (which the TODO comment in question was added under).
        Hide
        michaelsembwever mck added a comment -

        Yeah, the patch had a lot of crap in it. sorry. will re-apply.

        Show
        michaelsembwever mck added a comment - Yeah, the patch had a lot of crap in it. sorry. will re-apply.
        Hide
        michaelsembwever mck added a comment -

        correct patch & license grant

        Show
        michaelsembwever mck added a comment - correct patch & license grant
        Hide
        michaelsembwever mck added a comment -

        third time lucky. removed unnecessary import.

        Show
        michaelsembwever mck added a comment - third time lucky. removed unnecessary import.
        Hide
        uctopcu Utku Can Topcu added a comment -

        I've tested against the rc4+patch and it works.

        Show
        uctopcu Utku Can Topcu added a comment - I've tested against the rc4+patch and it works.
        Hide
        jbellis Jonathan Ellis added a comment -

        committed, thanks!

        Show
        jbellis Jonathan Ellis added a comment - committed, thanks!
        Hide
        hudson Hudson added a comment -

        Integrated in Cassandra-0.7 #142 (See https://hudson.apache.org/hudson/job/Cassandra-0.7/142/)
        retry hadoop split requests on connection failure
        patch by mck; reviewed by jbellis for CASSANDRA-1927

        Show
        hudson Hudson added a comment - Integrated in Cassandra-0.7 #142 (See https://hudson.apache.org/hudson/job/Cassandra-0.7/142/ ) retry hadoop split requests on connection failure patch by mck; reviewed by jbellis for CASSANDRA-1927

          People

          • Assignee:
            michaelsembwever mck
            Reporter:
            uctopcu Utku Can Topcu
            Reviewer:
            Jonathan Ellis
          • Votes:
            1 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development