Hadoop Common
  1. Hadoop Common
  2. HADOOP-1013

ArithmeticException when number of racks is zero

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.11.0
    • Fix Version/s: 0.11.2
    • Component/s: None
    • Labels:
      None
    • Environment:

      EC2

      Description

      It seems that if something is wrong in the configuration that results in the number of racks being zero, the symptom is a divide-by-zero exception.

      [root@domU-12-31-34-00-03-2F ~]# cd /usr/local/hadoop-0.11.0/
      [root@domU-12-31-34-00-03-2F hadoop-0.11.0]# bin/hadoop jar hadoop-0.11.0-examples.jar pi 10 10000000
      Number of Maps = 10 Samples per Map = 10000000
      org.apache.hadoop.ipc.RemoteException: java.io.IOException: java.lang.ArithmeticException: / by zero
      at org.apache.hadoop.dfs.FSNamesystem$Replicator.chooseTarget(FSNamesystem.java:2593)
      at org.apache.hadoop.dfs.FSNamesystem$Replicator.chooseTarget(FSNamesystem.java:2555)
      at org.apache.hadoop.dfs.FSNamesystem.startFile(FSNamesystem.java:684)
      at org.apache.hadoop.dfs.NameNode.create(NameNode.java:248)
      at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
      at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
      at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
      at java.lang.reflect.Method.invoke(Method.java:585)
      at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:337)
      at org.apache.hadoop.ipc.Server$Handler.run(Server.java:538)

      at org.apache.hadoop.ipc.Client.call(Client.java:467)
      at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:164)
      at org.apache.hadoop.dfs.$Proxy0.create(Unknown Source)
      at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.locateNewBlock(DFSClient.java:1091)
      at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:1031)
      at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.endBlock(DFSClient.java:1255)
      at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.close(DFSClient.java:1345)
      at java.io.FilterOutputStream.close(FilterOutputStream.java:143)
      at java.io.FilterOutputStream.close(FilterOutputStream.java:143)
      at java.io.FilterOutputStream.close(FilterOutputStream.java:143)
      at org.apache.hadoop.fs.FSDataOutputStream$Summer.close(FSDataOutputStream.java:98)
      at java.io.FilterOutputStream.close(FilterOutputStream.java:143)
      at java.io.FilterOutputStream.close(FilterOutputStream.java:143)
      at java.io.FilterOutputStream.close(FilterOutputStream.java:143)
      at org.apache.hadoop.io.SequenceFile$Writer.close(SequenceFile.java:724)
      at org.apache.hadoop.examples.PiEstimator.launch(PiEstimator.java:185)
      at org.apache.hadoop.examples.PiEstimator.main(PiEstimator.java:226)
      at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
      at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
      at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
      at java.lang.reflect.Method.invoke(Method.java:585)
      at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:71)
      at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:143)
      at org.apache.hadoop.examples.ExampleDriver.main(ExampleDriver.java:40)
      at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
      at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
      at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
      at java.lang.reflect.Method.invoke(Method.java:585)
      at org.apache.hadoop.util.RunJar.main(RunJar.java:155)
      [root@domU-12-31-34-00-03-2F hadoop-0.11.0]#

        Issue Links

          Activity

          Hide
          James P. White added a comment -

          The first thing in the log indicates trouble ahead:

          [root@domU-12-31-34-00-00-D9 hadoop-0.11.0]# more /mnt/hadoop/logs/hadoop-root-namenode-domU-12-31-34-00-00-D9.log
          2007-02-12 23:22:53,292 INFO org.apache.hadoop.dfs.StateChange: STATE* Network topology has 0 racks and 0 datanodes
          2007-02-12 23:22:53,550 INFO org.mortbay.util.Credential: Checking Resource aliases
          2007-02-12 23:22:53,659 INFO org.mortbay.http.HttpServer: Version Jetty/5.1.4
          2007-02-12 23:22:54,475 INFO org.mortbay.util.Container: Started org.mortbay.jetty.servlet.WebApplicationHandler@17ee8b8
          2007-02-12 23:22:54,573 INFO org.mortbay.util.Container: Started WebApplicationContext[/,/]
          2007-02-12 23:22:54,573 INFO org.mortbay.util.Container: Started HttpContext[/logs,/logs]
          2007-02-12 23:22:54,573 INFO org.mortbay.util.Container: Started HttpContext[/static,/static]
          2007-02-12 23:22:54,576 INFO org.mortbay.http.SocketListener: Started SocketListener on 0.0.0.0:50070
          2007-02-12 23:22:54,576 INFO org.mortbay.util.Container: Started org.mortbay.jetty.Server@10385c1
          2007-02-12 23:22:54,617 INFO org.apache.hadoop.ipc.Server: IPC Server listener on 50001: starting
          2007-02-12 23:22:54,618 INFO org.apache.hadoop.ipc.Server: IPC Server handler 0 on 50001: starting
          2007-02-12 23:22:54,618 INFO org.apache.hadoop.ipc.Server: IPC Server handler 1 on 50001: starting
          2007-02-12 23:22:54,618 INFO org.apache.hadoop.ipc.Server: IPC Server handler 2 on 50001: starting
          2007-02-12 23:22:54,618 INFO org.apache.hadoop.ipc.Server: IPC Server handler 3 on 50001: starting
          2007-02-12 23:22:54,619 INFO org.apache.hadoop.ipc.Server: IPC Server handler 4 on 50001: starting
          2007-02-12 23:22:54,619 INFO org.apache.hadoop.ipc.Server: IPC Server handler 5 on 50001: starting
          2007-02-12 23:22:54,619 INFO org.apache.hadoop.ipc.Server: IPC Server handler 6 on 50001: starting
          2007-02-12 23:22:54,619 INFO org.apache.hadoop.ipc.Server: IPC Server handler 7 on 50001: starting
          2007-02-12 23:22:54,619 INFO org.apache.hadoop.ipc.Server: IPC Server handler 8 on 50001: starting
          2007-02-12 23:22:54,619 INFO org.apache.hadoop.ipc.Server: IPC Server handler 9 on 50001: starting
          2007-02-12 23:22:57,441 WARN org.apache.hadoop.dfs.StateChange: DIR* FSDirectory.unprotectedDelete: failed to remove /mnt/hadoop/mapred/.system.crc
          because it does not exist
          2007-02-12 23:22:57,453 WARN org.apache.hadoop.dfs.StateChange: DIR* FSDirectory.unprotectedDelete: failed to remove /mnt/hadoop/mapred/system beca
          use it does not exist
          2007-02-12 23:25:02,841 WARN org.apache.hadoop.dfs.StateChange: DIR* FSDirectory.unprotectedDelete: failed to remove /user/root/.test-mini-mr.crc b
          ecause it does not exist
          2007-02-12 23:25:02,842 WARN org.apache.hadoop.dfs.StateChange: DIR* FSDirectory.unprotectedDelete: failed to remove /user/root/test-mini-mr becaus
          e it does not exist
          2007-02-12 23:25:03,056 INFO org.apache.hadoop.ipc.Server: IPC Server handler 0 on 50001 call error: java.io.IOException: java.lang.ArithmeticExcep
          tion: / by zero
          java.io.IOException: java.lang.ArithmeticException: / by zero
          ...

          Show
          James P. White added a comment - The first thing in the log indicates trouble ahead: [root@domU-12-31-34-00-00-D9 hadoop-0.11.0] # more /mnt/hadoop/logs/hadoop-root-namenode-domU-12-31-34-00-00-D9.log 2007-02-12 23:22:53,292 INFO org.apache.hadoop.dfs.StateChange: STATE* Network topology has 0 racks and 0 datanodes 2007-02-12 23:22:53,550 INFO org.mortbay.util.Credential: Checking Resource aliases 2007-02-12 23:22:53,659 INFO org.mortbay.http.HttpServer: Version Jetty/5.1.4 2007-02-12 23:22:54,475 INFO org.mortbay.util.Container: Started org.mortbay.jetty.servlet.WebApplicationHandler@17ee8b8 2007-02-12 23:22:54,573 INFO org.mortbay.util.Container: Started WebApplicationContext [/,/] 2007-02-12 23:22:54,573 INFO org.mortbay.util.Container: Started HttpContext [/logs,/logs] 2007-02-12 23:22:54,573 INFO org.mortbay.util.Container: Started HttpContext [/static,/static] 2007-02-12 23:22:54,576 INFO org.mortbay.http.SocketListener: Started SocketListener on 0.0.0.0:50070 2007-02-12 23:22:54,576 INFO org.mortbay.util.Container: Started org.mortbay.jetty.Server@10385c1 2007-02-12 23:22:54,617 INFO org.apache.hadoop.ipc.Server: IPC Server listener on 50001: starting 2007-02-12 23:22:54,618 INFO org.apache.hadoop.ipc.Server: IPC Server handler 0 on 50001: starting 2007-02-12 23:22:54,618 INFO org.apache.hadoop.ipc.Server: IPC Server handler 1 on 50001: starting 2007-02-12 23:22:54,618 INFO org.apache.hadoop.ipc.Server: IPC Server handler 2 on 50001: starting 2007-02-12 23:22:54,618 INFO org.apache.hadoop.ipc.Server: IPC Server handler 3 on 50001: starting 2007-02-12 23:22:54,619 INFO org.apache.hadoop.ipc.Server: IPC Server handler 4 on 50001: starting 2007-02-12 23:22:54,619 INFO org.apache.hadoop.ipc.Server: IPC Server handler 5 on 50001: starting 2007-02-12 23:22:54,619 INFO org.apache.hadoop.ipc.Server: IPC Server handler 6 on 50001: starting 2007-02-12 23:22:54,619 INFO org.apache.hadoop.ipc.Server: IPC Server handler 7 on 50001: starting 2007-02-12 23:22:54,619 INFO org.apache.hadoop.ipc.Server: IPC Server handler 8 on 50001: starting 2007-02-12 23:22:54,619 INFO org.apache.hadoop.ipc.Server: IPC Server handler 9 on 50001: starting 2007-02-12 23:22:57,441 WARN org.apache.hadoop.dfs.StateChange: DIR* FSDirectory.unprotectedDelete: failed to remove /mnt/hadoop/mapred/.system.crc because it does not exist 2007-02-12 23:22:57,453 WARN org.apache.hadoop.dfs.StateChange: DIR* FSDirectory.unprotectedDelete: failed to remove /mnt/hadoop/mapred/system beca use it does not exist 2007-02-12 23:25:02,841 WARN org.apache.hadoop.dfs.StateChange: DIR* FSDirectory.unprotectedDelete: failed to remove /user/root/.test-mini-mr.crc b ecause it does not exist 2007-02-12 23:25:02,842 WARN org.apache.hadoop.dfs.StateChange: DIR* FSDirectory.unprotectedDelete: failed to remove /user/root/test-mini-mr becaus e it does not exist 2007-02-12 23:25:03,056 INFO org.apache.hadoop.ipc.Server: IPC Server handler 0 on 50001 call error: java.io.IOException: java.lang.ArithmeticExcep tion: / by zero java.io.IOException: java.lang.ArithmeticException: / by zero ...
          Hide
          dhruba borthakur added a comment -

          In the chooseTarget method, we have a statement :

          int maxNodesPerRack =
          (totalNumOfReplicas-1)/clusterMap.getNumOfRacks()+2;

          if the network topology has 0 racks, then clusterMap.getNumOfRacks() will return 0. This may generate the divide-by-zero exception.

          Show
          dhruba borthakur added a comment - In the chooseTarget method, we have a statement : int maxNodesPerRack = (totalNumOfReplicas-1)/clusterMap.getNumOfRacks()+2; if the network topology has 0 racks, then clusterMap.getNumOfRacks() will return 0. This may generate the divide-by-zero exception.
          Hide
          Hairong Kuang added a comment -

          Hi James, did your file system have no data node when the exception was thrown? I just want to figure out why the number of racks ended up with zero.

          Show
          Hairong Kuang added a comment - Hi James, did your file system have no data node when the exception was thrown? I just want to figure out why the number of racks ended up with zero.
          Hide
          James P. White added a comment -

          I found that same code in trying to find out what was wrong (I originally noted this failure in HADOOP-692).

          I was trying out the EC2 AMI that Tom is working on (HADOOP-952), but don't know enough about Hadoop to debug very effectively (which is why it would be good to report this error early on with a helpful diagnostic).

          If there are specific commands you would like me to run or files to inspect, then I can do that. Otherwise I'm putting aside trying to run Hadoop on EC2 until I've got more experience with it running locally.

          Show
          James P. White added a comment - I found that same code in trying to find out what was wrong (I originally noted this failure in HADOOP-692 ). I was trying out the EC2 AMI that Tom is working on ( HADOOP-952 ), but don't know enough about Hadoop to debug very effectively (which is why it would be good to report this error early on with a helpful diagnostic). If there are specific commands you would like me to run or files to inspect, then I can do that. Otherwise I'm putting aside trying to run Hadoop on EC2 until I've got more experience with it running locally.
          Hide
          Tom White added a comment -

          I've just run the EC2 AMI again and I get different output and no ArithmeticException (it's actually an updated one using Hadoop 0.11.1, but I don't believe this accounts for the difference since I didn't get the exception when using 0.11.0):

          2007-02-13 15:11:08,026 INFO org.apache.hadoop.dfs.StateChange: STATE* Network topology has 0 racks and 0 datanodes
          2007-02-13 15:11:08,301 INFO org.mortbay.util.Credential: Checking Resource aliases
          2007-02-13 15:11:08,420 INFO org.mortbay.http.HttpServer: Version Jetty/5.1.4
          2007-02-13 15:11:09,305 INFO org.mortbay.util.Container: Started org.mortbay.jetty.servlet.WebApplicationHandler@17ee
          8b8
          2007-02-13 15:11:09,407 INFO org.mortbay.util.Container: Started WebApplicationContext[/,/]
          2007-02-13 15:11:09,418 INFO org.mortbay.util.Container: Started HttpContext[/logs,/logs]
          2007-02-13 15:11:09,418 INFO org.mortbay.util.Container: Started HttpContext[/static,/static]
          2007-02-13 15:11:09,422 INFO org.mortbay.http.SocketListener: Started SocketListener on 0.0.0.0:50070
          2007-02-13 15:11:09,422 INFO org.mortbay.util.Container: Started org.mortbay.jetty.Server@10385c1
          2007-02-13 15:11:09,446 INFO org.apache.hadoop.ipc.Server: IPC Server listener on 50001: starting
          2007-02-13 15:11:09,448 INFO org.apache.hadoop.ipc.Server: IPC Server handler 0 on 50001: starting
          2007-02-13 15:11:09,448 INFO org.apache.hadoop.ipc.Server: IPC Server handler 1 on 50001: starting
          2007-02-13 15:11:09,461 INFO org.apache.hadoop.ipc.Server: IPC Server handler 3 on 50001: starting
          2007-02-13 15:11:09,461 INFO org.apache.hadoop.ipc.Server: IPC Server handler 4 on 50001: starting
          2007-02-13 15:11:09,461 INFO org.apache.hadoop.ipc.Server: IPC Server handler 5 on 50001: starting
          2007-02-13 15:11:09,462 INFO org.apache.hadoop.ipc.Server: IPC Server handler 6 on 50001: starting
          2007-02-13 15:11:09,462 INFO org.apache.hadoop.ipc.Server: IPC Server handler 2 on 50001: starting
          2007-02-13 15:11:09,462 INFO org.apache.hadoop.ipc.Server: IPC Server handler 7 on 50001: starting
          2007-02-13 15:11:09,462 INFO org.apache.hadoop.ipc.Server: IPC Server handler 8 on 50001: starting
          2007-02-13 15:11:09,462 INFO org.apache.hadoop.ipc.Server: IPC Server handler 9 on 50001: starting
          2007-02-13 15:11:15,251 INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.registerDatanode: node registration
          from domU-12-31-34-00-02-20.usma2.compute.amazonaws.com:50010 storage
          2007-02-13 15:11:15,253 INFO org.apache.hadoop.net.NetworkTopology: Adding a new node: /default-rack/domU-12-31-34-00
          -02-20.usma2.compute.amazonaws.com:50010
          2007-02-13 15:11:18,491 WARN org.apache.hadoop.dfs.StateChange: DIR* FSDirectory.unprotectedDelete: failed to remove
          /mnt/hadoop/mapred/.system.crc because it does not exist
          2007-02-13 15:11:18,492 WARN org.apache.hadoop.dfs.StateChange: DIR* FSDirectory.unprotectedDelete: failed to remove
          /mnt/hadoop/mapred/system because it does not exist
          2007-02-13 15:16:16,841 INFO org.apache.hadoop.fs.FSNamesystem: Roll Edit Log
          2007-02-13 15:16:17,501 INFO org.apache.hadoop.fs.FSNamesystem: Roll FSImage

          Notice that a datanode registers (at 15:11:15,251), unlike on Jim's setup. I think there is a bug here (the namenode shouldn't throw an exception if there are no registered datanodes), which is being exposed by a problem in the EC2 scripts (why isn't the datanode registering for Jim). We should take the latter bug back to HADOOP-952 (Jim - are you able to dig out the logs on the datanode to see why it didn't connect?).

          Show
          Tom White added a comment - I've just run the EC2 AMI again and I get different output and no ArithmeticException (it's actually an updated one using Hadoop 0.11.1, but I don't believe this accounts for the difference since I didn't get the exception when using 0.11.0): 2007-02-13 15:11:08,026 INFO org.apache.hadoop.dfs.StateChange: STATE* Network topology has 0 racks and 0 datanodes 2007-02-13 15:11:08,301 INFO org.mortbay.util.Credential: Checking Resource aliases 2007-02-13 15:11:08,420 INFO org.mortbay.http.HttpServer: Version Jetty/5.1.4 2007-02-13 15:11:09,305 INFO org.mortbay.util.Container: Started org.mortbay.jetty.servlet.WebApplicationHandler@17ee 8b8 2007-02-13 15:11:09,407 INFO org.mortbay.util.Container: Started WebApplicationContext [/,/] 2007-02-13 15:11:09,418 INFO org.mortbay.util.Container: Started HttpContext [/logs,/logs] 2007-02-13 15:11:09,418 INFO org.mortbay.util.Container: Started HttpContext [/static,/static] 2007-02-13 15:11:09,422 INFO org.mortbay.http.SocketListener: Started SocketListener on 0.0.0.0:50070 2007-02-13 15:11:09,422 INFO org.mortbay.util.Container: Started org.mortbay.jetty.Server@10385c1 2007-02-13 15:11:09,446 INFO org.apache.hadoop.ipc.Server: IPC Server listener on 50001: starting 2007-02-13 15:11:09,448 INFO org.apache.hadoop.ipc.Server: IPC Server handler 0 on 50001: starting 2007-02-13 15:11:09,448 INFO org.apache.hadoop.ipc.Server: IPC Server handler 1 on 50001: starting 2007-02-13 15:11:09,461 INFO org.apache.hadoop.ipc.Server: IPC Server handler 3 on 50001: starting 2007-02-13 15:11:09,461 INFO org.apache.hadoop.ipc.Server: IPC Server handler 4 on 50001: starting 2007-02-13 15:11:09,461 INFO org.apache.hadoop.ipc.Server: IPC Server handler 5 on 50001: starting 2007-02-13 15:11:09,462 INFO org.apache.hadoop.ipc.Server: IPC Server handler 6 on 50001: starting 2007-02-13 15:11:09,462 INFO org.apache.hadoop.ipc.Server: IPC Server handler 2 on 50001: starting 2007-02-13 15:11:09,462 INFO org.apache.hadoop.ipc.Server: IPC Server handler 7 on 50001: starting 2007-02-13 15:11:09,462 INFO org.apache.hadoop.ipc.Server: IPC Server handler 8 on 50001: starting 2007-02-13 15:11:09,462 INFO org.apache.hadoop.ipc.Server: IPC Server handler 9 on 50001: starting 2007-02-13 15:11:15,251 INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.registerDatanode: node registration from domU-12-31-34-00-02-20.usma2.compute.amazonaws.com:50010 storage 2007-02-13 15:11:15,253 INFO org.apache.hadoop.net.NetworkTopology: Adding a new node: /default-rack/domU-12-31-34-00 -02-20.usma2.compute.amazonaws.com:50010 2007-02-13 15:11:18,491 WARN org.apache.hadoop.dfs.StateChange: DIR* FSDirectory.unprotectedDelete: failed to remove /mnt/hadoop/mapred/.system.crc because it does not exist 2007-02-13 15:11:18,492 WARN org.apache.hadoop.dfs.StateChange: DIR* FSDirectory.unprotectedDelete: failed to remove /mnt/hadoop/mapred/system because it does not exist 2007-02-13 15:16:16,841 INFO org.apache.hadoop.fs.FSNamesystem: Roll Edit Log 2007-02-13 15:16:17,501 INFO org.apache.hadoop.fs.FSNamesystem: Roll FSImage Notice that a datanode registers (at 15:11:15,251), unlike on Jim's setup. I think there is a bug here (the namenode shouldn't throw an exception if there are no registered datanodes), which is being exposed by a problem in the EC2 scripts (why isn't the datanode registering for Jim). We should take the latter bug back to HADOOP-952 (Jim - are you able to dig out the logs on the datanode to see why it didn't connect?).
          Hide
          Hairong Kuang added a comment -

          Tom, Thanks for clarification. I agree that replica placement should not throw an exception when there are no registered datanodes. But if zero rack is caused by zero registered datanode, it will be a simple fix.

          Show
          Hairong Kuang added a comment - Tom, Thanks for clarification. I agree that replica placement should not throw an exception when there are no registered datanodes. But if zero rack is caused by zero registered datanode, it will be a simple fix.
          Hide
          Hairong Kuang added a comment -

          fixed as part of HADOOP-972

          Show
          Hairong Kuang added a comment - fixed as part of HADOOP-972

            People

            • Assignee:
              Hairong Kuang
              Reporter:
              James P. White
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development