Hadoop Common
  1. Hadoop Common
  2. HADOOP-1138

Datanodes that are dead for a long long time should not show up in the UI

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Won't Fix
    • Affects Version/s: None
    • Fix Version/s: 0.15.0
    • Component/s: None
    • Labels:
      None

      Description

      Proposal 1:
      If a include files is used, then show all nodes (dead/alive) that are listed in the includes file. If there isn't an include file, then display only nodes that have pinged this instance of the namenode.

      Proposal2:
      A config variable specifies the time duration. The namenode, on a restart, purges all datanodes that have not pinged for that time duration. The default value of this config variable can be 1 week.

      1. HADOOP-1138.patch
        5 kB
        Raghu Angadi
      2. HADOOP-1138.patch
        5 kB
        Raghu Angadi
      3. HADOOP-1138.patch
        5 kB
        Raghu Angadi

        Activity

        dhruba borthakur created issue -
        dhruba borthakur made changes -
        Field Original Value New Value
        Fix Version/s 0.15.0 [ 12312565 ]
        Assignee dhruba borthakur [ dhruba ] Raghu Angadi [ rangadi ]
        Hide
        dhruba borthakur added a comment -

        It is better to remove DatanodeDescriptors from the fsimage and store only that much information that is required to generate a new storageid for new datanodes.

        Show
        dhruba borthakur added a comment - It is better to remove DatanodeDescriptors from the fsimage and store only that much information that is required to generate a new storageid for new datanodes.
        Hide
        Raghu Angadi added a comment -

        Currently NameNode does not need storage ids across restarts (we think. let me know otherwise). Based on this, proposal is to not to store any datanode information in the fsimage. When namenode restarts, it just generates a new storage id for each datanode that registers. Rest of the storageid/datanode behavior remains the same as now.

        For webUI, proposals in the description are fine.

        Show
        Raghu Angadi added a comment - Currently NameNode does not need storage ids across restarts (we think. let me know otherwise). Based on this, proposal is to not to store any datanode information in the fsimage. When namenode restarts, it just generates a new storage id for each datanode that registers. Rest of the storageid/datanode behavior remains the same as now. For webUI, proposals in the description are fine.
        Hide
        Raghu Angadi added a comment -

        There are two related issues here :

        1. Datanodes displayed in webui and "dfs -report"
        2. Policy regd storing all the known storage ids and datanodes persistently in Namenode.

        I will file a different jira for the second one.

        Show
        Raghu Angadi added a comment - There are two related issues here : Datanodes displayed in webui and "dfs -report" Policy regd storing all the known storage ids and datanodes persistently in Namenode. I will file a different jira for the second one.
        Hide
        Raghu Angadi added a comment -

        new config dfs.report.datanode.timeout.hours is added.

        A dead datanode is listed on webui or 'dfsadmin -report' if

        • the node is present in "dfs.hosts" file or
        • It is not listed in "dfs.hosts.exlude" and it has been inactive for less than "dfs.report.datanode.timeout.hours"
        Show
        Raghu Angadi added a comment - new config dfs.report.datanode.timeout.hours is added. A dead datanode is listed on webui or 'dfsadmin -report' if the node is present in "dfs.hosts" file or It is not listed in "dfs.hosts.exlude" and it has been inactive for less than "dfs.report.datanode.timeout.hours"
        Raghu Angadi made changes -
        Attachment HADOOP-1138.patch [ 12364814 ]
        Hide
        dhruba borthakur added a comment -

        The code looks good. A few comments:

        1. FSnamesystem.getDatanodeListForReport excludes nodes that are listed in dfs.hosts.exclude. Maybe a better option wold be to show them with a status of "Excluded". Currently, it shows "Decommisioned" or "In Service".

        2. The comment in FSnamesystem.getDatanodeListForReport talks about "dfs.report.datanode.timeout.day" but it should be "dfs.report.datanode.timeout.hours".

        3. Maybe a unit test case that tests this functionality would be really nice.

        Show
        dhruba borthakur added a comment - The code looks good. A few comments: 1. FSnamesystem.getDatanodeListForReport excludes nodes that are listed in dfs.hosts.exclude. Maybe a better option wold be to show them with a status of "Excluded". Currently, it shows "Decommisioned" or "In Service". 2. The comment in FSnamesystem.getDatanodeListForReport talks about "dfs.report.datanode.timeout.day" but it should be "dfs.report.datanode.timeout.hours". 3. Maybe a unit test case that tests this functionality would be really nice.
        Hide
        Raghu Angadi added a comment -

        New patch attached.

        1. FSnamesystem.getDatanodeListForReport excludes nodes that are listed in dfs.hosts.exclude. Maybe a better option wold be to show them with a status of "Excluded". Currently, it shows "Decommisioned" or "In Service".

        Currently there is no state shown for deadnodes. Note that this method looks at dfs.hosts.exclude only for datanodes that are considered dead.

        2. The comment in FSnamesystem.getDatanodeListForReport talks about "dfs.report.datanode.timeout.day" but it should be "dfs.report.datanode.timeout.hours".

        Done. Good catch. You actually read the comments!

        3. Maybe a unit test case that tests this functionality would be really nice.

        This is a very non-consequential functionality. It only affects Namenode front page and 'dfsadmin -report'. Let me know if we really need to add a unit test. I did test it.

        Show
        Raghu Angadi added a comment - New patch attached. 1. FSnamesystem.getDatanodeListForReport excludes nodes that are listed in dfs.hosts.exclude. Maybe a better option wold be to show them with a status of "Excluded". Currently, it shows "Decommisioned" or "In Service". Currently there is no state shown for deadnodes. Note that this method looks at dfs.hosts.exclude only for datanodes that are considered dead. 2. The comment in FSnamesystem.getDatanodeListForReport talks about "dfs.report.datanode.timeout.day" but it should be "dfs.report.datanode.timeout.hours". Done. Good catch. You actually read the comments! 3. Maybe a unit test case that tests this functionality would be really nice. This is a very non-consequential functionality. It only affects Namenode front page and 'dfsadmin -report'. Let me know if we really need to add a unit test. I did test it.
        Raghu Angadi made changes -
        Attachment HADOOP-1138.patch [ 12365101 ]
        Hide
        dhruba borthakur added a comment -

        +1.

        Show
        dhruba borthakur added a comment - +1.
        Hide
        Raghu Angadi added a comment -

        Thanks Dhruba.

        Show
        Raghu Angadi added a comment - Thanks Dhruba.
        Raghu Angadi made changes -
        Status Open [ 1 ] Patch Available [ 10002 ]
        Hide
        Hadoop QA added a comment -

        -1, build or testing failed

        2 attempts failed to build and test the latest attachment http://issues.apache.org/jira/secure/attachment/12365101/HADOOP-1138.patch against trunk revision r572826.

        Test results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/682/testReport/
        Console output: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/682/console

        Please note that this message is automatically generated and may represent a problem with the automation system and not the patch.

        Show
        Hadoop QA added a comment - -1, build or testing failed 2 attempts failed to build and test the latest attachment http://issues.apache.org/jira/secure/attachment/12365101/HADOOP-1138.patch against trunk revision r572826. Test results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/682/testReport/ Console output: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/682/console Please note that this message is automatically generated and may represent a problem with the automation system and not the patch.
        Raghu Angadi made changes -
        Status Patch Available [ 10002 ] Open [ 1 ]
        Hide
        Raghu Angadi added a comment -


        Earlier patch did not show the nodes that were listed in exlude file (dfs.hosts.exlude). But this broke TestDecommission. The current patch treats these nodes just like the normal dead nodes.

        Show
        Raghu Angadi added a comment - Earlier patch did not show the nodes that were listed in exlude file (dfs.hosts.exlude). But this broke TestDecommission. The current patch treats these nodes just like the normal dead nodes.
        Raghu Angadi made changes -
        Attachment HADOOP-1138.patch [ 12365214 ]
        Raghu Angadi made changes -
        Status Open [ 1 ] Patch Available [ 10002 ]
        Hide
        Raghu Angadi added a comment -

        Also this will conflict with HADOOP-1838, if it is committed earlier.

        Show
        Raghu Angadi added a comment - Also this will conflict with HADOOP-1838 , if it is committed earlier.
        Show
        Hadoop QA added a comment - +1 http://issues.apache.org/jira/secure/attachment/12365214/HADOOP-1138.patch applied and successfully tested against trunk revision r573081. Test results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/693/testReport/ Console output: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/693/console
        Hide
        Raghu Angadi added a comment - - edited

        This is not considered necessary after HADOOP-1762.

        Show
        Raghu Angadi added a comment - - edited This is not considered necessary after HADOOP-1762 .
        Raghu Angadi made changes -
        Resolution Won't Fix [ 2 ]
        Status Patch Available [ 10002 ] Resolved [ 5 ]
        Hide
        dhruba borthakur added a comment -

        I think that if the includes list or the excludes list is present, these nodes should always show up in the UI. This allows the administrator to accurately know the state of his cluster.( However, we do not need to automatically remove datanodes from the UI if they are dead for a certain period).

        Show
        dhruba borthakur added a comment - I think that if the includes list or the excludes list is present, these nodes should always show up in the UI. This allows the administrator to accurately know the state of his cluster.( However, we do not need to automatically remove datanodes from the UI if they are dead for a certain period).
        dhruba borthakur made changes -
        Resolution Won't Fix [ 2 ]
        Status Resolved [ 5 ] Reopened [ 4 ]
        Hide
        Raghu Angadi added a comment -

        There is too much stuff in this jira. I filed a new one HADOOP-1933. Should nodes from both the lists be listed or only from include list... please comment in HADOOP-1933.

        Show
        Raghu Angadi added a comment - There is too much stuff in this jira. I filed a new one HADOOP-1933 . Should nodes from both the lists be listed or only from include list... please comment in HADOOP-1933 .
        Hide
        dhruba borthakur added a comment -

        HADOOP-1933 has follow-up discussions.

        Show
        dhruba borthakur added a comment - HADOOP-1933 has follow-up discussions.
        dhruba borthakur made changes -
        Resolution Won't Fix [ 2 ]
        Status Reopened [ 4 ] Resolved [ 5 ]
        Doug Cutting made changes -
        Status Resolved [ 5 ] Closed [ 6 ]
        Owen O'Malley made changes -
        Component/s dfs [ 12310710 ]

          People

          • Assignee:
            Raghu Angadi
            Reporter:
            dhruba borthakur
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development