Hadoop HDFS
  1. Hadoop HDFS
  2. HDFS-5058

QJM should validate startLogSegment() more strictly


    • Type: Bug Bug
    • Status: Patch Available
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: 3.0.0, 2.1.0-beta
    • Fix Version/s: None
    • Component/s: qjm
    • Labels:


      We've seen a small handful of times a case where one of the NNs in an HA cluster ends up with an fsimage checkpoint that falls in the middle of an edit segment. We're not sure yet how this happens, but one issue can happen as a result:

      • Node has fsimage_500. Cluster has edits_1-1000, edits_1001_inprogress
      • Node restarts, loads fsimage_500
      • Node wants to become active. It calls selectInputStreams(500). Currently, this API logs a WARN that 500 falls in the middle of the 1-1000 segment, but continues and returns no results.
      • Node calls startLogSegment(501).

      Currently, the QJM will accept this (incorrectly). The node then crashes when it first tries to journal a real transaction, but it ends up leaving the edits_501_inprogress lying around, potentially causing more issues later.

      1. hdfs-5058.txt
        6 kB
        Todd Lipcon
      2. HDFS-5098.patch
        6 kB
        Allen Wittenauer


        Todd Lipcon created issue -
        Todd Lipcon made changes -
        Field Original Value New Value
        Attachment hdfs-5058.txt [ 12595650 ]
        Todd Lipcon made changes -
        Status Open [ 1 ] Patch Available [ 10002 ]
        Allen Wittenauer made changes -
        Attachment HDFS-5098.patch [ 12729982 ]
        Allen Wittenauer made changes -
        Labels BB2015-05-TBR


          • Assignee:
            Todd Lipcon
            Todd Lipcon
          • Votes:
            0 Vote for this issue
            9 Start watching this issue


            • Created: