Details

    • Type: Bug
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 2.0.4-alpha
    • Fix Version/s: None
    • Component/s: ha, namenode
    • Labels:
      None

      Description

      Due to absence of explicit timeout in FileJournalManager, error conditions that incur long delay (usually until driver timeout) can make namenode unresponsive for long time. This directly affects NN's failure detection latency, which is critical in HA.

        Activity

        Hide
        tlipcon Todd Lipcon added a comment -

        Hey Kihwal. Are you planning on using NFS based HA? I'd highly recommend using QJM instead – it has timeout features and has been much more reliable for us in production clusters.

        Show
        tlipcon Todd Lipcon added a comment - Hey Kihwal. Are you planning on using NFS based HA? I'd highly recommend using QJM instead – it has timeout features and has been much more reliable for us in production clusters.
        Hide
        cmccabe Colin P. McCabe added a comment -

        If the disk driver hangs on a synchronous write(2) or read(2), it doesn't matter what the Java software did-- the operating system thread will be blocked. This is why we recommended that people soft-mount the NFS directory when using NFS HA.

        Todd's suggestion is the best, though-- just use QJM.

        Show
        cmccabe Colin P. McCabe added a comment - If the disk driver hangs on a synchronous write(2) or read(2) , it doesn't matter what the Java software did-- the operating system thread will be blocked. This is why we recommended that people soft-mount the NFS directory when using NFS HA. Todd's suggestion is the best, though-- just use QJM.
        Hide
        kihwal Kihwal Lee added a comment -

        We will certainly use one of the HA-enabled journal managers in the future, but many users I've talked to want NFS-based as a first step. Even if QJM is used for the shared edits directory, local or NFS may still be used for storing extra copy of edits (as non-required resource). In this case, lack of timeout in FJM can affect HA with manual failover. Can health checks used with ZKFC detect I/O hang?

        Show
        kihwal Kihwal Lee added a comment - We will certainly use one of the HA-enabled journal managers in the future, but many users I've talked to want NFS-based as a first step. Even if QJM is used for the shared edits directory, local or NFS may still be used for storing extra copy of edits (as non-required resource). In this case, lack of timeout in FJM can affect HA with manual failover. Can health checks used with ZKFC detect I/O hang?
        Hide
        cmccabe Colin P. McCabe added a comment -

        If the NameNode hangs, ZKFC will detect it.

        Show
        cmccabe Colin P. McCabe added a comment - If the NameNode hangs, ZKFC will detect it.
        Hide
        kihwal Kihwal Lee added a comment -

        If the NameNode hangs, ZKFC will detect it.

        I understand that ZKFC will detect the failures if NN does not respond to RPC calls or the internal resource check fails. If all RPC handlers are waiting for a very long logSync() to finish, this may be detected as well. But if a couple of handlers are in trouble due to I/O hang and all others are happily serving reads, the error condition may not be detected in time. The situation will be different, of course, if the underlying journal flush can timeout.

        I think adding timeout will still be useful since users can run combination of a HA-JM and FJM. Ideally, NN should be able to detect and exclude failed storages with a predictable/configurable latency, regardless of underlying implementation.

        Show
        kihwal Kihwal Lee added a comment - If the NameNode hangs, ZKFC will detect it. I understand that ZKFC will detect the failures if NN does not respond to RPC calls or the internal resource check fails. If all RPC handlers are waiting for a very long logSync() to finish, this may be detected as well. But if a couple of handlers are in trouble due to I/O hang and all others are happily serving reads, the error condition may not be detected in time. The situation will be different, of course, if the underlying journal flush can timeout. I think adding timeout will still be useful since users can run combination of a HA-JM and FJM. Ideally, NN should be able to detect and exclude failed storages with a predictable/configurable latency, regardless of underlying implementation.
        Hide
        cmccabe Colin P. McCabe added a comment -

        The Linux kernel doesn't allow you to set a timeout on I/O operations, unless you use O_DIRECT and async I/O. If operations with the local filesystem take longer than you would like, what are you going to have the NameNode do? Kill itself? It can't even kill itself if it is hung on a write, because the process will be in "D state," otherwise known as uninterruptible sleep.

        In this scenario, the NameNode worker thread will be blocked forever, probably while holding the FSImage lock. There is nothing you can do. You can't kill the thread, and even if you could, how would you get the mutex back? There is nothing Java can do when the OS decides your thread cannot run.

        The solution to your problem is that you can easily set a timeout on NFS operations by using a soft mount plus timeo=60 (or whatever timeout you want).

        Show
        cmccabe Colin P. McCabe added a comment - The Linux kernel doesn't allow you to set a timeout on I/O operations, unless you use O_DIRECT and async I/O. If operations with the local filesystem take longer than you would like, what are you going to have the NameNode do? Kill itself? It can't even kill itself if it is hung on a write, because the process will be in "D state," otherwise known as uninterruptible sleep. In this scenario, the NameNode worker thread will be blocked forever, probably while holding the FSImage lock. There is nothing you can do. You can't kill the thread, and even if you could, how would you get the mutex back? There is nothing Java can do when the OS decides your thread cannot run. The solution to your problem is that you can easily set a timeout on NFS operations by using a soft mount plus timeo=60 (or whatever timeout you want).
        Hide
        kihwal Kihwal Lee added a comment -

        The Linux kernel doesn't allow you to set a timeout on I/O operations, unless you use O_DIRECT and async I/O. If operations with the local filesystem take longer than you would like, what are you going to have the NameNode do? Kill itself? It can't even kill itself if it is hung on a write, because the process will be in "D state," otherwise known as uninterruptible sleep.

        The thread who's calling fsync() will be in D state. It does not mean all other threads are also stuck. We can have another thread do this and the one who is doing logSync() can observe and timeout. The stuck thread can later determine whether it should terminate if it eventually returns. The failed EditLogOutputStream will be aborted and EditLogFileOutputStream must do this in non-blocking way. Any subsequent logSync() will avoid the troublesome edits directory.

        If the failed one is "required", NN will try to exit, which will hang, but the failure will likely be noticed right away. This still provides a shorter failure detection latency and possibility of keeping service up in case the failed edits directory was not required and NN has no other dependency on it.

        This is sort of like simplified user-space-only AIO. The main difference will be that there can only be one outstanding I/O per file descriptor, which doesn't limit our use case.

        In this scenario, the NameNode worker thread will be blocked forever, probably while holding the FSImage lock.

        In almost all cases I have seen, it was logSync() that first hit an I/O error condition, probably due to its call frequency. You may see what you mentioned if average fsync() interval is not much higher than checkpoint interval. This hasn't been the case in our production env.

        There is nothing you can do. You can't kill the thread, and even if you could, how would you get the mutex back? There is nothing Java can do when the OS decides your thread cannot run.

        You can't kill it, but you can certainly do other things. Otherwise I would call that MT implementation broken. You can certainly design things to work around this condition.

        The solution to your problem is that you can easily set a timeout on NFS operations by using a soft mount plus timeo=60 (or whatever timeout you want).

        That only works for NFS.

        I fully acknowledge that there are reasonable configurations that do not require this feature to do fast failure detection. But please understand they may be eventually, but are not immediately applicable to all use cases.

        Show
        kihwal Kihwal Lee added a comment - The Linux kernel doesn't allow you to set a timeout on I/O operations, unless you use O_DIRECT and async I/O. If operations with the local filesystem take longer than you would like, what are you going to have the NameNode do? Kill itself? It can't even kill itself if it is hung on a write, because the process will be in "D state," otherwise known as uninterruptible sleep. The thread who's calling fsync() will be in D state. It does not mean all other threads are also stuck. We can have another thread do this and the one who is doing logSync() can observe and timeout. The stuck thread can later determine whether it should terminate if it eventually returns. The failed EditLogOutputStream will be aborted and EditLogFileOutputStream must do this in non-blocking way. Any subsequent logSync() will avoid the troublesome edits directory. If the failed one is "required", NN will try to exit, which will hang, but the failure will likely be noticed right away. This still provides a shorter failure detection latency and possibility of keeping service up in case the failed edits directory was not required and NN has no other dependency on it. This is sort of like simplified user-space-only AIO. The main difference will be that there can only be one outstanding I/O per file descriptor, which doesn't limit our use case. In this scenario, the NameNode worker thread will be blocked forever, probably while holding the FSImage lock. In almost all cases I have seen, it was logSync() that first hit an I/O error condition, probably due to its call frequency. You may see what you mentioned if average fsync() interval is not much higher than checkpoint interval. This hasn't been the case in our production env. There is nothing you can do. You can't kill the thread, and even if you could, how would you get the mutex back? There is nothing Java can do when the OS decides your thread cannot run. You can't kill it, but you can certainly do other things. Otherwise I would call that MT implementation broken. You can certainly design things to work around this condition. The solution to your problem is that you can easily set a timeout on NFS operations by using a soft mount plus timeo=60 (or whatever timeout you want). That only works for NFS. I fully acknowledge that there are reasonable configurations that do not require this feature to do fast failure detection. But please understand they may be eventually, but are not immediately applicable to all use cases.
        Hide
        cmccabe Colin P. McCabe added a comment -

        Can you be more clear about why just using QuorumJournalManager plus ZKFC doesn't solve this problem?

        You don't actually even need local storage directories any more; we only ever recommended them because QJM new and untested.

        It's not just fsync that can block forever, but any write, any read, any fstat, really any blocking operation that touches the filesystem. I have seen ls go out to lunch forever on a corrupted filesystem. Are you going to add "check if I timed out and kill myself if so" recovery logic after every operation that touches the filesystem? Every FileInputStream or FileOutputStream or FileChannel method? Are you going to carefully monitor each new patch so that nobody adds back in a use of filechannel.size or whatever?

        Show
        cmccabe Colin P. McCabe added a comment - Can you be more clear about why just using QuorumJournalManager plus ZKFC doesn't solve this problem? You don't actually even need local storage directories any more; we only ever recommended them because QJM new and untested. It's not just fsync that can block forever, but any write, any read, any fstat, really any blocking operation that touches the filesystem. I have seen ls go out to lunch forever on a corrupted filesystem. Are you going to add "check if I timed out and kill myself if so" recovery logic after every operation that touches the filesystem? Every FileInputStream or FileOutputStream or FileChannel method? Are you going to carefully monitor each new patch so that nobody adds back in a use of filechannel.size or whatever?
        Hide
        kihwal Kihwal Lee added a comment -

        Can you be more clear about why just using QuorumJournalManager plus ZKFC doesn't solve this problem?

        As you said: "we only ever recommended them because QJM new and untested," it's all the additional moving parts that come with a full fledged HA setup, which is not specific to QJM. I originally thought about doing what you suggested plus extra name.dir for making more copies for images. But there are people who don't feel comfortable about getting all at one shot. Some are even opposed to having automatic failover in the initial setup. I can't persuade them otherwise until everything is thoroughly checked out under our typical load and scale, which we will eventually do.

        Last time I checked, which is not very long ago, FJM was not prohibited from being used for shared.edits.dir and the manual failover was not deprecated. IIRC, CDH4 doc also states that this combination is supported. I am sure that you are not implying improving it is a bad idea. Please explain further. I seem to keep failing to understand your point.

        It's not just fsync that can block forever, but any write, any read, any fstat, really any blocking operation that touches the filesystem. I have seen ls go out to lunch forever on a corrupted filesystem. Are you going to add "check if I timed out and kill myself if so" recovery logic after every operation that touches the filesystem? Every FileInputStream or FileOutputStream or FileChannel method? Are you going to carefully monitor each new patch so that nobody adds back in a use of filechannel.size or whatever?

        You are right in that it's not just fsync() that will be hanging. But logSync() is invoked frequently enough to be considered as a reasonable place to check for I/O hang condition. As long as ipc handler threads are not blocked by other threads that are hanging through a different call path, the check will work regardless of which system call those hung threads called. So, no, you don't have to add "check if I timed out and kill myself if so" recovery logic after every operation that touches the filesystem.

        Show
        kihwal Kihwal Lee added a comment - Can you be more clear about why just using QuorumJournalManager plus ZKFC doesn't solve this problem? As you said: "we only ever recommended them because QJM new and untested," it's all the additional moving parts that come with a full fledged HA setup, which is not specific to QJM. I originally thought about doing what you suggested plus extra name.dir for making more copies for images. But there are people who don't feel comfortable about getting all at one shot. Some are even opposed to having automatic failover in the initial setup. I can't persuade them otherwise until everything is thoroughly checked out under our typical load and scale, which we will eventually do. Last time I checked, which is not very long ago, FJM was not prohibited from being used for shared.edits.dir and the manual failover was not deprecated. IIRC, CDH4 doc also states that this combination is supported. I am sure that you are not implying improving it is a bad idea. Please explain further. I seem to keep failing to understand your point. It's not just fsync that can block forever, but any write, any read, any fstat, really any blocking operation that touches the filesystem. I have seen ls go out to lunch forever on a corrupted filesystem. Are you going to add "check if I timed out and kill myself if so" recovery logic after every operation that touches the filesystem? Every FileInputStream or FileOutputStream or FileChannel method? Are you going to carefully monitor each new patch so that nobody adds back in a use of filechannel.size or whatever? You are right in that it's not just fsync() that will be hanging. But logSync() is invoked frequently enough to be considered as a reasonable place to check for I/O hang condition. As long as ipc handler threads are not blocked by other threads that are hanging through a different call path, the check will work regardless of which system call those hung threads called. So, no, you don't have to add "check if I timed out and kill myself if so" recovery logic after every operation that touches the filesystem.
        Hide
        cmccabe Colin P. McCabe added a comment -

        I just don't think it's worth adding a lot of complexity to FJM, to handle something that's better handled by HA. Surely that's a reasonable position?

        It's also kind of weird to hear the argument "HA is too new to deploy, therefore we need to develop a bunch of new code to work around not having it." Wha?

        But I've talked enough, I'll let other people chime in with their opinions.

        Show
        cmccabe Colin P. McCabe added a comment - I just don't think it's worth adding a lot of complexity to FJM, to handle something that's better handled by HA. Surely that's a reasonable position? It's also kind of weird to hear the argument "HA is too new to deploy, therefore we need to develop a bunch of new code to work around not having it." Wha? But I've talked enough, I'll let other people chime in with their opinions.
        Hide
        cmccabe Colin P. McCabe added a comment -

        Last time I checked, which is not very long ago, FJM was not prohibited from being used for shared.edits.dir and the manual failover was not deprecated. IIRC, CDH4 doc also states that this combination is supported. I am sure that you are not implying improving it is a bad idea. Please explain further. I seem to keep failing to understand your point.

        I guess I should clarify. There's nothing wrong with using NFS HA, if you already have a substantial investment in NFS filers. Keep in mind, though, if your NFS filer itself is not HA, you are just moving the single point of failure around, not eliminating it. Some Hadoop users have invested quite a lot in highly available NFS filers, and are comfortable using them, which is one reason NFS HA is supported in Hadoop, and will probably continue to be so for a long time. In some cases these filers were installed prior to Hadoop. But supported != encouraged for all new installs. In any case, NFS HA isn't the issue here, the issue is that using either kind of HA should solve your problem without requiring FJM changes. Another simple thing that would solve the problem is using RAID on the local edit directories. You would probably have to use hardware raid, though, since in my experience software RAID on Linux had the same issues with threads hitting really slow timeouts when reading from an array where 1 disk was failing.

        Show
        cmccabe Colin P. McCabe added a comment - Last time I checked, which is not very long ago, FJM was not prohibited from being used for shared.edits.dir and the manual failover was not deprecated. IIRC, CDH4 doc also states that this combination is supported. I am sure that you are not implying improving it is a bad idea. Please explain further. I seem to keep failing to understand your point. I guess I should clarify. There's nothing wrong with using NFS HA, if you already have a substantial investment in NFS filers. Keep in mind, though, if your NFS filer itself is not HA, you are just moving the single point of failure around, not eliminating it. Some Hadoop users have invested quite a lot in highly available NFS filers, and are comfortable using them, which is one reason NFS HA is supported in Hadoop, and will probably continue to be so for a long time. In some cases these filers were installed prior to Hadoop. But supported != encouraged for all new installs. In any case, NFS HA isn't the issue here, the issue is that using either kind of HA should solve your problem without requiring FJM changes. Another simple thing that would solve the problem is using RAID on the local edit directories. You would probably have to use hardware raid, though, since in my experience software RAID on Linux had the same issues with threads hitting really slow timeouts when reading from an array where 1 disk was failing.
        Hide
        sureshms Suresh Srinivas added a comment -

        Colin P. McCabe Please stop preaching to the choir.

        Kihwal has been contributing to hadoop and supporting many large clusters for a long time and understands the issues more than you give him credit for in these comments. If he wants to explore making FJM more robust, it is up to him. I also believe that not everyone may want to use QJM and that is e reason why these interfaces are plug gable.

        Show
        sureshms Suresh Srinivas added a comment - Colin P. McCabe Please stop preaching to the choir. Kihwal has been contributing to hadoop and supporting many large clusters for a long time and understands the issues more than you give him credit for in these comments. If he wants to explore making FJM more robust, it is up to him. I also believe that not everyone may want to use QJM and that is e reason why these interfaces are plug gable.
        Hide
        kihwal Kihwal Lee added a comment -

        Thank you for the clarification. Fortunately I have a little bit of experience in Linux kernel, storage and fault-tolerance, so I was able to digest what you have described so far. Although you do not recommend the NFS + manual failover, I now understand you are not against a FJM improvement that could benefit existing installations as part of continuing support. I don't believe addition of this feature will be interpreted as a promotion or encouragement, since QJM+ZKFC is clearly newer, more capable and users like your customers are happy with it.

        Show
        kihwal Kihwal Lee added a comment - Thank you for the clarification. Fortunately I have a little bit of experience in Linux kernel, storage and fault-tolerance, so I was able to digest what you have described so far. Although you do not recommend the NFS + manual failover, I now understand you are not against a FJM improvement that could benefit existing installations as part of continuing support. I don't believe addition of this feature will be interpreted as a promotion or encouragement, since QJM+ZKFC is clearly newer, more capable and users like your customers are happy with it.
        Hide
        cmccabe Colin P. McCabe added a comment -

        I have no objection to putting FJM operations in separate threads. It would be a nice improvement. I am +0 on the whole thing and was just trying to point out alternative approaches.

        I describe all this stuff in detail so that everyone can read it, even the people who don't have years of experience like Kihwal and yourself. I also find that writing it down helps me think.

        Show
        cmccabe Colin P. McCabe added a comment - I have no objection to putting FJM operations in separate threads. It would be a nice improvement. I am +0 on the whole thing and was just trying to point out alternative approaches. I describe all this stuff in detail so that everyone can read it, even the people who don't have years of experience like Kihwal and yourself. I also find that writing it down helps me think.

          People

          • Assignee:
            robsparker Robert Parker
            Reporter:
            kihwal Kihwal Lee
          • Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

            • Created:
              Updated:

              Development