Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-5043

hostname lookup in SystemInfoHandler should be refactored so it's possible to not block core (re)load for long periouds on misconfigured systems

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 6.4, 7.0
    • Component/s: None
    • Labels:
      None

      Description

      SystemInfoHandler currently lookups the hostname of the machine on it's init, and caches for it's lifecycle – there is a comment to the effect that the reason for this is because on some machines (notably ones with wacky DNS settings) looking up the hostname can take a long ass time in some JVMs...

        // on some platforms, resolving canonical hostname can cause the thread
        // to block for several seconds if nameservices aren't available
        // so resolve this once per handler instance 
        //(ie: not static, so core reload will refresh)
      

      But as we move forward with a lot more multi-core, solr-cloud, dynamically updated instances, even paying this cost per core-reload is expensive.

      we should refactoring this so that SystemInfoHandler instances init immediately, with some kind of lazy loading of the hostname info in a background thread, (especially since hte only real point of having that info here is for UI use so you cna keep track of what machine you are looking at)

      1. SOLR-5043.patch
        3 kB
        Hoss Man
      2. SOLR-5043.patch
        3 kB
        Hoss Man
      3. SOLR-5043-lazy.patch
        2 kB
        Ramkumar Aiyengar

        Issue Links

          Activity

          Hide
          hossman Hoss Man added a comment -

          Here's a quick and drity patch that spins up background thread to resolve the hostname.

          in practice, this works fine – startup and core reload times are short, and an error eventually gets logged by the background thread if/when the JVM gives up on resolving the hostname.

          the hitch is: if we assume people might have bad DNS when running solr, we should probably also assume they might have bad DNS when running solr tests. As things stand on trunk today, that just means your tests run dog ass slow – with this patch, the test will finish fast, but then the runner will fail because it treats this background DNS thread as a leak.

          probably need to figure out a better way of dealing with this.

          Show
          hossman Hoss Man added a comment - Here's a quick and drity patch that spins up background thread to resolve the hostname. in practice, this works fine – startup and core reload times are short, and an error eventually gets logged by the background thread if/when the JVM gives up on resolving the hostname. the hitch is: if we assume people might have bad DNS when running solr, we should probably also assume they might have bad DNS when running solr tests. As things stand on trunk today, that just means your tests run dog ass slow – with this patch, the test will finish fast, but then the runner will fail because it treats this background DNS thread as a leak. probably need to figure out a better way of dealing with this.
          Hide
          romseygeek Alan Woodward added a comment -

          Can we use a CompletionService for this? Maybe have one running on the CoreContainer which can then be stopped when the container is shutdown, which should stop any thread leaks.

          Show
          romseygeek Alan Woodward added a comment - Can we use a CompletionService for this? Maybe have one running on the CoreContainer which can then be stopped when the container is shutdown, which should stop any thread leaks.
          Hide
          andyetitmoves Ramkumar Aiyengar added a comment -

          Having this on a per-instance basis like it does currently would also mean that you have one more thread running per core, even if temporarily, that might cause issues if you have lots of cores starting up at the same time in a JVM.

          Is it a big deal to just do this lazily and synchronously the first time the handler is requested? I know it just moves the delay elsewhere, but on the flip side, with this patch, you might sometimes not get the hostname when you expect it (so technically it's a functional difference)

          Show
          andyetitmoves Ramkumar Aiyengar added a comment - Having this on a per-instance basis like it does currently would also mean that you have one more thread running per core, even if temporarily, that might cause issues if you have lots of cores starting up at the same time in a JVM. Is it a big deal to just do this lazily and synchronously the first time the handler is requested? I know it just moves the delay elsewhere, but on the flip side, with this patch, you might sometimes not get the hostname when you expect it (so technically it's a functional difference)
          Hide
          andyetitmoves Ramkumar Aiyengar added a comment -

          Attached is a patch describing what I meant (untested)

          Show
          andyetitmoves Ramkumar Aiyengar added a comment - Attached is a patch describing what I meant (untested)
          Hide
          hossman Hoss Man added a comment -

          Ram: the approach you're suggesting was explicitly discussed in the past and deemed a bad idea because it means that if DNS changes there is no way for a solr admin to trigger a "refresh" of the "cached" hostname – which is why the existing code explicitly has the comment "not static, so core reload will refresh" – that way users could at least trigger a core reload when doing things like server swaps (or disaster recovery fail over of whole colos using subnet aliases, etc...)

          Having this on a per-instance basis like it does currently would also mean that you have one more thread running per core, even if temporarily, that might cause issues if you have lots of cores starting up at the same time in a JVM.

          which is why the suggestion of using a single threaded CompletionService, overwritting a single static variable each time there is a core reload, is a better one (that unfortunately noone has ever got arround to implementing)

          ...on the flip side, with this patch, you might sometimes not get the hostname when you expect it (so technically it's a functional difference)

          It's always been the case that if there is a DNS problem you'll get null instead of the canonical hostname, and at a later time you might start getting a new/different/correct hostname (currently if/when DNS is fixed and hte core reloads) so improvements that return null if/when a hostname is unresolvable shouldn't be ruled out just for that reason.


          The more we talk about the various trade offs involved with this type of problem, the more and more I ultimately feel like we really shouldn't invest too much complexity in the code just to account for people with bad/broken DNS configurations.

          my current feeling is:

          • we should "fix" the current SystemInfoHandler init logic to log errors when there are DNS problems so they should up in the logs, but otherwise leave things along.
          • For situations like SOLR-7884 i think an advanced, seriously expert, system property that says "hey solr, under no circumstances should you try to do any DNS lookups because i know my DNS is busted" would be a good idea, and should be implemented by a generic helper method for use both here and in the various parts of the disributed update/search logic in the cloud code. (with forbidden API checks to prevent any future code other then this helper method from doing DNS related methods)
          Show
          hossman Hoss Man added a comment - Ram: the approach you're suggesting was explicitly discussed in the past and deemed a bad idea because it means that if DNS changes there is no way for a solr admin to trigger a "refresh" of the "cached" hostname – which is why the existing code explicitly has the comment "not static, so core reload will refresh" – that way users could at least trigger a core reload when doing things like server swaps (or disaster recovery fail over of whole colos using subnet aliases, etc...) Having this on a per-instance basis like it does currently would also mean that you have one more thread running per core, even if temporarily, that might cause issues if you have lots of cores starting up at the same time in a JVM. which is why the suggestion of using a single threaded CompletionService, overwritting a single static variable each time there is a core reload, is a better one (that unfortunately noone has ever got arround to implementing) ...on the flip side, with this patch, you might sometimes not get the hostname when you expect it (so technically it's a functional difference) It's always been the case that if there is a DNS problem you'll get null instead of the canonical hostname, and at a later time you might start getting a new/different/correct hostname (currently if/when DNS is fixed and hte core reloads) so improvements that return null if/when a hostname is unresolvable shouldn't be ruled out just for that reason. The more we talk about the various trade offs involved with this type of problem, the more and more I ultimately feel like we really shouldn't invest too much complexity in the code just to account for people with bad/broken DNS configurations. my current feeling is: we should "fix" the current SystemInfoHandler init logic to log errors when there are DNS problems so they should up in the logs, but otherwise leave things along. For situations like SOLR-7884 i think an advanced, seriously expert, system property that says "hey solr, under no circumstances should you try to do any DNS lookups because i know my DNS is busted" would be a good idea, and should be implemented by a generic helper method for use both here and in the various parts of the disributed update/search logic in the cloud code. (with forbidden API checks to prevent any future code other then this helper method from doing DNS related methods)
          Hide
          andyetitmoves Ramkumar Aiyengar added a comment -

          Ah, my bad, I interpreted the comment to mean "not static (for other reasons), so core reload will (unfortunately) refresh (, but that's not a big deal)"

          Makes sense overall..

          Show
          andyetitmoves Ramkumar Aiyengar added a comment - Ah, my bad, I interpreted the comment to mean "not static (for other reasons), so core reload will (unfortunately) refresh (, but that's not a big deal)" Makes sense overall..
          Hide
          rmk_apache Robert Krüger added a comment -

          I agree that the solution to this should be simple because cases like SOLR-7884 are certainly not the norm (I am the reporter btw). The proposed system property would be a perfectly sufficient and very welcome fix for this.

          Show
          rmk_apache Robert Krüger added a comment - I agree that the solution to this should be simple because cases like SOLR-7884 are certainly not the norm (I am the reporter btw). The proposed system property would be a perfectly sufficient and very welcome fix for this.
          Hide
          rmk_apache Robert Krüger added a comment -

          Is anyone planning on scheduling this for an upcoming release? The proposed solution unfortunately is beyond my solr-knowledge, so I probably won't send a patch.

          Show
          rmk_apache Robert Krüger added a comment - Is anyone planning on scheduling this for an upcoming release? The proposed solution unfortunately is beyond my solr-knowledge, so I probably won't send a patch.
          Hide
          rmk_apache Robert Krüger added a comment -

          I would offer a bounty to have this fixed. Is there a systematic/recommended way of doing this in this project?

          Show
          rmk_apache Robert Krüger added a comment - I would offer a bounty to have this fixed. Is there a systematic/recommended way of doing this in this project?
          Hide
          hossman Hoss Man added a comment -

          For situations like SOLR-7884 i think an advanced, seriously expert, system property that says "hey solr, under no circumstances should you try to do any DNS lookups because i know my DNS is busted" would be a good idea, and should be implemented by a generic helper method for use both here and in the various parts of the disributed update/search logic in the cloud code. (with forbidden API checks to prevent any future code other then this helper method from doing DNS related methods)

          I spent a little time looking into this, and it appears to be somewhat intractable. There are too many InetAddress methods that may do reverse DNS lookups under the covers to try and write our own wrappers and suppress direct calls with forbidden-API – particularly since many of those methods are the same ones that just return the local hostname configuration if available (when dealing with InetAddress.getLocalHost()) or the previously specified hostname (when dealing with InetAddress.getByName(String))

          So instead I punted on trying to solve the general problem of a sysprop to say "Don't ever do any DNS lookups" and just kept the problem constrained to sysprop that says "Don't ever do this DNS lookup" .. leaving open the possibility of refactoring it later if there are other places where we decide we want it to apply as well.

          Robert: can you try out this patch on a machine with a configuration you know currently causes you long delays in starting up solr, and confirm you get a helpful WARN message instructing you to set the solr.dns.prevent.reverse.lookup .. and then test out adding that to your startup params and confirm it eliminates the problem?

          Show
          hossman Hoss Man added a comment - For situations like SOLR-7884 i think an advanced, seriously expert, system property that says "hey solr, under no circumstances should you try to do any DNS lookups because i know my DNS is busted" would be a good idea, and should be implemented by a generic helper method for use both here and in the various parts of the disributed update/search logic in the cloud code. (with forbidden API checks to prevent any future code other then this helper method from doing DNS related methods) I spent a little time looking into this, and it appears to be somewhat intractable. There are too many InetAddress methods that may do reverse DNS lookups under the covers to try and write our own wrappers and suppress direct calls with forbidden-API – particularly since many of those methods are the same ones that just return the local hostname configuration if available (when dealing with InetAddress.getLocalHost() ) or the previously specified hostname (when dealing with InetAddress.getByName(String) ) So instead I punted on trying to solve the general problem of a sysprop to say "Don't ever do any DNS lookups" and just kept the problem constrained to sysprop that says "Don't ever do this DNS lookup" .. leaving open the possibility of refactoring it later if there are other places where we decide we want it to apply as well. Robert: can you try out this patch on a machine with a configuration you know currently causes you long delays in starting up solr, and confirm you get a helpful WARN message instructing you to set the solr.dns.prevent.reverse.lookup .. and then test out adding that to your startup params and confirm it eliminates the problem?
          Hide
          rmk_apache Robert Krüger added a comment -

          Thanks a lot! I will try and get back to you.

          Show
          rmk_apache Robert Krüger added a comment - Thanks a lot! I will try and get back to you.
          Hide
          rmk_apache Robert Krüger added a comment -

          Works, thanks a lot!

          The only question I have is why you don't set the host name to the inet address (or maybe even the result of getHostName?) in the case when the DNS lookup is suppressed. For admins looking into the logs this is still better than not having a host name in there and that should not block at all. The issue was with getCanonicalHostName making the reverse DNS lookup that caused the system to hang. Your call. Since I don't really care about host name in the logs, it fixes my problem as is but I thought I'd at least point out the possibility.

          Show
          rmk_apache Robert Krüger added a comment - Works, thanks a lot! The only question I have is why you don't set the host name to the inet address (or maybe even the result of getHostName?) in the case when the DNS lookup is suppressed. For admins looking into the logs this is still better than not having a host name in there and that should not block at all. The issue was with getCanonicalHostName making the reverse DNS lookup that caused the system to hang. Your call. Since I don't really care about host name in the logs, it fixes my problem as is but I thought I'd at least point out the possibility.
          Hide
          rmk_apache Robert Krüger added a comment -

          Is anything keeping you from pushing this into one of the next updates? It is still a big issue for us with no known workaround.

          Show
          rmk_apache Robert Krüger added a comment - Is anything keeping you from pushing this into one of the next updates? It is still a big issue for us with no known workaround.
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 135604f6327032d0258227aaa524369203d40822 in lucene-solr's branch refs/heads/branch_6x from Chris Hostetter
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=135604f ]

          SOLR-5043: New solr.dns.prevent.reverse.lookup system property that can be used to prevent long core (re)load delays on systems with missconfigured hostname/DNS

          (cherry picked from commit 8b98b158ff9cc2a71216e12c894ca14352d31f0e)

          Show
          jira-bot ASF subversion and git services added a comment - Commit 135604f6327032d0258227aaa524369203d40822 in lucene-solr's branch refs/heads/branch_6x from Chris Hostetter [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=135604f ] SOLR-5043 : New solr.dns.prevent.reverse.lookup system property that can be used to prevent long core (re)load delays on systems with missconfigured hostname/DNS (cherry picked from commit 8b98b158ff9cc2a71216e12c894ca14352d31f0e)
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 8b98b158ff9cc2a71216e12c894ca14352d31f0e in lucene-solr's branch refs/heads/master from Chris Hostetter
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=8b98b15 ]

          SOLR-5043: New solr.dns.prevent.reverse.lookup system property that can be used to prevent long core (re)load delays on systems with missconfigured hostname/DNS

          Show
          jira-bot ASF subversion and git services added a comment - Commit 8b98b158ff9cc2a71216e12c894ca14352d31f0e in lucene-solr's branch refs/heads/master from Chris Hostetter [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=8b98b15 ] SOLR-5043 : New solr.dns.prevent.reverse.lookup system property that can be used to prevent long core (re)load delays on systems with missconfigured hostname/DNS
          Hide
          hossman Hoss Man added a comment -

          Is anything keeping you from pushing this into one of the next updates?

          Sorry – i lost track of it and didn't see your previous comment verifying hat the patch orked out for you.

          The only question I have is why you don't set the host name to the inet address (or maybe even the result of getHostName?) in the case when the DNS lookup is suppressed. ...

          well, 2 reasons...

          #1) conceptually, i don't like the idea of redefining what SystemInfoHandler reports as the host value ... this has always ment "Either The Canonical Hostname of null if it can't be determined" and I don't really like the idea that sometimes it's something else – partciularly when the primary usecase that might lead to "sometimes" is missconfigured DNS – i don't want to give users the impression "The (canonical) hostname of this solr node is foobar" when foobar is just some locally configured hostname and that name can't actaully be used to connect to the solr node.

          Adding distinct SystemInfoHandler variables for the IP Addr, or (locally configured) hostname might be conceptually ok – but personally i don't see much value, and leads me to...

          #2) the way the InetAddress API is designed, just calling InetAddress.getLocalHost() causes a DNS lookup to happen – leading to the whole potential long pause delay this issue was opened about (perhaps not necessarily in the same misconfiguration situation you face, but it could in other misconfiguration situations). Likewise, getHostName() will do a reverse lookup in some situations if there isn't any locally configured hostname.

          The bottom line being: since the entire predicate for this issue is "Sometimes people have badly configure DNS and/or hostname settings, and we should give them a way to make life less painful" I didn't want to make too many assumptions about the specific nature of how their DNS and/or hostname settings might be badly configured and/or introduce similar problems or more complexity in just trying to get the IP addr.

          Show
          hossman Hoss Man added a comment - Is anything keeping you from pushing this into one of the next updates? Sorry – i lost track of it and didn't see your previous comment verifying hat the patch orked out for you. The only question I have is why you don't set the host name to the inet address (or maybe even the result of getHostName?) in the case when the DNS lookup is suppressed. ... well, 2 reasons... #1) conceptually, i don't like the idea of redefining what SystemInfoHandler reports as the host value ... this has always ment "Either The Canonical Hostname of null if it can't be determined" and I don't really like the idea that sometimes it's something else – partciularly when the primary usecase that might lead to "sometimes" is missconfigured DNS – i don't want to give users the impression "The (canonical) hostname of this solr node is foobar " when foobar is just some locally configured hostname and that name can't actaully be used to connect to the solr node. Adding distinct SystemInfoHandler variables for the IP Addr, or (locally configured) hostname might be conceptually ok – but personally i don't see much value, and leads me to... #2) the way the InetAddress API is designed, just calling InetAddress.getLocalHost() causes a DNS lookup to happen – leading to the whole potential long pause delay this issue was opened about (perhaps not necessarily in the same misconfiguration situation you face, but it could in other misconfiguration situations). Likewise, getHostName() will do a reverse lookup in some situations if there isn't any locally configured hostname. The bottom line being: since the entire predicate for this issue is "Sometimes people have badly configure DNS and/or hostname settings, and we should give them a way to make life less painful" I didn't want to make too many assumptions about the specific nature of how their DNS and/or hostname settings might be badly configured and/or introduce similar problems or more complexity in just trying to get the IP addr.

            People

            • Assignee:
              hossman Hoss Man
              Reporter:
              hossman Hoss Man
            • Votes:
              2 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development