Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-9389

HDFS Transaction logs stay open for writes which leaks Xceivers

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 6.1, 7.0
    • Fix Version/s: 6.2.1, 6.3, 7.0
    • Component/s: Hadoop Integration, hdfs
    • Security Level: Public (Default Security Level. Issues are Public)
    • Labels:
      None
    • Flags:
      Patch

      Description

      The HdfsTransactionLog implementation keeps a Hadoop FSDataOutputStream open for its whole lifetime, which consumes two threads on the HDFS data node server (dataXceiver and packetresponder) even once the Solr tlog has finished being written to.

      This means for a cluster with many indexes on HDFS, the number of Xceivers can keep growing and eventually hit the limit of 4096 on the data nodes. It's especially likely for indexes that have low write rates, because Solr keeps enough tlogs around to contain 100 documents (up to a limit of 10 tlogs). There's also the issue that attempting to write to a finished tlog would be a major bug, so closing it for writes helps catch that.

      Our cluster during testing had 100+ collections with 100 shards each, spread across 8 boxes (each running 4 solr nodes and 1 hdfs data node) and with 3x replication for the tlog files, this meant we hit the xceiver limit fairly easily and had to use the attached patch to ensure tlogs were closed for writes once finished.

      The patch introduces an extra lifecycle state for the tlog, so it can be closed for writes and free up the HDFS resources, while still being available for reading. I've tried to make it as unobtrusive as I could, but there's probably a better way. I have not changed the behaviour of the local disk tlog implementation, because it only consumes a file descriptor regardless of read or write.

      nb We have decided not to use Solr-on-HDFS now, we're using local disk (for various reasons). So I don't have a HDFS cluster to do further testing on this, I'm just contributing the patch which worked for us.

        Activity

        Hide
        markrmiller@gmail.com Mark Miller added a comment -

        I've spent some time reviewing and testing out this patch - it all looks reasonable to me and tests out well. I'll commit this change soon.

        Show
        markrmiller@gmail.com Mark Miller added a comment - I've spent some time reviewing and testing out this patch - it all looks reasonable to me and tests out well. I'll commit this change soon.
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 6bff06ce4fad8edbe2a45e9b3639dfc8d3d2bb87 in lucene-solr's branch refs/heads/master from markrmiller
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=6bff06c ]

        SOLR-9389: HDFS Transaction logs stay open for writes which leaks Xceivers.

        Show
        jira-bot ASF subversion and git services added a comment - Commit 6bff06ce4fad8edbe2a45e9b3639dfc8d3d2bb87 in lucene-solr's branch refs/heads/master from markrmiller [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=6bff06c ] SOLR-9389 : HDFS Transaction logs stay open for writes which leaks Xceivers.
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit aaee4c820556ca0f62d51f939e907864e4262a32 in lucene-solr's branch refs/heads/branch_6x from markrmiller
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=aaee4c8 ]

        SOLR-9389: HDFS Transaction logs stay open for writes which leaks Xceivers.

        Show
        jira-bot ASF subversion and git services added a comment - Commit aaee4c820556ca0f62d51f939e907864e4262a32 in lucene-solr's branch refs/heads/branch_6x from markrmiller [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=aaee4c8 ] SOLR-9389 : HDFS Transaction logs stay open for writes which leaks Xceivers.
        Hide
        markrmiller@gmail.com Mark Miller added a comment -

        Thanks Tim!

        Show
        markrmiller@gmail.com Mark Miller added a comment - Thanks Tim!
        Hide
        TimOwen Tim Owen added a comment -

        Great, thanks for reviewing and testing this Mark

        Show
        TimOwen Tim Owen added a comment - Great, thanks for reviewing and testing this Mark
        Hide
        dsmiley David Smiley added a comment -

        100+ collections with 100 shards each, spread across 8 boxes

        Tim Owen that would mean ~1250 shards/box; no? ~312 per Solr process. Wow!

        Show
        dsmiley David Smiley added a comment - 100+ collections with 100 shards each, spread across 8 boxes Tim Owen that would mean ~1250 shards/box; no? ~312 per Solr process. Wow!
        Hide
        TimOwen Tim Owen added a comment -

        David Smiley We're now running 6 Solr JVMs per box, as the machines in production have 6 SSDs installed, so yeah it works out at around 200 Solr cores being served by each Solr JVM. That seems to run fine, and we've had our staging environment for another Solr installation with hundreds of cores per JVM for several years. The reason for many shards is that we do frequent updates and deletes, and want to keep the Lucene index size below a manageable level e.g. 5GB, to avoid a potentially slow merge that would block writes for too long. With composite routing, our queries never touch all shards in a collection - just a few.

        The problem still is with SolrCloud and the Overseer/Zookeeper, which become overloaded with traffic once there's any kind of problem e.g. machine failure, or worse an entire rack losing power - this causes a flood of overseer queue events and all the nodes feverishly downloading state.json repeatedly. Happy to talk to anyone who's working on that problem!

        Show
        TimOwen Tim Owen added a comment - David Smiley We're now running 6 Solr JVMs per box, as the machines in production have 6 SSDs installed, so yeah it works out at around 200 Solr cores being served by each Solr JVM. That seems to run fine, and we've had our staging environment for another Solr installation with hundreds of cores per JVM for several years. The reason for many shards is that we do frequent updates and deletes, and want to keep the Lucene index size below a manageable level e.g. 5GB, to avoid a potentially slow merge that would block writes for too long. With composite routing, our queries never touch all shards in a collection - just a few. The problem still is with SolrCloud and the Overseer/Zookeeper, which become overloaded with traffic once there's any kind of problem e.g. machine failure, or worse an entire rack losing power - this causes a flood of overseer queue events and all the nodes feverishly downloading state.json repeatedly. Happy to talk to anyone who's working on that problem!
        Hide
        markrmiller@gmail.com Mark Miller added a comment -

        What version are you seeing that with?

        Show
        markrmiller@gmail.com Mark Miller added a comment - What version are you seeing that with?
        Hide
        TimOwen Tim Owen added a comment -

        We're using Solr 6.1 (on local disk now, as mentioned). The first production cluster we had hoped to get stable was 40 boxes, each running 5 or 6 Solr JVMs, with a dedicated ZK cluster on 3 other boxes, and 100 shards per collection. That was problematic, we had a lot of Zookeeper traffic during normal writes, but especially whenever one or more boxes were deliberately killed as many Solr instances restarted all at once, leading to a large overseer queue and shards in recovery for a long time.

        Right now we're testing two scaled-down clusters: 24 boxes, and 12 boxes, with correspondingly reduced number of shards, to see at what point it can be stable when we do destructive testing by killing machines and whole racks, to see how it copes. 12 boxes is looking a lot more stable so far.

        We'll have to consider running multiple of these smaller clusters instead of 1 large one - is that best practice? There was some discussion on SOLR-5872 and SOLR-5475 about scaling the overseer with large numbers of collections and shards, although it's clearly a tricky problem.

        Show
        TimOwen Tim Owen added a comment - We're using Solr 6.1 (on local disk now, as mentioned). The first production cluster we had hoped to get stable was 40 boxes, each running 5 or 6 Solr JVMs, with a dedicated ZK cluster on 3 other boxes, and 100 shards per collection. That was problematic, we had a lot of Zookeeper traffic during normal writes, but especially whenever one or more boxes were deliberately killed as many Solr instances restarted all at once, leading to a large overseer queue and shards in recovery for a long time. Right now we're testing two scaled-down clusters: 24 boxes, and 12 boxes, with correspondingly reduced number of shards, to see at what point it can be stable when we do destructive testing by killing machines and whole racks, to see how it copes. 12 boxes is looking a lot more stable so far. We'll have to consider running multiple of these smaller clusters instead of 1 large one - is that best practice? There was some discussion on SOLR-5872 and SOLR-5475 about scaling the overseer with large numbers of collections and shards, although it's clearly a tricky problem.
        Hide
        dsmiley David Smiley added a comment -

        The reason for many shards is that we do frequent updates and deletes, and want to keep the Lucene index size below a manageable level e.g. 5GB, to avoid a potentially slow merge that would block writes for too long.

        That's addressable easily enough by increasing the merge concurrency threads, and using higher mergeFactor (now split to two other options). It's cool your queries only hit a couple shards on average but I think going up a factor of 10 (and using 1/10th number shards) is in a better better direction that truly massive shard counts like this.

        Show
        dsmiley David Smiley added a comment - The reason for many shards is that we do frequent updates and deletes, and want to keep the Lucene index size below a manageable level e.g. 5GB, to avoid a potentially slow merge that would block writes for too long. That's addressable easily enough by increasing the merge concurrency threads, and using higher mergeFactor (now split to two other options). It's cool your queries only hit a couple shards on average but I think going up a factor of 10 (and using 1/10th number shards) is in a better better direction that truly massive shard counts like this.
        Hide
        TimOwen Tim Owen added a comment -

        Thanks for the advice David, I'll take a look at the concurrency setting, we'll need to test out using fewer shards and see how that compares for our use-case. Since we create new collections weekly, we always have the option to increase the shard count later if we do hit situations of large merges happening.

        Although I'm a bit surprised that this model is considered 'truly massive' .. I'd have expected many large Solr installations will have thousands of shards across all their collections.

        Show
        TimOwen Tim Owen added a comment - Thanks for the advice David, I'll take a look at the concurrency setting, we'll need to test out using fewer shards and see how that compares for our use-case. Since we create new collections weekly, we always have the option to increase the shard count later if we do hit situations of large merges happening. Although I'm a bit surprised that this model is considered 'truly massive' .. I'd have expected many large Solr installations will have thousands of shards across all their collections.
        Hide
        dsmiley David Smiley added a comment -

        100 shards for any one collection isn't massive (although I do think it's high)... I mean the total number of shards you run per box. I can see now you keep your shard size low, which makes it more feasible; and no doubt you have tons of RAM. Most people go for bigger shards and fewer number of them, rather than small numerous shards. A factor enabling you to do this is that your application allows for the very effective use of composite key doc routing. Nonetheless I'm sure there's a high java heap overhead per-shard at these numbers, and it'd be nice to bring it down from the stratosphere

        Show
        dsmiley David Smiley added a comment - 100 shards for any one collection isn't massive (although I do think it's high)... I mean the total number of shards you run per box. I can see now you keep your shard size low, which makes it more feasible; and no doubt you have tons of RAM. Most people go for bigger shards and fewer number of them, rather than small numerous shards. A factor enabling you to do this is that your application allows for the very effective use of composite key doc routing. Nonetheless I'm sure there's a high java heap overhead per-shard at these numbers, and it'd be nice to bring it down from the stratosphere
        Hide
        shalinmangar Shalin Shekhar Mangar added a comment -

        Re-opened to back-port to 6.2.1

        Show
        shalinmangar Shalin Shekhar Mangar added a comment - Re-opened to back-port to 6.2.1
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 83cd8c12ecd53198eea51ff418c7e1901b4a9dde in lucene-solr's branch refs/heads/branch_6_2 from markrmiller
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=83cd8c1 ]

        SOLR-9389: HDFS Transaction logs stay open for writes which leaks Xceivers.

        (cherry picked from commit aaee4c8)

        Show
        jira-bot ASF subversion and git services added a comment - Commit 83cd8c12ecd53198eea51ff418c7e1901b4a9dde in lucene-solr's branch refs/heads/branch_6_2 from markrmiller [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=83cd8c1 ] SOLR-9389 : HDFS Transaction logs stay open for writes which leaks Xceivers. (cherry picked from commit aaee4c8)
        Hide
        shalinmangar Shalin Shekhar Mangar added a comment -

        Closing after 6.2.1 release

        Show
        shalinmangar Shalin Shekhar Mangar added a comment - Closing after 6.2.1 release

          People

          • Assignee:
            markrmiller@gmail.com Mark Miller
            Reporter:
            TimOwen Tim Owen
          • Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development