Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Fix Version/s: 0.6.5, 0.7 beta 2
    • Component/s: Core
    • Labels:
      None

      Description

      The way mmap()'d IO is handled in cassandra is dangerous. It allocates potentially massive buffers without any care for bounding the total size of the program's buffers. As the node's dataset grows, this will lead to swapping and instability.

      This is a dangerous and wrong default for a couple of reasons.

      1) People are likely to test cassandra with the default settings. This issue is insidious because it only appears when you have sufficient data in a certain node, there is absolutely no way to control it, and it doesn't at all respect the memory limits that you give to the JVM.

      That can all be ascertained by reading the code, and people should certainly do their homework, but nevertheless, cassandra should ship with sane defaults that don't break down when you cross some magic unknown threshold.

      2) It's deceptive. Unless you are extremely careful with capacity planning, you will get bit by this. Most people won't really be able to use this in production, so why get them excited about performance that they can't actually have?

      1. 1214-v4.txt
        10 kB
        Jonathan Ellis
      2. 1214-v3.txt
        8 kB
        Jonathan Ellis
      3. mlockall-jna.patch.txt
        2 kB
        Folke Behrens
      4. trunk-1214.txt
        13 kB
        Peter Schuller
      5. Read Throughput with mmap.jpg
        36 kB
        Schubert Zhang

        Activity

        Hide
        Jonathan Ellis added a comment -

        Because we don't want mmap'd data to be locked into memory – typical data sizes far exceed available RAM. The OS deals well with keeping hot mmap'd data paged in, so we want to let it do its job there. We just don't want it to be confused by the JVM's GC behavior into paging part of the JVM itself out.

        Show
        Jonathan Ellis added a comment - Because we don't want mmap'd data to be locked into memory – typical data sizes far exceed available RAM. The OS deals well with keeping hot mmap'd data paged in, so we want to let it do its job there. We just don't want it to be confused by the JVM's GC behavior into paging part of the JVM itself out.
        Hide
        Yang Yang added a comment -

        Jonathan:

        why is MCL_CURRENT chosen? I thought you would want to use MCL_FUTURE (ignoring the discussion above that these 2 seem to have the same value).

        with MCL_CURRENT, supposedly SSTables that you mmap() later will still have the possibility to be paged out. or maybe I am not understanding it correctly?

        Thanks
        Yang

        Show
        Yang Yang added a comment - Jonathan: why is MCL_CURRENT chosen? I thought you would want to use MCL_FUTURE (ignoring the discussion above that these 2 seem to have the same value). with MCL_CURRENT, supposedly SSTables that you mmap() later will still have the possibility to be paged out. or maybe I am not understanding it correctly? Thanks Yang
        Hide
        Chris Goffinet added a comment -

        mmap + memlock gives us about 13% improvement, on our test bed we were maxing out our 4 cores.

        Show
        Chris Goffinet added a comment - mmap + memlock gives us about 13% improvement, on our test bed we were maxing out our 4 cores.
        Hide
        Jonathan Ellis added a comment -

        committed

        Show
        Jonathan Ellis added a comment - committed
        Hide
        Jon Hermes added a comment -

        +1.

        It's a best-effort patch dependant on OS (which is all we can do, short of defaulting to mmap_index_only and taking a performance hit by default). Assuming the average use case, this is a much better default than before.

        Show
        Jon Hermes added a comment - +1. It's a best-effort patch dependant on OS (which is all we can do, short of defaulting to mmap_index_only and taking a performance hit by default). Assuming the average use case, this is a much better default than before.
        Hide
        Jonathan Ellis added a comment -

        v4 includes the ivy changes to download jna at build time.

        Again, the relevant text from http://www.apache.org/legal/3party.html is, "LGPL v2.1-licensed works must not be included in Apache products, although they may be listed as system requirements or distributed elsewhere as optional works." We are not including jna, nor are we even requiring it [although it explicitly states it would be fine to do so]. The only restriction is on distributing the lgpl work itself, so while Hadoop is welcome to pile additional restrictions on themselves this is fine for us, since (and perhaps this wasn't clear) dependencies we pull in with ivy are build-time only, and are not distributed with our source or binary artifacts.

        (FWIW it is also fine for an apache-licensed debian package, to declare a dependency on an lgpl one.)

        Show
        Jonathan Ellis added a comment - v4 includes the ivy changes to download jna at build time. Again, the relevant text from http://www.apache.org/legal/3party.html is, "LGPL v2.1-licensed works must not be included in Apache products, although they may be listed as system requirements or distributed elsewhere as optional works." We are not including jna, nor are we even requiring it [although it explicitly states it would be fine to do so] . The only restriction is on distributing the lgpl work itself, so while Hadoop is welcome to pile additional restrictions on themselves this is fine for us, since (and perhaps this wasn't clear) dependencies we pull in with ivy are build-time only, and are not distributed with our source or binary artifacts. (FWIW it is also fine for an apache-licensed debian package, to declare a dependency on an lgpl one.)
        Hide
        Jonathan Ellis added a comment -

        Works fine here.

        Show
        Jonathan Ellis added a comment - Works fine here.
        Hide
        Folke Behrens added a comment -

        I meant that com.sun.jna.Native would be a hard runtime dependency of FBUtilities. (NoClassDefFoundError != ClassNotFoundException) Cassandra wouldn't start without JNA, or did I miss something?

        Show
        Folke Behrens added a comment - I meant that com.sun.jna.Native would be a hard runtime dependency of FBUtilities. (NoClassDefFoundError != ClassNotFoundException) Cassandra wouldn't start without JNA, or did I miss something?
        Hide
        Jonathan Ellis added a comment -

        That way we have to have a catch for CNFE in each method we're exposing, instead of just once in CLibrary.

        Show
        Jonathan Ellis added a comment - That way we have to have a catch for CNFE in each method we're exposing, instead of just once in CLibrary.
        Hide
        Folke Behrens added a comment -

        Wouldn't it then be better if tryMlockAll() loads another class with Class.forName() and catches the ClassNotFoundException if JNA jar is not on the classpath?

        Show
        Folke Behrens added a comment - Wouldn't it then be better if tryMlockAll() loads another class with Class.forName() and catches the ClassNotFoundException if JNA jar is not on the classpath?
        Hide
        Todd Lipcon added a comment -

        I don't think it's kosher to pull in LGPL as a build dependency with ivy either - in Hadoop we dynamically linked some JNI against LZO (LGPL licensed) but it was decided even that was not allowed, so we had to move the entire LZO support out to github.

        Regarding the FD issue, although reflecting out the FD field isn't that portable, I've seen it done in an awful lot of places, so I don't think it's going to change any time soon. There's a patch in the works for Hadoop that adds some JNI calls for IO-related things, and we grab the fd field there. There's also an interface sun.misc.JavaIOFileDescriptorAccess which you can sneak out of sun.misc.SharedSecrets, if that makes you feel better than using reflection

        Show
        Todd Lipcon added a comment - I don't think it's kosher to pull in LGPL as a build dependency with ivy either - in Hadoop we dynamically linked some JNI against LZO (LGPL licensed) but it was decided even that was not allowed, so we had to move the entire LZO support out to github. Regarding the FD issue, although reflecting out the FD field isn't that portable, I've seen it done in an awful lot of places, so I don't think it's going to change any time soon. There's a patch in the works for Hadoop that adds some JNI calls for IO-related things, and we grab the fd field there. There's also an interface sun.misc.JavaIOFileDescriptorAccess which you can sneak out of sun.misc.SharedSecrets, if that makes you feel better than using reflection
        Hide
        Jonathan Ellis added a comment -

        (correct patch attached)

        Show
        Jonathan Ellis added a comment - (correct patch attached)
        Hide
        Jonathan Ellis added a comment -

        patch that uses JNA, with catch for various error conditions and more informative logging where possible.

        As discussed above, we can't ship JNA with Cassandra but we can pull it in with ivy at build time. So one of the conditions handled is simply "JNA doesn't exist at runtime." (But we don't need to resort to reflection to allow it to compile without JNA.) [A sufficiently recent version of JNA is not available in the main public maven repo, and that won't change in the near future, so we will host one on Riptano's repo. I will update this patch when that is ready.]

        Show
        Jonathan Ellis added a comment - patch that uses JNA, with catch for various error conditions and more informative logging where possible. As discussed above, we can't ship JNA with Cassandra but we can pull it in with ivy at build time. So one of the conditions handled is simply "JNA doesn't exist at runtime." (But we don't need to resort to reflection to allow it to compile without JNA.) [A sufficiently recent version of JNA is not available in the main public maven repo, and that won't change in the near future, so we will host one on Riptano's repo. I will update this patch when that is ready.]
        Hide
        Peter Schuller added a comment -

        Well, posix_fadvise() is potentially a bit more problematic than mlockall(). It again takes flags, whose values I suppose may be as practically standardized as supposedly for mlockall() (though I have not yet checked). In addition it takes an off_t which, being an abstract type, would have potential for portability concerns but a quick Googling suggests (http://markmail.org/message/qvf7hhq2mgmwwmw3) JNA has some particular support for the off_t data type though I did not find it right now in the API docs (will have to check more carefully).

        The other thing is that posix_fadvise() will need a file descriptor in integer form. java.io.FileDescriptor is decidedly abstract and does not expose this information (which is understandable). I am not aware, off hand, of a good way for us to obtain the relevant underlying file descriptor; anyone? Molesting FileDescriptor with reflection should technically do the trick with openjdk/sun derived VM:s (at least based on current openjdk7 FileDescriptor.java), but.... yuck.

        If it weren't for the build problems implied by JNI I would strongly prefer it. Under the circumstances I'm not sure. One observation is that given the kind of ifs and buts one seems to have to resort to anyway, writing some simple semi-portable build rules in Ant, specifically targetting certain platforms and compilers, does not feel so bad. Even if one hard-codes each common platform to avoid solving the native build problem generally, that does not feel worse to me in practice than making the assumptions necessary with JNA and stuff like using reflection to access private fields...

        As long as the native building remain optional and does not hinder anyone getting Cassandra to work with just Java, and as long as it is relatively easy for someone on an unsupported/problematic platform to simply build the JNI libraries themselves (doable by e.g. a simple Makefile with clear instructions for pointing to JDK headers etc), JNI feels pretty reasonable to me.

        Thoughts? Am I painting a bleaker picture than reality with respect to using JNA?

        Show
        Peter Schuller added a comment - Well, posix_fadvise() is potentially a bit more problematic than mlockall(). It again takes flags, whose values I suppose may be as practically standardized as supposedly for mlockall() (though I have not yet checked). In addition it takes an off_t which, being an abstract type, would have potential for portability concerns but a quick Googling suggests ( http://markmail.org/message/qvf7hhq2mgmwwmw3 ) JNA has some particular support for the off_t data type though I did not find it right now in the API docs (will have to check more carefully). The other thing is that posix_fadvise() will need a file descriptor in integer form. java.io.FileDescriptor is decidedly abstract and does not expose this information (which is understandable). I am not aware, off hand, of a good way for us to obtain the relevant underlying file descriptor; anyone? Molesting FileDescriptor with reflection should technically do the trick with openjdk/sun derived VM:s (at least based on current openjdk7 FileDescriptor.java), but.... yuck. If it weren't for the build problems implied by JNI I would strongly prefer it. Under the circumstances I'm not sure. One observation is that given the kind of ifs and buts one seems to have to resort to anyway, writing some simple semi-portable build rules in Ant, specifically targetting certain platforms and compilers, does not feel so bad. Even if one hard-codes each common platform to avoid solving the native build problem generally, that does not feel worse to me in practice than making the assumptions necessary with JNA and stuff like using reflection to access private fields... As long as the native building remain optional and does not hinder anyone getting Cassandra to work with just Java, and as long as it is relatively easy for someone on an unsupported/problematic platform to simply build the JNI libraries themselves (doable by e.g. a simple Makefile with clear instructions for pointing to JDK headers etc), JNI feels pretty reasonable to me. Thoughts? Am I painting a bleaker picture than reality with respect to using JNA?
        Hide
        Folke Behrens added a comment -

        Note that this is a political matter, not a legal one. It's against the ASF policy to distribute packages containing LGPL code. The licenses are compatible.

        Show
        Folke Behrens added a comment - Note that this is a political matter, not a legal one. It's against the ASF policy to distribute packages containing LGPL code. The licenses are compatible.
        Hide
        Jonathan Ellis added a comment -

        Ugh, that's a pain. (JFFI is also LGPL.)

        It's not a deal breaker for us since we'd like to use it for basically optimizations... ASF says "LGPL v2.1-licensed works must not be included in Apache products, although they may be listed as system requirements or distributed elsewhere as optional works" so that would be workable if sub-optimal.

        Curious if Peter things we're going to have to go raw JNI for fadvise on compactions. If we're going to have to bite that bullet anyway then JNA gets less interesting.

        Show
        Jonathan Ellis added a comment - Ugh, that's a pain. (JFFI is also LGPL.) It's not a deal breaker for us since we'd like to use it for basically optimizations... ASF says "LGPL v2.1-licensed works must not be included in Apache products, although they may be listed as system requirements or distributed elsewhere as optional works" so that would be workable if sub-optimal. Curious if Peter things we're going to have to go raw JNI for fadvise on compactions. If we're going to have to bite that bullet anyway then JNA gets less interesting.
        Hide
        Todd Lipcon added a comment -

        AFAIK JNA is LGPL and thus incompatible with Apache 2 license. I've wanted to use it in other ASF projects, too, and it's a pain there isn't a Apache-licensed alternative. If some of the Cassandra people are interested in a cleanroom implementation, I'd be interested in helping, though!

        Show
        Todd Lipcon added a comment - AFAIK JNA is LGPL and thus incompatible with Apache 2 license. I've wanted to use it in other ASF projects, too, and it's a pain there isn't a Apache-licensed alternative. If some of the Cassandra people are interested in a cleanroom implementation, I'd be interested in helping, though!
        Hide
        Jonathan Ellis added a comment -

        Sounds good, with the caveat that it needs to catch the kind of error conditions I mentioned.

        Show
        Jonathan Ellis added a comment - Sounds good, with the caveat that it needs to catch the kind of error conditions I mentioned.
        Hide
        Peter Schuller added a comment -

        It all sounds reasonable.

        So I take it the way forward would be to take your JNA version and combine with the configuration/policy parts of my patch (assuming people agree that those parts are a good idea) and go for that version for now and maybe move to JNI in the future if JNI becomes a dependency anyway for some other reason.

        Any objections?

        Show
        Peter Schuller added a comment - It all sounds reasonable. So I take it the way forward would be to take your JNA version and combine with the configuration/policy parts of my patch (assuming people agree that those parts are a good idea) and go for that version for now and maybe move to JNI in the future if JNI becomes a dependency anyway for some other reason. Any objections?
        Hide
        Folke Behrens added a comment -

        How does the JNA approach behave if there is no C library (Windows?) or mlockall doesn't exist (OS X?)

        In case of Mac OS X an UnsatisfiedLinkError will be thrown. Windows? I don't know. Maybe a JNA-specific exception, maybe a ULE, too. OS's can be easily detected with Platform.isXXX() and dealt with accordingly.

        something as simple as "grab errno" became a holy mess of portability concerns.

        Yes, but errno is a particularly hard case. The "inventors" messed up big time with this. That's why the JNA developers provide two ways to check errno: you either mark your methods with "throws LastErrorException" or you ask Native.getLastError(). This works under Windows, too.

        The proposed JNA patch seems to suffer from exactly this problem as far as I can see, making assumptions about what the concrete values are of MCL_CURRENT and MCL_FUTURE.

        Theoretically, you're right, in practice, however, I can't find a single POSIX system that assigns different values to MCL_CURRENT or MCL_FUTURE, and I think it's highly unlikely that these will change in the future. If so, Cassandra's code can be adjusted.

        As far as I can tell, once one has gotten over the initial one-time hurdle of using JNI and the associated building issues, you have a much more correct/standards-compliant access to the native platform than through JNA since you're in compile time with access to appropriate headers etc.
        Please do correct me if I'm wrong, since the idea of avoiding compile time/build issues is certainly very attractive and the reason why I tried to find an acceptable solution with JNA in the past.

        You're absolutely right, and your JNI code is really superb. If Cassandra needs to bind a couple more native functions I'd say JNI is the way to go. But not just yet.

        Show
        Folke Behrens added a comment - How does the JNA approach behave if there is no C library (Windows?) or mlockall doesn't exist (OS X?) In case of Mac OS X an UnsatisfiedLinkError will be thrown. Windows? I don't know. Maybe a JNA-specific exception, maybe a ULE, too. OS's can be easily detected with Platform.isXXX() and dealt with accordingly. something as simple as "grab errno" became a holy mess of portability concerns. Yes, but errno is a particularly hard case. The "inventors" messed up big time with this. That's why the JNA developers provide two ways to check errno: you either mark your methods with "throws LastErrorException" or you ask Native.getLastError(). This works under Windows, too. The proposed JNA patch seems to suffer from exactly this problem as far as I can see, making assumptions about what the concrete values are of MCL_CURRENT and MCL_FUTURE. Theoretically, you're right, in practice, however, I can't find a single POSIX system that assigns different values to MCL_CURRENT or MCL_FUTURE, and I think it's highly unlikely that these will change in the future. If so, Cassandra's code can be adjusted. As far as I can tell, once one has gotten over the initial one-time hurdle of using JNI and the associated building issues, you have a much more correct/standards-compliant access to the native platform than through JNA since you're in compile time with access to appropriate headers etc. Please do correct me if I'm wrong, since the idea of avoiding compile time/build issues is certainly very attractive and the reason why I tried to find an acceptable solution with JNA in the past. You're absolutely right, and your JNI code is really superb. If Cassandra needs to bind a couple more native functions I'd say JNI is the way to go. But not just yet.
        Hide
        Peter Schuller added a comment -

        I'll admit I did not investigate JNA (or POSIX-JNA) for this particular case. Last time I did however, I found it lacking. Very trivial cases were okay, but even something as simple as "grab errno" became a holy mess of portability concerns.

        I looked briefly at what posix-jna does, and I was unable to find any magic bullets in there and instead saw things like hard-coded constants that are non-portable and difficult to detect when they break due to changes to some particular platform.

        The proposed JNA patch seems to suffer from exactly this problem as far as I can see, making assumptions about what the concrete values are of MCL_CURRENT and MCL_FUTURE.

        As far as I can tell, once one has gotten over the initial one-time hurdle of using JNI and the associated building issues, you have a much more correct/standards-compliant access to the native platform than through JNA since you're in compile time with access to appropriate headers etc.

        Please do correct me if I'm wrong, since the idea of avoiding compile time/build issues is certainly very attractive and the reason why I tried to find an acceptable solution with JNA in the past.

        Show
        Peter Schuller added a comment - I'll admit I did not investigate JNA (or POSIX-JNA) for this particular case. Last time I did however, I found it lacking. Very trivial cases were okay, but even something as simple as "grab errno" became a holy mess of portability concerns. I looked briefly at what posix-jna does, and I was unable to find any magic bullets in there and instead saw things like hard-coded constants that are non-portable and difficult to detect when they break due to changes to some particular platform. The proposed JNA patch seems to suffer from exactly this problem as far as I can see, making assumptions about what the concrete values are of MCL_CURRENT and MCL_FUTURE. As far as I can tell, once one has gotten over the initial one-time hurdle of using JNI and the associated building issues, you have a much more correct/standards-compliant access to the native platform than through JNA since you're in compile time with access to appropriate headers etc. Please do correct me if I'm wrong, since the idea of avoiding compile time/build issues is certainly very attractive and the reason why I tried to find an acceptable solution with JNA in the past.
        Hide
        Jonathan Ellis added a comment -

        How does the JNA approach behave if there is no C library (Windows?) or mlockall doesn't exist (OS X?)

        Show
        Jonathan Ellis added a comment - How does the JNA approach behave if there is no C library (Windows?) or mlockall doesn't exist (OS X?)
        Hide
        Folke Behrens added a comment -

        Whoa ... have you looked at JNA?

        1. Apply attached patch.
        2. Put jna.jar (LGPL 2.1 / ~900k) in /lib/
        3. Start Cassandra with CAP_IPC_LOCK (or as "root").
        4. Linux: grep Unevictable /proc/meminfo

        https://jna.dev.java.net/servlets/ProjectDocumentList?folderID=12329&expandFolder=12329&folderID=0

        Show
        Folke Behrens added a comment - Whoa ... have you looked at JNA? Apply attached patch. Put jna.jar (LGPL 2.1 / ~900k) in /lib/ Start Cassandra with CAP_IPC_LOCK (or as "root"). Linux: grep Unevictable /proc/meminfo https://jna.dev.java.net/servlets/ProjectDocumentList?folderID=12329&expandFolder=12329&folderID=0
        Hide
        Peter Schuller added a comment -

        This is the patch referred to by the previous comment. The 'submit patch' workflow never asked me to upload a file (or I missed it somehow).

        Show
        Peter Schuller added a comment - This is the patch referred to by the previous comment. The 'submit patch' workflow never asked me to upload a file (or I missed it somehow).
        Hide
        Peter Schuller added a comment -

        This is a draft (as in, submitted now as a work-in-progress for review rather than for commit) patch to add mlockall() support. It allows 'off', 'auto' and 'required' to be specified in the configuration, with the default being 'auto'.

        mlockall() can fail either because the native JNI library is missing or because mlockall() itself fails; neither is a terminal condition unless 'required' is specified in the configuration file.

        The patch currently does not address building and packaging, except for a toy change to build.xml that is more of an example for a human to use for testing.

        I'd be interested to hear any opinions about how building and deployment should be handled given JNI libraries. I think it is important that no one is prevented to use cassandra without mlockall() functionality due to native build issues, so it should presumably be optional. Even then, any suggestions for favorite/preferred method of building JNI libraries portably in a way that hooks nicely into ant and cassandra build infrastructure? In particular taking into consideration deployment (e.g. how it fits into debian packaging or similar infrastructure).

        Show
        Peter Schuller added a comment - This is a draft (as in, submitted now as a work-in-progress for review rather than for commit) patch to add mlockall() support. It allows 'off', 'auto' and 'required' to be specified in the configuration, with the default being 'auto'. mlockall() can fail either because the native JNI library is missing or because mlockall() itself fails; neither is a terminal condition unless 'required' is specified in the configuration file. The patch currently does not address building and packaging, except for a toy change to build.xml that is more of an example for a human to use for testing. I'd be interested to hear any opinions about how building and deployment should be handled given JNI libraries. I think it is important that no one is prevented to use cassandra without mlockall() functionality due to native build issues, so it should presumably be optional. Even then, any suggestions for favorite/preferred method of building JNI libraries portably in a way that hooks nicely into ant and cassandra build infrastructure? In particular taking into consideration deployment (e.g. how it fits into debian packaging or similar infrastructure).
        Hide
        Jonathan Ellis added a comment -

        according to http://andrigoss.blogspot.com/2008/02/jvm-performance-tuning.html, using huge pages automatically gives us the lock-jvm-heap-in-memory behavior we want, and may provide a substantial performance benefit as well.

        See also: http://java.sun.com/javase/technologies/hotspot/largememory.jsp

        Show
        Jonathan Ellis added a comment - according to http://andrigoss.blogspot.com/2008/02/jvm-performance-tuning.html , using huge pages automatically gives us the lock-jvm-heap-in-memory behavior we want, and may provide a substantial performance benefit as well. See also: http://java.sun.com/javase/technologies/hotspot/largememory.jsp
        Hide
        Schubert Zhang added a comment -

        Yes, I also found it is not good with mmap.

        Show
        Schubert Zhang added a comment - Yes, I also found it is not good with mmap.
        Hide
        Nate McCall added a comment -

        I have not hit this issue yet, but has anyone tried using the -XX:MaxDirectMemorySize option?

        Show
        Nate McCall added a comment - I have not hit this issue yet, but has anyone tried using the -XX:MaxDirectMemorySize option?
        Hide
        Tupshin Harper added a comment -

        I am strongly in favor of defaults that are as flexible and stable as possible. If it is hard for even a relatively small percentage of users to get stable performance with mmap, then I would agree that the default should be standard I/O. There should then be a Cassandra Tuning wiki page that include a mmap discussion.

        That said, I also agree that it is worth doing the native code work to get mmap more stable with larger datasets and/or smaller machines.

        Show
        Tupshin Harper added a comment - I am strongly in favor of defaults that are as flexible and stable as possible. If it is hard for even a relatively small percentage of users to get stable performance with mmap, then I would agree that the default should be standard I/O. There should then be a Cassandra Tuning wiki page that include a mmap discussion. That said, I also agree that it is worth doing the native code work to get mmap more stable with larger datasets and/or smaller machines.
        Hide
        James Golick added a comment -

        I have tried many levels of swappiness (including 0) without any change in behaviour. Additionally, I haven't seen much if any change in performance with standard IO.

        Continuing to iterate on the mmap code might be a good idea. But, it's the wrong default. Especially now that we've agreed that it is currently broken. It's possible that it may be a sensible default in the future, but right now, it's not a good choice for production (in most cases).

        Show
        James Golick added a comment - I have tried many levels of swappiness (including 0) without any change in behaviour. Additionally, I haven't seen much if any change in performance with standard IO. Continuing to iterate on the mmap code might be a good idea. But, it's the wrong default. Especially now that we've agreed that it is currently broken. It's possible that it may be a sensible default in the future, but right now, it's not a good choice for production (in most cases).
        Hide
        Todd Lipcon added a comment -

        Configuring /proc/sys/vm/swappiness down to 0-10 may also help.

        Show
        Todd Lipcon added a comment - Configuring /proc/sys/vm/swappiness down to 0-10 may also help.
        Hide
        Jonathan Ellis added a comment -

        It seems that what is happening is,

        • the JVM hasn't needed to run a major collection in a while,
        • so Linux says "I'll swap part of the JVM's heap so I can pull more of this hot sstable into ram,"
        • then the JVM goes to GC and thrashes pulling its heap in from swap

        The "right" solution is probably to use mlockall(MCL_CURRENT) on JVM start (with min heap = max heap so that gets pre-allocated). Then perform the mmapping.

        mmap'd io is enough faster that this is probably worth biting the native code bullet for.

        Show
        Jonathan Ellis added a comment - It seems that what is happening is, the JVM hasn't needed to run a major collection in a while, so Linux says "I'll swap part of the JVM's heap so I can pull more of this hot sstable into ram," then the JVM goes to GC and thrashes pulling its heap in from swap The "right" solution is probably to use mlockall(MCL_CURRENT) on JVM start (with min heap = max heap so that gets pre-allocated). Then perform the mmapping. mmap'd io is enough faster that this is probably worth biting the native code bullet for.
        Hide
        Jeff Hodges added a comment -

        This is one of the very first things we've had to do with every cluster we've built. The mmap implementation just does not work for anything I've seen in production beyond trivial datasets. This would be a wonderful, reality-driven change.

        Show
        Jeff Hodges added a comment - This is one of the very first things we've had to do with every cluster we've built. The mmap implementation just does not work for anything I've seen in production beyond trivial datasets. This would be a wonderful, reality-driven change.

          People

          • Assignee:
            Jonathan Ellis
            Reporter:
            James Golick
            Reviewer:
            Jon Hermes
          • Votes:
            0 Vote for this issue
            Watchers:
            18 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development