Cassandra
  1. Cassandra
  2. CASSANDRA-2749

fine-grained control over data directories

    Details

    • Type: New Feature New Feature
    • Status: Resolved
    • Priority: Minor Minor
    • Resolution: Fixed
    • Fix Version/s: 1.1.0
    • Component/s: Core
    • Labels:
      None

      Description

      Currently Cassandra supports multiple data directories but no way to control what sstables are placed where. Particularly for systems with mixed SSDs and rotational disks, it would be nice to pin frequently accessed columnfamilies to the SSDs.

      Postgresql does this with tablespaces (http://www.postgresql.org/docs/9.0/static/manage-ag-tablespaces.html) but we should probably avoid using that name because of confusing similarity to "keyspaces."

      1. 0003-Fixes.patch
        8 kB
        Sylvain Lebresne
      2. 0002-fix-unit-tests.patch
        252 kB
        Sylvain Lebresne
      3. 0001-2749.patch
        82 kB
        Sylvain Lebresne
      4. 2749_proper.tar.gz
        54 kB
        Marcus Eriksson
      5. 2749_not_backwards.tar.gz
        54 kB
        Marcus Eriksson
      6. 0001-non-backwards-compatible-patch-for-2749-putting-cfs-.patch.gz
        52 kB
        Marcus Eriksson
      7. 2749.tar.gz
        3 kB
        Marcus Eriksson
      8. 2749_backwards_compatible_v4_rebase1.patch
        43 kB
        Marcus Eriksson
      9. 2749_backwards_compatible_v4.patch
        43 kB
        Marcus Eriksson
      10. 2749_backwards_compatible_v3.patch
        44 kB
        Marcus Eriksson
      11. 2749_backwards_compatible_v2.patch
        31 kB
        Marcus Eriksson
      12. 2749_backwards_compatible_v1.patch
        25 kB
        Marcus Eriksson
      13. 0001-Make-it-possible-to-put-column-families-in-subdirect.patch
        83 kB
        Marcus Eriksson

        Activity

        Hide
        Chris Chiappone added a comment -

        This actually causes a bunch of problems for customers that have used longer the 48 character keyspace and cf names. How will this affect the upgrade path for those customers?

        Show
        Chris Chiappone added a comment - This actually causes a bunch of problems for customers that have used longer the 48 character keyspace and cf names. How will this affect the upgrade path for those customers?
        Hide
        Jonathan Ellis added a comment -

        Did some more research on path limitations:

        NTFS is technically okay with paths up to 32K long[1], but the windows api is limited to 256[2]. Common Linux filesystems have a limit of 255 bytes per path component (i.e. directory or filename) but no total path limit. However, Linux defines PATH_MAX and FILENAME_MAX, both 4096. [3]

        [1] http://en.wikipedia.org/wiki/Comparison_of_file_systems
        [2] http://msdn.microsoft.com/en-us/library/aa365247.aspx
        [3] http://serverfault.com/questions/9546/filename-length-limits-on-linux

        In short: restricting KS and CF names to 32 characters is a good idea for the benefit of Windows portability. However, we may want to exempt Linux systems from the startup length check to allow easier upgrades.

        Show
        Jonathan Ellis added a comment - Did some more research on path limitations: NTFS is technically okay with paths up to 32K long [1] , but the windows api is limited to 256 [2] . Common Linux filesystems have a limit of 255 bytes per path component (i.e. directory or filename) but no total path limit. However, Linux defines PATH_MAX and FILENAME_MAX, both 4096. [3] [1] http://en.wikipedia.org/wiki/Comparison_of_file_systems [2] http://msdn.microsoft.com/en-us/library/aa365247.aspx [3] http://serverfault.com/questions/9546/filename-length-limits-on-linux In short: restricting KS and CF names to 32 characters is a good idea for the benefit of Windows portability. However, we may want to exempt Linux systems from the startup length check to allow easier upgrades.
        Hide
        Sylvain Lebresne added a comment -

        Committed with nits above fixed, thanks Pavel.

        Show
        Sylvain Lebresne added a comment - Committed with nits above fixed, thanks Pavel.
        Hide
        Pavel Yaskevich added a comment -

        +1 with last nit - Directories.SSTableLister methods skip

        {Compacted,Temporary}

        (boolean) and includeBackups(boolean) ignore argument and always set corresponding option to "true".

        Show
        Pavel Yaskevich added a comment - +1 with last nit - Directories.SSTableLister methods skip {Compacted,Temporary} (boolean) and includeBackups(boolean) ignore argument and always set corresponding option to "true".
        Hide
        Sylvain Lebresne added a comment -

        Rebased patches attached.

        Show
        Sylvain Lebresne added a comment - Rebased patches attached.
        Hide
        Pavel Yaskevich added a comment -

        That my be the case I will re-test as part of the review anyway.

        Show
        Pavel Yaskevich added a comment - That my be the case I will re-test as part of the review anyway.
        Hide
        Sylvain Lebresne added a comment -

        Weird. I just tried the same scenario and everything worked correctly. I should mention that when moving the snapshots/backups, the migration process rename them to the new filename convention, so they will be called Keyspace1-Standard1.Idx-*. Or maybe I fixed it with the last version of the patch without realizing it.

        Show
        Sylvain Lebresne added a comment - Weird. I just tried the same scenario and everything worked correctly. I should mention that when moving the snapshots/backups, the migration process rename them to the new filename convention, so they will be called Keyspace1-Standard1.Idx-*. Or maybe I fixed it with the last version of the patch without realizing it.
        Hide
        Pavel Yaskevich added a comment -

        I'm not sure I see what this one is. Are we talking of the migration process?

        I was testing it like this :

        1. run 1.1 without modifications
        2. ./tools/stress/bin/stress -n 50000 -S 512 -x KEYS
        3. ./bin/nodetool -h localhost flush Keyspace1 Standard1
        4. ./bin/nodetool -h localhost snapshot Keyspace1
        5. made sure that Standard1.Idx-* SSTables are in the snapshots/<timestamp> directory
        6. run 1.1 with you patch applied
        7. checked if snapshots directory was moved and what files did it include - it was lucking Standard1.Idx-* files
        8. cleaned data directory
        9. repeated steps 1 - 5 but this time with your patch applied and it didn't include Standard1.Idx-* into snapshot

        Maybe it is more "natural" to have secondary indexes sstables be in the same directory than the base cfs?

        +1

        Show
        Pavel Yaskevich added a comment - I'm not sure I see what this one is. Are we talking of the migration process? I was testing it like this : run 1.1 without modifications ./tools/stress/bin/stress -n 50000 -S 512 -x KEYS ./bin/nodetool -h localhost flush Keyspace1 Standard1 ./bin/nodetool -h localhost snapshot Keyspace1 made sure that Standard1.Idx-* SSTables are in the snapshots/<timestamp> directory run 1.1 with you patch applied checked if snapshots directory was moved and what files did it include - it was lucking Standard1.Idx-* files cleaned data directory repeated steps 1 - 5 but this time with your patch applied and it didn't include Standard1.Idx-* into snapshot Maybe it is more "natural" to have secondary indexes sstables be in the same directory than the base cfs? +1
        Hide
        Sylvain Lebresne added a comment -

        Attaching rebased patches, with a 3rd patches (0003-Fixes.patch) addressing the Pavel's remarks. More specifically:

        o.a.c.db.Directories comment should be updated because it still uses SSTable file name without keyspace.

        Fixed, thanks

        o.a.c.io.sstable.SSTableReaderTest won't compile

        Sorry, I forgot to check the test after a last rebase, fixed too (this involved renaming a number of sstables from test/data/legacy-sstables/hb to include the keyspace name, so that specific change is in the 2nd 'fix unit tests' patch to avoid polluting the 3rd one).

        if you start with empty data directory you get following exception and process exits

        Fixed. I've actually made two modifications: the migration checks the existence of the directory to avoid the NPE during listFiles(), but I've also modified the 'should we migrate' check to detect new nodes (checking if the system keyspace directory exists) and thus not print the migration message at all.

        on snapshot doesn't create or move (from older schema) index SSTables related to CF

        I'm not sure I see what this one is. Are we talking of the migration process?

        In any case, you made me think about secondary indexes. Maybe it is more "natural" to have secondary indexes sstables be in the same directory than the base cfs? Since the indexes name is not really something exposed (granted you don't have to be a genius to figure it out), it feels like it would slightly simplify administration to not put them in a separate directory.

        I've updated the patch to implement this last idea (so indexes are in the same directory than their base cf), but it would be nice to have multiple opinions on that move since we don't want to have to do a new migration in 6 month because "we've changed our mind".

        shouldn't old "snapshots" directory be removed after move?

        Your right, fixed (for backups too).

        Show
        Sylvain Lebresne added a comment - Attaching rebased patches, with a 3rd patches (0003-Fixes.patch) addressing the Pavel's remarks. More specifically: o.a.c.db.Directories comment should be updated because it still uses SSTable file name without keyspace. Fixed, thanks o.a.c.io.sstable.SSTableReaderTest won't compile Sorry, I forgot to check the test after a last rebase, fixed too (this involved renaming a number of sstables from test/data/legacy-sstables/hb to include the keyspace name, so that specific change is in the 2nd 'fix unit tests' patch to avoid polluting the 3rd one). if you start with empty data directory you get following exception and process exits Fixed. I've actually made two modifications: the migration checks the existence of the directory to avoid the NPE during listFiles(), but I've also modified the 'should we migrate' check to detect new nodes (checking if the system keyspace directory exists) and thus not print the migration message at all. on snapshot doesn't create or move (from older schema) index SSTables related to CF I'm not sure I see what this one is. Are we talking of the migration process? In any case, you made me think about secondary indexes. Maybe it is more "natural" to have secondary indexes sstables be in the same directory than the base cfs? Since the indexes name is not really something exposed (granted you don't have to be a genius to figure it out), it feels like it would slightly simplify administration to not put them in a separate directory. I've updated the patch to implement this last idea (so indexes are in the same directory than their base cf), but it would be nice to have multiple opinions on that move since we don't want to have to do a new migration in 6 month because "we've changed our mind". shouldn't old "snapshots" directory be removed after move? Your right, fixed (for backups too).
        Hide
        Pavel Yaskevich added a comment -

        Thanks, Sylvain! We are getting really close

        Here are problems I found:

        • o.a.c.db.Directories comment should be updated because it still uses SSTable file name without keyspace.
        • o.a.c.io.sstable.SSTableReaderTest won't compile
        • if you start with empty data directory you get following exception and process exits
           INFO 23:51:11,987 Upgrade from pre-1.1 version detected: migrating sstables to new directory layout
          ERROR 23:51:11,988 Exception encountered during startup
          java.lang.NullPointerException
          	at org.apache.cassandra.db.Directories.migrateSSTables(Directories.java:416)
          	at org.apache.cassandra.service.AbstractCassandraDaemon.setup(AbstractCassandraDaemon.java:164)
          	at org.apache.cassandra.service.AbstractCassandraDaemon.activate(AbstractCassandraDaemon.java:360)
          	at org.apache.cassandra.thrift.CassandraDaemon.main(CassandraDaemon.java:107)
          java.lang.NullPointerException
          	at org.apache.cassandra.db.Directories.migrateSSTables(Directories.java:416)
          	at org.apache.cassandra.service.AbstractCassandraDaemon.setup(AbstractCassandraDaemon.java:164)
          	at org.apache.cassandra.service.AbstractCassandraDaemon.activate(AbstractCassandraDaemon.java:360)
          	at org.apache.cassandra.thrift.CassandraDaemon.main(CassandraDaemon.java:107)
          Exception encountered during startup: null
          
        • on snapshot doesn't create or move (from older schema) index SSTables related to CF
        • shouldn't old "snapshots" directory be removed after move?
        Show
        Pavel Yaskevich added a comment - Thanks, Sylvain! We are getting really close Here are problems I found: o.a.c.db.Directories comment should be updated because it still uses SSTable file name without keyspace. o.a.c.io.sstable.SSTableReaderTest won't compile if you start with empty data directory you get following exception and process exits INFO 23:51:11,987 Upgrade from pre-1.1 version detected: migrating sstables to new directory layout ERROR 23:51:11,988 Exception encountered during startup java.lang.NullPointerException at org.apache.cassandra.db.Directories.migrateSSTables(Directories.java:416) at org.apache.cassandra.service.AbstractCassandraDaemon.setup(AbstractCassandraDaemon.java:164) at org.apache.cassandra.service.AbstractCassandraDaemon.activate(AbstractCassandraDaemon.java:360) at org.apache.cassandra.thrift.CassandraDaemon.main(CassandraDaemon.java:107) java.lang.NullPointerException at org.apache.cassandra.db.Directories.migrateSSTables(Directories.java:416) at org.apache.cassandra.service.AbstractCassandraDaemon.setup(AbstractCassandraDaemon.java:164) at org.apache.cassandra.service.AbstractCassandraDaemon.activate(AbstractCassandraDaemon.java:360) at org.apache.cassandra.thrift.CassandraDaemon.main(CassandraDaemon.java:107) Exception encountered during startup: null on snapshot doesn't create or move (from older schema) index SSTables related to CF shouldn't old "snapshots" directory be removed after move?
        Hide
        Sylvain Lebresne added a comment -

        Patch attached (0001-2749-v2.patch and 0002-fix-unit-tests-v2.patch) that:

        1. only support the new layout with sstables in cf directories
        2. migrate automagically files the first time
        3. adds the keyspace into the file name (to not rely on the directory sstables are in)
        4. limit keyspace and cf names to 32 characters

        During the initial upgrade, we also check if the user would be fine with the current keyspace and cf name. If not, we just refuse to start. I hope this won't happen to anyone because renaming a CF/Keyspace in a rolling fashion is not something fun (or even possible for that matter). Note that this check don't fully enforce the 32 chars limitation however but rather tries to be as permissive as possible checking that any file path resulting for the upgrade should be less than 256 chars (the windows limit).

        PS: the first patch is the bulk of the change, the second one is the unit tests. The later is huge because it renames the sstables in test/data to include the keyspace name in them.

        Show
        Sylvain Lebresne added a comment - Patch attached (0001-2749-v2.patch and 0002-fix-unit-tests-v2.patch) that: only support the new layout with sstables in cf directories migrate automagically files the first time adds the keyspace into the file name (to not rely on the directory sstables are in) limit keyspace and cf names to 32 characters During the initial upgrade, we also check if the user would be fine with the current keyspace and cf name. If not, we just refuse to start. I hope this won't happen to anyone because renaming a CF/Keyspace in a rolling fashion is not something fun (or even possible for that matter). Note that this check don't fully enforce the 32 chars limitation however but rather tries to be as permissive as possible checking that any file path resulting for the upgrade should be less than 256 chars (the windows limit). PS: the first patch is the bulk of the change, the second one is the unit tests. The later is huge because it renames the sstables in test/data to include the keyspace name in them.
        Hide
        Jonathan Ellis added a comment -

        It might be worth adding a "are my filenames going to be too large" check against all KS + CF combinations before starting to migrate data files around, though. It would suck to end up with a partially converted database if some short CF names complete early on, before erroring out on a long one.

        Show
        Jonathan Ellis added a comment - It might be worth adding a "are my filenames going to be too large" check against all KS + CF combinations before starting to migrate data files around, though. It would suck to end up with a partially converted database if some short CF names complete early on, before erroring out on a long one.
        Hide
        Sylvain Lebresne added a comment -

        guess we need to supply a tool to rename sstables files if anyone is on longer names?

        We probably don't need to do anything. I don't think anyone is really using names long enough for them to it the file system limit, the goal of limiting the names is just so to prevent this from happening but there will be no other assumption that the names are short from the code.

        I also don't think anything will prevent rolling upgrades, do you had something in mind?

        Note: I have a long flight ahead of me so I plan to update my last patch with both those changes, as I still like the moving of all the directories handling in a dedicated class, even if we don't support both layout.

        Show
        Sylvain Lebresne added a comment - guess we need to supply a tool to rename sstables files if anyone is on longer names? We probably don't need to do anything. I don't think anyone is really using names long enough for them to it the file system limit, the goal of limiting the names is just so to prevent this from happening but there will be no other assumption that the names are short from the code. I also don't think anything will prevent rolling upgrades, do you had something in mind? Note: I have a long flight ahead of me so I plan to update my last patch with both those changes, as I still like the moving of all the directories handling in a dedicated class, even if we don't support both layout.
        Hide
        Pavel Yaskevich added a comment -

        I think if we would be just supporting new style layout we can convert on startup. +1 on both ideas tho.

        Show
        Pavel Yaskevich added a comment - I think if we would be just supporting new style layout we can convert on startup. +1 on both ideas tho.
        Hide
        Marcus Eriksson added a comment -

        sounds great (both just supporting new-style-layout and limiting names to 32chars)

        guess we need to supply a tool to rename sstables files if anyone is on longer names? and rolling upgrades are out of the question then right? (maybe the already are?)

        Show
        Marcus Eriksson added a comment - sounds great (both just supporting new-style-layout and limiting names to 32chars) guess we need to supply a tool to rename sstables files if anyone is on longer names? and rolling upgrades are out of the question then right? (maybe the already are?)
        Hide
        Jonathan Ellis added a comment -

        +1 limiting to 32

        Show
        Jonathan Ellis added a comment - +1 limiting to 32
        Hide
        Sylvain Lebresne added a comment -

        On the other hand, we allow arbitrary-length KS + CF names (up to 64K iirc) so UUID aside we're already in trouble on ext3/ext4, xfs, and ntfs, which all support max filename length of ~256. I'm starting to think we should move these into the metadata component instead of the filename.

        The thing with the metadata component is that from a code perspective, there is lots of places where we want to create a Descriptor, which involves extracting the keyspace/cf names only based on the filename. Adding the necessity to locate and read the metadata in those places will likely don't be very fun.

        So I'd be in favor of just limiting the keyspace and column family names. It's one for which there is no real point to have very long names. Limiting each one to 32 characters shouldn't be a strong limitation.

        Show
        Sylvain Lebresne added a comment - On the other hand, we allow arbitrary-length KS + CF names (up to 64K iirc) so UUID aside we're already in trouble on ext3/ext4, xfs, and ntfs, which all support max filename length of ~256. I'm starting to think we should move these into the metadata component instead of the filename. The thing with the metadata component is that from a code perspective, there is lots of places where we want to create a Descriptor, which involves extracting the keyspace/cf names only based on the filename. Adding the necessity to locate and read the metadata in those places will likely don't be very fun. So I'd be in favor of just limiting the keyspace and column family names. It's one for which there is no real point to have very long names. Limiting each one to 32 characters shouldn't be a strong limitation.
        Hide
        Jonathan Ellis added a comment -

        we cannot stream between two nodes, one using separate cf directory

        I don't see any reason to continue to support the old-style directory layout. That adds complexity (operationally as well as in the code) for no benefit that I can think of. I think we should migrate from old layout to new on the first startup under 1.1.

        regarding keyspaces in file names, sure, why not, guess having a header with this info in the file is out of the question, then the only meta data we have is the file name, right? A problem could be if we want to do CASSANDRA-1983 later, that would increase the file name length even more

        I'm on the fence here – on the one hand having ks + cf in the filename simplifies some things. On the other hand, we allow arbitrary-length KS + CF names (up to 64K iirc) so UUID aside we're already in trouble on ext3/ext4, xfs, and ntfs, which all support max filename length of ~256. I'm starting to think we should move these into the metadata component instead of the filename.

        Show
        Jonathan Ellis added a comment - we cannot stream between two nodes, one using separate cf directory I don't see any reason to continue to support the old-style directory layout. That adds complexity (operationally as well as in the code) for no benefit that I can think of. I think we should migrate from old layout to new on the first startup under 1.1. regarding keyspaces in file names, sure, why not, guess having a header with this info in the file is out of the question, then the only meta data we have is the file name, right? A problem could be if we want to do CASSANDRA-1983 later, that would increase the file name length even more I'm on the fence here – on the one hand having ks + cf in the filename simplifies some things. On the other hand, we allow arbitrary-length KS + CF names (up to 64K iirc) so UUID aside we're already in trouble on ext3/ext4, xfs, and ntfs, which all support max filename length of ~256. I'm starting to think we should move these into the metadata component instead of the filename.
        Hide
        Sylvain Lebresne added a comment -

        i had on algorithm for detecting what kind of file it is in the old backwards compatible patch, it iterates over all data directories and figures out which data directory the file is in, then it knows the keyspace is the next part of the filename, and can check if there are files or directories in that directory.

        I tried that too, but this doesn't work when say you're opening a sstable by the sstableloader, because the files are not in a data directories then.

        Show
        Sylvain Lebresne added a comment - i had on algorithm for detecting what kind of file it is in the old backwards compatible patch, it iterates over all data directories and figures out which data directory the file is in, then it knows the keyspace is the next part of the filename, and can check if there are files or directories in that directory. I tried that too, but this doesn't work when say you're opening a sstable by the sstableloader, because the files are not in a data directories then.
        Hide
        Marcus Eriksson added a comment -

        agree that the file structure is all over the place, that's why this patch was so incredibly painful to do

        seems i only did the disk space checking in sub directories in the backwards compatible patch

        i had on algorithm for detecting what kind of file it is in the old backwards compatible patch, it iterates over all data directories and figures out which data directory the file is in, then it knows the keyspace is the next part of the filename, and can check if there are files or directories in that directory.

        but i agree, it is incredibly ugly, your modifications make it alot better, and yes, this would be a good time to do this properly, not like me just trying to do a minimal patch that "works".

        regarding keyspaces in file names, sure, why not, guess having a header with this info in the file is out of the question, then the only meta data we have is the file name, right? A problem could be if we want to do CASSANDRA-1983 later, that would increase the file name length even more.

        Show
        Marcus Eriksson added a comment - agree that the file structure is all over the place, that's why this patch was so incredibly painful to do seems i only did the disk space checking in sub directories in the backwards compatible patch i had on algorithm for detecting what kind of file it is in the old backwards compatible patch, it iterates over all data directories and figures out which data directory the file is in, then it knows the keyspace is the next part of the filename, and can check if there are files or directories in that directory. but i agree, it is incredibly ugly, your modifications make it alot better, and yes, this would be a good time to do this properly, not like me just trying to do a minimal patch that "works". regarding keyspaces in file names, sure, why not, guess having a header with this info in the file is out of the question, then the only meta data we have is the file name, right? A problem could be if we want to do CASSANDRA-1983 later, that would increase the file name length even more.
        Hide
        Sylvain Lebresne added a comment -

        I've only look at the last posted patchset (2749_proper.tar.gz) but I've found the following few problems:

        • The disk space free space checks doesn't take the new layout into account. In other words, it always checks the keyspace directory for free space. But in case we use CF directories, the keyspace directory could likely be on a disk with almost no freespace (and in any case, this won't always be the right disk to check).
        • The check that a snapshot is existing and the clearing of snapshot is broken for the CF directories layout (in Table.java)
        • The leveled manifest is not put into it's CF directory (it follows that its snapshot is not in the right place either).

        I'm also not a fan of having the directory argument given to the Descriptor constructors not be the actual directory of the sstable it describes. I think it's error prone, making Descriptor harder to use (the fact that Descriptor.getDirectory() doesn't necessarily returns the directory you've fed to the constructor is confusing).

        More generally, the current code dealing with the paths to the data directories is currently very badly encapsulated (not this patch fault at all). Which means that adding a new "parallel" data directory layout leaks everywhere and make the code hard to read/maintain. I strongly believe we should use the opportunity to improve that code encapsulation.

        So attaching a patch that refactor a bit more the code to improve encapsulation. It essentially consists in adding a new Directories class that abstracts the actual layout of the data directory (including the fact that we can use multiple locations, which imho greatly clean the code). Aside from that, the patch reuse the sstablemover of the preceding patch (with an added verbose option). The patch also includes fairly extensive unit tests for the new class.

        Now there is one thing that really bother me with this patch (the preceding patches have the same problem). It's the fact that the Descriptor.fromFilename method needs to query the global useSeparateCFDirectories option to work correctly. This has a number of practical consequences:

        1. it is the reason for a number of super annoying gymnastic in the unit tests, like the need to duplicate the sstables used by LegacySSTableTest for instance. I've put all of this gymnastic in a separate patch (0002-fix-unit-tests.patch).
        2. this means we cannot stream between two nodes, one using separate cf directory, one that don't (that limitation could be solve independently for streaming only, but it's worth considering a more general fix imo, in particular because a specific fix would be a tad ugly).
        3. tools like the sstableloader will require that you put the sstable to load into a ksname/cfname/ directory when you use the separate directory option. It's not a big deal, but it feels arbitrary and annoying.

        At first, I had hoped that we could make Descriptor.fromFilename 'detect' whether or not we were using the separate directory layout by using the fact that the file name contains the column family name already, but that doesn't work when the ksname == cfname (which is allowed afaik).

        One option would be to stop relying on the directory structure the sstable is in to detect the keyspace name. We could do that if we included the keyspace name in the filename for instance. The pros would be to fix all the problem listed above, and in particular we would be able to allow stuffs like sstableloader <filename>, instead of requiring that the files are put inside a specifically name directory, which would be nice. I don't see much cons except making the file names longer. Any opinion on that idea?

        Show
        Sylvain Lebresne added a comment - I've only look at the last posted patchset (2749_proper.tar.gz) but I've found the following few problems: The disk space free space checks doesn't take the new layout into account. In other words, it always checks the keyspace directory for free space. But in case we use CF directories, the keyspace directory could likely be on a disk with almost no freespace (and in any case, this won't always be the right disk to check). The check that a snapshot is existing and the clearing of snapshot is broken for the CF directories layout (in Table.java) The leveled manifest is not put into it's CF directory (it follows that its snapshot is not in the right place either). I'm also not a fan of having the directory argument given to the Descriptor constructors not be the actual directory of the sstable it describes. I think it's error prone, making Descriptor harder to use (the fact that Descriptor.getDirectory() doesn't necessarily returns the directory you've fed to the constructor is confusing). More generally, the current code dealing with the paths to the data directories is currently very badly encapsulated (not this patch fault at all). Which means that adding a new "parallel" data directory layout leaks everywhere and make the code hard to read/maintain. I strongly believe we should use the opportunity to improve that code encapsulation. So attaching a patch that refactor a bit more the code to improve encapsulation. It essentially consists in adding a new Directories class that abstracts the actual layout of the data directory (including the fact that we can use multiple locations, which imho greatly clean the code). Aside from that, the patch reuse the sstablemover of the preceding patch (with an added verbose option). The patch also includes fairly extensive unit tests for the new class. Now there is one thing that really bother me with this patch (the preceding patches have the same problem). It's the fact that the Descriptor.fromFilename method needs to query the global useSeparateCFDirectories option to work correctly. This has a number of practical consequences: it is the reason for a number of super annoying gymnastic in the unit tests, like the need to duplicate the sstables used by LegacySSTableTest for instance. I've put all of this gymnastic in a separate patch (0002-fix-unit-tests.patch). this means we cannot stream between two nodes, one using separate cf directory, one that don't (that limitation could be solve independently for streaming only, but it's worth considering a more general fix imo, in particular because a specific fix would be a tad ugly). tools like the sstableloader will require that you put the sstable to load into a ksname/cfname/ directory when you use the separate directory option. It's not a big deal, but it feels arbitrary and annoying. At first, I had hoped that we could make Descriptor.fromFilename 'detect' whether or not we were using the separate directory layout by using the fact that the file name contains the column family name already, but that doesn't work when the ksname == cfname (which is allowed afaik). One option would be to stop relying on the directory structure the sstable is in to detect the keyspace name. We could do that if we included the keyspace name in the filename for instance. The pros would be to fix all the problem listed above, and in particular we would be able to allow stuffs like sstableloader <filename> , instead of requiring that the files are put inside a specifically name directory, which would be nice. I don't see much cons except making the file names longer. Any opinion on that idea?
        Hide
        Marcus Eriksson added a comment -

        updated, of course mv should be used

        rebased as well

        Show
        Marcus Eriksson added a comment - updated, of course mv should be used rebased as well
        Hide
        Pavel Yaskevich added a comment -

        +1 first few additions:

        • You forgot to set `chmod +x ./bin/sstablemover`.
        • Good idea would be to use `mv` instead of `cp` when -d option is specified.
        Show
        Pavel Yaskevich added a comment - +1 first few additions: You forgot to set `chmod +x ./bin/sstablemover`. Good idea would be to use `mv` instead of `cp` when -d option is specified.
        Hide
        Marcus Eriksson added a comment -

        updated;

        • sstablemover can now reverse changes (-r) and delete old files (-d).
        • LegacySSTableTest fixed, hopefully
        • added comments to cassandra.yaml
        Show
        Marcus Eriksson added a comment - updated; sstablemover can now reverse changes (-r) and delete old files (-d). LegacySSTableTest fixed, hopefully added comments to cassandra.yaml
        Hide
        Pavel Yaskevich added a comment -

        Thanks a lot for taking care about this, Marcus!

        I have tested your latest patch using stress and sstablemover and then stress (+ flush/scrub/snapshot) again for Standard (+ secondary indexes) and Super ColumnFamilies and everything worked just fine for me. Here are my last comments and I think after this gets done we are ready to push your changes:

        1). a). Tool should be made more user-friendly by adding at least --help and add an option to move sstables back to the keyspace directory (as we have an two states in the config);
        b). I think it should delete old sstables after copy automatically or by asking user using --delete-old option?
        c). Move sstablemover tool the ./bin instead of ./tools/sstablemover.

        2). LegacySSTableTest fails on my machine can you, please, check that?

        3). I think we should extend comment in the config describing what should be done before it is safe to change to change the option e.g. "Note: before setting this option to 'true' you should move existing SSTable files to the separate directories manually by using ./bin/sstablemover tool, the same applies for 'false' state - ./bin/sstablemover could be used to restore original directory structure (Please use ./bin/sstablemover --help for help and information)."...

        Also please remove/implement remaining // TODO: comments.

        Show
        Pavel Yaskevich added a comment - Thanks a lot for taking care about this, Marcus! I have tested your latest patch using stress and sstablemover and then stress (+ flush/scrub/snapshot) again for Standard (+ secondary indexes) and Super ColumnFamilies and everything worked just fine for me. Here are my last comments and I think after this gets done we are ready to push your changes: 1). a). Tool should be made more user-friendly by adding at least --help and add an option to move sstables back to the keyspace directory (as we have an two states in the config); b). I think it should delete old sstables after copy automatically or by asking user using --delete-old option? c). Move sstablemover tool the ./bin instead of ./tools/sstablemover. 2). LegacySSTableTest fails on my machine can you, please, check that? 3). I think we should extend comment in the config describing what should be done before it is safe to change to change the option e.g. "Note: before setting this option to 'true' you should move existing SSTable files to the separate directories manually by using ./bin/sstablemover tool, the same applies for 'false' state - ./bin/sstablemover could be used to restore original directory structure (Please use ./bin/sstablemover --help for help and information)."... Also please remove/implement remaining // TODO: comments.
        Hide
        Marcus Eriksson added a comment -

        ok, actually touched it again, a non-backwards compatible patch with a tool to move(copy) the files into subdirs

        script is in tools/sstablemover

        Show
        Marcus Eriksson added a comment - ok, actually touched it again, a non-backwards compatible patch with a tool to move(copy) the files into subdirs script is in tools/sstablemover
        Hide
        Marcus Eriksson added a comment -

        ok, im not touching this anymore, code is there if anyone wants it

        Show
        Marcus Eriksson added a comment - ok, im not touching this anymore, code is there if anyone wants it
        Hide
        Pavel Yaskevich added a comment -

        I don't really like the idea of keeping a map of all generates for each of ks/cf and updating it all the time (this also implies that we need to resize it after cf is created/dropped). After re-thinking this once again it feels like we really should just make a tool to transfer SSTable to separate directories and back, make a flag in config that will indicate old/new style of file structure, because I don't see a error-prone way to handle this when old/new style SSTables are mixed...

        Show
        Pavel Yaskevich added a comment - I don't really like the idea of keeping a map of all generates for each of ks/cf and updating it all the time (this also implies that we need to resize it after cf is created/dropped). After re-thinking this once again it feels like we really should just make a tool to transfer SSTable to separate directories and back, make a flag in config that will indicate old/new style of file structure, because I don't see a error-prone way to handle this when old/new style SSTables are mixed...
        Hide
        Marcus Eriksson added a comment -

        4 patches, one is the same as the v4, the other three patches are

        • change configuration option name
        • make LegacySSTable test work by telling Descriptor that the file is old-style
        • make incremental backups work
        • problem was when migrating from old-style to new-style, the generation-counter in ColumnFamilyStore generated clashing ids, which was not a problem until it tried to hard link the files to the same directory with the same name. This should be refactored in CASSANDRA-1983 (i'll give that a try when this is merged)
        Show
        Marcus Eriksson added a comment - 4 patches, one is the same as the v4, the other three patches are change configuration option name make LegacySSTable test work by telling Descriptor that the file is old-style make incremental backups work problem was when migrating from old-style to new-style, the generation-counter in ColumnFamilyStore generated clashing ids, which was not a problem until it tried to hard link the files to the same directory with the same name. This should be refactored in CASSANDRA-1983 (i'll give that a try when this is merged)
        Hide
        Pavel Yaskevich added a comment -

        Tested v4 on trunk and I see few test failures - LegacySSTableTest and ColumnFamilyStoreTest, it is related to how you determine what placement style is used for a given SSTable, feels like !DatabaseDescriptor.useSeparateCFDirectories() condition is insufficient. Also I think it would be correct to rename "one_directory_per_column_family" option to "use_separate_column_family_directories" with comment like "Useful when you have disks with different speeds (HDD/SSD) and you want explicitly distribute Column Families between them".

        Show
        Pavel Yaskevich added a comment - Tested v4 on trunk and I see few test failures - LegacySSTableTest and ColumnFamilyStoreTest, it is related to how you determine what placement style is used for a given SSTable, feels like !DatabaseDescriptor.useSeparateCFDirectories() condition is insufficient. Also I think it would be correct to rename "one_directory_per_column_family" option to "use_separate_column_family_directories" with comment like "Useful when you have disks with different speeds (HDD/SSD) and you want explicitly distribute Column Families between them".
        Hide
        Marcus Eriksson added a comment - - edited

        rebased after CASSANDRA-3464, renameSSTable code removed

        Show
        Marcus Eriksson added a comment - - edited rebased after CASSANDRA-3464 , renameSSTable code removed
        Hide
        Marcus Eriksson added a comment -

        v4 attached

        • renamed DatabaseDescriptor method
        • made file name checking case sensitive again...
        • removed unused rename code (following the renameSSTables call path, it looks very strange, even in trunk, ill fix this after this patch is merged)
        • cant figure out a way to make snapshotExists any better, im looking for a directory with the snapshot tag in all places, not sure if it could be wrong here (even if it is quite naive)
        • removed dead todos
        Show
        Marcus Eriksson added a comment - v4 attached renamed DatabaseDescriptor method made file name checking case sensitive again... removed unused rename code (following the renameSSTables call path, it looks very strange, even in trunk, ill fix this after this patch is merged) cant figure out a way to make snapshotExists any better, im looking for a directory with the snapshot tag in all places, not sure if it could be wrong here (even if it is quite naive) removed dead todos
        Hide
        Pavel Yaskevich added a comment -

        v3 looks good (needs rebase), here is my comments:

        • DatabaseDescriptor.getPerCFDirectory() should be renamed to something like "useSeparateCFDirectories"
        • in the ColumnFamilyStory.accept method file name should be checked depending on case because we can CF names are case-sensitive.
        • we don't have ColumnFamilyStore.rename method anymore which is a good thing for this patch, you can just rebase and remove related code.
        • in the Table.snapshotExists I think we should be more careful determining if snapshot actually exists.
        • TODO should be cleaned up.

        Wish: if it's possible I think we can remove ColumnFamily name from the SSTable files if those are in the CF directory already.

        I think we are almost done in here.

        Show
        Pavel Yaskevich added a comment - v3 looks good (needs rebase), here is my comments: DatabaseDescriptor.getPerCFDirectory() should be renamed to something like "useSeparateCFDirectories" in the ColumnFamilyStory.accept method file name should be checked depending on case because we can CF names are case-sensitive. we don't have ColumnFamilyStore.rename method anymore which is a good thing for this patch, you can just rebase and remove related code. in the Table.snapshotExists I think we should be more careful determining if snapshot actually exists. TODO should be cleaned up. Wish: if it's possible I think we can remove ColumnFamily name from the SSTable files if those are in the CF directory already. I think we are almost done in here.
        Hide
        Marcus Eriksson added a comment -

        third version, this is could be considered feature-complete, changes compared to v2 of the patch:

        • make sure disk space is calculated correctly
        • if there is a subdirectory in the datadirectory with the same name as the CF, check in that directory, otherwise, check in the datadirectory (the subdirectory will be created)
        • add config param
        • add test for RecursiveGlob
        • make RecursiveGlob ignore case when comparing file names
        • add test for scrubbing old style CFs to new style CFs (or, modify an existing test to check this too)
        Show
        Marcus Eriksson added a comment - third version, this is could be considered feature-complete, changes compared to v2 of the patch: make sure disk space is calculated correctly if there is a subdirectory in the datadirectory with the same name as the CF, check in that directory, otherwise, check in the datadirectory (the subdirectory will be created) add config param add test for RecursiveGlob make RecursiveGlob ignore case when comparing file names add test for scrubbing old style CFs to new style CFs (or, modify an existing test to check this too)
        Hide
        Marcus Eriksson added a comment -

        Biggest part left is figuring out how to estimate disk space to know where to write an sstable, ill work on that this weekend

        Other than that small cleanups (codestyle), unit tests and adding the config parameter

        Show
        Marcus Eriksson added a comment - Biggest part left is figuring out how to estimate disk space to know where to write an sstable, ill work on that this weekend Other than that small cleanups (codestyle), unit tests and adding the config parameter
        Hide
        Pavel Yaskevich added a comment -

        I forgot to mention - would be great to see a test for RecursiveGlob and please put a license banner on top of that file like we do everywhere else.

        Show
        Pavel Yaskevich added a comment - I forgot to mention - would be great to see a test for RecursiveGlob and please put a license banner on top of that file like we do everywhere else.
        Hide
        Pavel Yaskevich added a comment -

        Good, can you briefly tell what is left? I see that ColumnFamilyStoreTest has an commented assertion and code still have some TODOs. Also please make sure you follow your CodeStyle and put space after if/for/while/... statements...

        Show
        Pavel Yaskevich added a comment - Good, can you briefly tell what is left? I see that ColumnFamilyStoreTest has an commented assertion and code still have some TODOs. Also please make sure you follow your CodeStyle and put space after if/for/while/... statements...
        Hide
        Marcus Eriksson added a comment -

        ah that is a leftover from my first non-backwards-compatible patch

        it works even without the cf in the new File(...)

        Show
        Marcus Eriksson added a comment - ah that is a leftover from my first non-backwards-compatible patch it works even without the cf in the new File(...)
        Hide
        Pavel Yaskevich added a comment -

        Looks better now but I don't like how you changed DescriptorTest especially testLegacy() "userActionUtilsKey" there is meant to be a CF name, so it seems like current version does not work correctly?

        Show
        Pavel Yaskevich added a comment - Looks better now but I don't like how you changed DescriptorTest especially testLegacy() "userActionUtilsKey" there is meant to be a CF name, so it seems like current version does not work correctly?
        Hide
        Pavel Yaskevich added a comment -

        Sorry Marcus I had no time to look at this but I will by the end of this week.

        Show
        Pavel Yaskevich added a comment - Sorry Marcus I had no time to look at this but I will by the end of this week.
        Hide
        Marcus Eriksson added a comment -

        Anyone got a suggestion on how to do the available disk space checking? Reason in that now we cannot simply check which dataFileLocation that has the most available disk space, we need to check the subdirs where files are actually written.

        My naive first approach would be to simply change DatabaseDescriptor.getDataFileLocationForTable(..) to take a column family name, which we ought to always know when calling that method (and a quick check seems to say that only compactions might prove a problem for this approach).

        Comments?

        Show
        Marcus Eriksson added a comment - Anyone got a suggestion on how to do the available disk space checking? Reason in that now we cannot simply check which dataFileLocation that has the most available disk space, we need to check the subdirs where files are actually written. My naive first approach would be to simply change DatabaseDescriptor.getDataFileLocationForTable(..) to take a column family name, which we ought to always know when calling that method (and a quick check seems to say that only compactions might prove a problem for this approach). Comments?
        Hide
        Jonathan Ellis added a comment -

        Yes, trunk is the right place for this.

        Show
        Jonathan Ellis added a comment - Yes, trunk is the right place for this.
        Hide
        Marcus Eriksson added a comment -

        Updated patch, working snapshots and incremental backups

        also a bit of cleanups

        btw, the patches are always against trunk, not against the older patches, is this the correct way to do it?

        Show
        Marcus Eriksson added a comment - Updated patch, working snapshots and incremental backups also a bit of cleanups btw, the patches are always against trunk, not against the older patches, is this the correct way to do it?
        Hide
        Marcus Eriksson added a comment -

        First version of a backwards compatible patch.

        Approach is:

        • On startup, recursively glob all data directories to find existing sstables
        • whenever generating new files, look at the setting in the config, and put the file there
        • probably more, cant remember, my brain is messed up after fighting with this for a few days, look at patch!

        this gives that a user can enable/disable the setting without mocking around manually with the sstables.

        Comments? Yes, the setting is still hardcoded as a marker that this is not ready to be merged

        Show
        Marcus Eriksson added a comment - First version of a backwards compatible patch. Approach is: On startup, recursively glob all data directories to find existing sstables whenever generating new files, look at the setting in the config, and put the file there probably more, cant remember, my brain is messed up after fighting with this for a few days, look at patch! this gives that a user can enable/disable the setting without mocking around manually with the sstables. Comments? Yes, the setting is still hardcoded as a marker that this is not ready to be merged
        Hide
        Pavel Yaskevich added a comment -

        On my opinion it's more user-friendly if we support old directory structure for existing files and let compaction do replace for us.

        Show
        Pavel Yaskevich added a comment - On my opinion it's more user-friendly if we support old directory structure for existing files and let compaction do replace for us.
        Hide
        Marcus Eriksson added a comment -

        agree, gonna look at how much pain it is to actually build though

        wont need alot of downtime if it is done offline, can rsync files into the subdir right before stopping the node

        Show
        Marcus Eriksson added a comment - agree, gonna look at how much pain it is to actually build though wont need alot of downtime if it is done offline, can rsync files into the subdir right before stopping the node
        Hide
        Jonathan Ellis added a comment -

        I don't think taking compression as a model makes sense. With compression, it's fine to set the option globally and each node can start using compression when it gets the schema update. But this affects the storage engine "below" schema.

        I think it's reasonable to make the change offline, especially if an upgrade tool is provided to make it simple.

        Show
        Jonathan Ellis added a comment - I don't think taking compression as a model makes sense. With compression, it's fine to set the option globally and each node can start using compression when it gets the schema update. But this affects the storage engine "below" schema. I think it's reasonable to make the change offline, especially if an upgrade tool is provided to make it simple.
        Hide
        Marcus Eriksson added a comment -

        Ok, i'll update the patch

        Show
        Marcus Eriksson added a comment - Ok, i'll update the patch
        Hide
        Pavel Yaskevich added a comment -

        This should be done like compression which compresses only newly created SSTables, we don't want to make users to stop their node for indefinite time to change config and move files.

        Show
        Pavel Yaskevich added a comment - This should be done like compression which compresses only newly created SSTables, we don't want to make users to stop their node for indefinite time to change config and move files.
        Hide
        Marcus Eriksson added a comment -

        It is not backwards compatible, i figured it was not worth the extra complexity, i imagine the "upgrade"-path to be:
        1. shut down node
        2. edit config file
        3. symlink in SSDs and move the files into the subdirectories
        4. start node

        Of course, this could be done on startup by cassandra itself, but providing a shell-script that does it keeps it simple, what do you think? I really dont want the complexity with sstable files in two places at the same time.

        My solution simply changed the Descriptor class to return paths to sstable files in subdirs, and then fix everything that was affected. Many of the classes modified were due to assumptions in the unit tests. I'm currently refactoring the Descriptor class to make it cleaner and a central place to glob for files etc. Expect a patch tonight or tomorrow.

        I will of course change DatabaseDescriptor.getPerCFDirectory() to read the config file, kept it hard coded to make testing simpler.

        Show
        Marcus Eriksson added a comment - It is not backwards compatible, i figured it was not worth the extra complexity, i imagine the "upgrade"-path to be: 1. shut down node 2. edit config file 3. symlink in SSDs and move the files into the subdirectories 4. start node Of course, this could be done on startup by cassandra itself, but providing a shell-script that does it keeps it simple, what do you think? I really dont want the complexity with sstable files in two places at the same time. My solution simply changed the Descriptor class to return paths to sstable files in subdirs, and then fix everything that was affected. Many of the classes modified were due to assumptions in the unit tests. I'm currently refactoring the Descriptor class to make it cleaner and a central place to glob for files etc. Expect a patch tonight or tomorrow. I will of course change DatabaseDescriptor.getPerCFDirectory() to read the config file, kept it hard coded to make testing simpler.
        Hide
        Pavel Yaskevich added a comment -

        Thanks for your work, Marcus! First of all, can you please write your algorithm down so people can participate without reading actual code?

        I don't think that this is the right way to go because I can't find how does it manage to keep backward compatibility with current directory structure (and that test SSTables should be moved is yet another confirmation) that would imply major change in the Descriptor/Component classes, all the current changes relay on DatabaseDescriptor.getPerCFDirectory() which is static "true" which makes it redundant. In my vision of things major changes should be made only in Descriptor, Component, ColumnFamilyStore and SSTable classes to change the way they create/lookup file locations and do backups and snapshoting.

        Show
        Pavel Yaskevich added a comment - Thanks for your work, Marcus! First of all, can you please write your algorithm down so people can participate without reading actual code? I don't think that this is the right way to go because I can't find how does it manage to keep backward compatibility with current directory structure (and that test SSTables should be moved is yet another confirmation) that would imply major change in the Descriptor/Component classes, all the current changes relay on DatabaseDescriptor.getPerCFDirectory() which is static "true" which makes it redundant. In my vision of things major changes should be made only in Descriptor, Component, ColumnFamilyStore and SSTable classes to change the way they create/lookup file locations and do backups and snapshoting.
        Hide
        Marcus Eriksson added a comment -

        btw, yes, i did include the data files in the patch on purpose, they need to be in subdirs for some of the unit tests to work

        Show
        Marcus Eriksson added a comment - btw, yes, i did include the data files in the patch on purpose, they need to be in subdirs for some of the unit tests to work
        Hide
        Marcus Eriksson added a comment -

        submitting a "working" patch for separating column families into subdirectories, <ks>/<cf>/<cf>-xyz.db

        There are alot of things to clean up/refactor, but submitting patch for comments.

        Unit tests work both for new-style dirs and old (except for a backup test that i will fix when backups actually end up where they should).

        TODO: (probably more than this)

        • snapshots need to go into the <ks>/<cf>/snapshots dir since the purpose of this patch is to make it possible to have the <cf>/ directory on a separate drive
        • incremental backups - same issue as for snapshots
        • refactoring of Descriptor class, quite hairy now
        • Disk space checking in column family subdirs
        Show
        Marcus Eriksson added a comment - submitting a "working" patch for separating column families into subdirectories, <ks>/<cf>/<cf>-xyz.db There are alot of things to clean up/refactor, but submitting patch for comments. Unit tests work both for new-style dirs and old (except for a backup test that i will fix when backups actually end up where they should). TODO: (probably more than this) snapshots need to go into the <ks>/<cf>/snapshots dir since the purpose of this patch is to make it possible to have the <cf>/ directory on a separate drive incremental backups - same issue as for snapshots refactoring of Descriptor class, quite hairy now Disk space checking in column family subdirs
        Hide
        Pavel Yaskevich added a comment -

        After crushing my head again this for a few days I can say that this is more complicated that it sounds for a few reasons:

        • We will need to support both old/new directory structures which requires major changes in the way how Descriptor class works and how CFS and SSTable classes do file lookup and path generation.
        • Adds additional complexity to the way how we do backups, snapshots and recover which could potentially lead to some nasty bugs.
        • As Peter already mentioned "Cassandra won't be able to distinguish between an actual empty CF and a directory that wasn't mounted (or a symlink pointing to a non-mounted directory)".

        There are more but those mentioned are major ones. Let's skip this for now.

        Show
        Pavel Yaskevich added a comment - After crushing my head again this for a few days I can say that this is more complicated that it sounds for a few reasons: We will need to support both old/new directory structures which requires major changes in the way how Descriptor class works and how CFS and SSTable classes do file lookup and path generation. Adds additional complexity to the way how we do backups, snapshots and recover which could potentially lead to some nasty bugs. As Peter already mentioned "Cassandra won't be able to distinguish between an actual empty CF and a directory that wasn't mounted (or a symlink pointing to a non-mounted directory)". There are more but those mentioned are major ones. Let's skip this for now.
        Hide
        Chris Burroughs added a comment -

        It would also be cool (but this is obviously speculative) to have the ability to keep Index files on an SSD, and the larger data files on rotating disks.

        Show
        Chris Burroughs added a comment - It would also be cool (but this is obviously speculative) to have the ability to keep Index files on an SSD, and the larger data files on rotating disks.
        Hide
        Peter Schuller added a comment -

        +1 on that. We have been discussing the same thing, for the same purpose. The only kink is that you don't want to do something like having a per-cf setting that is tied to local node details like paths. But simply placing CF:s in a named subdirectory (similar to the pg tablespace) which can, on a per-node basis, by a symlink or a mountpoint, avoids that.

        This means there's no problem doing a rolling re-configuration of a cluster, and there is no need to realize before hand that you might want to move some particular CF and do something like assign it to a tablespace (to get the level of indirection). It all just works by default, and you can move CF:s at any time on any node without co-ordination other than the node being down for a bit.

        I can foresee it being easier to accidentally start a node which seems to work but has some CF:s be completely empty, because Cassandra won't be able to distinguish between an actual empty CF and a directory that wasn't mounted (or a symlink pointing to a non-mounted directory). Something simple like creating a marker of some kind on CF creation might help with that; on start-up CF:s that are missing the marker could be rejected. But - I suppose this is overkill at least initially.

        Show
        Peter Schuller added a comment - +1 on that. We have been discussing the same thing, for the same purpose. The only kink is that you don't want to do something like having a per-cf setting that is tied to local node details like paths. But simply placing CF:s in a named subdirectory (similar to the pg tablespace) which can, on a per-node basis, by a symlink or a mountpoint, avoids that. This means there's no problem doing a rolling re-configuration of a cluster, and there is no need to realize before hand that you might want to move some particular CF and do something like assign it to a tablespace (to get the level of indirection). It all just works by default, and you can move CF:s at any time on any node without co-ordination other than the node being down for a bit. I can foresee it being easier to accidentally start a node which seems to work but has some CF:s be completely empty, because Cassandra won't be able to distinguish between an actual empty CF and a directory that wasn't mounted (or a symlink pointing to a non-mounted directory). Something simple like creating a marker of some kind on CF creation might help with that; on start-up CF:s that are missing the marker could be rejected. But - I suppose this is overkill at least initially.
        Hide
        Jonathan Ellis added a comment -

        Ryan's idea sounds like the simplest way to get something "good enough" to me.

        Show
        Jonathan Ellis added a comment - Ryan's idea sounds like the simplest way to get something "good enough" to me.
        Hide
        Ryan King added a comment -

        Since each keyspace is stored in a different sub-directory of the DataDiretories, you can already split the storage of different keyspaces with some clever mount options. Maybe we could give column families the same treatment?

        Show
        Ryan King added a comment - Since each keyspace is stored in a different sub-directory of the DataDiretories, you can already split the storage of different keyspaces with some clever mount options. Maybe we could give column families the same treatment?
        Hide
        Héctor Izquierdo added a comment -

        What about being configurable in a separate file like the network topology? Could that work as a first approximation?

        Show
        Héctor Izquierdo added a comment - What about being configurable in a separate file like the network topology? Could that work as a first approximation?
        Hide
        Jonathan Ellis added a comment -

        There's some tension between managing this cluster-wide and the actual data directory definitions being per-machine. Not sure what the best solution there is.

        Show
        Jonathan Ellis added a comment - There's some tension between managing this cluster-wide and the actual data directory definitions being per-machine. Not sure what the best solution there is.
        Hide
        Jonathan Ellis added a comment -

        We could also have a "memory" location that would be useful for temporary data.

        Show
        Jonathan Ellis added a comment - We could also have a "memory" location that would be useful for temporary data.

          People

          • Assignee:
            Sylvain Lebresne
            Reporter:
            Jonathan Ellis
            Reviewer:
            Pavel Yaskevich
          • Votes:
            1 Vote for this issue
            Watchers:
            10 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development