CouchDB
  1. CouchDB
  2. COUCHDB-1243

Compact and copy feature that resets changes

    Details

    • Type: New Feature New Feature
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: 1.0.1, 1.1
    • Fix Version/s: None
    • Component/s: Database Core
    • Environment:

      Ubuntu, but not important

    • Skill Level:
      Committers Level (Medium to Hard)

      Description

      After running db and view compaction on a 70K doc db with 6+ mio. changes - it takes up 0.8 GB. If copying the same documents to a new db (get and bulk insert) - the same date with 70K changes (only the inserts) takes up 40 mb. That is a huge difference. Has been verified on 2 db's that the difference is more than 65 times the size of data.

      A "Compact and copy" feature that copies only documents, and resets the changes for at db would be very nice to try and limit the disk usage a little bit. (Our current test environment takes up nearly 100 GB... )

      I've attached the dump load php script for your convenience.

      1. dump_load.php
        3 kB
        Henrik Hofmeister

        Activity

        Hide
        Henrik Hofmeister added a comment -

        dump load script - requires php and curl installed.

        takes 2 arguments - the host (including http and port and trailing /) - and the source db name

        Show
        Henrik Hofmeister added a comment - dump load script - requires php and curl installed. takes 2 arguments - the host (including http and port and trailing /) - and the source db name
        Hide
        Robert Newson added a comment -

        The difference is the removal of all the _deleted stubs which are critical to the correct operation of CouchDB replication. As such, I'm -1 on the idea.

        That said, I'd be +1 on add /db/_export and /db/import entry points that make backup/restore trivial.

        Show
        Robert Newson added a comment - The difference is the removal of all the _deleted stubs which are critical to the correct operation of CouchDB replication. As such, I'm -1 on the idea. That said, I'd be +1 on add /db/_export and /db/import entry points that make backup/restore trivial.
        Hide
        Henrik Hofmeister added a comment - - edited

        export/import would do the trick as well - or at least make it easier... However we are using couchdb intensively for both moderate and huge size dbs... this forever growing changes size will cause us to switch away from couch eventually - as we are rapidly growing into SAN size requirements which makes couchdb a very expensive db Also making view changes and compaction is getting to a point where it has to be done in weekends to allow for it to update. Our main db has 2 changes for every document... with 7 mio documents - we are facing a staggering 15 mio changes

        I'd atleast consider that couchdb is - to my understanding - built for web scale - and we are nowhere near our expected size and already growing out of it?

        (note: not bashing couch - we grew out of mysql 6 months ago )

        Show
        Henrik Hofmeister added a comment - - edited export/import would do the trick as well - or at least make it easier... However we are using couchdb intensively for both moderate and huge size dbs... this forever growing changes size will cause us to switch away from couch eventually - as we are rapidly growing into SAN size requirements which makes couchdb a very expensive db Also making view changes and compaction is getting to a point where it has to be done in weekends to allow for it to update. Our main db has 2 changes for every document... with 7 mio documents - we are facing a staggering 15 mio changes I'd atleast consider that couchdb is - to my understanding - built for web scale - and we are nowhere near our expected size and already growing out of it? (note: not bashing couch - we grew out of mysql 6 months ago )
        Hide
        Damien Katz added a comment -

        I mostly agree with Robert Newsom, that what you are asking for is a dangerous thing for CouchDB replication. However, there is the purge option, which "forgets" documents, deleted or otherwise, completely removing them from the internal indexes. Once documents are purged, compaction will will completely remove them from the file forever. Unfortunately, I couldn't find actual documentation on the purge functionality, so the best place to figure out how to use the purge is to look at the purge test in the browser test suite, which can be found here:

        http://svn.apache.org/viewvc/couchdb/trunk/share/www/script/test/purge.js?view=co&revision=1086241&content-type=text%2Fplain

        I've often thought a it would be useful to purge docs during compaction, by providing a user defined function to signal to remove unwanted docs/stubs. But no such thing exists, in the meantime you can accomplish it with a purge + compaction.

        Show
        Damien Katz added a comment - I mostly agree with Robert Newsom, that what you are asking for is a dangerous thing for CouchDB replication. However, there is the purge option, which "forgets" documents, deleted or otherwise, completely removing them from the internal indexes. Once documents are purged, compaction will will completely remove them from the file forever. Unfortunately, I couldn't find actual documentation on the purge functionality, so the best place to figure out how to use the purge is to look at the purge test in the browser test suite, which can be found here: http://svn.apache.org/viewvc/couchdb/trunk/share/www/script/test/purge.js?view=co&revision=1086241&content-type=text%2Fplain I've often thought a it would be useful to purge docs during compaction, by providing a user defined function to signal to remove unwanted docs/stubs. But no such thing exists, in the meantime you can accomplish it with a purge + compaction.
        Hide
        Paul Joseph Davis added a comment -

        Though there's a caveat and a note on purge. Firstly, if you purge twice in a row without updating a view, you have to rebuild the view from scratch. For heavy users of views this becomes a problem. This is just an implementation detail at the moment and at some time in the future could eventually be fixed.

        And a note, there was another report of a bug this morning that looks as though its triggered in the purge code and specifically affects compaction. There's been some speculation that its purge code, but I don't think anyone's sat down to comb through it yet to try and reproduce it.

        Show
        Paul Joseph Davis added a comment - Though there's a caveat and a note on purge. Firstly, if you purge twice in a row without updating a view, you have to rebuild the view from scratch. For heavy users of views this becomes a problem. This is just an implementation detail at the moment and at some time in the future could eventually be fixed. And a note, there was another report of a bug this morning that looks as though its triggered in the purge code and specifically affects compaction. There's been some speculation that its purge code, but I don't think anyone's sat down to comb through it yet to try and reproduce it.
        Hide
        Robert Newson added a comment -

        _purge is really for the "oops, I just put my admin password in a document" scenario. It's not well tested, has known and unresolved bugs, and obviously ruins eventual consistency. I'd rather see it removed than encouraged, but I think it's important for the narrow use case I just mentioned.

        We only remember the _rev's for the last 1000 updates to a document, so there is a cap (albeit a generous one) on how much is retained. When you say '6+ million changes' are these updates to existing documents or are you deleting documents and making new ones?

        If the latter, then you could consider the temporal database idea, which is often suggested when using couchdb as a message queue: Use a database per time interval (say, weekly). When the database is empty (i.e, only has deleted documents), you can delete the db entirely.

        I'll finish with saying that CouchDB's retention of information about deleted documents and old revisions is central to CouchDB, if it's working so strongly against you, then I don't think it's the right database solution for your problem.

        Show
        Robert Newson added a comment - _purge is really for the "oops, I just put my admin password in a document" scenario. It's not well tested, has known and unresolved bugs, and obviously ruins eventual consistency. I'd rather see it removed than encouraged, but I think it's important for the narrow use case I just mentioned. We only remember the _rev's for the last 1000 updates to a document, so there is a cap (albeit a generous one) on how much is retained. When you say '6+ million changes' are these updates to existing documents or are you deleting documents and making new ones? If the latter, then you could consider the temporal database idea, which is often suggested when using couchdb as a message queue: Use a database per time interval (say, weekly). When the database is empty (i.e, only has deleted documents), you can delete the db entirely. I'll finish with saying that CouchDB's retention of information about deleted documents and old revisions is central to CouchDB, if it's working so strongly against you, then I don't think it's the right database solution for your problem.
        Hide
        Paul Joseph Davis added a comment -

        The oops scenario is important, but the motivating use case as I always heard it was if you wanted to rebalance doc information across shards in a cluster.

        Show
        Paul Joseph Davis added a comment - The oops scenario is important, but the motivating use case as I always heard it was if you wanted to rebalance doc information across shards in a cluster.
        Hide
        Robert Newson added a comment -

        Seriously? Bleh. We can surely do shard splitting without the horrors of _purge.

        Show
        Robert Newson added a comment - Seriously? Bleh. We can surely do shard splitting without the horrors of _purge.
        Hide
        Henrik Hofmeister added a comment -

        We update the documents - not delete and create - or at least - not all the time.. Its not temp data its just forever growing data But anyways... good points - I'm starting to get the fact the CouchDB's main point is master/master replication - which we are also using it for on the more moderatly sized dbs. Could be alright though - to allow couch to disable replication on certain db's - in favor of stuff like this?

        Show
        Henrik Hofmeister added a comment - We update the documents - not delete and create - or at least - not all the time.. Its not temp data its just forever growing data But anyways... good points - I'm starting to get the fact the CouchDB's main point is master/master replication - which we are also using it for on the more moderatly sized dbs. Could be alright though - to allow couch to disable replication on certain db's - in favor of stuff like this?
        Hide
        Robert Newson added a comment -

        You could reduce revs_limits on those databases, which will reduce much of the overhead, with the caveat that replication could be impaired if no common ancestor can be found (not a problem if you never replicate).

        curl -X PUT -d "number goes here" http://localhost:5984/dbname/_revs_limit

        Show
        Robert Newson added a comment - You could reduce revs_limits on those databases, which will reduce much of the overhead, with the caveat that replication could be impaired if no common ancestor can be found (not a problem if you never replicate). curl -X PUT -d "number goes here" http://localhost:5984/dbname/_revs_limit
        Hide
        Henrik Hofmeister added a comment -

        Already done that though... but thanks

        Show
        Henrik Hofmeister added a comment - Already done that though... but thanks
        Hide
        Randall Leeds added a comment -

        If a smaller _revs_limit doesn't fix your problem then it sounds like you have some documents that are in conflict. The best way I can think to automate purging the conflicts would be to consume the /_changes feed with ?style=all_docs. Each entry in the feed will include an array of revisions in the 'changes' property. The first of these is the winning conflict revision. Then use /_purge to remove all but this winning revision and you'll be left with only the history of the winning version. If you only consume the _changes feed up to a sequence number before the stable replication checkpoints you won't be destroying revisions that haven't replicated yet and replication should continue to function. Additionally, documents that haven't been in conflict much but have received many updates will still have history back to _revs_limit and should replicate safely, without introducing new conflicts, so long as they haven't received a number of divergent updates.

        Paul's caveats about _purge and view indexes applies.

        Show
        Randall Leeds added a comment - If a smaller _revs_limit doesn't fix your problem then it sounds like you have some documents that are in conflict. The best way I can think to automate purging the conflicts would be to consume the /_changes feed with ?style=all_docs. Each entry in the feed will include an array of revisions in the 'changes' property. The first of these is the winning conflict revision. Then use /_purge to remove all but this winning revision and you'll be left with only the history of the winning version. If you only consume the _changes feed up to a sequence number before the stable replication checkpoints you won't be destroying revisions that haven't replicated yet and replication should continue to function. Additionally, documents that haven't been in conflict much but have received many updates will still have history back to _revs_limit and should replicate safely, without introducing new conflicts, so long as they haven't received a number of divergent updates. Paul's caveats about _purge and view indexes applies.
        Hide
        Randall Leeds added a comment -

        Also, if it wasn't already clear, this is bat country. Proceed with caution.

        Show
        Randall Leeds added a comment - Also, if it wasn't already clear, this is bat country. Proceed with caution.

          People

          • Assignee:
            Unassigned
            Reporter:
            Henrik Hofmeister
          • Votes:
            1 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:

              Development