CouchDB
  1. CouchDB
  2. COUCHDB-1141

Docs deleted via PUT or POST do not have contents removed.

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Won't Fix
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: Database Core
    • Labels:
      None
    • Environment:

      All

    • Skill Level:
      Regular Contributors Level (Easy to Medium)

      Description

      If a doc is deleted via -X DELETE, the resulting doc contains only id/rev_deleted:true. However, if a doc is deleted via PUT or POST, through adding _deleted:true, the entire contents of the doc remains stored. Even after compaction, the original document contents remain.

      This issue is causing databases with large docs to become bloated over time, as the original doc contents remain in the database even after being deleted.

        Activity

        Hide
        Randall Leeds added a comment -

        From IRC:
        <tilgovi> huh. is 1141 actually an issue?
        <tilgovi> it's certainly a lil weird
        <tilgovi> but like, delete is via DELETE
        <davisp> tilgovi: _bulk_docs?
        <tilgovi> ohhhhhhhhhh snappppp

        I think a proper fix should include detecting this during compaction and removing the contents at that point.

        Show
        Randall Leeds added a comment - From IRC: <tilgovi> huh. is 1141 actually an issue? <tilgovi> it's certainly a lil weird <tilgovi> but like, delete is via DELETE <davisp> tilgovi: _bulk_docs? <tilgovi> ohhhhhhhhhh snappppp I think a proper fix should include detecting this during compaction and removing the contents at that point.
        Hide
        Paul Joseph Davis added a comment -

        Pretty sure we have the compactor fix like such:

        http://friendpaste.com/2J5xwZlMzDV1ICcZDnkxpv

        Completely untested but it sure looks like a fix.

        Show
        Paul Joseph Davis added a comment - Pretty sure we have the compactor fix like such: http://friendpaste.com/2J5xwZlMzDV1ICcZDnkxpv Completely untested but it sure looks like a fix.
        Hide
        Randall Leeds added a comment -

        That looks like a fix to me.
        I wonder if we should make some macros for element/2 in places where we have tuples like this. Doing it as a macro would mean that it'd still expand to something suitable for a guard but it scares me how easy it is to typo an index and screw everything up.

        Show
        Randall Leeds added a comment - That looks like a fix to me. I wonder if we should make some macros for element/2 in places where we have tuples like this. Doing it as a macro would mean that it'd still expand to something suitable for a guard but it scares me how easy it is to typo an index and screw everything up.
        Hide
        Damien Katz added a comment -

        This is by design. Deleted documents are supposed be able to contain meta information about who deleted them, etc, because they replicate. The problem might be a documentation issue, as clients need to make sure the document body is empty when bulk deleting.

        Show
        Damien Katz added a comment - This is by design. Deleted documents are supposed be able to contain meta information about who deleted them, etc, because they replicate. The problem might be a documentation issue, as clients need to make sure the document body is empty when bulk deleting.
        Hide
        Paul Joseph Davis added a comment -

        Heh, if by documentation issue you mean eleven of twelve committers had no idea that it was intentional.

        Though it brings up the question, is this information accessible over the HTTP API in any manner? I can't think of anything but I can't say it's something I've checked.

        Show
        Paul Joseph Davis added a comment - Heh, if by documentation issue you mean eleven of twelve committers had no idea that it was intentional. Though it brings up the question, is this information accessible over the HTTP API in any manner? I can't think of anything but I can't say it's something I've checked.
        Hide
        Chris Anderson added a comment -

        This is proper behavior. (Or – what Damien said...)

        The intent is to allow users the option to save audit data like deleted_at and deleted_by when they set _deleted=true.

        To read these documents you have to fetch them with their rev num.

        Now, I haven't figured out how to find their rev num except via the changes feed, and THAT might be an improvement we should make.

        Show
        Chris Anderson added a comment - This is proper behavior. (Or – what Damien said...) The intent is to allow users the option to save audit data like deleted_at and deleted_by when they set _deleted=true. To read these documents you have to fetch them with their rev num. Now, I haven't figured out how to find their rev num except via the changes feed, and THAT might be an improvement we should make.
        Hide
        Robert Newson added a comment -

        This is fixed on trunk but needs porting to 1.1.x branch once 1.1 has shipped.

        The http layer now cleans out #doc.body if the doc is deleted. Additionally, the compactor cleans out the body too for all deleted documents (thanks to davisp for that piece). The compactor piece will need minor modification to apply cleanly to 1.1.x

        Show
        Robert Newson added a comment - This is fixed on trunk but needs porting to 1.1.x branch once 1.1 has shipped. The http layer now cleans out #doc.body if the doc is deleted. Additionally, the compactor cleans out the body too for all deleted documents (thanks to davisp for that piece). The compactor piece will need minor modification to apply cleanly to 1.1.x
        Hide
        Robert Newson added a comment -

        Well, that teaches me not to commit without reading my email.

        This is horrible behavior, imo. Since no one knew that this information was preserved other than Damien and Chris isn't it safe to assume no one is storing tombstone information deliberately? Instead, most people are surprised that whatever they post along with their deletion is preserved forever (but forever invisible).

        I think the default should be to clean out, as trunk does, perhaps we add a flag to bypass it for the case Damien mentions?

        Show
        Robert Newson added a comment - Well, that teaches me not to commit without reading my email. This is horrible behavior, imo. Since no one knew that this information was preserved other than Damien and Chris isn't it safe to assume no one is storing tombstone information deliberately? Instead, most people are surprised that whatever they post along with their deletion is preserved forever (but forever invisible). I think the default should be to clean out, as trunk does, perhaps we add a flag to bypass it for the case Damien mentions?
        Hide
        Filipe Manana added a comment -

        I remember when this intentional behaviour was added, and it's fine for me.

        https://github.com/apache/couchdb/commit/32c60cf5c554a1ff3819cc658ebf5846ccb9d6be

        Show
        Filipe Manana added a comment - I remember when this intentional behaviour was added, and it's fine for me. https://github.com/apache/couchdb/commit/32c60cf5c554a1ff3819cc658ebf5846ccb9d6be
        Hide
        Benoit Chesneau added a comment -

        well even if it' sintentional, would be good to document it somewhere so we make sure library writers remove the doc content before adding _deleted: true ALos agree, doc content should be at least removed at doc compaction.

        Show
        Benoit Chesneau added a comment - well even if it' sintentional, would be good to document it somewhere so we make sure library writers remove the doc content before adding _deleted: true ALos agree, doc content should be at least removed at doc compaction.
        Hide
        Benoit Chesneau added a comment -

        To add another client issue about this behaviour, if people want to make sure the doc content is emptied on bulk doc it will imply we remove all properties except _id, and _rev or create a new new empty doc before sending this doc, which would slow down the process a lot.

        Show
        Benoit Chesneau added a comment - To add another client issue about this behaviour, if people want to make sure the doc content is emptied on bulk doc it will imply we remove all properties except _id, and _rev or create a new new empty doc before sending this doc, which would slow down the process a lot.
        Hide
        Robert Newson added a comment -

        Hm, I don't recall a JIRA ticket related to that change in July 2010, did it just get changed without a ticket or discussion or am just forgetful?

        Show
        Robert Newson added a comment - Hm, I don't recall a JIRA ticket related to that change in July 2010, did it just get changed without a ticket or discussion or am just forgetful?
        Hide
        Paul Joseph Davis added a comment -

        That commit looks like its just the ability to see the body, not the enabling part. Unless I'm mistaken this feature has always existed, just no one has used it.

        I see the benefits of having such a capability but seeing as there's such a lack of awareness even by people that have worked with CouchDB internals, it seems like we need to have a behavior change so that it becomes a bit more obvious why things work the way they are.

        The simplest change that would make this fairly obvious is to just return the doc body in the 404 response to a GET request. So instead of

        {"not_found": "deleted"}

        we would return (usually)

        {"_id": "foo", "_rev": "1-234223424", "_deleted": true}

        And then if people had extra tombstone info it would just be in the doc body. Though we still haven't addressed whether attachments should be removed or kept.

        The second bit of functionality that seems like it'd be necessary is to be able to DELETE a deleted doc somehow. As it is, it seems like the only way to get rid of deleted doc bodies is to resurrect them and then DELETE them again. Seems like being able to issue a delete against the 404 would be better (though seems un HTTP, though I can't think of a clause in the RFC that addresses this directly, also, the RFC does make distinctions between unknown resources and resources that are known to not be available, so maybe precedent?)

        Anyone have thoughts? Open a new ticket?

        Show
        Paul Joseph Davis added a comment - That commit looks like its just the ability to see the body, not the enabling part. Unless I'm mistaken this feature has always existed, just no one has used it. I see the benefits of having such a capability but seeing as there's such a lack of awareness even by people that have worked with CouchDB internals, it seems like we need to have a behavior change so that it becomes a bit more obvious why things work the way they are. The simplest change that would make this fairly obvious is to just return the doc body in the 404 response to a GET request. So instead of {"not_found": "deleted"} we would return (usually) {"_id": "foo", "_rev": "1-234223424", "_deleted": true} And then if people had extra tombstone info it would just be in the doc body. Though we still haven't addressed whether attachments should be removed or kept. The second bit of functionality that seems like it'd be necessary is to be able to DELETE a deleted doc somehow. As it is, it seems like the only way to get rid of deleted doc bodies is to resurrect them and then DELETE them again. Seems like being able to issue a delete against the 404 would be better (though seems un HTTP, though I can't think of a clause in the RFC that addresses this directly, also, the RFC does make distinctions between unknown resources and resources that are known to not be available, so maybe precedent?) Anyone have thoughts? Open a new ticket?
        Hide
        Robert Newson added a comment -

        Instead of stripping potentially valuable data posted to a document with _deleted:true;

        1) the full document body will be returned when requested. The status code will still be 404 and the body will contain _deleted:true, either of which are generic tests for whether a document is deleted.
        2) A new view option will be added called include_deletes, allowing views to be built out of whatever information was stored along with the deletion request.

        Part 1 is not compatible with 1.0 which returns a body of

        {not_found:deleted}

        . This must be included in the API Changes or Breaking Changes section of the release nodes for 1.2

        A separate ticket should be created to address the behavior of bulkRemove in jquery.couch.js which does not adhere to best practices in the light of this feature.

        Show
        Robert Newson added a comment - Instead of stripping potentially valuable data posted to a document with _deleted:true; 1) the full document body will be returned when requested. The status code will still be 404 and the body will contain _deleted:true, either of which are generic tests for whether a document is deleted. 2) A new view option will be added called include_deletes, allowing views to be built out of whatever information was stored along with the deletion request. Part 1 is not compatible with 1.0 which returns a body of {not_found:deleted} . This must be included in the API Changes or Breaking Changes section of the release nodes for 1.2 A separate ticket should be created to address the behavior of bulkRemove in jquery.couch.js which does not adhere to best practices in the light of this feature.
        Hide
        Robert Newson added a comment -

        This change must not be applied to 1.1.1 or earlier.

        Show
        Robert Newson added a comment - This change must not be applied to 1.1.1 or earlier.
        Hide
        Robert Newson added a comment -

        We will not be changing this behavior. The contents of the body that you post with your DELETE will be preserved by design.

        Documentation should be improved so that this is clear and as many couch clients should delete by forming an object with just _id, _rev and deleted rather than naively adding _deleted:true to the full document.

        Show
        Robert Newson added a comment - We will not be changing this behavior. The contents of the body that you post with your DELETE will be preserved by design. Documentation should be improved so that this is clear and as many couch clients should delete by forming an object with just _id, _rev and deleted rather than naively adding _deleted:true to the full document.

          People

          • Assignee:
            Robert Newson
            Reporter:
            Wendall Cada
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development