CouchDB
  1. CouchDB
  2. COUCHDB-844

Documents missing after CouchDB restart

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Critical Critical
    • Resolution: Fixed
    • Affects Version/s: 1.0
    • Fix Version/s: 1.0.1
    • Component/s: Database Core
    • Labels:
      None
    • Environment:

      Debian Version 5.0.5, Linux *** 2.6.29-xs5.5.0.17 #1 SMP Mon Aug 3 17:37:37 UTC 2009 i686 GNU/Linux, XenServer Guest

      Description

      After a CouchDB restart, recently added/changed documents+designdocuments (min. 2 weeks timeline!) are missing and cant be accessed trough REST Calls / Futon.

      All documents that are still available trought REST/Futon only exist in old revisions.

      All documents/revisions can be found doing a manual search (less/egrep/...) in the datafile (/var/lib/couchdb/<database>.couch)

      Example:

      strings dtap.couch | grep -i "226b2e6c-24b7-4336-92c7-257abf923b11"
      $226b2e6c-24b7-4336-92c7-257abf923b11h
      $226b2e6c-24b7-4336-92c7-257abf923b11l
      $226b2e6c-24b7-4336-92c7-257abf923b11l
      $226b2e6c-24b7-4336-92c7-257abf923b11l
      $226b2e6c-24b7-4336-92c7-257abf923b11l
      $226b2e6c-24b7-4336-92c7-257abf923b11l
      $226b2e6c-24b7-4336-92c7-257abf923b11l
      $226b2e6c-24b7-4336-92c7-257abf923b11l
      $226b2e6c-24b7-4336-92c7-257abf923b11l
      $226b2e6c-24b7-4336-92c7-257abf923b11h
      $226b2e6c-24b7-4336-92c7-257abf923b11h

      curl http://localhost:5984/dtap/226b2e6c-24b7-4336-92c7-257abf923b11

      {"error":"not_found","reason":"missing"}

        Activity

        Hide
        Adam Kocoloski added a comment -

        Randall Leeds wrote a test case and a patch, which I've committed to both trunk and the 1.0.x branch.

        Show
        Adam Kocoloski added a comment - Randall Leeds wrote a test case and a patch, which I've committed to both trunk and the 1.0.x branch.
        Hide
        Sascha Reuter added a comment -

        Feedback by mail didn't quote the questions... so, here we go....

        Is it possible either of you ran compaction as the last thing you did before restarting?

        Never ran any compaction in this environment ...

        Sascha: did you lose all documents written during those 2 weeks or only some?

        All.. pretty sure...

        Sascha, what is your TZ?

        CEST

        Also, can you run this command in the `erl` shell?

        $ erl
        Erlang R14A (erts-5.8) [source] [rq:1] [async-threads:0] [hipe] [kernel-poll:false]

        Eshell V5.8 (abort with ^G)
        1> httpd_util:rfc1123_date(erlang:localtime()).
        "Sat, 07 Aug 2010 22:00:06 GMT"
        2>

        Hope it helps guys...

        Show
        Sascha Reuter added a comment - Feedback by mail didn't quote the questions... so, here we go.... Is it possible either of you ran compaction as the last thing you did before restarting? Never ran any compaction in this environment ... Sascha: did you lose all documents written during those 2 weeks or only some? All.. pretty sure... Sascha, what is your TZ? CEST Also, can you run this command in the `erl` shell? $ erl Erlang R14A (erts-5.8) [source] [rq:1] [async-threads:0] [hipe] [kernel-poll:false] Eshell V5.8 (abort with ^G) 1> httpd_util:rfc1123_date(erlang:localtime()). "Sat, 07 Aug 2010 22:00:06 GMT" 2> Hope it helps guys...
        Hide
        Randall Leeds added a comment -

        Sascha: did you lose all documents written during those 2 weeks or only some?

        Show
        Randall Leeds added a comment - Sascha: did you lose all documents written during those 2 weeks or only some?
        Hide
        Chris Anderson added a comment -

        We are investigating a potential issue with timezone bugs in Erlang as part of the cause. Sascha, what is your TZ? Lim Yue Chaun is GMT +8.

        Here is the TZ bug: https://issues.apache.org/jira/browse/COUCHDB-627

        Also, can you run this command in the `erl` shell?

        httpd_util:rfc1123_date(erlang:localtime()).

        if it has a badarg error that could help us narrow down the cause.

        Thanks

        Show
        Chris Anderson added a comment - We are investigating a potential issue with timezone bugs in Erlang as part of the cause. Sascha, what is your TZ? Lim Yue Chaun is GMT +8. Here is the TZ bug: https://issues.apache.org/jira/browse/COUCHDB-627 Also, can you run this command in the `erl` shell? httpd_util:rfc1123_date(erlang:localtime()). if it has a badarg error that could help us narrow down the cause. Thanks
        Hide
        Lim Yue Chuan added a comment -

        I am pretty sure that the database I gave Damien and Chris has not been compacted. Its a pretty small database, I can't think of any reason as to why I would compact it.

        Looking at whats left of my data, I can see revision 1 of some documents I created really early in development. So it might have been compacted at some point, but not recently before the restart.

        I however can recall compacting another database numerous times while trying to track down a bug on the builtin reduce functions with Chris (https://mail.google.com/mail/?shva=1#inbox/12a47692440730ac). Afraid I can't say for sure that I have not lost data from that database as it is a log database with mostly randomly generated data, but if I'd to guess, I would say that that would be the case (no data loss).

        Show
        Lim Yue Chuan added a comment - I am pretty sure that the database I gave Damien and Chris has not been compacted. Its a pretty small database, I can't think of any reason as to why I would compact it. Looking at whats left of my data, I can see revision 1 of some documents I created really early in development. So it might have been compacted at some point, but not recently before the restart. I however can recall compacting another database numerous times while trying to track down a bug on the builtin reduce functions with Chris ( https://mail.google.com/mail/?shva=1#inbox/12a47692440730ac ). Afraid I can't say for sure that I have not lost data from that database as it is a log database with mostly randomly generated data, but if I'd to guess, I would say that that would be the case (no data loss).
        Hide
        Randall Leeds added a comment -

        Sascha, Lim:

        Is it possible either of you ran compaction as the last thing you did before restarting? After the last compaction ran you did not do any more updates?

        Show
        Randall Leeds added a comment - Sascha, Lim: Is it possible either of you ran compaction as the last thing you did before restarting? After the last compaction ran you did not do any more updates?
        Hide
        Sascha Reuter added a comment -

        Hi Damien,

        filesystem is ext3 and I'm using LVM. Filesystem corruption is rather unlikely and space was always plenty available (about 50%usage)...

        As said before, i just restarted the CouchDB via the provided init script.

        Let me know if you need any more information on this!

        Cheers,

        Sascha

        Show
        Sascha Reuter added a comment - Hi Damien, filesystem is ext3 and I'm using LVM. Filesystem corruption is rather unlikely and space was always plenty available (about 50%usage)... As said before, i just restarted the CouchDB via the provided init script. Let me know if you need any more information on this! Cheers, Sascha
        Hide
        Lim Yue Chuan added a comment -

        As requested, I am having the same problem and have forwarded the database in question to J Chris. Do mail me if you need a copy.

        Disk consistency check :
        C:\>chkdsk c:
        The type of the file system is NTFS.
        ~ snip ~
        CHKDSK is verifying files (stage 1 of 3)...
        366848 file records processed.
        File verification completed.
        304 large file records processed.
        0 bad file records processed.
        2 EA records processed.
        179 reparse records processed.
        CHKDSK is verifying indexes (stage 2 of 3)...
        454806 index entries processed.
        Index verification completed.
        0 unindexed files scanned.
        0 unindexed files recovered.
        CHKDSK is verifying security descriptors (stage 3 of 3)...
        366848 file SDs/SIDs processed.
        Security descriptor verification completed.
        43980 data files processed.
        CHKDSK is verifying Usn Journal...
        36799960 USN bytes processed.
        Usn Journal verification completed.
        Windows has checked the file system and found no problems.

        209612799 KB total disk space.
        101378104 KB in 322461 files.
        136072 KB in 43981 indexes.
        0 KB in bad sectors.
        476195 KB in use by the system.
        65536 KB occupied by the log file.
        107622428 KB available on disk.

        4096 bytes in each allocation unit.
        52403199 total allocation units on disk.
        26905607 allocation units available on disk.

        At no point in time is the disk anywhere close to full.

        Running Windows 7 32-bit, NTFS filesystem, harddisk is ST31000528AS - a Seagate 1TB harddisk. Partitoned into 200GB/731GB (200GB is the system partition, and also where CouchDB is installed). Installation directory is in Program Files.

        Write caching policy option in Windows is enabled, write-cache buffer flushing is NOT enabled. (These are system defaults, typical of just about every Windows system I have ever encountered, Windows advises that the second option not be turned on unless a UPS is present, I have never made use of the second option).

        I am using Mark Hammond's installation of CouchDB (not the one off couch.io)

        Do let me know if you require any further details.

        Show
        Lim Yue Chuan added a comment - As requested, I am having the same problem and have forwarded the database in question to J Chris. Do mail me if you need a copy. Disk consistency check : C:\>chkdsk c: The type of the file system is NTFS. ~ snip ~ CHKDSK is verifying files (stage 1 of 3)... 366848 file records processed. File verification completed. 304 large file records processed. 0 bad file records processed. 2 EA records processed. 179 reparse records processed. CHKDSK is verifying indexes (stage 2 of 3)... 454806 index entries processed. Index verification completed. 0 unindexed files scanned. 0 unindexed files recovered. CHKDSK is verifying security descriptors (stage 3 of 3)... 366848 file SDs/SIDs processed. Security descriptor verification completed. 43980 data files processed. CHKDSK is verifying Usn Journal... 36799960 USN bytes processed. Usn Journal verification completed. Windows has checked the file system and found no problems. 209612799 KB total disk space. 101378104 KB in 322461 files. 136072 KB in 43981 indexes. 0 KB in bad sectors. 476195 KB in use by the system. 65536 KB occupied by the log file. 107622428 KB available on disk. 4096 bytes in each allocation unit. 52403199 total allocation units on disk. 26905607 allocation units available on disk. At no point in time is the disk anywhere close to full. Running Windows 7 32-bit, NTFS filesystem, harddisk is ST31000528AS - a Seagate 1TB harddisk. Partitoned into 200GB/731GB (200GB is the system partition, and also where CouchDB is installed). Installation directory is in Program Files. Write caching policy option in Windows is enabled, write-cache buffer flushing is NOT enabled. (These are system defaults, typical of just about every Windows system I have ever encountered, Windows advises that the second option not be turned on unless a UPS is present, I have never made use of the second option). I am using Mark Hammond's installation of CouchDB (not the one off couch.io) Do let me know if you require any further details.
        Hide
        Damien Katz added a comment -

        Hello Sascha.

        What file system are you running? Can you run a consistency check on it?

        It's strange. It looks like your file was either truncated or the header was never written. There is a bunch of data after the last header, and it contains your missing data, but none of it looks like a header for it. All the interval markers are set for data. This is consistent with a file that's been truncated. Still doing a bit more investigation to check the data regions to see if they might actually have a header.

        We have seen instances in the past (0.8.0 and earlier) where file systems have truncated the db file, making recovery difficult, which is why we switched to pure tail append format. As I recall, those reports were associated with the file system running out of space.

        Barring a physical corruption or truncation, the other only possibility I can think of is somehow there is bug where the couchdb isn't writing the header. I don't know of any other instances of that happening, but if that's what it is, it's a very serious bug.

        Show
        Damien Katz added a comment - Hello Sascha. What file system are you running? Can you run a consistency check on it? It's strange. It looks like your file was either truncated or the header was never written. There is a bunch of data after the last header, and it contains your missing data, but none of it looks like a header for it. All the interval markers are set for data. This is consistent with a file that's been truncated. Still doing a bit more investigation to check the data regions to see if they might actually have a header. We have seen instances in the past (0.8.0 and earlier) where file systems have truncated the db file, making recovery difficult, which is why we switched to pure tail append format. As I recall, those reports were associated with the file system running out of space. Barring a physical corruption or truncation, the other only possibility I can think of is somehow there is bug where the couchdb isn't writing the header. I don't know of any other instances of that happening, but if that's what it is, it's a very serious bug.
        Hide
        Sascha Reuter added a comment -

        If you talk about the views I currently have access to via REST/Futon... no, because CouchDB doesn't show the design documents I've created within the last 2 weeks anymore.

        I only have access to views I created before the "about 2 weeks" timeframe.

        If I grep trough the .view files in the filesystem (.<dbname>_design) I also see rows containing data not accessable trought CouchDB.... But as there is no reference to this .view files available trough the CouchDB interface (because the design documents are missing), I can't access this data trough CouchDB.

        Show
        Sascha Reuter added a comment - If you talk about the views I currently have access to via REST/Futon... no, because CouchDB doesn't show the design documents I've created within the last 2 weeks anymore. I only have access to views I created before the "about 2 weeks" timeframe. If I grep trough the .view files in the filesystem (.<dbname>_design) I also see rows containing data not accessable trought CouchDB.... But as there is no reference to this .view files available trough the CouchDB interface (because the design documents are missing), I can't access this data trough CouchDB.
        Hide
        Filipe Manana added a comment -

        Ah, my fault, I missed the 2 weeks detail.

        Than that's definitely weird.
        Do you have some view where there are rows based on those lost docs?

        Show
        Filipe Manana added a comment - Ah, my fault, I missed the 2 weeks detail. Than that's definitely weird. Do you have some view where there are rows based on those lost docs?
        Hide
        Sascha Reuter added a comment -

        Hey Filipe,

        let me clarify some things, again...

        • It affects data added/changed within about two weeks (not 1 second) before the scheduled restart, using the provided init script.
        • A document that is reported as "missing" by CouchDB still persist in multiple revisions on the filesystem (See Example in Bugreport)

        Cheers,

        Sascha

        Show
        Sascha Reuter added a comment - Hey Filipe, let me clarify some things, again... It affects data added/changed within about two weeks (not 1 second) before the scheduled restart, using the provided init script. A document that is reported as "missing" by CouchDB still persist in multiple revisions on the filesystem (See Example in Bugreport) Cheers, Sascha
        Hide
        Filipe Manana added a comment -

        Sascha,

        This is likely because you have delayed commits turned on in your .ini config (they're enabled by default).
        With delayed commits on, Couch writes the doc to the DB file, then replies to the client with an HTTP 201 code, and then after 1 second it does an fsync.

        So if the crash/restart happened in less than 1 second after saving the doc...

        Show
        Filipe Manana added a comment - Sascha, This is likely because you have delayed commits turned on in your .ini config (they're enabled by default). With delayed commits on, Couch writes the doc to the DB file, then replies to the client with an HTTP 201 code, and then after 1 second it does an fsync. So if the crash/restart happened in less than 1 second after saving the doc...

          People

          • Assignee:
            Unassigned
            Reporter:
            Sascha Reuter
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development