Details

    • Type: Bug Bug
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      via dev@ by Peta Bogdan <bog495@gmail.com>

      Hello,

      I have a small database around 120 MB with approximately 16,000 documents.

      However, it happens (also from futon) that I get this error:

      [Tue, 17 Jan 2012 07:22:01 GMT] [error] [<0.185.0>] {error_report,<0.30.0>,
      {<0.185.0>,crash_report,
      [[{initial_call,{couch_file,init,['Argument__1']}},

      {pid,<0.185.0>}

      ,

      {registered_name,[]},
      {error_info,
      {exit,
      {{badmatch,
      {ok,
      9_MEGABYTES_BINARY}},
      [{couch_file,read_raw_iolist_int,3},
      {couch_file,maybe_read_more_iolist,4},
      {couch_file,handle_call,3},
      {gen_server,handle_msg,5},
      {proc_lib,init_p_do_apply,3}]},
      [{gen_server,terminate,6},
      {proc_lib,init_p_do_apply,3}]}},
      {ancestors,[<0.184.0>]},
      {messages,
      [{'$gen_call',
      {<0.10840.18>,#Ref<0.0.3.20907>},
      bytes}]},
      {links,[<0.190.0>]},
      {dictionary,[]},
      {trap_exit,true},
      {status,running},
      {heap_size,1597},
      {stack_size,24},
      {reductions,65666}],
      [{neighbour,
      [{pid,<0.190.0>},
      {registered_name,[]}

      ,
      {initial_call,
      {couch_ref_counter,init,['Argument__1']}},
      {current_function,{gen_server,loop,6}},

      {ancestors,[<0.188.0>,<0.187.0>,<0.184.0>]}

      ,

      {messages,[]}

      ,

      {links,[<0.185.0>]}

      ,

      {dictionary,[]}

      ,

      {trap_exit,false}

      ,

      {status,waiting}

      ,

      {heap_size,610}

      ,

      {stack_size,9}

      ,

      {reductions,362}

      ]}]]}}

      If this error occurs to frequently causes couch_server to reach its max
      restart frequency causing the entire supervision tree to shutdown and hence
      the database server instance disappears.

      The function couch_file:read_raw_iolist_int/3 calls file:pread which
      returns

      {ok, Binary}

      . This Binary has almost 9 megabytes in size, which is
      very strange.
      I think this does mean that the function file:pread/3 is instructed to read
      from wrong position.

      The only reason I can think of is that the value of 'TotalBytes' from line
      (1) doesn't match the value of 'TotalBytes' from line (2)

      (1) TotalBytes = calculate_total_read_len(BlockOffset, Len),
      (2)

      {ok, <<RawBin:TotalBytes/binary>>}

      = file:pread(Fd, Pos, TotalBytes),

      The possible answer would be that in certain conditions the function
      calculate_total_read_len/2 doesn't return the expected value.

      Server: CouchDB/1.1.1 (Erlang OTP/R14B04)
      OS: OpenBSD 5.0 GENERIC.MP#63 amd64

      Now, the trouble is how to circumvent this situation.

      Thank you in advance,

      Bogdan

        Activity

        Hide
        Jan Lehnardt added a comment -

        And Randall relied:

        I suspect you're right. One probably reason for the mismatch is that
        file:pread is reading off the end of the file due to an improperly
        huge TotalBytes value.
        It's not clear why things got to this state. It may be a classic case
        of data corruption. It's not anything I've seen reported before.

        If you have the inclination to dig through the Erlang terms and find
        anything interesting, please let us know.
        Alternatively, if you can share the database file someone else might
        be able to take a look. If you require so, it may be possible to send
        the data to a committer privately if it contains more sensitive
        information.

        If the problem is corruption, truncating the file before the corrupted
        data should allow the database to function again (at the cost of some
        data loss).

        -Randall

        Show
        Jan Lehnardt added a comment - And Randall relied: I suspect you're right. One probably reason for the mismatch is that file:pread is reading off the end of the file due to an improperly huge TotalBytes value. It's not clear why things got to this state. It may be a classic case of data corruption. It's not anything I've seen reported before. If you have the inclination to dig through the Erlang terms and find anything interesting, please let us know. Alternatively, if you can share the database file someone else might be able to take a look. If you require so, it may be possible to send the data to a committer privately if it contains more sensitive information. If the problem is corruption, truncating the file before the corrupted data should allow the database to function again (at the cost of some data loss). -Randall
        Hide
        Adam Kocoloski added a comment -

        I've seen this happen before. file:pread/3 will return fewer bytes than requested if it starts at a valid position and reaches eof. The

        {ok, <<RawBin:TotalBytes/binary>>}

        pattern fails to match because the size is less than TotalBytes and CouchDB ends up stupidly trying to log the binary it receives instead. Robert Newson wrote a patch for us a while ago to sensibly log the fact that a mismatch occurred and suppress the multi-megabyte log entry. I'll try to dig it up and see if it applies.

        A 9MB pread request is very unusual and does point to an incorrectly computed value of TotalBytes.

        Show
        Adam Kocoloski added a comment - I've seen this happen before. file:pread/3 will return fewer bytes than requested if it starts at a valid position and reaches eof. The {ok, <<RawBin:TotalBytes/binary>>} pattern fails to match because the size is less than TotalBytes and CouchDB ends up stupidly trying to log the binary it receives instead. Robert Newson wrote a patch for us a while ago to sensibly log the fact that a mismatch occurred and suppress the multi-megabyte log entry. I'll try to dig it up and see if it applies. A 9MB pread request is very unusual and does point to an incorrectly computed value of TotalBytes.
        Hide
        Paul Joseph Davis added a comment -

        I'd be quite interested to know what sort of fsync options were being used on the database/server in question where this occurred.

        Show
        Paul Joseph Davis added a comment - I'd be quite interested to know what sort of fsync options were being used on the database/server in question where this occurred.
        Hide
        Randall Leeds added a comment -

        And any history of solar storms in the area.
        Seriously, fsync is no match for a bad sector or a cosmic ray.

        Show
        Randall Leeds added a comment - And any history of solar storms in the area. Seriously, fsync is no match for a bad sector or a cosmic ray.
        Hide
        Adam Kocoloski added a comment -

        True, and it only takes one bit flip to increase the number of bytes we try to read by a huge number. We don't have any checksumming on that piece.

        Oh, and we don't fsync .view files.

        Show
        Adam Kocoloski added a comment - True, and it only takes one bit flip to increase the number of bytes we try to read by a huge number. We don't have any checksumming on that piece. Oh, and we don't fsync .view files.

          People

          • Assignee:
            Unassigned
            Reporter:
            Jan Lehnardt
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:

              Development