Derby
  1. Derby
  2. DERBY-4348

Copy table content with "INSERT INTO table SELECT FROM (...)" statement leads to corrupt data

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Critical Critical
    • Resolution: Fixed
    • Affects Version/s: 10.0.2.1, 10.1.1.0, 10.2.1.6, 10.2.2.0, 10.3.1.4, 10.4.2.0, 10.5.1.1, 10.6.1.0
    • Fix Version/s: 10.5.3.1, 10.6.1.0
    • Component/s: SQL
    • Labels:
      None
    • Environment:
      Derby: embedded driver 10.5.1.1 - (764942); testet with 10.4 and client driver also
      OS: Windows XP
      SQL statements executed using SQuirrel SQL Client, but behavior is the same with ij
    • Urgency:
      Urgent
    • Issue & fix info:
      Repro attached
    • Bug behavior facts:
      Data corruption

      Description

      I had to change a the primary key of a table and used ddlutils to do so. Ddlutils recreated the table to perform this task.
      After the schema conversion the row data of the changed table were corrupted.
      The values of the last table column were filled with values from other rows!

      After performing a few tests I could break down the problem to the SQL statement "INSERT INTO table SELECT FROM (...)"

      To reprocude the effect do the following:

      1. unpack attached database 'rmdb.zip'

      2. connect to the database with embedded driver
      User: IGEL
      Password: test

      3. read data of an example row from database
      select * from stringrangenew where classname = 'x.xserver%.colordepth';

      result is
      x.xserver%.colordepth 2 [16] [24] [32]

      4. now copy the data to the second table (column INSTANCENR has been added to this table)
      INSERT INTO STRINGRANGENEW_ (CLASSNAME,FIRMWAREID,RVALUE) SELECT CLASSNAME,FIRMWAREID,RVALUE FROM STRINGRANGENEW;

      5. select data of example row from second table
      select * from stringrangenew_ where classname = 'x.xserver%.colordepth';

      result is
      x.xserver%.colordepth 2 -1 [CCW][CW][XX]

      -> value of last column is not the same as in orignal table!

      Here some additional information i worked out during my tests:

      • if you change the copy statement to include the additional column INSTANCENR, the copied data are correct.
        delete from STRINGRANGENEW_;
        INSERT INTO STRINGRANGENEW_ (CLASSNAME,FIRMWAREID, INSTANCENR, RVALUE) SELECT CLASSNAME,FIRMWAREID, -1, RVALUE FROM STRINGRANGENEW;
      • if you select the rows only 'SELECT CLASSNAME,FIRMWAREID,RVALUE FROM STRINGRANGENEW', the result shows correct data

      Note:
      The effect is not restricted to this row but also applies to other rows. But it's always the same rows, that get corrupted.

      1. rmdb.zip
        391 kB
        Stefan Huber
      2. out.dat
        152 kB
        Kristian Waagan
      3. derby-4348-1a.stat
        0.2 kB
        Knut Anders Hatlen
      4. derby-4348-1a.diff
        8 kB
        Knut Anders Hatlen
      5. d4348-import.sql
        2 kB
        Kristian Waagan
      6. d4348.sql
        0.6 kB
        Kristian Waagan
      7. D4348.java
        1 kB
        Knut Anders Hatlen

        Activity

        Hide
        Kristian Waagan added a comment -

        Verified in trunk with the provided repro database.

        Observations:

        • database is in 10.4 format
          (experiments below run with trunk)
        • consistency checks pass
        • exporting data, importing and repeating the repro procedure seems to work (at least the exact row is the not corrupt). I didn't create any indexes.
        • dropping and recreating STRINGRANGENEW_ without indexes works without corruption
        • dropping and recreating STRINGRANGENEW_ with primary key works without corruption

        This issue needs further qualification.
        When I ran the repro, 42 rows got corrupted.
        The attached sql script executes the steps required to reproduce (download and unzip database, then run script with ij - 'ij d4348.sql').

        I don't have time to follow up on this issue (finally some vacation ), but maybe it would be good to figure out if any of the tables are corrupt, or if it is a single index or a combination of indexes that causes the corruption.

        Maybe the reporter can post the SQL that created the indexes for completeness?

        Show
        Kristian Waagan added a comment - Verified in trunk with the provided repro database. Observations: database is in 10.4 format (experiments below run with trunk) consistency checks pass exporting data, importing and repeating the repro procedure seems to work (at least the exact row is the not corrupt). I didn't create any indexes. dropping and recreating STRINGRANGENEW_ without indexes works without corruption dropping and recreating STRINGRANGENEW_ with primary key works without corruption This issue needs further qualification. When I ran the repro, 42 rows got corrupted. The attached sql script executes the steps required to reproduce (download and unzip database, then run script with ij - 'ij d4348.sql'). I don't have time to follow up on this issue (finally some vacation ), but maybe it would be good to figure out if any of the tables are corrupt, or if it is a single index or a combination of indexes that causes the corruption. Maybe the reporter can post the SQL that created the indexes for completeness?
        Hide
        Kristian Waagan added a comment -

        Attached repro script 'd4348.sql'.

        Show
        Kristian Waagan added a comment - Attached repro script 'd4348.sql'.
        Hide
        Stefan Huber added a comment -

        Orignal table create statements:

        CREATE TABLE STRINGRANGENEW
        (
        CLASSNAME VARCHAR(200) NOT NULL,
        FIRMWAREID INTEGER NOT NULL,
        RVALUE LONG VARCHAR NOT NULL,
        PRIMARY KEY (CLASSNAME, FIRMWAREID)
        );
        CREATE INDEX IDX_CLASSNAME_05 ON STRINGRANGENEW (CLASSNAME);

        Statements for second table:

        CREATE TABLE STRINGRANGENEW_
        (
        CLASSNAME VARCHAR(200) NOT NULL,
        FIRMWAREID INTEGER NOT NULL,
        INSTANCENR INTEGER DEFAULT -1 NOT NULL,
        RVALUE LONG VARCHAR NOT NULL,
        PRIMARY KEY (CLASSNAME, FIRMWAREID, INSTANCENR)
        );

        (All statements are created by DDLUTILS automatically)

        There are no explicit indexes created for table STRINGRANGENEW_.

        The orignial table STRINGRANGENEW has additional foreign keys (one on column CLASSNAME and another one on column FIRMWAREID) in production environment. If you need the complete production database for further examination, please let me know.

        Show
        Stefan Huber added a comment - Orignal table create statements: CREATE TABLE STRINGRANGENEW ( CLASSNAME VARCHAR(200) NOT NULL, FIRMWAREID INTEGER NOT NULL, RVALUE LONG VARCHAR NOT NULL, PRIMARY KEY (CLASSNAME, FIRMWAREID) ); CREATE INDEX IDX_CLASSNAME_05 ON STRINGRANGENEW (CLASSNAME); Statements for second table: CREATE TABLE STRINGRANGENEW_ ( CLASSNAME VARCHAR(200) NOT NULL, FIRMWAREID INTEGER NOT NULL, INSTANCENR INTEGER DEFAULT -1 NOT NULL, RVALUE LONG VARCHAR NOT NULL, PRIMARY KEY (CLASSNAME, FIRMWAREID, INSTANCENR) ); (All statements are created by DDLUTILS automatically) There are no explicit indexes created for table STRINGRANGENEW_. The orignial table STRINGRANGENEW has additional foreign keys (one on column CLASSNAME and another one on column FIRMWAREID) in production environment. If you need the complete production database for further examination, please let me know.
        Hide
        Kristian Waagan added a comment -

        Thanks, Stefan.

        I'm not seeing the same behavior when I use your exact table definition, so the index is probably a dead end. Right now my suspicion is directed at the combination of DEFAULT and NOT NULL for the newly added column.

        Show
        Kristian Waagan added a comment - Thanks, Stefan. I'm not seeing the same behavior when I use your exact table definition, so the index is probably a dead end. Right now my suspicion is directed at the combination of DEFAULT and NOT NULL for the newly added column.
        Hide
        Kristian Waagan added a comment -

        Attaching a new repro - which imports the data ('d4348-import.sql' contains the script, 'out.dat' the data to import).
        I removed all the indexes to simplify the script. Note the comments for the SQL creating the two tables.

        As it is now, the repro fails (by showing 42 rows where RVALUE differs from the original).
        By tweaking the DDL slightly, it passes.

        Show
        Kristian Waagan added a comment - Attaching a new repro - which imports the data ('d4348-import.sql' contains the script, 'out.dat' the data to import). I removed all the indexes to simplify the script. Note the comments for the SQL creating the two tables. As it is now, the repro fails (by showing 42 rows where RVALUE differs from the original). By tweaking the DDL slightly, it passes.
        Hide
        Kristian Waagan added a comment -

        Note that I haven't checked if the bug occurs in versions prior to 10.3 (didn't have the jars easily available at the moment).

        Show
        Kristian Waagan added a comment - Note that I haven't checked if the bug occurs in versions prior to 10.3 (didn't have the jars easily available at the moment).
        Hide
        Knut Anders Hatlen added a comment -

        I checked three random samples out of the 42 corrupt rows, and all of them had an RVALUE identical to the row 15 places ahead of it. So at least there seems to be some kind of pattern.

        Show
        Knut Anders Hatlen added a comment - I checked three random samples out of the 42 corrupt rows, and all of them had an RVALUE identical to the row 15 places ahead of it. So at least there seems to be some kind of pattern.
        Hide
        Knut Anders Hatlen added a comment -

        More info about the pattern (thanks to Dag for his DERBY-3634 patch which made it a lot easier to find!):

        The corrupted rows in STRINGRANGENEW_ are all rows whose row number (counting from the beginning of the the table, starting with 1) is 16*N, N in

        {2,3,4,...,40}

        or N in

        {49,50,51}

        .

        Each corrupted row in STRINGRANGENEW_ with row number RN, has the same RVALUE as the row in STRINGRANGENEW with row number RN-15.

        Show
        Knut Anders Hatlen added a comment - More info about the pattern (thanks to Dag for his DERBY-3634 patch which made it a lot easier to find!): The corrupted rows in STRINGRANGENEW_ are all rows whose row number (counting from the beginning of the the table, starting with 1) is 16*N, N in {2,3,4,...,40} or N in {49,50,51} . Each corrupted row in STRINGRANGENEW_ with row number RN, has the same RVALUE as the row in STRINGRANGENEW with row number RN-15.
        Hide
        Knut Anders Hatlen added a comment -

        The attached class D4348.java reproduces the bug with a simpler schema and data set.

        Show
        Knut Anders Hatlen added a comment - The attached class D4348.java reproduces the bug with a simpler schema and data set.
        Hide
        Knut Anders Hatlen added a comment -

        I'm able to reproduce this bug all the way back to 10.0.

        Show
        Knut Anders Hatlen added a comment - I'm able to reproduce this bug all the way back to 10.0.
        Hide
        Knut Anders Hatlen added a comment -

        I've had a look at the repro in a debugger. It looks like the rows returned by the BulkTableScanResultSet that reads the source table are correct. The corruption seems to happen somewhere in the normalization after a chunk of 16 rows has been fetched from the BTSRS (BTSRS has an internal fetch buffer which holds 16 rows) and before the rows are inserted into the destination table. When the (16N+1)th row is normalized (N>0), the contents of the LONG VARCHAR column are actually copied into the SQLLongVarchar that holds the value for the (16N+16)th row. Why that happens is still not clear to me, but I suspect there's some kind of aliasing problem where two SQLLongVarchar references point to the same underlying instance.

        DataTypeDescriptor.normalize() has a special case for LONG VARCHAR (see below). If I comment out the special case, the repro doesn't produce a corrupt table, which is also an indication that the problem is related to the normalization.

        //doing the following check after normalize so that normalize method would get called on long varchs and long varbinary
        //Need normalize to be called on long varchar for bug 5592 where we need to enforce a lenght limit in db2 mode
        if ((jdbcId == Types.LONGVARCHAR) || (jdbcId == Types.LONGVARBINARY))

        { // special case for possible streams if (source.getClass() == cachedDest.getClass()) return source; }
        Show
        Knut Anders Hatlen added a comment - I've had a look at the repro in a debugger. It looks like the rows returned by the BulkTableScanResultSet that reads the source table are correct. The corruption seems to happen somewhere in the normalization after a chunk of 16 rows has been fetched from the BTSRS (BTSRS has an internal fetch buffer which holds 16 rows) and before the rows are inserted into the destination table. When the (16N+1)th row is normalized (N>0), the contents of the LONG VARCHAR column are actually copied into the SQLLongVarchar that holds the value for the (16N+16)th row. Why that happens is still not clear to me, but I suspect there's some kind of aliasing problem where two SQLLongVarchar references point to the same underlying instance. DataTypeDescriptor.normalize() has a special case for LONG VARCHAR (see below). If I comment out the special case, the repro doesn't produce a corrupt table, which is also an indication that the problem is related to the normalization. //doing the following check after normalize so that normalize method would get called on long varchs and long varbinary //Need normalize to be called on long varchar for bug 5592 where we need to enforce a lenght limit in db2 mode if ((jdbcId == Types.LONGVARCHAR) || (jdbcId == Types.LONGVARBINARY)) { // special case for possible streams if (source.getClass() == cachedDest.getClass()) return source; }
        Hide
        Knut Anders Hatlen added a comment -

        Here's a patch (derby-4348-1a.diff) that adds a regression test case
        and fixes the problem.

        It turns out that there in fact is a problem with the special case for
        LONG VARCHAR and LONG VARBINARY when performing normalization of the
        values. Normally, DataTypeDescriptor.normalize() normalizes a
        DataValueDescriptor by copying it into another DataValueDescriptor and
        returning the copy. This destination DVD is cached and reused so that
        one doesn't need to reallocate it for every value to normalize.

        The special case for LONG VARCHAR and LONG VARBINARY changes this
        slightly by returning the source DVD instead of the destination DVD,
        apparently to avoid problems with shared streams.

        Now, NormalizeResultSet has an ExecRow field, called normalizedRow, in
        which the cached destination DVDs are stored. It is reused so that
        NormalizeResultSet.getNextRowCore() returns the exact same instance
        for every row. But since DataTypeDescriptor.normalize() returns the
        source DVD instead of the copy for LONG VARCHAR, the cached ExecRow
        will contain the original DVD and not the copy. When the next row is
        requested from the NormalizeResultSet, it will therefore use the
        source DVD for the previous row as the destination DVD for the call to
        normalize().

        Copying a column from the current row to the previous row is not a
        problem for most of the rows, as the previous row has already been
        processed. However, when processing the first row in a new chunk
        returned from BulkTableScanResultSet, the DVDs in the previous row
        have also been reused in the fetch buffer to hold the last row in the
        chunk. Since that row has not yet been processed, copying into it from
        the current row will affect what we see when we get to it later.

        The problem here is that NormalizeResultSet.normalizedRow serves two
        purposes: (1) Hold an ExecRow object that can be reused, and (2) hold
        one DataValueDescriptor per column that can be reused. This works fine
        as long as the actual DVD references in the ExecRow are not changed,
        but when one of the values is a LONG VARCHAR/LONG VARBINARY the
        references are changed.

        The patch addresses the problem by having a separate data structure
        for each of the two purposes. NormalizeResultSet.normalizedRow
        continues to cache the ExecRow object for reuse. A new field
        (cachedDestinations[]) is added to hold each individual
        DataValueDescriptor that should be reused. This way, changing the DVD
        references in normalizedRow does not change which destination DVD is
        used when processing the next row, and we don't end up modifying a DVD
        which is also present later in the fetch buffer of the bulk scan.

        Description of changes:

        • NormalizeResultSet.java:
        • new field cachedDestinations which takes over some of the
          responsibility from normalizedRow
        • new helper methods getCachedDestination() and getDesiredType() to
          reduce the complexity of normalizeRow()
        • removed unneeded throws clause from fetchResultTypes() to prevent
          getDesiredType() from having to inherit the unneeded clause
        • DataTypeDescriptor.java:
        • removed code in normalize() that initializes the cached destination
          if it is null, since this is now handled by
          NormalizeResultSet.getCachedDestination()
        • InsertTest.java:
        • new JUnit test which exposes the bug

        The regression tests ran cleanly with this patch.

        Show
        Knut Anders Hatlen added a comment - Here's a patch (derby-4348-1a.diff) that adds a regression test case and fixes the problem. It turns out that there in fact is a problem with the special case for LONG VARCHAR and LONG VARBINARY when performing normalization of the values. Normally, DataTypeDescriptor.normalize() normalizes a DataValueDescriptor by copying it into another DataValueDescriptor and returning the copy. This destination DVD is cached and reused so that one doesn't need to reallocate it for every value to normalize. The special case for LONG VARCHAR and LONG VARBINARY changes this slightly by returning the source DVD instead of the destination DVD, apparently to avoid problems with shared streams. Now, NormalizeResultSet has an ExecRow field, called normalizedRow, in which the cached destination DVDs are stored. It is reused so that NormalizeResultSet.getNextRowCore() returns the exact same instance for every row. But since DataTypeDescriptor.normalize() returns the source DVD instead of the copy for LONG VARCHAR, the cached ExecRow will contain the original DVD and not the copy. When the next row is requested from the NormalizeResultSet, it will therefore use the source DVD for the previous row as the destination DVD for the call to normalize(). Copying a column from the current row to the previous row is not a problem for most of the rows, as the previous row has already been processed. However, when processing the first row in a new chunk returned from BulkTableScanResultSet, the DVDs in the previous row have also been reused in the fetch buffer to hold the last row in the chunk. Since that row has not yet been processed, copying into it from the current row will affect what we see when we get to it later. The problem here is that NormalizeResultSet.normalizedRow serves two purposes: (1) Hold an ExecRow object that can be reused, and (2) hold one DataValueDescriptor per column that can be reused. This works fine as long as the actual DVD references in the ExecRow are not changed, but when one of the values is a LONG VARCHAR/LONG VARBINARY the references are changed. The patch addresses the problem by having a separate data structure for each of the two purposes. NormalizeResultSet.normalizedRow continues to cache the ExecRow object for reuse. A new field (cachedDestinations[]) is added to hold each individual DataValueDescriptor that should be reused. This way, changing the DVD references in normalizedRow does not change which destination DVD is used when processing the next row, and we don't end up modifying a DVD which is also present later in the fetch buffer of the bulk scan. Description of changes: NormalizeResultSet.java: new field cachedDestinations which takes over some of the responsibility from normalizedRow new helper methods getCachedDestination() and getDesiredType() to reduce the complexity of normalizeRow() removed unneeded throws clause from fetchResultTypes() to prevent getDesiredType() from having to inherit the unneeded clause DataTypeDescriptor.java: removed code in normalize() that initializes the cached destination if it is null, since this is now handled by NormalizeResultSet.getCachedDestination() InsertTest.java: new JUnit test which exposes the bug The regression tests ran cleanly with this patch.
        Hide
        Knut Anders Hatlen added a comment -

        Committed revision 808850.

        Show
        Knut Anders Hatlen added a comment - Committed revision 808850.
        Hide
        Knut Anders Hatlen added a comment -

        Merged fix to 10.5 and committed revision 809490.

        Show
        Knut Anders Hatlen added a comment - Merged fix to 10.5 and committed revision 809490.

          People

          • Assignee:
            Knut Anders Hatlen
            Reporter:
            Stefan Huber
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development