Derby
  1. Derby
  2. DERBY-3907

Save useful length information for Clobs in store

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 10.5.1.1
    • Fix Version/s: 10.5.1.1
    • Component/s: JDBC, Store
    • Labels:
      None
    • Bug behavior facts:
      Performance

      Description

      The store should save useful length information for Clobs. This allows the length to be found without decoding the whole data stream.
      The following thread raised the issue on what information to store, and also contains some background information: http://www.nabble.com/Storing-length-information-for-CLOB-on-disk-tp19197535p19197535.html

      The information to store, and the exact format of it, is still to be discussed/determined.
      Currently two bytes are set aside for length information, which is inadequate.

      1. derby-3907-7a-write_new_header_format-PREVIEW.diff
        44 kB
        Kristian Waagan
      2. derby-3907-7a-write_new_header_format.stat
        0.8 kB
        Kristian Waagan
      3. derby-3907-7a-write_new_header_format.diff
        76 kB
        Kristian Waagan
      4. derby-3907-7a-write_new_header_format.diff
        76 kB
        Kristian Waagan
      5. derby-3907-7a3-use_new_header_format.stat
        0.3 kB
        Kristian Waagan
      6. derby-3907-7a3-use_new_header_format.diff
        27 kB
        Kristian Waagan
      7. derby-3907-7a2-use_new_framework.stat
        0.5 kB
        Kristian Waagan
      8. derby-3907-7a2-use_new_framework.diff
        29 kB
        Kristian Waagan
      9. derby-3907-7a1-write_new_header_format.diff
        20 kB
        Kristian Waagan
      10. derby-3907-6a-SQLClob_stream_descriptor_sync.diff
        1 kB
        Kristian Waagan
      11. derby-3907-5a-use_getStreamWithDescriptor.stat
        0.8 kB
        Kristian Waagan
      12. derby-3907-5a-use_getStreamWithDescriptor.diff
        30 kB
        Kristian Waagan
      13. derby-3907-4a-add_getStreamWithDescriptor.stat
        0.3 kB
        Kristian Waagan
      14. derby-3907-4a-add_getStreamWithDescriptor.diff
        13 kB
        Kristian Waagan
      15. derby-3907-3b-readertoutf8stream_cleanup.diff
        15 kB
        Kristian Waagan
      16. derby-3907-3a-readertoutf8stream_cleanup.stat
        0.3 kB
        Kristian Waagan
      17. derby-3907-3a-readertoutf8stream_cleanup.diff
        18 kB
        Kristian Waagan
      18. derby-3907-3a-readertoutf8stream_cleanup.diff
        18 kB
        Kristian Waagan
      19. derby-3907-2c-header_write_preparation-PREVIEW.stat
        0.7 kB
        Kristian Waagan
      20. derby-3907-2c-header_write_preparation-PREVIEW.diff
        38 kB
        Kristian Waagan
      21. derby-3907-2c-header_write_preparation.stat
        0.7 kB
        Kristian Waagan
      22. derby-3907-2c-header_write_preparation.diff
        48 kB
        Kristian Waagan
      23. derby-3907-2c-header_write_preparation.diff
        48 kB
        Kristian Waagan
      24. derby-3907-2b-header_write_preparation.stat
        0.5 kB
        Kristian Waagan
      25. derby-3907-2b-header_write_preparation.diff
        15 kB
        Kristian Waagan
      26. derby-3907-2b-header_write_preparation.diff
        14 kB
        Kristian Waagan
      27. derby-3907-1a-alternative_approach.diff
        12 kB
        Kristian Waagan

        Issue Links

          Activity

          Hide
          Kristian Waagan added a comment -

          A few starting points for discussion follows (all about the meta-information for Clobs).
          I have assumed the following prerequisites:
          1) Clob modifications are done on a copy (i.e. TemporaryClob).
          2) The meta-information is of fixed length and at the start of the data stream (first page), so that it can be updated after the data has been streamed to store.

          a) Format specification byte
          Shall we use a format specification ("magic number") byte?

          b) Maximum Clob length (in characters)
          How many bits shall we use for the Clob length?
          Is representing todays maximum (2G-1) enough, or should we make some headroom?

          c) Storing byte length
          I mentioned storing the byte length as well, but haven't found any strong use cases.
          Opinions?

          [Optimizations]

          d) Bytes per character information
          Use a few bits to save byte per character information, which can be used to optimize positioning.
          If the value is different than 0, one can calculate the byte position from the char position without decoding the stream.
          This information must be obtained by looking at all the bytes in the Clob, typically when inserting it.
          Example with 2 bits:
          0 = unknown/mixed
          1 = one byte per char
          2 = two bytes per char
          3 = three bytes per char

          e) Save "key positions" for the Clob
          For instance save the char/byte positions for 25%, 50% and 75% of the Clob.
          This increases space overhead, but reduces the decoding/positioning costs for large Clobs.
          Also adds some complexity to the positioning logic in upper layer code (i.e. above store).

          Please comment on these issues.
          Information about the upgrade issue is also appreciated.

          Show
          Kristian Waagan added a comment - A few starting points for discussion follows (all about the meta-information for Clobs). I have assumed the following prerequisites: 1) Clob modifications are done on a copy (i.e. TemporaryClob). 2) The meta-information is of fixed length and at the start of the data stream (first page), so that it can be updated after the data has been streamed to store. a) Format specification byte Shall we use a format specification ("magic number") byte? b) Maximum Clob length (in characters) How many bits shall we use for the Clob length? Is representing todays maximum (2G-1) enough, or should we make some headroom? c) Storing byte length I mentioned storing the byte length as well, but haven't found any strong use cases. Opinions? [Optimizations] d) Bytes per character information Use a few bits to save byte per character information, which can be used to optimize positioning. If the value is different than 0, one can calculate the byte position from the char position without decoding the stream. This information must be obtained by looking at all the bytes in the Clob, typically when inserting it. Example with 2 bits: 0 = unknown/mixed 1 = one byte per char 2 = two bytes per char 3 = three bytes per char e) Save "key positions" for the Clob For instance save the char/byte positions for 25%, 50% and 75% of the Clob. This increases space overhead, but reduces the decoding/positioning costs for large Clobs. Also adds some complexity to the positioning logic in upper layer code (i.e. above store). Please comment on these issues. Information about the upgrade issue is also appreciated.
          Hide
          Mike Matrigali added a comment -

          upgrade comments:
          o In keeping with the current Derby goals I believe any change of format should continue to support
          the hard/soft upgrade paradigm. Given that any change will need to have code that supports both
          the old and new format.

          o So then the decision is when to use the new format. At high level (though not sure all are feasible)
          are:
          1) every different clob in existing table
          2) only new tables support new format, in hard upgraded db's support new format
          3) only newly created DB's support new format.

          1) Off hand I don't think we have enough per data value information to figure out an olf format clob
          vs. a new format clob - but maybe someone else can come up with a magic sequence of initial
          bytes. It would be nice if the new format made it so in the future other possible optimizations
          could be handled on a per row basis - with some sort of format byte.
          2) This option seems the most flexible and readily implemented. By creating a new format id for
          the new type of clob then the code can be made to easily know the difference between a table
          of old clobs and new clobs. If someone really wants to data upgrade an old db then an offline
          compress in a hard upgraded db should automatically do it, with the added benefit of defraging
          and compressing the tables.
          3) I think this is sort of like #2 and is not really any less code.

          o Also there is a decision on what to do at hard upgrade time. Either the newly hard upgrade db supports both formats or one had to do a per row data upgrade. No Derby release has done this
          (and it was never done before the code was donated to apache either). My opinion would be to
          continue to support both formats.

          Show
          Mike Matrigali added a comment - upgrade comments: o In keeping with the current Derby goals I believe any change of format should continue to support the hard/soft upgrade paradigm. Given that any change will need to have code that supports both the old and new format. o So then the decision is when to use the new format. At high level (though not sure all are feasible) are: 1) every different clob in existing table 2) only new tables support new format, in hard upgraded db's support new format 3) only newly created DB's support new format. 1) Off hand I don't think we have enough per data value information to figure out an olf format clob vs. a new format clob - but maybe someone else can come up with a magic sequence of initial bytes. It would be nice if the new format made it so in the future other possible optimizations could be handled on a per row basis - with some sort of format byte. 2) This option seems the most flexible and readily implemented. By creating a new format id for the new type of clob then the code can be made to easily know the difference between a table of old clobs and new clobs. If someone really wants to data upgrade an old db then an offline compress in a hard upgraded db should automatically do it, with the added benefit of defraging and compressing the tables. 3) I think this is sort of like #2 and is not really any less code. o Also there is a decision on what to do at hard upgrade time. Either the newly hard upgrade db supports both formats or one had to do a per row data upgrade. No Derby release has done this (and it was never done before the code was donated to apache either). My opinion would be to continue to support both formats.
          Hide
          Mike Matrigali added a comment -

          I will post some more comments on your proposal, but need some clarification.

          What does the following mean? Will the changes apply to all sql which inserts clobs, or to only particular jdbc interfaces?
          1) Clob modifications are done on a copy (i.e. TemporaryClob).

          What is the expected call sequence to store, and what is the goal performance characteristic? ie.
          for an insert of non-lengthed clob is it something like:
          insert unlength clob into store, calculating length along the way
          update clob in store changing N leading bytes

          Show
          Mike Matrigali added a comment - I will post some more comments on your proposal, but need some clarification. What does the following mean? Will the changes apply to all sql which inserts clobs, or to only particular jdbc interfaces? 1) Clob modifications are done on a copy (i.e. TemporaryClob). What is the expected call sequence to store, and what is the goal performance characteristic? ie. for an insert of non-lengthed clob is it something like: insert unlength clob into store, calculating length along the way update clob in store changing N leading bytes
          Hide
          Kristian Waagan added a comment -

          [Upgrade issues]

          Thanks for the information on the upgrade issue, Mike.
          I think it sounds smart to follow existing patterns. Having never written this type of code, I'll need some more time and more pieces of advice when I start working on it.

          So to sum it up (correct me if I got this wrong):
          a) Support hard/soft upgrade
          b) Use new format in new tables (in newly created databases and in hard upgraded databases)
          c) Leave data as it is at hard upgrade and support both formats for the "stream header"

          Is the format id stored per table only, or per row?

          Besides from the lower level format id code, I see that changes must be done in various other classes (i.e. ReaderToUTF8Stream and SQLClob).
          I'm not sure, but I think some of the newly added classes (10.3 and later) must be reviewed too, as they reset the stream to do positioning. Currently a reset means to position the UTF-8 stream from store at position 2 to skip the two length bytes.

          Show
          Kristian Waagan added a comment - [Upgrade issues] Thanks for the information on the upgrade issue, Mike. I think it sounds smart to follow existing patterns. Having never written this type of code, I'll need some more time and more pieces of advice when I start working on it. So to sum it up (correct me if I got this wrong): a) Support hard/soft upgrade b) Use new format in new tables (in newly created databases and in hard upgraded databases) c) Leave data as it is at hard upgrade and support both formats for the "stream header" Is the format id stored per table only, or per row? Besides from the lower level format id code, I see that changes must be done in various other classes (i.e. ReaderToUTF8Stream and SQLClob). I'm not sure, but I think some of the newly added classes (10.3 and later) must be reviewed too, as they reset the stream to do positioning. Currently a reset means to position the UTF-8 stream from store at position 2 to skip the two length bytes.
          Hide
          Kristian Waagan added a comment -

          [Header format]

          Mike wrote:


          What does the following mean? Will the changes apply to all sql which inserts clobs, or to only particular jdbc interfaces?
          1) Clob modifications are done on a copy (i.e. TemporaryClob).


          With Clob modifications I mean updates of parts of an existing Clob. To get into this state, one must first do a select to get the Clob that has already been stored in the database. I think updating parts of the Clob can only be done through the Clob interface. Is that correct?

          The ResultSet.updateXXX-methods can be seen as inserting a new Clob.
          My current hope is that all insertion will go through ReaderToUTF8Stream, which seems like a good place to count characters (and bytes) and obtain bytes per char statistics.

          There might be a slight complication as we allow using setString on Clob columns.


          What is the expected call sequence to store, and what is the goal performance characteristic?


          The expected call sequence is exactly as you describe it (see Mike's comment from 10/Oct/08 10:10 AM).
          Depending on the information we need to obtain, the header can be written at once or as the last step of insertion. Even if we only store length information, we need to support the latter due to the lengthless JDBC methods.

          The goal performance characteristic for the length operation is that getting the length for the largest storable Clob should be as fast as for the shortest one (read first page and decode stream header bytes). This is not the case today, because the Clob data must be decoded to find the length. Besides from Clob.getLength, this is hurting us where other methods do argument checking using the Clob length.

          Positioning can be expressed with costs like this:
          [reset stream] + decode_chars + skip_bytes
          In certain cases, we can remove the decoding costs by knowing that all chars are represented by one, two or three bytes. In these cases, the positioning cost should be as for Blob. This is the motivation for the bytes per char information.

          Show
          Kristian Waagan added a comment - [Header format] Mike wrote: What does the following mean? Will the changes apply to all sql which inserts clobs, or to only particular jdbc interfaces? 1) Clob modifications are done on a copy (i.e. TemporaryClob). With Clob modifications I mean updates of parts of an existing Clob. To get into this state, one must first do a select to get the Clob that has already been stored in the database. I think updating parts of the Clob can only be done through the Clob interface. Is that correct? The ResultSet.updateXXX-methods can be seen as inserting a new Clob. My current hope is that all insertion will go through ReaderToUTF8Stream, which seems like a good place to count characters (and bytes) and obtain bytes per char statistics. There might be a slight complication as we allow using setString on Clob columns. What is the expected call sequence to store, and what is the goal performance characteristic? The expected call sequence is exactly as you describe it (see Mike's comment from 10/Oct/08 10:10 AM). Depending on the information we need to obtain, the header can be written at once or as the last step of insertion. Even if we only store length information, we need to support the latter due to the lengthless JDBC methods. The goal performance characteristic for the length operation is that getting the length for the largest storable Clob should be as fast as for the shortest one (read first page and decode stream header bytes). This is not the case today, because the Clob data must be decoded to find the length. Besides from Clob.getLength, this is hurting us where other methods do argument checking using the Clob length. Positioning can be expressed with costs like this: [reset stream] + decode_chars + skip_bytes In certain cases, we can remove the decoding costs by knowing that all chars are represented by one, two or three bytes. In these cases, the positioning cost should be as for Blob. This is the motivation for the bytes per char information.
          Hide
          Kristian Waagan added a comment -

          I'm a bit unsure how to handle the format id issue.

          It is not clear to me what the differences between the various clob fields in StoredFormatIds are:

          • CLOB_TYPE_ID
          • CLOB_COMPILATION_TYPE_ID
          • CLOB_TYPE_ID_IMPL
          • SQL_CLOB_ID

          In the current implementation, I think I need to access this information up to the JDBC level, i.e. in classes like StoreStreamClob, SQLClob and possibly a ResultSet class.
          To me it seems I need two versions of the SQL_CLOB_ID, but I'm not sure.
          Can anyone with more knowledge about the type system guide me?

          I do think we need a new format id, because the current format (two bytes that can have any values) makes it hard to add another stream header format.

          Show
          Kristian Waagan added a comment - I'm a bit unsure how to handle the format id issue. It is not clear to me what the differences between the various clob fields in StoredFormatIds are: CLOB_TYPE_ID CLOB_COMPILATION_TYPE_ID CLOB_TYPE_ID_IMPL SQL_CLOB_ID In the current implementation, I think I need to access this information up to the JDBC level, i.e. in classes like StoreStreamClob, SQLClob and possibly a ResultSet class. To me it seems I need two versions of the SQL_CLOB_ID, but I'm not sure. Can anyone with more knowledge about the type system guide me? I do think we need a new format id, because the current format (two bytes that can have any values) makes it hard to add another stream header format.
          Hide
          Mike Matrigali added a comment -

          > I'm a bit unsure how to handle the format id issue.
          >
          > It is not clear to me what the differences between the various clob fields in
          StoredFormatIds are:
          > - CLOB_TYPE_ID
          > - CLOB_COMPILATION_TYPE_ID
          > - CLOB_TYPE_ID_IMPL
          > - SQL_CLOB_ID
          For the reading and writing of the data to disk the SQL_CLOB_ID is the one.
          It is associated with the SQLClob class.

          The SQL layer when it creates a table gives the format id of each of the
          collumns. This format id is used by store to tell what object it should
          create to interpret the bytes on disk. So say you create a SQL_CLOB_VER2_ID,
          and associated it with SQLClob2, and work the code such that new db's and/or
          hard upgraded db's always use the new class/id when creating new tables.
          Then Store will use SQLClob2 to read data into, and count on
          SQLClob2.readExternal
          SQLClob2.readExternalFromArray
          SQLClob2.writeExternal

          to read and write your new format.

          Note that for current SQLClob these are all inherited from SQLChar currently.

          In general what has been done in the pase is to switch the format ids in the
          new code such that the "current" version has the existing name, and that
          the old version has the new name. So SQLClob would get the new version id,
          and a new class would get the old version id, something like SQLClob_10_4.java.

          For an example check out:
          java/engine/org/apache/derby/impl/store/access/btree/index/B2I.java
          java/engine/org/apache/derby/impl/store/access/btree/index/B2I_10_3.java
          java/engine/org/apache/derby/impl/store/access/btree/index/B2I_v10_2.java

          Again this is only talking about the issue of reading/writing disk at store
          level. There may be other code with stream dependencies that thinks it
          "knows" the format of the stream. The way this kind of thing has been handled
          in the past is to make the runtime version of the object always look like the
          "current" format. For past upgrades this has been easy as it usually is just
          another field in the object that one can pick a default for if it is an old
          version. In this way it isolates the upgrade code to just the to/from
          disk part of the code. With streams this may be more challenging.

          >
          > In the current implementation, I think I need to access this information up to
          the JDBC level, i.e. in classes like StoreStreamClob, SQLClob and possibly a Re
          sultSet class.
          > To me it seems I need two versions of the SQL_CLOB_ID, but I'm not sure.
          > Can anyone with more knowledge about the type system guide me?
          >
          > I do think we need a new format id, because the current format (two bytes that
          can have any values) makes it hard to add another stream header format.

          Show
          Mike Matrigali added a comment - > I'm a bit unsure how to handle the format id issue. > > It is not clear to me what the differences between the various clob fields in StoredFormatIds are: > - CLOB_TYPE_ID > - CLOB_COMPILATION_TYPE_ID > - CLOB_TYPE_ID_IMPL > - SQL_CLOB_ID For the reading and writing of the data to disk the SQL_CLOB_ID is the one. It is associated with the SQLClob class. The SQL layer when it creates a table gives the format id of each of the collumns. This format id is used by store to tell what object it should create to interpret the bytes on disk. So say you create a SQL_CLOB_VER2_ID, and associated it with SQLClob2, and work the code such that new db's and/or hard upgraded db's always use the new class/id when creating new tables. Then Store will use SQLClob2 to read data into, and count on SQLClob2.readExternal SQLClob2.readExternalFromArray SQLClob2.writeExternal to read and write your new format. Note that for current SQLClob these are all inherited from SQLChar currently. In general what has been done in the pase is to switch the format ids in the new code such that the "current" version has the existing name, and that the old version has the new name. So SQLClob would get the new version id, and a new class would get the old version id, something like SQLClob_10_4.java. For an example check out: java/engine/org/apache/derby/impl/store/access/btree/index/B2I.java java/engine/org/apache/derby/impl/store/access/btree/index/B2I_10_3.java java/engine/org/apache/derby/impl/store/access/btree/index/B2I_v10_2.java Again this is only talking about the issue of reading/writing disk at store level. There may be other code with stream dependencies that thinks it "knows" the format of the stream. The way this kind of thing has been handled in the past is to make the runtime version of the object always look like the "current" format. For past upgrades this has been easy as it usually is just another field in the object that one can pick a default for if it is an old version. In this way it isolates the upgrade code to just the to/from disk part of the code. With streams this may be more challenging. > > In the current implementation, I think I need to access this information up to the JDBC level, i.e. in classes like StoreStreamClob, SQLClob and possibly a Re sultSet class. > To me it seems I need two versions of the SQL_CLOB_ID, but I'm not sure. > Can anyone with more knowledge about the type system guide me? > > I do think we need a new format id, because the current format (two bytes that can have any values) makes it hard to add another stream header format.
          Hide
          Mike Matrigali added a comment -

          The updating discussion has been a bit confusing. I am concentrating on the store interface to update an object on disk. I realize there are a number of in memory clob update paths, but that is a different issue. Currently there is no way to update "part" of the object in the store
          interface, only the whole object.

          So without store changes your normal use case for the lengthless updates would require the entire clob to be rewritten to disk.

          I think easiest would be to support a new store interface that allowed the update of first initial n bytes by an exact same number of bytes. Currently for long clobs the usual case would be that we used all the bytes on the 1st page for the fist piece of the linked list, so expanding by any number of bytes would be a worst case. We would log at least the entire "old" portion of the row. If the lengths don't exactly match there is more logging complication. So my guess at ease of implementation from easiest to hardest supporting some subset of bytes update:
          o exact leading number of bytes
          o shrinking leading number of bytes
          o expanding leading number of bytes
          o arbitrary positioned number of bytes

          This probably would mean a new log record, and thus need to be careful again about hard/softupgrade. We have done log record updates (or new ones) in derby so an example
          exists.

          Show
          Mike Matrigali added a comment - The updating discussion has been a bit confusing. I am concentrating on the store interface to update an object on disk. I realize there are a number of in memory clob update paths, but that is a different issue. Currently there is no way to update "part" of the object in the store interface, only the whole object. So without store changes your normal use case for the lengthless updates would require the entire clob to be rewritten to disk. I think easiest would be to support a new store interface that allowed the update of first initial n bytes by an exact same number of bytes. Currently for long clobs the usual case would be that we used all the bytes on the 1st page for the fist piece of the linked list, so expanding by any number of bytes would be a worst case. We would log at least the entire "old" portion of the row. If the lengths don't exactly match there is more logging complication. So my guess at ease of implementation from easiest to hardest supporting some subset of bytes update: o exact leading number of bytes o shrinking leading number of bytes o expanding leading number of bytes o arbitrary positioned number of bytes This probably would mean a new log record, and thus need to be careful again about hard/softupgrade. We have done log record updates (or new ones) in derby so an example exists.
          Hide
          Kristian Waagan added a comment -

          Thanks a lot, Mike!

          I think I know enough to start working on some code now. The first patch(es) will be far from complete, but hopefully it can take the issue forwards.
          My initial plan is to create two subtasks:
          o add a new version of SQLClob.
          o add capability to update parts of an object to store

          The first part hopefully allows me to create tables with the new Clob format, without actually storing length information. I`ll just use the existing EOF marker and skip the header bytes. As a side effect, running the test suite should help me detect the places where the JDBC layer thinks it knows the store format.
          Maybe I can limit the knowledge about the exact format in the new version of SQLClob?

          Regarding the second part, I need to look at the example. I think replacing a fixed number of bytes should suffice, as Clobs are intended for large objects and therefore the header bytes overhead is acceptable. This means the header (which is part of the data stream) will take up just as many bytes for a 2 GB Clob as for a ~32K one.
          I have no intention to provide functionality to replace arbitrary positioned bytes, as this "conflicts" with the current LOB implementation and is a large undertaking (design and implementation).

          Hopefully we can use the functionality added in the second part to improve the situation for "lenghtless Blobs" as well, at least in some cases.

          Show
          Kristian Waagan added a comment - Thanks a lot, Mike! I think I know enough to start working on some code now. The first patch(es) will be far from complete, but hopefully it can take the issue forwards. My initial plan is to create two subtasks: o add a new version of SQLClob. o add capability to update parts of an object to store The first part hopefully allows me to create tables with the new Clob format, without actually storing length information. I`ll just use the existing EOF marker and skip the header bytes. As a side effect, running the test suite should help me detect the places where the JDBC layer thinks it knows the store format. Maybe I can limit the knowledge about the exact format in the new version of SQLClob? Regarding the second part, I need to look at the example. I think replacing a fixed number of bytes should suffice, as Clobs are intended for large objects and therefore the header bytes overhead is acceptable. This means the header (which is part of the data stream) will take up just as many bytes for a 2 GB Clob as for a ~32K one. I have no intention to provide functionality to replace arbitrary positioned bytes, as this "conflicts" with the current LOB implementation and is a large undertaking (design and implementation). Hopefully we can use the functionality added in the second part to improve the situation for "lenghtless Blobs" as well, at least in some cases.
          Hide
          Kristian Waagan added a comment -

          I got stuck trying to implement the original solution, so I tried an alternative approach.

          It is a lot simpler, but people might not like it. Note however, that it follows roughly the same pattern as Blob.
          Note the patch is a quick mash-up, and I want some feedback from the community.

          The alternative approach is to make all classes writing and reading data from store able to peek at it and determine which format it has to use to read/write the data.
          Including my second format, we have these two byte formats:

          • current: D1_D2_DATA
          • new: D4_D3_M_D2_D1_DATA

          M is a magic byte, and is used to detect the new format. It is a illegal UTF-8 encoding, so it should not be possible to interpret it incorrectly as the first format and data.
          I have set M to F0 (11110000), but I'm masking out the last four bits when looking for the magic byte. This makes it possible to have arbitrary many formats, should that be necessary, the main point is to keep the four highest bits set.
          With respect to data corruption (i.e. one bit getting flipped), is this approach safe enough?

          So if we need to be able to store huge Clobs in the future, we could change M and use another format:

          • future: D6_D5_M_D4_D3_D2_D1_DATA
            The same approach could be used to store other meta information.

          The patch 'derby-3907-alternative_approach.diff' only changes behavior for small Clobs. To enable a new format for a larger Clob, the streaming classes have to be changed (ReaderToUTF8Stream, UTF8Reader).
          It should be noted that these classes are used to write other character types (CHAR, VARCHAR) as well, and I do not intend to change how they are represented. This means that I have to include enough information to be able to do the correct thing.

          While the format can be detected on read, an informed decision must be made on write. Now I'm consulting the data dictionary to check the database version, and if it is less than 10.5 I use th e old format. Is there a better way?

          Regarding the original approach, I got stuck because the upper layers of Derby are sending down NULL values of the data types into store. The upper layer don't have any context information, and is unable to choose the correct implementation. The system doesn't seem to be set up for having multiple implementations of a single data type at this level.
          I ended up with a series of hacks, for instance having store override the Clob implementation type, but it just didn't work very well. At one point I had normal, soft- and hard-upgraded working, but compress table failed. I'm sure this isn't the only path that will fail.

          I might pick up the work again later, but right now I want to wait for a while and work on other issues.

          Show
          Kristian Waagan added a comment - I got stuck trying to implement the original solution, so I tried an alternative approach. It is a lot simpler, but people might not like it. Note however, that it follows roughly the same pattern as Blob. Note the patch is a quick mash-up, and I want some feedback from the community. The alternative approach is to make all classes writing and reading data from store able to peek at it and determine which format it has to use to read/write the data. Including my second format, we have these two byte formats: current: D1_D2_DATA new: D4_D3_M_D2_D1_DATA M is a magic byte, and is used to detect the new format. It is a illegal UTF-8 encoding, so it should not be possible to interpret it incorrectly as the first format and data. I have set M to F0 (11110000), but I'm masking out the last four bits when looking for the magic byte. This makes it possible to have arbitrary many formats, should that be necessary, the main point is to keep the four highest bits set. With respect to data corruption (i.e. one bit getting flipped), is this approach safe enough? So if we need to be able to store huge Clobs in the future, we could change M and use another format: future: D6_D5_M_D4_D3_D2_D1_DATA The same approach could be used to store other meta information. The patch 'derby-3907-alternative_approach.diff' only changes behavior for small Clobs. To enable a new format for a larger Clob, the streaming classes have to be changed (ReaderToUTF8Stream, UTF8Reader). It should be noted that these classes are used to write other character types (CHAR, VARCHAR) as well, and I do not intend to change how they are represented. This means that I have to include enough information to be able to do the correct thing. While the format can be detected on read, an informed decision must be made on write. Now I'm consulting the data dictionary to check the database version, and if it is less than 10.5 I use th e old format. Is there a better way? Regarding the original approach, I got stuck because the upper layers of Derby are sending down NULL values of the data types into store. The upper layer don't have any context information, and is unable to choose the correct implementation. The system doesn't seem to be set up for having multiple implementations of a single data type at this level. I ended up with a series of hacks, for instance having store override the Clob implementation type, but it just didn't work very well. At one point I had normal, soft- and hard-upgraded working, but compress table failed. I'm sure this isn't the only path that will fail. I might pick up the work again later, but right now I want to wait for a while and work on other issues.
          Hide
          Kristian Waagan added a comment -

          Attached the wrong file. Adding the correct one.

          Show
          Kristian Waagan added a comment - Attached the wrong file. Adding the correct one.
          Hide
          Knut Anders Hatlen added a comment -

          I think the magic byte sounds like an elegant solution to the problem. The invalid values for the first byte in the UTF-8-encoded sequence are used almost as if we had half a byte extra in the header reserved for future use. I think you are right that this will make the backwards-compatibility issues much easier to handle. As to your question about picking the right format when writing, I think your approach of checking the version of the data dictionary is the way it's normally done.

          Show
          Knut Anders Hatlen added a comment - I think the magic byte sounds like an elegant solution to the problem. The invalid values for the first byte in the UTF-8-encoded sequence are used almost as if we had half a byte extra in the header reserved for future use. I think you are right that this will make the backwards-compatibility issues much easier to handle. As to your question about picking the right format when writing, I think your approach of checking the version of the data dictionary is the way it's normally done.
          Hide
          Kristian Waagan added a comment -

          'derby-3907-2b-header_write_preparation.diff' is a preparation patch to lay the foundation for writing the new stream header format. Essentially, the behavior should be the same before and after the patch.

          The patch adds a new method to the interface StringDataValue, since the header format will only apply for string data types. The method generateStreamHdr creates the stream header based on the version of the dictionary and the character length information, and also determines if the stream have to be ended with a Derby-specific end-of-stream marker.
          The current implementation simply returns unknown length and instructs to terminate the value with a Derby EOF marker. This is because Derby pre 10.5 expects a byte count, not a char count. At this level, the byte count is generally unknown. The method generateStreamHdr will be overridden in SQLClob, and the new header format will be used there, unless we are running in soft-upgrade mode where the old format will be used.
          The goal is that the knowledge about the exact format of the headers is contained in the DVD(s).

          In ReaderToUTF8Stream I removed the instance variable maximumLength, because it is only needed in the constructor. Further, I added setHeader to allow the header to be overridden after the reader is instantiated. If the header isn't overridden, the reader should behave as before.

          Note that things may have to be changed a bit if support for updating the header only is added (i.e. the lenghtless scenario).

          Patch 2b ready for review.

          FYI, patch 2a is almost identical, except that it doesn't use a utility/holder class for the header. I think using the utility class is cleaner, but it does of course introduce yet another class.

          Show
          Kristian Waagan added a comment - 'derby-3907-2b-header_write_preparation.diff' is a preparation patch to lay the foundation for writing the new stream header format. Essentially, the behavior should be the same before and after the patch. The patch adds a new method to the interface StringDataValue, since the header format will only apply for string data types. The method generateStreamHdr creates the stream header based on the version of the dictionary and the character length information, and also determines if the stream have to be ended with a Derby-specific end-of-stream marker. The current implementation simply returns unknown length and instructs to terminate the value with a Derby EOF marker. This is because Derby pre 10.5 expects a byte count, not a char count. At this level, the byte count is generally unknown. The method generateStreamHdr will be overridden in SQLClob, and the new header format will be used there, unless we are running in soft-upgrade mode where the old format will be used. The goal is that the knowledge about the exact format of the headers is contained in the DVD(s). In ReaderToUTF8Stream I removed the instance variable maximumLength, because it is only needed in the constructor. Further, I added setHeader to allow the header to be overridden after the reader is instantiated. If the header isn't overridden, the reader should behave as before. Note that things may have to be changed a bit if support for updating the header only is added (i.e. the lenghtless scenario). Patch 2b ready for review. FYI, patch 2a is almost identical, except that it doesn't use a utility/holder class for the header. I think using the utility class is cleaner, but it does of course introduce yet another class.
          Hide
          Kristian Waagan added a comment -

          Regression tests for patch 2b ran without failures with Java SE 6 on Solaris.

          Show
          Kristian Waagan added a comment - Regression tests for patch 2b ran without failures with Java SE 6 on Solaris.
          Hide
          Knut Anders Hatlen added a comment -

          I looked at the 2b patch, and it looks good to me. Please see some
          minor comments below:

          It seems like we always call ReaderToUTF8Stream.setHeader() right
          after we have created a ReaderToUTF8Stream object, and we never create
          a ReaderToUTF8Stream object without calling setHeader(). Would it be
          cleaner to do add an extra parameter to the constructor instead of
          requiring code that uses ReaderToUTF8Stream to call setHeader()? With
          the current patch, it seems like headerOverridden will always be
          true. Is that going to change? If not, the code would be simpler if
          headerOverridden was removed.

          I think StringDataValue.generateStreamHdr() should be renamed to
          generateStreamHeader() since it's a public method and a clear name is
          more important than saving three characters.

          EmbedResultSet.updateCharacterStreamInternal: Since the value returned
          by getDVDforColumnToBeUpdated() is now used twice, it is stored in a
          local variable, which is good. But since the method is quite long,
          perhaps it would be clearer to have the declaration of the variable
          closer to the end of the method where it's used?

          Show
          Knut Anders Hatlen added a comment - I looked at the 2b patch, and it looks good to me. Please see some minor comments below: It seems like we always call ReaderToUTF8Stream.setHeader() right after we have created a ReaderToUTF8Stream object, and we never create a ReaderToUTF8Stream object without calling setHeader(). Would it be cleaner to do add an extra parameter to the constructor instead of requiring code that uses ReaderToUTF8Stream to call setHeader()? With the current patch, it seems like headerOverridden will always be true. Is that going to change? If not, the code would be simpler if headerOverridden was removed. I think StringDataValue.generateStreamHdr() should be renamed to generateStreamHeader() since it's a public method and a clear name is more important than saving three characters. EmbedResultSet.updateCharacterStreamInternal: Since the value returned by getDVDforColumnToBeUpdated() is now used twice, it is stored in a local variable, which is good. But since the method is quite long, perhaps it would be clearer to have the declaration of the variable closer to the end of the method where it's used?
          Hide
          Kristian Waagan added a comment -

          I agree ReaderToUTF8Stream can be simplified quite a bit, and I'll take a look at it.

          By adding the StreamHeaderHolder to the constructor, I think we'll loose one feature. The current implementation has the ability to go back and fill in the header if the value fits into the buffer (32768 - minus header size). The value written to the header is the byte count, and it is used as a hint for sizing the character array used when materializing the string value (CHAR, VARCHAR, LONG VARCHAR).
          This functionality can continue to exist, but I'm unsure how to do it. Some quick proposals:
          a) Duplicate header generation code in ReaderToUTF8Stream (trying to avoid this)
          b) Pass around a header generation object instead of a "pre-generated" header.
          c) Pass around reference to the StringDataValue/DVD object
          d) Special handling for all non-Clob data values (requires data type information)
          e) Simply check header length, if two bytes long then update header (hacky?)

          I'm not sure how much performance will be affected by removing the feature, but options (d) and (e) seem pretty simple to implement. There are also some alternative "sizing heuristics", but I don't know how effective they are (I've seen InputStream.available() and the start-stop index on the page being used). When resizing, CHAR grows by 64 bytes, VARCHAR by 4 KB. Also note that the problem is only affecting values inserted with a stream/reader.

          Also, since we are already spending two bytes per string value to store meta information, it would be nice to actually use them optimally. I'm tempted to start using those two bytes for the character count, instead of the byte count. That way, the sizing hint would be exact for almost all CHAR, VARCHAR and LONG VARCHAR values (exceptions are those values inserted with a stream through the "lengthless overrides" where the byte representation exceeds ~32k bytes).
          The downsides of this are that some decoding loops must be modified and probably that the hint has to be ignored when reading old databases (pre 10.5). I also need to do a more thorough search for places where this information is used.

          I think using the new header format for all string types is a bad idea, as it will add an extra 3 bytes of overhead. Further, two of those bytes will never be used due to the maximum allowed length of the non-Clob data types.

          BTW, I have added a static variable for the unknown length stream header holder. A new patch will be uploaded later.

          Show
          Kristian Waagan added a comment - I agree ReaderToUTF8Stream can be simplified quite a bit, and I'll take a look at it. By adding the StreamHeaderHolder to the constructor, I think we'll loose one feature. The current implementation has the ability to go back and fill in the header if the value fits into the buffer (32768 - minus header size). The value written to the header is the byte count, and it is used as a hint for sizing the character array used when materializing the string value (CHAR, VARCHAR, LONG VARCHAR). This functionality can continue to exist, but I'm unsure how to do it. Some quick proposals: a) Duplicate header generation code in ReaderToUTF8Stream (trying to avoid this) b) Pass around a header generation object instead of a "pre-generated" header. c) Pass around reference to the StringDataValue/DVD object d) Special handling for all non-Clob data values (requires data type information) e) Simply check header length, if two bytes long then update header (hacky?) I'm not sure how much performance will be affected by removing the feature, but options (d) and (e) seem pretty simple to implement. There are also some alternative "sizing heuristics", but I don't know how effective they are (I've seen InputStream.available() and the start-stop index on the page being used). When resizing, CHAR grows by 64 bytes, VARCHAR by 4 KB. Also note that the problem is only affecting values inserted with a stream/reader. Also, since we are already spending two bytes per string value to store meta information, it would be nice to actually use them optimally. I'm tempted to start using those two bytes for the character count, instead of the byte count. That way, the sizing hint would be exact for almost all CHAR, VARCHAR and LONG VARCHAR values (exceptions are those values inserted with a stream through the "lengthless overrides" where the byte representation exceeds ~32k bytes). The downsides of this are that some decoding loops must be modified and probably that the hint has to be ignored when reading old databases (pre 10.5). I also need to do a more thorough search for places where this information is used. I think using the new header format for all string types is a bad idea, as it will add an extra 3 bytes of overhead. Further, two of those bytes will never be used due to the maximum allowed length of the non-Clob data types. BTW, I have added a static variable for the unknown length stream header holder. A new patch will be uploaded later.
          Hide
          Kristian Waagan added a comment -

          Forget my rambling about the constructor, which has nothing to do with the issue at all. But the rest of the comment should have some substance
          It's time to get home...

          Show
          Kristian Waagan added a comment - Forget my rambling about the constructor, which has nothing to do with the issue at all. But the rest of the comment should have some substance It's time to get home...
          Hide
          Kristian Waagan added a comment -

          Patch 3a cleans up ReaderToUTF8Stream and changes the behavior in way one.
          It will now try to truncate CHAR and VARCHAR as well as CLOB. Only spaces are accepted as truncatable characters.
          Truncation is disallowed for LONG VARCHAR.

          In addition, I added some substance to the error message when truncation fails. I decided to not print the stream content, as it is not easily accessible and it may be very large.
          I also added quite a bit of JavaDoc/comments, removed the instance variable maximumLength and simplified the constructors.
          Since the type name is used to determine if truncation is allowed or not, I added a debug block verifying the name. As a consequence of this, I had to modify UTF8UtilTest.

          Regression tests ran with two failures (stressmulti and replication), and due to some changes I'm re-running the tests.
          Patch ready for review.

          Show
          Kristian Waagan added a comment - Patch 3a cleans up ReaderToUTF8Stream and changes the behavior in way one. It will now try to truncate CHAR and VARCHAR as well as CLOB. Only spaces are accepted as truncatable characters. Truncation is disallowed for LONG VARCHAR. In addition, I added some substance to the error message when truncation fails. I decided to not print the stream content, as it is not easily accessible and it may be very large. I also added quite a bit of JavaDoc/comments, removed the instance variable maximumLength and simplified the constructors. Since the type name is used to determine if truncation is allowed or not, I added a debug block verifying the name. As a consequence of this, I had to modify UTF8UtilTest. Regression tests ran with two failures (stressmulti and replication), and due to some changes I'm re-running the tests. Patch ready for review.
          Hide
          Kristian Waagan added a comment -

          Wrong diff file... Reattaching 3a.

          Show
          Kristian Waagan added a comment - Wrong diff file... Reattaching 3a.
          Hide
          Kristian Waagan added a comment -

          Attaching patch 3b.
          I had to pull parts of the patch out. The error reporting for truncation failure during insert from a stream isn't adequate, and requires additional work. I filed DERBY-4005 for some of it, but I expect there will be another Jira as well.

          The patch now changes ReaderToUTF8Stream and UTF8UtilTest only. See commit message for more detailed description of changes.

          Committed 3b to trunk with revision 732676.

          Show
          Kristian Waagan added a comment - Attaching patch 3b. I had to pull parts of the patch out. The error reporting for truncation failure during insert from a stream isn't adequate, and requires additional work. I filed DERBY-4005 for some of it, but I expect there will be another Jira as well. The patch now changes ReaderToUTF8Stream and UTF8UtilTest only. See commit message for more detailed description of changes. Committed 3b to trunk with revision 732676.
          Hide
          Knut Anders Hatlen added a comment -

          Does the 3b patch also change the behaviour with regards to truncation of CHAR and VARCHAR? Are these changes observable by users?

          Show
          Knut Anders Hatlen added a comment - Does the 3b patch also change the behaviour with regards to truncation of CHAR and VARCHAR? Are these changes observable by users?
          Hide
          Kristian Waagan added a comment -

          Patch 3b does change the behavior with regards to truncation of CHAR in one scenario, only possible through JDBC 4.0:
          Inserting a stream that is too long (longer than the specified maximum width of the CHAR column) with a lengthless override. With a lengthless override, I mean a JDBC method taking a stream, but where the length of the stream is not specified.
          Before the change the user would get SQLState 22001, after the change the SQLState would be XJ001. A follow-up patch should revert this change of behavior (Jira not yet logged, is dependent on DERBY-4005).

          Further, I added CHAR to the list of allowed data types in ReaderToUTF8Stream.canTruncate, but the code using ReaderToUTF8Stream sets the number of characters to truncate to zero for all string data types except CLOB.
          The code changing the behavior described above is what I pulled out from patch 3a (see some related info in DERBY-4005).
          Since the functionality is already in place, I think it makes sense to allow for truncation in the stream instead of materializing the data value and try to truncate it later.
          That said, the number of applicable use cases are limited so it wouldn't be a disaster to leave it as it is.

          Regarding the error reporting issue, it is still valid also if we drop the "extended truncation support".

          Show
          Kristian Waagan added a comment - Patch 3b does change the behavior with regards to truncation of CHAR in one scenario, only possible through JDBC 4.0: Inserting a stream that is too long (longer than the specified maximum width of the CHAR column) with a lengthless override. With a lengthless override, I mean a JDBC method taking a stream, but where the length of the stream is not specified. Before the change the user would get SQLState 22001, after the change the SQLState would be XJ001. A follow-up patch should revert this change of behavior (Jira not yet logged, is dependent on DERBY-4005 ). Further, I added CHAR to the list of allowed data types in ReaderToUTF8Stream.canTruncate, but the code using ReaderToUTF8Stream sets the number of characters to truncate to zero for all string data types except CLOB. The code changing the behavior described above is what I pulled out from patch 3a (see some related info in DERBY-4005 ). Since the functionality is already in place, I think it makes sense to allow for truncation in the stream instead of materializing the data value and try to truncate it later. That said, the number of applicable use cases are limited so it wouldn't be a disaster to leave it as it is. Regarding the error reporting issue, it is still valid also if we drop the "extended truncation support".
          Hide
          Kristian Waagan added a comment -

          Refreshed patch 2b, as the old one don't apply any more due to other changes.
          It has a few changes from the original version. The remaining comments from Knut Anders will be addressed in revision 2c.

          Show
          Kristian Waagan added a comment - Refreshed patch 2b, as the old one don't apply any more due to other changes. It has a few changes from the original version. The remaining comments from Knut Anders will be addressed in revision 2c.
          Hide
          Kristian Waagan added a comment -

          Attached 'derby-3907-2c-header_write_preparation-PREVIEW.diff'.
          This is an early version of revision 2c. In general, Derby should behave as before, but it should now have the framework it needs to handle multiple header formats for the streams coming in.
          This is a partial solution only, Derby must also learn how to handle multiple header formats for non-streaming situations. I expect most of these changes will come in SQLChar and SQLClob.
          Finally, the new header format must be added.

          I'm thinking about making StreamHeaderHolder immutable again, because the holder for a 10.4 header for a stream with unknown length will be used a lot. That way we can keep an instance in SQLChar that can be shared. I'll look some more at it an comment further.

          Feel free to have a look and give me some early feedback

          Show
          Kristian Waagan added a comment - Attached 'derby-3907-2c-header_write_preparation-PREVIEW.diff'. This is an early version of revision 2c. In general, Derby should behave as before, but it should now have the framework it needs to handle multiple header formats for the streams coming in. This is a partial solution only, Derby must also learn how to handle multiple header formats for non-streaming situations. I expect most of these changes will come in SQLChar and SQLClob. Finally, the new header format must be added. I'm thinking about making StreamHeaderHolder immutable again, because the holder for a 10.4 header for a stream with unknown length will be used a lot. That way we can keep an instance in SQLChar that can be shared. I'll look some more at it an comment further. Feel free to have a look and give me some early feedback
          Hide
          Kristian Waagan added a comment -

          Patch 'derby-3907-2c-header_write_preparation.diff' is ready for review.

          A description of the changes:
          o EmbedResultSet and EmbedPreparedStatement
          Started using the new ReaderToUTF8Stream constructor, where the stream header is passed in. Also started to treat the DataValueDescriptor as a StringDataValue, which should always be the case at this point in the code.

          o ReaderToUTF8Stream
          Added field 'header', which holds a StreamHeaderHolder coming from a StringDataValue object. Updated the constructors with a new argument. The first execution of fillBuffer now uses the header holder to obtain the header length, and the header holder object is consulted when checking if the header can be updated with the length after the application stream has been drained. Note that updating the header with a character count is not yet supported. This will be added in another patch together with the handling of the new header format.

          o StringDataValue
          Added new method generateStreamHeader.

          o SQLChar
          Implemented generateStreamHeader, which always return a header for a stream with unknown length (see the constant). In most cases, the header will be updated after the stream has been drained. The exception is when the data values for VARCHAR and LONG VARCHAR are too long to be held in the byte buffer in ReaderToUTF8Reader, which can only happen when the string contains characters that require a two or three byte representation.

          o SQLClob
          Added a constant for a 10.5 stream header holder representing a stream with unknown character length. Also updated the use of the ReaderToUTF8Stream constructor.

          o StreamHeaderHolder
          Holder object for a stream header, containing the header itself and the following additional information; "instructions" on how to update the header with a new length, if the length is expected to be in number of bytes or characters, and if an EOF marker is expected to be appended to the stream. The object is considered immutable, but it is not copying the byte arrays passed in to the constructor. I found this unnecessary because this code is being called by internal code only, but should it still do defensive copying?

          o UTF8UTilTest
          Updated usage of the ReaderToUTF8Stream constructor, and replaced the hardcode byte count to skip with a call to the header holder object.

          o ClobTest (jdbc4)
          Added some simple tests inserting and fetching Clobs to test the basics of stream header handling.

          o StreamTruncationTest
          New test testing truncation of string data values when they are inserted as streams. Note the not-so-elegant error reporting (see catch clause in insertSmall)...

          Show
          Kristian Waagan added a comment - Patch 'derby-3907-2c-header_write_preparation.diff' is ready for review. A description of the changes: o EmbedResultSet and EmbedPreparedStatement Started using the new ReaderToUTF8Stream constructor, where the stream header is passed in. Also started to treat the DataValueDescriptor as a StringDataValue, which should always be the case at this point in the code. o ReaderToUTF8Stream Added field 'header', which holds a StreamHeaderHolder coming from a StringDataValue object. Updated the constructors with a new argument. The first execution of fillBuffer now uses the header holder to obtain the header length, and the header holder object is consulted when checking if the header can be updated with the length after the application stream has been drained. Note that updating the header with a character count is not yet supported. This will be added in another patch together with the handling of the new header format. o StringDataValue Added new method generateStreamHeader. o SQLChar Implemented generateStreamHeader, which always return a header for a stream with unknown length (see the constant). In most cases, the header will be updated after the stream has been drained. The exception is when the data values for VARCHAR and LONG VARCHAR are too long to be held in the byte buffer in ReaderToUTF8Reader, which can only happen when the string contains characters that require a two or three byte representation. o SQLClob Added a constant for a 10.5 stream header holder representing a stream with unknown character length. Also updated the use of the ReaderToUTF8Stream constructor. o StreamHeaderHolder Holder object for a stream header, containing the header itself and the following additional information; "instructions" on how to update the header with a new length, if the length is expected to be in number of bytes or characters, and if an EOF marker is expected to be appended to the stream. The object is considered immutable, but it is not copying the byte arrays passed in to the constructor. I found this unnecessary because this code is being called by internal code only, but should it still do defensive copying? o UTF8UTilTest Updated usage of the ReaderToUTF8Stream constructor, and replaced the hardcode byte count to skip with a call to the header holder object. o ClobTest (jdbc4) Added some simple tests inserting and fetching Clobs to test the basics of stream header handling. o StreamTruncationTest New test testing truncation of string data values when they are inserted as streams. Note the not-so-elegant error reporting (see catch clause in insertSmall)...
          Hide
          Kristian Waagan added a comment -

          Updated a new version of patch 2c. The only change is in ReaderToUTF8Stream.checkSufficientData, where I had used the wrong header holder to determine if an EOF marker was to be written to the stream.

          With the exception of one failure in UTF8UtilTest, the regression tests ran successfully with JDK 1.6 on Solaris.
          In rerunning with the updated patch.

          Show
          Kristian Waagan added a comment - Updated a new version of patch 2c. The only change is in ReaderToUTF8Stream.checkSufficientData, where I had used the wrong header holder to determine if an EOF marker was to be written to the stream. With the exception of one failure in UTF8UtilTest, the regression tests ran successfully with JDK 1.6 on Solaris. In rerunning with the updated patch.
          Hide
          Kristian Waagan added a comment -

          Committed patch 2c to trunk with revision 734065.

          Show
          Kristian Waagan added a comment - Committed patch 2c to trunk with revision 734065.
          Hide
          Kristian Waagan added a comment -

          Patch 'derby-3907-4a-add_getStreamWithDescriptor.diff' adds the method getStringWithDescriptor to StringDataValue.

          It it intended to be used when getting a stream from a StringDataValue to be used with a Clob object, or with streaming of string data values in general. The DVD is responsible for returning a correct descriptor for the raw stream. The descriptor is in turn used by other classes to correctly configure themselves with respect to data offsets, buffering, repositioning and so on.
          This patch was part of a bigger patch, but I decided to split it into two to make it easier to review.

          Patch description:
          o CharacterStreamDescriptor
          Added a toString method and more verbose assert-messages.

          o StringDataValue
          Added method 'CharacterStreamDescriptor getStreamWithDescriptor()'.

          o SQLChar
          Made setStream non-final so it can be overridden in SQLClob.
          Added default implementation of getStreamWithDescriptor that always returns null. This means that all non-Clob string data types will be handled as strings instead of streams in situations where a stream is requested through getStreamWithDescriptor. I'll look into the performance implications of this a little later, when more of the final code is in place.
          Made throwStreamingIOException protected to access it from SQLClob.

          o SQLClob
          Implemented getStreamWithDescriptor, handling the old 2-byte format only.
          Overrid setStream to discard the stream descriptor when a new stream is set for the DVD.

          Patch ready for review.
          I will commit this shortly, but since the code isn't used yet it should be harmless. That shouldn't stop any reviewers though!
          I'll also post the next patch shortly.

          Show
          Kristian Waagan added a comment - Patch 'derby-3907-4a-add_getStreamWithDescriptor.diff' adds the method getStringWithDescriptor to StringDataValue. It it intended to be used when getting a stream from a StringDataValue to be used with a Clob object, or with streaming of string data values in general. The DVD is responsible for returning a correct descriptor for the raw stream. The descriptor is in turn used by other classes to correctly configure themselves with respect to data offsets, buffering, repositioning and so on. This patch was part of a bigger patch, but I decided to split it into two to make it easier to review. Patch description: o CharacterStreamDescriptor Added a toString method and more verbose assert-messages. o StringDataValue Added method 'CharacterStreamDescriptor getStreamWithDescriptor()'. o SQLChar Made setStream non-final so it can be overridden in SQLClob. Added default implementation of getStreamWithDescriptor that always returns null. This means that all non-Clob string data types will be handled as strings instead of streams in situations where a stream is requested through getStreamWithDescriptor. I'll look into the performance implications of this a little later, when more of the final code is in place. Made throwStreamingIOException protected to access it from SQLClob. o SQLClob Implemented getStreamWithDescriptor, handling the old 2-byte format only. Overrid setStream to discard the stream descriptor when a new stream is set for the DVD. Patch ready for review. I will commit this shortly, but since the code isn't used yet it should be harmless. That shouldn't stop any reviewers though! I'll also post the next patch shortly.
          Hide
          Kristian Waagan added a comment -

          Committed patch 4a to trunk with revision 734148.

          Show
          Kristian Waagan added a comment - Committed patch 4a to trunk with revision 734148.
          Hide
          Kristian Waagan added a comment -

          Patch 'derby-3907-5a-use_getStreamWithDescriptor.diff' takes the new StringDataValue.getStreamWithDescriptor() into use.

          Description of changes:
          o EmbedClob
          Changed constructor to take a StringDataValue instead of a DataValueDescriptor.
          Updated call to the StoreStreamClob constructor.

          o EmbedResultSet
          Started using the getStreamWithDescriptor method and updated invocations of the UTF8Reader constructor.

          o StoreStreamClob
          Added a CharacterStreamDescriptor, and made the constructor take one as an argument.
          Adapted the class to use a CSD.

          o UTF8Reader
          Updated some comments.
          Fixed bug where the header length wasn't added to the byte length of the stream, and updated the class appropriately (adjusted utfCount, fixed the reset routine).
          Made sure the header bytes are skipped (either by skipping them in the constructor or by adjusting the position to on the next reposition).

          o ResultSetStreamTest
          Added a test for maxFieldSize, where truncation have to happen.

          o Various tests
          Adjusted tests to run with the new implementation.

          Patch ready for review.
          Regression tests ran cleanly, but due to a last minute change I have to rerun them again. I will post the results tomorrow.

          Show
          Kristian Waagan added a comment - Patch 'derby-3907-5a-use_getStreamWithDescriptor.diff' takes the new StringDataValue.getStreamWithDescriptor() into use. Description of changes: o EmbedClob Changed constructor to take a StringDataValue instead of a DataValueDescriptor. Updated call to the StoreStreamClob constructor. o EmbedResultSet Started using the getStreamWithDescriptor method and updated invocations of the UTF8Reader constructor. o StoreStreamClob Added a CharacterStreamDescriptor, and made the constructor take one as an argument. Adapted the class to use a CSD. o UTF8Reader Updated some comments. Fixed bug where the header length wasn't added to the byte length of the stream, and updated the class appropriately (adjusted utfCount, fixed the reset routine). Made sure the header bytes are skipped (either by skipping them in the constructor or by adjusting the position to on the next reposition). o ResultSetStreamTest Added a test for maxFieldSize, where truncation have to happen. o Various tests Adjusted tests to run with the new implementation. Patch ready for review. Regression tests ran cleanly, but due to a last minute change I have to rerun them again. I will post the results tomorrow.
          Hide
          Kristian Waagan added a comment -

          Rerun of the regression tests (derbyall and suites.All) successful, no errors/failures.

          Show
          Kristian Waagan added a comment - Rerun of the regression tests (derbyall and suites.All) successful, no errors/failures.
          Hide
          Kristian Waagan added a comment - - edited

          Committed patch 5a to trunk with revision 734630.

          A quick comment on the following piece of code from EmbedResultSet.getCharacterStream:


          CharacterStreamDescriptor csd = dvd.getStreamWithDescriptor();

          if (csd == null) {

          String val = dvd.getString();
          if (lmfs > 0)

          { if (val.length() > lmfs) val = val.substring(0, lmfs); }

          java.io.Reader ret = new java.io.StringReader(val);
          currentStream = ret;
          return ret;
          }

          // See if we have to enforce a max field size.
          if (lmfs > 0)

          { csd = new CharacterStreamDescriptor.Builder().copyState(csd). maxCharLength(lmfs).build(); }

          java.io.Reader ret = new UTF8Reader(csd, this, syncLock);


          The last "if (lmfs > 0)" statement will never be run as long as SQLChar.getStreamWithDescriptor() returns null, because then the value will be materialized (SQLChar.getString()).
          We can allow non-Clob values to be treated as streams as well, but I'm not sure it is worth it due to the limited size (max 32700 chars).
          Opinons?

          Show
          Kristian Waagan added a comment - - edited Committed patch 5a to trunk with revision 734630. A quick comment on the following piece of code from EmbedResultSet.getCharacterStream: CharacterStreamDescriptor csd = dvd.getStreamWithDescriptor(); if (csd == null) { String val = dvd.getString(); if (lmfs > 0) { if (val.length() > lmfs) val = val.substring(0, lmfs); } java.io.Reader ret = new java.io.StringReader(val); currentStream = ret; return ret; } // See if we have to enforce a max field size. if (lmfs > 0) { csd = new CharacterStreamDescriptor.Builder().copyState(csd). maxCharLength(lmfs).build(); } java.io.Reader ret = new UTF8Reader(csd, this, syncLock); The last "if (lmfs > 0)" statement will never be run as long as SQLChar.getStreamWithDescriptor() returns null, because then the value will be materialized (SQLChar.getString()). We can allow non-Clob values to be treated as streams as well, but I'm not sure it is worth it due to the limited size (max 32700 chars). Opinons?
          Hide
          Kristian Waagan added a comment -

          Attached a bug fix (6a) for a problem where the stream could get out of sync with the descriptor.

          Committed to trunk with revision 734758.

          Show
          Kristian Waagan added a comment - Attached a bug fix (6a) for a problem where the stream could get out of sync with the descriptor. Committed to trunk with revision 734758.
          Hide
          Kristian Waagan added a comment -

          Patch 'derby-3907-7a-write_new_header_format-PREVIEW.diff' enables the new header format.

          I ran into some problems with obtaining the version of the database being written into, so I had to change where the header is generated. To be able to use the context service to gain access to the data dictionary, there must be context. The context is not pushed in for instance EmbedPreparedStatement.setCharacterStream. To solve this, I added the requirement that StringDataValue.generateStreamHeader has to be invoked when a context is set up.

          The new approach is to generate the header when the store is asking for the data (during execute), which happens in ReaderToUTF8Stream.fillBuffer. The downside of the approach, is that ReaderToUTF8Stream now takes StringDataValue as an argument in the constructor. This makes a lot more code available in the reader, and it also makes the reader harder to test.

          I also considered adding a method to tell the DVD the version of the database, but I think it will be hard to make Derby invoke this method in all valid use-cases, and it breaks with the pattern used in the existing classes.
          A second option is to add a stream header generation object, which can be passed in to ReaderToUTF8Stream. I think this can be done by modifying/replacing the StreamHeaderHolder, and I think it can be done easily. The difference is that the header generation will be postponed.

          I'll continue the work on Monday, and I will most likely post another patch implementing the approach with a separate class generating the header.

          Please comment if you think I'm heading down the wrong road.

          Show
          Kristian Waagan added a comment - Patch 'derby-3907-7a-write_new_header_format-PREVIEW.diff' enables the new header format. I ran into some problems with obtaining the version of the database being written into, so I had to change where the header is generated. To be able to use the context service to gain access to the data dictionary, there must be context. The context is not pushed in for instance EmbedPreparedStatement.setCharacterStream. To solve this, I added the requirement that StringDataValue.generateStreamHeader has to be invoked when a context is set up. The new approach is to generate the header when the store is asking for the data (during execute), which happens in ReaderToUTF8Stream.fillBuffer. The downside of the approach, is that ReaderToUTF8Stream now takes StringDataValue as an argument in the constructor. This makes a lot more code available in the reader, and it also makes the reader harder to test. I also considered adding a method to tell the DVD the version of the database, but I think it will be hard to make Derby invoke this method in all valid use-cases, and it breaks with the pattern used in the existing classes. A second option is to add a stream header generation object, which can be passed in to ReaderToUTF8Stream. I think this can be done by modifying/replacing the StreamHeaderHolder, and I think it can be done easily. The difference is that the header generation will be postponed. I'll continue the work on Monday, and I will most likely post another patch implementing the approach with a separate class generating the header. Please comment if you think I'm heading down the wrong road.
          Hide
          Kristian Waagan added a comment -

          The patch 'derby-3907-7a-write_new_header_format.diff' is the second attempt at implementing the handling of the new stream header format.

          Hopefully, the performance of reading/writing CHAR, VARCHAR and LONG VARCHAR hasn't suffered, but I'll run some performance tests to confirm it. The performance for reading the old header format for Clobs (i.e. accessing old databases) may suffer a little bit, because in some situations too many bytes are read and the stream has to be reset before reading it again. I need to test this as well.

          Description of the changes:

          • EmbedResultSet
            Adjusted the usage of the ReaderToUTF8Stream constructor. Note the usage of 'setSoftUpgradeMode', which is required because the stream can be read and written when the context stack hasn't been set up. This is true for updatable result sets. When there is no context, the context service fails to obtain the DatabaseContext used by the generator to get to the data dictionary.
          • EmbedPreparedStatement
            Adjusted the usage of the ReaderToUTF8Stream constructor.
          • ArrayInputStream
            Added an argument to 'readDerbyUTF'. The stream header is now read outside of this method.
          • StreamHeaderGenerator
            Added an interface for generating stream headers.
          • CharStreamHeaderGenerator
            New class generating old-style headers (two bytes long). Always used for CHAR, VARCHAR and LONG VARCHAR. In addition, it is also used for CLOB in pre 10.5 databases (i.e. soft upgrade mode).
          • ClobStreamHeaderGenerator
            New class generating new-style headers (five bytes long). Used only for CLOBs written into a 10.5 database. If a old-style header is needed, the work is delegated to CharStreamHeaderGenerator.
          • ReaderToUTF8Stream
            Added a StreamHeaderGenerator to the constructors, and updated the header writing logic to use it. Also added a constant to distinguish the first invocation of 'fillBuffer'. The header is generated on the first invocation, and possibly updated again in 'checkSufficientData'.
          • StringDataValue
            Replaced method 'generateStreamHeader(long)' with 'getStreamHeaderGenerator()'. Added method 'setSoftUpgradeMode'. The latter is used in situations where the generator itself is unable to determine of the database being written into is in soft upgrade mode.
          • SQLChar
            Factored out code to write the modified UTF-8 format (see 'writeUTF'). Updated 'writeExternal', which will now only be invoked for non-CLOB data values. Added method 'writeClobUTF', which is used to write CLOB data values. It is kept in SQLChar to avoid having to make more of the internal state available to the subclasses. Added a second version of 'readExternal', which is the one doing the actual work. It takes both a byte count and a char count, where both can can by unknown. Implemented the new method in StringDataValue.
          • SQLClob
            Added variable 'inSoftUpgradeMode', which tells if the DVD is used in a database being in soft upgrade mode or not. This must be known to generate the correct header format. Note that this may be unknown, in which case the header generator itself will try to determine the mode. Implemented 'getLength', which will obtain the length from the stream header, delegate the work to 'SQLChar.getLength' if the value is not a stream, or decode the stream data if the length is not stored in the header. The data value is not materialized. Added support to read both stream header formats in 'getStreamWithDescriptor'. Implemented 'investigateStream' to decode the header. Added 'writeExternal', 'readExternal' and 'readExternalFromArray'. In general, some preparation steps are taken and then the work is delegated to SQLChar. Added utility class HeaderInfo.
          • StreamHeaderHolder
            Deleted the class.
          • UTF8UtilTest
            Updated code to use the new generator class.

          Patch ready for review.
          I have run the regression tests without failures, but due to a small last minute change I will run them again tonight.
          I have also tried reading and writing Clob from a 10.4 database manually, both in soft and hard upgrade mode.

          Based on Mike's suggestion, I was hoping that a table compress would update the old headers to the new header format after a hard upgrade. This is in principle correct, but the new header is written with "unknown length" encoded. I haven't investigated how to best solve this problem.

          Show
          Kristian Waagan added a comment - The patch 'derby-3907-7a-write_new_header_format.diff' is the second attempt at implementing the handling of the new stream header format. Hopefully, the performance of reading/writing CHAR, VARCHAR and LONG VARCHAR hasn't suffered, but I'll run some performance tests to confirm it. The performance for reading the old header format for Clobs (i.e. accessing old databases) may suffer a little bit, because in some situations too many bytes are read and the stream has to be reset before reading it again. I need to test this as well. Description of the changes: EmbedResultSet Adjusted the usage of the ReaderToUTF8Stream constructor. Note the usage of 'setSoftUpgradeMode', which is required because the stream can be read and written when the context stack hasn't been set up. This is true for updatable result sets. When there is no context, the context service fails to obtain the DatabaseContext used by the generator to get to the data dictionary. EmbedPreparedStatement Adjusted the usage of the ReaderToUTF8Stream constructor. ArrayInputStream Added an argument to 'readDerbyUTF'. The stream header is now read outside of this method. StreamHeaderGenerator Added an interface for generating stream headers. CharStreamHeaderGenerator New class generating old-style headers (two bytes long). Always used for CHAR, VARCHAR and LONG VARCHAR. In addition, it is also used for CLOB in pre 10.5 databases (i.e. soft upgrade mode). ClobStreamHeaderGenerator New class generating new-style headers (five bytes long). Used only for CLOBs written into a 10.5 database. If a old-style header is needed, the work is delegated to CharStreamHeaderGenerator. ReaderToUTF8Stream Added a StreamHeaderGenerator to the constructors, and updated the header writing logic to use it. Also added a constant to distinguish the first invocation of 'fillBuffer'. The header is generated on the first invocation, and possibly updated again in 'checkSufficientData'. StringDataValue Replaced method 'generateStreamHeader(long)' with 'getStreamHeaderGenerator()'. Added method 'setSoftUpgradeMode'. The latter is used in situations where the generator itself is unable to determine of the database being written into is in soft upgrade mode. SQLChar Factored out code to write the modified UTF-8 format (see 'writeUTF'). Updated 'writeExternal', which will now only be invoked for non-CLOB data values. Added method 'writeClobUTF', which is used to write CLOB data values. It is kept in SQLChar to avoid having to make more of the internal state available to the subclasses. Added a second version of 'readExternal', which is the one doing the actual work. It takes both a byte count and a char count, where both can can by unknown. Implemented the new method in StringDataValue. SQLClob Added variable 'inSoftUpgradeMode', which tells if the DVD is used in a database being in soft upgrade mode or not. This must be known to generate the correct header format. Note that this may be unknown, in which case the header generator itself will try to determine the mode. Implemented 'getLength', which will obtain the length from the stream header, delegate the work to 'SQLChar.getLength' if the value is not a stream, or decode the stream data if the length is not stored in the header. The data value is not materialized. Added support to read both stream header formats in 'getStreamWithDescriptor'. Implemented 'investigateStream' to decode the header. Added 'writeExternal', 'readExternal' and 'readExternalFromArray'. In general, some preparation steps are taken and then the work is delegated to SQLChar. Added utility class HeaderInfo. StreamHeaderHolder Deleted the class. UTF8UtilTest Updated code to use the new generator class. Patch ready for review. I have run the regression tests without failures, but due to a small last minute change I will run them again tonight. I have also tried reading and writing Clob from a 10.4 database manually, both in soft and hard upgrade mode. Based on Mike's suggestion, I was hoping that a table compress would update the old headers to the new header format after a hard upgrade. This is in principle correct, but the new header is written with "unknown length" encoded. I haven't investigated how to best solve this problem.
          Hide
          Kristian Waagan added a comment -

          Replaced patch 7a (removed two lines of debugging code).

          Show
          Kristian Waagan added a comment - Replaced patch 7a (removed two lines of debugging code).
          Hide
          Kristian Waagan added a comment -

          I decided to split up patch 7a due to its size.
          Patch 7a1 adds the new stream header generator classes (one interface, two implementations).

          Committed to trunk with revision 736612.

          Show
          Kristian Waagan added a comment - I decided to split up patch 7a due to its size. Patch 7a1 adds the new stream header generator classes (one interface, two implementations). Committed to trunk with revision 736612.
          Hide
          Kristian Waagan added a comment -

          Patch 'derby-3907-7a2-use_new_framework.diff' is the second part of 7a.
          It prepares the code to deal with multiple stream header formats, but doesn't change the current behavior regarding stream headers.

          Committed to trunk with revision 736636.

          I will upload the next patch, but will wait a few days before I commit it to see if any problems are detected with patch 7a2.

          Show
          Kristian Waagan added a comment - Patch 'derby-3907-7a2-use_new_framework.diff' is the second part of 7a. It prepares the code to deal with multiple stream header formats, but doesn't change the current behavior regarding stream headers. Committed to trunk with revision 736636. I will upload the next patch, but will wait a few days before I commit it to see if any problems are detected with patch 7a2.
          Hide
          Kristian Waagan added a comment -

          Patch 7a3 is the patch that enables the new header format.
          I plan to commit this Tuesday or Wednesday next week.

          Patch ready for review.

          Remaining work:
          o Determine if the database version should be kept in the Database object.
          It all depends on how expensive it is to consult the data dictionary about the version.
          o Investigate upgrade further to understand whether long columns can easily be upgraded with a new header or not.
          o (stretch task) Investigate how to add the functionality to update only the first few bytes of a data value.

          NOTE: After this patch has been committed, databases containing CLOBs written with a Derby revision after the commit cannot be read with a Derby revision before the commit.

          Show
          Kristian Waagan added a comment - Patch 7a3 is the patch that enables the new header format. I plan to commit this Tuesday or Wednesday next week. Patch ready for review. Remaining work: o Determine if the database version should be kept in the Database object. It all depends on how expensive it is to consult the data dictionary about the version. o Investigate upgrade further to understand whether long columns can easily be upgraded with a new header or not. o (stretch task) Investigate how to add the functionality to update only the first few bytes of a data value. NOTE: After this patch has been committed, databases containing CLOBs written with a Derby revision after the commit cannot be read with a Derby revision before the commit.
          Hide
          Kristian Waagan added a comment -

          Committed patch 7a3 to trunk with revision 738408.

          I think some code in SQLChar can be removed now, but I'll wait a little while before I start looking at it.
          As noted earlier, databases created with revision 738408 or later containing Clobs cannot be used with earlier revisions any more.

          Show
          Kristian Waagan added a comment - Committed patch 7a3 to trunk with revision 738408. I think some code in SQLChar can be removed now, but I'll wait a little while before I start looking at it. As noted earlier, databases created with revision 738408 or later containing Clobs cannot be used with earlier revisions any more.
          Hide
          Myrna van Lunteren added a comment -

          It looks like most of the work for this issue has been completed for 10.5.
          Is it possible to log a new issue for the remaining tasks?
          Does this require a release note?

          Show
          Myrna van Lunteren added a comment - It looks like most of the work for this issue has been completed for 10.5. Is it possible to log a new issue for the remaining tasks? Does this require a release note?
          Hide
          Kristian Waagan added a comment -

          I'll log at least one new issue, tracking the final piece of the functionality required to insert the required meta data for Clobs inserted with one of the JDBC 4.0 "length-less overrides".

          I don't think this issue requires a release note, but I'll comment on a few things before I close the issue.

          Show
          Kristian Waagan added a comment - I'll log at least one new issue, tracking the final piece of the functionality required to insert the required meta data for Clobs inserted with one of the JDBC 4.0 "length-less overrides". I don't think this issue requires a release note, but I'll comment on a few things before I close the issue.
          Hide
          Kristian Waagan added a comment -

          See DERBY-4652 for a possible improvement (dealing with JDBC 4.0 "length less overrides").

          Show
          Kristian Waagan added a comment - See DERBY-4652 for a possible improvement (dealing with JDBC 4.0 "length less overrides").
          Hide
          Kristian Waagan added a comment -

          Closing.

          Show
          Kristian Waagan added a comment - Closing.

            People

            • Assignee:
              Kristian Waagan
              Reporter:
              Kristian Waagan
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development