Issue Details (XML | Word | Printable)

Key: DERBY-2336
Type: Sub-task Sub-task
Status: Closed Closed
Resolution: Fixed
Priority: Major Major
Assignee: Mamta A. Satoor
Reporter: Mamta A. Satoor
Votes: 0
Watchers: 0
Operations

If you were logged in you would be able to see more operations.
Derby
DERBY-1478

Enable collation based ordering for CHAR data type.

Created: 14/Feb/07 08:30 PM   Updated: 07/Jul/07 07:42 PM
Return to search
Component/s: SQL
Affects Version/s: 10.3.1.4
Fix Version/s: 10.3.1.4

Time Tracking:
Not Specified

File Attachments:
  Size
Text File Licensed for inclusion in ASF works DERBY_LocalFinder_CodeCleanup_diff_V01.txt 2007-03-02 04:46 PM Mamta A. Satoor 21 kB
Text File Licensed for inclusion in ASF works DERBY_LocalFinder_CodeCleanup_stat_V01.txt 2007-03-02 04:46 PM Mamta A. Satoor 0.8 kB
Issue Links:
Dependants
Incorporates
 

Resolution Date: 16/May/07 04:57 PM


 Description  « Hide
I am breaking down the Parent task DERBY-1478 (Add built in language based ordering and like processing to Derby) into multiple sub tasks. One of them is to concentrate on enabling the collation based ordering on (hopefully the simplest of all the character data types) CHAR data type. This task in itself might need subtasks if it is later found that it can be subdivided into multiple smaller steps.

 All   Comments   Work Log   Change History   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
Mamta A. Satoor added a comment - 02/Mar/07 04:46 PM
I am continuing to work on Language Based Ordering. One thing this feature requires is that character datatypes datatypes should have access to LocaleFinder to know what collation they need to use. Currently, this gets done by going through DatabaseContext (this code is in SQLChar's getLocalFinder method. For SQLTime, SQLTimestamp, SQLDate, the code is in setValue(String theValue) method). But the problem with this code is that during restart recovery time, the DatabaseContext will not be available and hence locale can't be determined at restart recovery time. In order to fix that problem, in my attached patch DERBY_LocalFinder_CodeCleanup_diff_V01.txt, I am removing the LocaleFinder related code from BasicDatabase and instead moving it into RawStore. This will make sure that during restart recovery time, LocaleFinder info will be available because store context is available on the context manager at that time.

Another change I am making in this patch is to have the ContextManager keep track of current LocaleFinder so that character datatypes could get LocalFinder information more quickly rather than having to go through the store context in ContextManager's stack of contexts.

In addition to the changes mentioned above, I have removed some imports that are not really used by the classes involved in this patch.

There are no visible end-user effects of any of the above mentioned changes but this groundwork will be required later on when the collation is enabled for SQLChar datatype.

svn stat -q output is as follows
M java\engine\org\apache\derby\impl\db\BasicDatabase.java
M java\engine\org\apache\derby\impl\store\raw\xact\XactFactory.java
M java\engine\org\apache\derby\impl\store\raw\RawStore.java
M java\engine\org\apache\derby\iapi\services\context\ContextManager.java
M java\engine\org\apache\derby\iapi\types\SQLChar.java
M java\engine\org\apache\derby\iapi\types\SQLDate.java
M java\engine\org\apache\derby\iapi\types\SQLTime.java
M java\engine\org\apache\derby\iapi\types\DataValueFactoryImpl.java
M java\engine\org\apache\derby\iapi\types\SQLTimestamp.java
M java\engine\org\apache\derby\iapi\db\Database.java
M java\engine\org\apache\derby\iapi\store\raw\RawStoreFactory.java
M java\engine\org\apache\derby\database\Database.java

The patch description per class wise is as follows
1)BasicDatabase -
a)removed unused imports
b)removed code which was implementing LocalFinder methods because this class is not implementing LocaleFinder anymore. The LocaleFinder implementation will be provided at the store level.
c)I didn't remove the getLocale method from this class becasue we have few tests that rely on this internal public Derby api. The use of this can be found in functionTests/conn/DefaultLocale.java's checkDatabaseLocale method. I will file a jira entry for this because I don't think our test should be going against internal Derby api.
2)RawStore
a)This class now implements LocaleFinder interface so later in XactFactory, RawStore instance can be passed to the ContextManager. This way, LocaleFinder info will be available at Derby's restart recovery time.
b)also removed some unused import statements.
3)XactFactory
The change in this class passes the RawStore instance to ContextManager's setLocaleFinder info.
4)ContextManager
This class now provides a getLocalFinder method to return the local info. Prior to this, the callers had to go through the stack of contexts to find the database context to get the LocaleFinder info.
5)SQLChar, SQLDate, SQLTime, SQLTimestamp, DataValueFactoryImpl
All of these classes now get the LocaleFinder info by directly calling the getLocaleFinder method on ContextManager.
6)iapi/db/Database.java
This interface doesn't extend LocaleFinder interface anymore. Also, took out some unused import statements.
7)RawStoreFactory now extends LocaleFinder interface. Also, took out some unused import statements.
8)database/Database.java
The existing getLocale method now can throw exception because it goes through ContextManager to get the locale info. I will file a jira entry for this method because the only purpose of this method is to allow some existing tests to get the locale info by calling this internal api.

Mike Matrigali added a comment - 02/Mar/07 06:37 PM
This patch seems a reasonable approach to make sure the information is available at the lowest level of the system and in an efficient manner. The collation info is definitely needed during restart recovery as logical undo of btree transactions may require traversing the tree and thus will need to make sure comparisons are done with the same collation as they were
created. Recovery runs with only the raw store context on the stack so the info needs to be available there.

Later when you get colllations actually working you should add a recovery scenario to make sure this gets tested. I'll think a little more about what should happen, but at least you should have an outstand transaction with insert/delete/updates that affect one or more character indexes (all with a collation setting in the db that is different from default one today). Make sure those log records get to the log (easiest is to either write a LOT of them or to have another xact commit), and then crash the server. Restart the server will then run through the code. The following are examples of using .sql tests to force restart
recover in old test harness - simply exits after doing work and then restarts on same db:
store/oc_rec1.java
store/oc_rec2.java
store/oc_rec3.java
store/oc_rec4.java

I am not sure how to handle this in junit tests, we have counted on crashing jvm providing the testing scenario we need.

Looks good as first step checkin, assuming tests pass I say you should go ahead and check in.

Daniel John Debrunner added a comment - 02/Mar/07 07:10 PM
Isn't the simple change for ContextManager sufficient (addition of getLocaleFinder())?
I don't see why you need to change the remaining code. Especially when in the storeless engine case there won't be a raw-store but there will be a database.

Mamta A. Satoor added a comment - 02/Mar/07 07:10 PM
Mike, thanks for such a quick review. I will plan on committing this soon.

BTW, I forgot to mention when I submitted the patch that the tests ran fine with my changes on Windows XP with Sun's jdk 1.4

Mamta A. Satoor added a comment - 02/Mar/07 07:22 PM
Dan, you are probably right because it doesn't matter who sets the LocaleFinder in the ContextManager (ie store or database) as long as it is available at the recovery time. I will rerun the existing tests with that change and see what happens.

Mamta A. Satoor added a comment - 03/Mar/07 06:21 AM
Dan, actually thinking more about it, I think it is not enough to make the simple change of adding getLocaleFinder() method to ContextManager, because it means that we are still relying on Database Context to set the LocaleFinder information on the ContextManager. At recovery time(if I understand the restart recovery logic correctly), there will be no DatabaseContext and hence there will be no way to put the correct LocaleFinder info into ContextManager through the DatabaseContext. But store context is available at recovery time and hence we can have the store context do the LocaleFinder setting in the ContextManager.

If we do go with store context approach, I recognize that there does exist a problem for storeless engine case that you mentioned because there is no store context in such a case and hence LocaleFinder info will never get set into ContextManager.

So, in order to solve the 2 problems ie 1)storeless engine and 2)the recovery time case, maybe we can duplicate the code for setting the LocaleFinder in both DatabaseContext and Store context so we are covered in both cases. Code duplicatiion is generally not a good idea but this can be an intermediate solution for now. Maybe it can be tackled as a separate Jira issue.

Daniel John Debrunner added a comment - 03/Mar/07 05:36 PM
Duplicate code is a bad idea, in this case the locale is a property of the database, thus the current location (BasicDatabase) to set the code is correct.

Maybe you could write up some quick design notes of how the data types will be accessing the correct locale information. I just posted in another e-mail that I think the current code that exists is taking the wrong approach, and now would be a god time to come up with a clean design.

Mike Matrigali added a comment - 06/Mar/07 06:44 PM
I don't think BasicDatabase is available to the raw store restart recovery code, which needs the collation information at a very low level
of the bootstrap recovery loop. Do you think this should booted to the context stack before log recovery starts?
So I think we are in a situation where we need the info set in one case for a bottom stack of the store
that does not have the upper half of the code running and in the storeless case we want it set when store does not exist. To me it looks
like either duplicate code or 2 different ways to set the same thing.


It would be nice if there was a solution that got us incrementally to the "clean design" you envision - while letting mamta get some
incremental steps in toward a locale based sort. I think some of the problems you would like to see fixed are not involved with sorting
but some of the other datatype interaction with locale which I don't think is part of this JIRA issue.

Daniel John Debrunner added a comment - 06/Mar/07 07:47 PM
Some of my thoughts in this area are biased by the existing implementation of (not-exposed) national character types, which was originally done quickly and in my opinion rather poorly. Typically in software engineering the second re-write is an improvement on the first, it seems there is an opportunity here to make a second re-write better rather than repeat the mistakes of the first.

Some of the mistakes of the first are:
  - using the context to obtain the locale (even commented as a HACK)
  - pushing the locale ordering code into the base class (SQLChar) so that non-locale based ordering has to deal with additional overhead
  - basing the ordering on a locale and not a Collator

I think there is also plenty of opportunity to make incremental progress here. Steps I could see happening are:

  1) Create the new character data value classes that perform ordering based upon a Collator. Have the collator be a field
        in the class and hard-code it to some language for now (say Norwegian :-)
  2) make those classes the ones used when locale based ordering is required.
  3) write some tests that ensure the order does change when the collation property is set on database creation
     (can be converted into real tests later by using a Norwegian database)
   4) Delete the old national character types and remove their ovverhead from the USC_BASIC ordering classes (SQLChar etc.)

   5) Figure out how to set the real Collator object for a DataValueDescriptor during runtime and recovery.

Of course step 5) doesn't have to be done last, it's independent of steps 1-4

   6) Write more tests for other languages.

I agree that some of the locale issues with data types are being confused here, Mamta found a valid bug where conversion of string to
a date time value is not being handled correctly. That is a separate issue, but it's being confused because the discussion so far has
not really described the actual problems, it's focussed on getting the locale based upon the old national character types code.
The real problems here are:

   1) How to get the correct Collator object for character comparisions
   2) How to get the correct object to parse date-time values from Strings

1) Is just locale based for this issue, but there is the chance to have a framework that works with more than locale based ordering,
such as case-insensitive ordering. Focusing on the locale increases the likelyhood that the solution will not be expandable to other types.
E.g. if we duplicate code for locale in the store, do we need to duplicate code again for case-insensitive searches, and then again for
another Collator style?

2) is easier because it doesn't need to worry about recovery and is more closely related to the solution used for the Calendar object.


Mike Matrigali added a comment - 06/Mar/07 08:09 PM
Note the only duplicate code I am talking about is the code that somehow gets the collation value for the db and puts it somewhere so that the datatype code can get it. So in the current "hack" this code pushes it into the context so that it
can then be read from the context. Moving it out of the context is fine, but I am not sure if putting it into the BasicDatabase
works for restart recovery. If we take it out of the context where do we put it so that it is available to restart recovery and
to storeless engine?

I expect all the code that actually compares the datatypes to be in one place, in the datatype code.

I may have missed the thread, do you have an opinion on the best way for the datatype code to get the collation setting?
Seems like there are a few options:
1) from context (seen as a hack)
2) included as parameter to all calls that need it. I think this will result in an extra parameter pushed for every call - may
     slow down non-locale sorting.
3) somehow passed into datatype when it gets instantiated. Note store recovery needs to create these things on the fly
     there is no compilation time for code to create these datatypes.
4) something else?


Daniel John Debrunner added a comment - 06/Mar/07 10:11 PM
I'm not sure of a solution yet, but I think that with this discussion we are getting closer to understanding the problem and hence exploring solutions.

"Store recovery needs to create these things on the fly"

What exactly does this mean? How does store know to create a SQLInteger for a column in a table during recovery instead of creating a SQLVarchar?
What other meta-data is stored at the store level and available for use during recovery?

Mamta A. Satoor added a comment - 06/Mar/07 10:39 PM
Thanks for your time on this Jira entry, Dan and Mike. Has been very helpful.

I talked to Dan offline on this issue and just wanted to summarize the conversation here.

One of the goals for this Jira entry is not to burden the existing SQLChar datatype with collation information because they work greate for the deault ordering which is UCS_BASIC. For the territory based collation, we should have new subclass(es) of existing char datatypes which will use the territory based collation to do the sorting. One of the tasks identified for the new subclasses is how to get the required collation information to them so that their sorting will be based on territory. I am going to tackle this in incremental step by may be hardcoding some kind of collation in new char datatypes and make sure that it works fine. The next task would be to get the collation infromation correctly from the compile phase into the new data types instances.

The next task is to see how recovery can be implemented so that the new datatypes will have the correction collation associated with them when they are instantiated during recovery. One way, which is not a practical solution, is to have these new datatypes write their collation information onto the disk. But doing so will mean that same collation information will be duplicated in all the instances of these new datatypes and that is not very efficient. Ideal solution would be to have this collation informaiton available at table level or databse level and use that information at recovery time to create the correct collation for the new char datatypes.

As a first step, I am going to concentrate on creating subclass of SQLChar which will have a specific collation hardcoded into it and have the compiler use those new subclasses.

If anyone has any comments, plese share them.

Mike Matrigali added a comment - 06/Mar/07 11:42 PM
The reality is that there isn't really "recovery" code involved. It is just store code that happens to be running at recovery time (one of the
good things about derby is that redo recovery path is really the same path through the code as forward running recovery). The problem
comes in that with the old system of pulling the collation info off the context stack is that the setup of the info onto the context stack can't
be from some code "above" store as at recovery time there is no code "above" store.

To answer dan's question about the meta data:
Store currently maintains an array of format id's per table, one for each table/index. Many of the store interfaces the store never needs to
create an object to read or write data to/from disk. Most of the interfaces include an empty or a full object passed in from the caller. But
in some cases store goes ahead and creates empty objects using format id's. The Monitor currently provides some interfaces to
generate empty datatype objects based on format id's.

For example the code in
java/engine/org/apache/derby/impl/store/access/conglomerate/TemplateRow.java uses the following to generate an object of the correct
type given a format id:
 Monitor.newInstanceFromIdentifier(format_ids[i])

So if we want to "pass" in the info into the datatype, rather than have the datatype pull it from a context then one option would be to
create something like Monitor.newInstanceFromIdentifier(format_id, collation_info); And similar changes to any other Monitor supported
interface to create a datatype. Then the problem just turns into getting the correct info into the various monitor calls. Above the store
maybe the compiler could generate 1 arg calls when it knew there was no collation info. In the store I think it would be least ugly to
always generate 2 arg calls, but it is extra overhead for the non-collation path.


Mike Matrigali added a comment - 07/Mar/07 12:01 AM
Logically it seems like at boot time what we want is to set a static variable in one or a few datatypes (the new datatypes that mamta just described above - I assume one below CHAR, VARCHAR, and CLOB) to store the collation info and then
that info would be efficiently available always to the those datatypes. The compiler and the format id's make it so that the extra path code
only happens in the new datatypes. It seems like anything that requires extra passed info on every
datatype creation is unnecessary overhead, until we really want to support different collations at less than a db unit (ie. datavalue, columns,
and/or table).

 Is there anyway we could get the datatype system to support something like that?

Daniel John Debrunner added a comment - 07/Mar/07 12:51 AM
On the metadata, I think the store stores more information that just the format identifiers for an index.
The class BTree has a number of fields that seem to be metadata for the index including the ascending/descending status for each column.
How flexible is that metadata? can additional information be added without requiring an upgrade of previous versions, e.g. if information
about column collation was added?

And does each index/table have a set of properties associated with it that could be used to store metadata?

Mike Matrigali added a comment - 07/Mar/07 01:43 AM
Currently both heap and btree has the 1st row on page 0 as a free format "metadata" row. There will need to be upgrade support but it can be handled "on-the-fly" depending on what new info you want to add (ie. basically it is easy if there is a workable default the missing data for old tables - so for instance in this discussion existing tables would somehow be set with the UCS_BASIC collation).
By on the fly I mean automatically when the table is accessed rather than as part of the boot upgrade process.

There is actually some dead code in java/engine/org/apache/derby/impl/store/access/btree/index/B2I.java :
(writeExternal_v36 and readExternal_v36) showing how we handled a pre-derby upgrade to this metadata when we added
the storing of ascending/descending information.
 I'll look at getting rid of the dead code.

The cost is disk space in page 0 which is the root page and thus a level is added to the btree sooner
The limitation is that all the info in the metadata has to fit on the page. So for instance I would not like to add a possibly long
string for every column.

Since the info we are setting is really one piece of info per db, setting it once per table or worse once per column seems overkill. We
may want to do that in the future, and we have the infrastructure to do it when we need to - but should we do it now?

So we could do this but I still wonder about getting this info and packaging it up for every template call to the datatype. Note that these
template creations don't just happen for recovery. They happen during other normal forward processing operations. Some examples
are that we need a scratch template to do a search on a btree, we need some scratch templates to do btree splits, we create some
templates for bulk operations from heap and btree (like hash scan result sets).

Mike Matrigali added a comment - 07/Mar/07 01:53 AM
I may have mispoke in the last comment, I am not sure if there is a limit on the length of the metadata. There is support for rows spanning pages and the metadata is just formatted up to look like a row and inserted using the standard row insert logic.

Daniel John Debrunner added a comment - 07/Mar/07 02:29 AM
I'm just looking at options and as you say at the moment the need is one per database, but at some point it will be per-something else, maybe schema, maybe column, maybe index column. I think it's worth looking at possibilities now to see if general solution can exist.

Some of the issue is that currently a single format identifier and hence single object specifies the data type and partially specifies the comparision logic.
For ascending/descending the format id does not specifiy the comparision logic, I guess instead the performance hit is taken on every compare if the comparision is descending and thus the comparison value needs to be switched.

Java took the path of separating the two, e.g. SortedMap objects use a Comparator interface.

One option is to take that path and make sure enough meta-data exists to create a correct Comparator like object.
This would change the logic for comparisions to be:

   comparator.compare(v1, v2)

instead of today's

   v1.compare(v2)

The comparator does look more expensive but the ascending/descending logic could be in the comparator, thus removing that check for the typical ascending case.

Daniel John Debrunner added a comment - 07/Mar/07 04:10 AM
I think I have a good interim solution, move all use of locale related to data types into DataValueFactory.

  - Database (BasicDatabase) continues to own the locale of the database

  - BasicDatabase after booting DataValueFactory calls a new method on DataValueFactory
              void setLocale(Locale)
  - Store uses DataValueFactory to get DataValueDescriptors from a format identifier instead of through the monitor (requires a new method on DataValueFactory like)
              DataValueDescriptor getNullValue(int format_id)

  - DataValueFactory does whatever it needs with Locale, e.g. has a single RuleBasedCollator object for the locale and passes that into the new SQLChar variants that use a Collator for comparision.

  - LocaleFinder as a class can probably go away, though that could be delayed until the new SQL Char variants exist and the old national character types have been removed. I think its only use is the old style national character collation (and was driven by the hack in gettting the locale from a context) which will not be needed with the new code.

This seems like a logical breakdown of responsibility, data type stuff in the data value factory :-)

While it doesn't handle the future issues of different collations per schema etc. I think it can be expanded to handle it e.g. a new set of format identifiers for collators like case-insensitive. At least the code is self-contained in a logical place.


Mike Matrigali added a comment - 07/Mar/07 06:09 PM
Dan's solution seems good - let the datatype system encapsulate everything.

The details I don't understand are:
1) How does Database (BasicDatabase) get set up in the runtime recovery case, we need the collation stuff before the database has been
     recovered - I think in the current system it only gets initialized later. Or in recovery case is store making the DatavalueFactory.setLocale()
     call?
2) Is it ok for store to boot DataValueFactory, before the database has been recovered?
      I think this is ok, that a lot of work has been done to move toward this, are we there now?

Daniel John Debrunner added a comment - 07/Mar/07 06:27 PM
For a database BasicDatabase is the code that boots the data value factory and then the store.
So I'd assumed that data value factory is already available to the booting store, in recovery mode.

A similar problem already exists where the correct data value factory has to be booted before recovery so that the correct implementation of the DECIMAL type can be loaded for recovery. See the comments in BasicDatabase around line 174.


Mamta A. Satoor added a comment - 07/Mar/07 07:39 PM
One of the subtasks for DERBY-2336 is to define new data type on top of SQLChar and this new data type will have collation based on territory. This work will be done as part of DERBY-2416.