Description
We hit an interesting bug in a case where datanucleus.identifierFactory = datanucleus2 .
The problem is that directSql handgenerates SQL strings assuming "datanucleus1" naming scheme. If a user has their metastore JDO managed by datanucleus.identifierFactory = datanucleus2 , the SQL strings we generate are incorrect.
One simple example of what this results in is the following: whenever DN persists a field which is held as a List<T>, it winds up storing each T as a separate line in the appropriate mapping table, and has a column called INTEGER_IDX, which holds the position in the list. Then, upon reading, it automatically reads all relevant lines with an ORDER BY INTEGER_IDX, which results in the list retaining its order. In DN2 naming scheme, the column is called IDX, instead of INTEGER_IDX. If the user has run appropriate metatool upgrade scripts, it is highly likely that they have both columns, INTEGER_IDX and IDX.
Whenever they use JDO, such as with all writes, it will then use the IDX field, and when they do any sort of optimized reads, such as through directSQL, it will ORDER BY INTEGER_IDX.
An immediate danger is seen when we consider that the schema of a table is stored as a List<FieldSchema> , and while IDX has 0,1,2,3,... , INTEGER_IDX will contain 0,0,0,0,... and thus, any attempt to describe the table or fetch schema for the table can come up mixed up in the table's native hashing order, rather than sorted by the index.
This can then result in schema ordering being different from the actual table. For eg:, if a user has a (a:int,b:string,c:string), a describe on this may return (c:string, a:int, b: string), and thus, queries which are inserting after selecting from another table can have ClassCastExceptions when trying to insert data in the wong order - this is how we discovered this bug. This problem, however, can be far worse, if there are no type problems - it is possible, for eg., that if a,b&c were all strings, that that insert query would succeed but mix up the order, which then results in user table data being mixed up. This has the potential to be very bad.
We should write a tool to help convert metastores that use "datanucleus2" to "datanucleus1"(more difficult, needs more one-time testing) or change directSql to support both(easier to code, but increases test-coverage matrix significantly and we should really then be testing against both schemes). But in the short term, we should disable directSql if we see that the identifierfactory is "datanucleus2"
Attachments
Attachments
Issue Links
- is cloned by
-
HIVE-11039 Write a tool to allow people with datanucelus.identifierFactory=datanucleus2 to migrate their metastore to datanucleus1 naming
- Open