Details
-
Bug
-
Status: Closed
-
Critical
-
Resolution: Fixed
-
v2.3.2
Description
Hi, recently, I've met a problem in our product environment: Segments failed to merge because TrieDictionaryForest was disordered
java.lang.IllegalStateException: Invalid input data. Unordered data cannot be split into multi trees at org.apache.kylin.dict.TrieDictionaryForestBuilder.addValue(TrieDictionaryForestBuilder.java:92) at org.apache.kylin.dict.TrieDictionaryForestBuilder.addValue(TrieDictionaryForestBuilder.java:78) at org.apache.kylin.dict.DictionaryGenerator$StringTrieDictForestBuilder.addValue(DictionaryGenerator.java:214) at org.apache.kylin.dict.DictionaryGenerator.buildDictionary(DictionaryGenerator.java:81) at org.apache.kylin.dict.DictionaryGenerator.buildDictionary(DictionaryGenerator.java:65) at org.apache.kylin.dict.DictionaryGenerator.mergeDictionaries(DictionaryGenerator.java:106)
After some analysis, we found out when there is large values in a dict-encoded column, iterating over a single TrieDictionaryTree will get unordered data.
Digging into the source code, the root cause is as described:
- Kylin will split a TrieTree Node into two parts when a single nodes's value length is more than 255 bytes
- Then, these tow parts of value will be added to build the TrieTree. In fact the splitted two parts should not be used as new values to add to the TrieTree.
- Step 2 will cause the TrieDictionaryTree build more leave nodes,and the extra leaf nodes will be 'end-value' of dictionary tree;
- It has no impact to the correctness of the dict tree itself, except for adding some additional nodes .
- But If you spit a UTF-8 word, you will get unordered data when iterating over the tree ( Something todo with Java UTF-8 String Serialize/Deserialize implementations. Please Refer to JDK sun.nio.cs.UTF_8.class)
How to re-produce ? Run test code :
TrieDictionaryForestBuilder builder = new TrieDictionaryForestBuilder(new StringBytesConverter()); String longUrl = "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx你好~~~"; builder.addValue(longUrl); TrieDictionaryForest<String> dict = builder.build(); TrieDictionaryForestBuilder mergeBuild = new TrieDictionaryForestBuilder(new StringBytesConverter()); for (int i = dict.getMinId(); i <= dict.getMaxId(); i++) { String str = dict.getValueFromId(i); System.out.println("add value into merge tree"); mergeBuild.addValue(str); } The log output of this test code is: add value into merge tree add value into merge tree 16:59:36 [main] INFO org.apache.kylin.dict.TrieDictionaryForestBuilder.addValue(TrieDictionaryForestBuilder.java:127) values not in ascending order, previous 'xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\xEF\xBF\xBD', current 'xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\xE4\xBD\xA0\xE5\xA5\xBD~~~'
We can see from the test code's output:
- We only add 1 value but the tire dictionary tree turn out to have 2 end vlaues
- Iterating over the TrieDictionary Tree got unordered data
We address this problem by
- classify values which is a whole column value, which is splitted value,
- not mark splitted value as end-value of a TrieTree Node.
I wonder if there is something wrong, thanks.