To clarify, you meant 1% is for rule base syllable segmentation correct?
Yes: it is unmodified as before but I did some inspection of it. It handles all common structures but has no rules for rarer cases mentioned in that study: syllable chaining, great sa, etc.
Rule base syllable algorithm is nearly to its perfection in Lucene and I'm satisfied with it. Just also curious, where did you got the rules?
As I mentioned earlier, I created these almost 7 years ago informally. This is why I was eager to remove these rules, because we know they are not perfect. They were created when Myanmar in unicode was still rapidly changing, and I didn't find such formal algorithms at the time.
The rules are done in a "unicode way", really just using the base consonant and tries to let unicode properties take care of the rest (Word_Break=Extend, etc). It is really not much more than just this main part:
$Cons = [[:Other_Letter:]&[:Myanmar:]];
$Virama = [\u1039];
$Asat = [\u103A];
$ConsEx = $Cons ($Extend | $Format)*;
$AsatEx = $Cons $Asat ($Virama $ConsEx)? ($Extend | $Format)*;
$MyanmarSyllableEx = $ConsEx ($Virama $ConsEx)? ($AsatEx)*;
I didn't see the patch link though.
See the top of this issue: there is an Attachments section, underneath the Description section.