patch adds ignoreCaseChange option to WordDelimiterFilter, which I have used and it may be more generally useful
For reference, can you clarify how this differs from generateWordParts?
i'm also wondering if the option shouldn't be inverted (ie: delimitOnCaseChange) and defaulted to true. all of the existing options are are "positive" in nature, they cause the filter to "do more" when true ... the semantics of this option would be to "do less" when it's true which may be a bit confusing for people
I like splitOnCaseChange=false (default would be true)
The difference from generateWordParts is as follows: gWP splits adjacent tokens to see if they are both alpha regardless of how they are delimited. So if gWP=0, then PowerShot=power-shot=Power-Shot=powershot. If gWP=1 and ignoreCaseChange=1, then PowerShot=powershot, but Power-Shot=power-shot=power shot.
For us, case changes were too "weak" a delimiter, and high idf subwords were inappropriately driving up relevancy on certain docs.
Just last week I considered adding a parameter to WDF to suppress splitting on case change; I lazed out by simply throwing a LowerCaseFilter in front of the WDF in my Analyzer chain, but at the time I was thinking that it would get me into trouble if I ever wanted to run the output of WDF into a case-dependent stopword or synonym table. So this is useful and should be committed... though I agree the parameter should be "positive" and I like Yonik's naming suggestion.
Do people think the example "text" fieldType should default to splitOnCaseChange="false"?
Many people use these fieldType definitions unchanged, until they run into a problem.
Might be a good idea. Case-based splitting is a relatively aggressive default (then again, I'd say the same about stemming).
I'll leave it out of the patch, and we can always change it later
commited in r545597 with inverted logic and yonik's name suggestion.