I've been experimenting with the idea outlined above and I thought I should share some very early results.
The improvement here is basically to give the compound splitting heuristic an improved ability to split unknown words that are part of compounds. Experiments I've run using using our compound splitting test cases suggest that the effect is indeed positive. The improved heuristic is able to handle some of the test case that we couldn't do earlier, but all of this requires further experimentation and validation.
I've been able to segment トートバッグ (tote bag with トート being unknown) and also ショルダーバッグ (shoulder bag) as you would like with some weight tweaks, but then it also segmented エンジニアリング (engineering) into エンジニア (engineer) リング (ring).
It might be possible to tune this up or developer a more advanced heuristic that remedies this, but I haven't had a chance to look further into this. Also, any change here would require extensive testing and validation. See the evaluation attached to
LUCENE-3726 that was done on Wikipedia for search mode.
Please note that there will not be time to provide improvements here for 3.6, but we can follow up on katakana segmentation for 4.0.
With the above idea for katakana in mind, I'm thinking we can skip emitting katakana words that start with ン、ッ、ー since we don't want tokens that start with these characters and consider adding this as an option to the tokenizer if it works well.
Having said this, there are real limits to what we can achieve by hacking the statistical model (and it also affects our karma, you know...). The approach above also has performance and memory impact. We'd need to introduce a fairly short limits to how long unknown words can be and this can perhaps only apply to unknown katakana words. The length restriction will be big enough to not have any practical impact on segmentation, though.
An alternative approach to all of this is to build some lexical assets. I think we'd get pretty far for katakana if we apply some of the corpus-based compound-splitting algorithms European NLP researchers have developed. Some of these algorithms are pretty simple and quite effective.