Type: New Feature
Resolution: Won't Fix
Affects Version/s: None
Fix Version/s: None
Lucene Fields:New, Patch Available
A formula 1A token/ngram-based language detector. Requires a paragraph of text to avoid false positive classifications.
Depends on contrib/analyzers/ngrams for tokenization, Weka for classification (logistic support vector models) feature selection and normalization of token freuencies. Optionally Wikipedia and NekoHTML for training data harvesting.
Initialized like this:
Training set build from Wikipedia is the pages describing the home country of each registred language in the language to train. Above example pass this test:
(testEquals is the same as assertEquals, just not required. Only one of them fail, see comment.)
I don't know how well it works on lots of lanugages, but this fits my needs for now. I'll try do more work on considering the language trees when classifying.
It takes a bit of time and RAM to build the training data, so the patch contains a pre-compiled arff-file.