Type: New Feature
Resolution: Won't Fix
Affects Version/s: None
Fix Version/s: None
Lucene Fields:New, Patch Available
A formula 1A token/ngram-based language detector. Requires a paragraph of text to avoid false positive classifications.
Depends on contrib/analyzers/ngrams for tokenization, Weka for classification (logistic support vector models) feature selection and normalization of token freuencies. Optionally Wikipedia and NekoHTML for training data harvesting.
Initialized like this:
Training set build from Wikipedia is the pages describing the home country of each registred language in the language to train. Above example pass this test:
(testEquals is the same as assertEquals, just not required. Only one of them fail, see comment.)
I don't know how well it works on lots of lanugages, but this fits my needs for now. I'll try do more work on considering the language trees when classifying.
It takes a bit of time and RAM to build the training data, so the patch contains a pre-compiled arff-file.
|Workflow||Default workflow, editable Closed status [ 12564562 ]||jira [ 12584968 ]|
|Workflow||jira [ 12398943 ]||Default workflow, editable Closed status [ 12564562 ]|
|Resolution||Won't Fix [ 2 ]|
|Status||Open [ 1 ]||Closed [ 6 ]|