• Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Won't Fix
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
    • Lucene Fields:
      New, Patch Available


      A formula 1A token/ngram-based language detector. Requires a paragraph of text to avoid false positive classifications.

      Depends on contrib/analyzers/ngrams for tokenization, Weka for classification (logistic support vector models) feature selection and normalization of token freuencies. Optionally Wikipedia and NekoHTML for training data harvesting.

      Initialized like this:

          LanguageRoot root = new LanguageRoot(new File("documentClassifier/language root"));
          root.addBranch("fino-ugric", "uralic");
          root.addBranch("ugric", "uralic");
          root.addLanguage("fino-ugric", "fin", "finnish", "fi", "Suomi");
          root.addBranch("proto-indo european");
          root.addBranch("germanic", "proto-indo european");
          root.addBranch("northern germanic", "germanic");
          root.addLanguage("northern germanic", "dan", "danish", "da", "Danmark");
          root.addLanguage("northern germanic", "nor", "norwegian", "no", "Norge");
          root.addLanguage("northern germanic", "swe", "swedish", "sv", "Sverige");
          root.addBranch("west germanic", "germanic");
          root.addLanguage("west germanic", "eng", "english", "en", "UK");
          LanguageClassifier classifier = new LanguageClassifier(root);
          if (!new File(root.getDataPath(), "trainingData.arff").exists()) {
            classifier.compileTrainingData(); // from wikipedia

      Training set build from Wikipedia is the pages describing the home country of each registred language in the language to train. Above example pass this test:

      (testEquals is the same as assertEquals, just not required. Only one of them fail, see comment.)

          assertEquals("swe", classifier.classify(sweden_in_swedish).getISO());
          testEquals("swe", classifier.classify(norway_in_swedish).getISO());
          testEquals("swe", classifier.classify(denmark_in_swedish).getISO());
          testEquals("swe", classifier.classify(finland_in_swedish).getISO());
          testEquals("swe", classifier.classify(uk_in_swedish).getISO());
          testEquals("nor", classifier.classify(sweden_in_norwegian).getISO());
          assertEquals("nor", classifier.classify(norway_in_norwegian).getISO());
          testEquals("nor", classifier.classify(denmark_in_norwegian).getISO());
          testEquals("nor", classifier.classify(finland_in_norwegian).getISO());
          testEquals("nor", classifier.classify(uk_in_norwegian).getISO());
          testEquals("fin", classifier.classify(sweden_in_finnish).getISO());
          testEquals("fin", classifier.classify(norway_in_finnish).getISO());
          testEquals("fin", classifier.classify(denmark_in_finnish).getISO());
          assertEquals("fin", classifier.classify(finland_in_finnish).getISO());
          testEquals("fin", classifier.classify(uk_in_finnish).getISO());
          testEquals("dan", classifier.classify(sweden_in_danish).getISO());
          // it is ok that this fails. dan and nor are very similar, and the document about norway in danish is very small.
          testEquals("dan", classifier.classify(norway_in_danish).getISO()); 
          assertEquals("dan", classifier.classify(denmark_in_danish).getISO());
          testEquals("dan", classifier.classify(finland_in_danish).getISO());
          testEquals("dan", classifier.classify(uk_in_danish).getISO());
          testEquals("eng", classifier.classify(sweden_in_english).getISO());
          testEquals("eng", classifier.classify(norway_in_english).getISO());
          testEquals("eng", classifier.classify(denmark_in_english).getISO());
          testEquals("eng", classifier.classify(finland_in_english).getISO());
          assertEquals("eng", classifier.classify(uk_in_english).getISO());

      I don't know how well it works on lots of lanugages, but this fits my needs for now. I'll try do more work on considering the language trees when classifying.

      It takes a bit of time and RAM to build the training data, so the patch contains a pre-compiled arff-file.

      1. ld.tar.gz
        7.73 MB
        Karl Wettin
      2. ld.tar.gz
        500 kB
        Karl Wettin


        Mark Thomas made changes -
        Workflow Default workflow, editable Closed status [ 12564562 ] jira [ 12584968 ]
        Mark Thomas made changes -
        Workflow jira [ 12398943 ] Default workflow, editable Closed status [ 12564562 ]
        Karl Wettin made changes -
        Resolution Won't Fix [ 2 ]
        Status Open [ 1 ] Closed [ 6 ]
        Karl Wettin made changes -
        Attachment ld.tar.gz [ 12352906 ]
        Karl Wettin made changes -
        Field Original Value New Value
        Attachment ld.tar.gz [ 12352807 ]
        Karl Wettin created issue -


          • Assignee:
            Karl Wettin
            Karl Wettin
          • Votes:
            0 Vote for this issue
            2 Start watching this issue


            • Created: