Details

    • New Feature
    • Status: Closed
    • Major
    • Resolution: Won't Fix
    • None
    • None
    • None
    • None
    • New, Patch Available

    Description

      A formula 1A token/ngram-based language detector. Requires a paragraph of text to avoid false positive classifications.

      Depends on contrib/analyzers/ngrams for tokenization, Weka for classification (logistic support vector models) feature selection and normalization of token freuencies. Optionally Wikipedia and NekoHTML for training data harvesting.

      Initialized like this:

          LanguageRoot root = new LanguageRoot(new File("documentClassifier/language root"));
      
          root.addBranch("uralic");
          root.addBranch("fino-ugric", "uralic");
          root.addBranch("ugric", "uralic");
          root.addLanguage("fino-ugric", "fin", "finnish", "fi", "Suomi");
      
          root.addBranch("proto-indo european");
          root.addBranch("germanic", "proto-indo european");
          root.addBranch("northern germanic", "germanic");
          root.addLanguage("northern germanic", "dan", "danish", "da", "Danmark");
          root.addLanguage("northern germanic", "nor", "norwegian", "no", "Norge");
          root.addLanguage("northern germanic", "swe", "swedish", "sv", "Sverige");
      
          root.addBranch("west germanic", "germanic");
          root.addLanguage("west germanic", "eng", "english", "en", "UK");
      
          root.mkdirs();
      
          LanguageClassifier classifier = new LanguageClassifier(root);
          if (!new File(root.getDataPath(), "trainingData.arff").exists()) {
            classifier.compileTrainingData(); // from wikipedia
          }
          classifier.buildClassifier();
      

      Training set build from Wikipedia is the pages describing the home country of each registred language in the language to train. Above example pass this test:

      (testEquals is the same as assertEquals, just not required. Only one of them fail, see comment.)

          assertEquals("swe", classifier.classify(sweden_in_swedish).getISO());
          testEquals("swe", classifier.classify(norway_in_swedish).getISO());
          testEquals("swe", classifier.classify(denmark_in_swedish).getISO());
          testEquals("swe", classifier.classify(finland_in_swedish).getISO());
          testEquals("swe", classifier.classify(uk_in_swedish).getISO());
      
          testEquals("nor", classifier.classify(sweden_in_norwegian).getISO());
          assertEquals("nor", classifier.classify(norway_in_norwegian).getISO());
          testEquals("nor", classifier.classify(denmark_in_norwegian).getISO());
          testEquals("nor", classifier.classify(finland_in_norwegian).getISO());
          testEquals("nor", classifier.classify(uk_in_norwegian).getISO());
      
          testEquals("fin", classifier.classify(sweden_in_finnish).getISO());
          testEquals("fin", classifier.classify(norway_in_finnish).getISO());
          testEquals("fin", classifier.classify(denmark_in_finnish).getISO());
          assertEquals("fin", classifier.classify(finland_in_finnish).getISO());
          testEquals("fin", classifier.classify(uk_in_finnish).getISO());
      
          testEquals("dan", classifier.classify(sweden_in_danish).getISO());
          // it is ok that this fails. dan and nor are very similar, and the document about norway in danish is very small.
          testEquals("dan", classifier.classify(norway_in_danish).getISO()); 
          assertEquals("dan", classifier.classify(denmark_in_danish).getISO());
          testEquals("dan", classifier.classify(finland_in_danish).getISO());
          testEquals("dan", classifier.classify(uk_in_danish).getISO());
      
          testEquals("eng", classifier.classify(sweden_in_english).getISO());
          testEquals("eng", classifier.classify(norway_in_english).getISO());
          testEquals("eng", classifier.classify(denmark_in_english).getISO());
          testEquals("eng", classifier.classify(finland_in_english).getISO());
          assertEquals("eng", classifier.classify(uk_in_english).getISO());
      

      I don't know how well it works on lots of lanugages, but this fits my needs for now. I'll try do more work on considering the language trees when classifying.

      It takes a bit of time and RAM to build the training data, so the patch contains a pre-compiled arff-file.

      Attachments

        1. ld.tar.gz
          7.73 MB
          Karl Wettin
        2. ld.tar.gz
          500 kB
          Karl Wettin

        Activity

          People

            karl.wettin Karl Wettin
            karl.wettin Karl Wettin
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: