Uploaded image for project: 'Joshua (Retired)'
  1. Joshua (Retired)
  2. JOSHUA-307

Java-based tokenization and normalization

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Minor
    • Resolution: Unresolved
    • None
    • 6.2
    • None
    • None

    Description

      Currently, Joshua expects data to be lowercased, normalized, and tokenized consistent with the way the training data was prepared before being passed in. This requires calling Perl scripts on the input data. It would be nice if these Perl scripts (located under $JOSHUA/scripts/preparation) were rewritten in Java (under org.apache.joshua.util) so that Joshua could do this normalization itself. This would be particularly useful for the language packs.

      Attachments

        Activity

          People

            Unassigned Unassigned
            post Matt Post
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated: