Uploaded image for project: 'Joshua (Retired)'
  1. Joshua (Retired)
  2. JOSHUA-340

Revamp Tokenization and Normalization

    XMLWordPrintableJSON

Details

    • Task
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • None
    • core, pipeline

    Description

      As part of the preprocessing, Joshua tokenizes input sentences, for example splitting punctuation off from words. This is currently done with a set of Perl preprocessing scripts [1], but it would be nice to move this to the decoder itself.

      [1] : https://github.com/apache/incubator-joshua/tree/master/scripts/preparation

      Attachments

        Activity

          People

            Unassigned Unassigned
            teofili Tommaso Teofili
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated: