[JOSHUA-340] Revamp Tokenization and Normalization - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Task
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: None
Component/s: core, pipeline
Labels:
- gsoc2019

Description

As part of the preprocessing, Joshua tokenizes input sentences, for example splitting punctuation off from words. This is currently done with a set of Perl preprocessing scripts [1], but it would be nice to move this to the decoder itself.

[1] : https://github.com/apache/incubator-joshua/tree/master/scripts/preparation

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Tommaso Teofili

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 15/Mar/19 07:31

Updated:: 18/Mar/19 21:19