Uploaded image for project: 'Joshua (Retired)'
  1. Joshua (Retired)
  2. JOSHUA-315

Thrax keeps all rules

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 6.1
    • None
    • None

    Description

      When extracting rules, Thrax keeps all options for each target side. For large bitexts and common source sides (e.g., "de" for Spanish–English), there can be tens of thousands of translations, due to errors in the alignments and phenomena like garbage collection. The decoder throws out all but the top num_translation_options of these (default 20), but before doing so, it has to score all the target side options with all feature functions, include the language model. This slows down "warming up" of the model and means that the first sentences to use these items are very slow to translation.

      I have updated scripts/training/filter-rules.pl to filter out using Thrax's rarity penalty field, but it would be much better if Thrax were to keep only the most 100 frequent translation options for each source side.

      Attachments

        Activity

          People

            Unassigned Unassigned
            post Matt Post
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: