Uploaded image for project: 'OpenNLP'
  1. OpenNLP
  2. OPENNLP-515

Request for multi-words expressions (MWE) support in serialization formats

    XMLWordPrintableJSON

    Details

      Description

      Multi-words expressions (MWE) are expressions with whitespace-separated words like "traffic light", "in order to", "two thousand and one", "Jules Verne"...

      So far, by using the CLI to train a model (in particular a POS model), there was no way to specify what is a simple or a multi-word expressions.
      By convention, users use the underscore character to concat the words of MWE and make MWE a token.
      Consequently a model trained by the API on the same data can be distinct since this preprocessing is not required.

      We need to offer to the users the possibility to set by parameter in the CLI what is the MWE separator char sequence.

      This concerns both trainers and labelers.

      A default MWE separator should be specified which will be used when serializing data with MWEs.

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              nicolas.hernandez Nicolas Hernandez
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated: