Uploaded image for project: 'OpenNLP'
  1. OpenNLP
  2. OPENNLP-515

Request for multi-words expressions (MWE) support in serialization formats

    XMLWordPrintableJSON

Details

    Description

      Multi-words expressions (MWE) are expressions with whitespace-separated words like "traffic light", "in order to", "two thousand and one", "Jules Verne"...

      So far, by using the CLI to train a model (in particular a POS model), there was no way to specify what is a simple or a multi-word expressions.
      By convention, users use the underscore character to concat the words of MWE and make MWE a token.
      Consequently a model trained by the API on the same data can be distinct since this preprocessing is not required.

      We need to offer to the users the possibility to set by parameter in the CLI what is the MWE separator char sequence.

      This concerns both trainers and labelers.

      A default MWE separator should be specified which will be used when serializing data with MWEs.

      Attachments

        Activity

          People

            Unassigned Unassigned
            nicolas.hernandez Nicolas Hernandez
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated: