Multi-words expressions (MWE) are expressions with whitespace-separated words like "traffic light", "in order to", "two thousand and one", "Jules Verne"...
So far, by using the CLI to train a model (in particular a POS model), there was no way to specify what is a simple or a multi-word expressions.
By convention, users use the underscore character to concat the words of MWE and make MWE a token.
Consequently a model trained by the API on the same data can be distinct since this preprocessing is not required.
We need to offer to the users the possibility to set by parameter in the CLI what is the MWE separator char sequence.
This concerns both trainers and labelers.
A default MWE separator should be specified which will be used when serializing data with MWEs.