[OPENNLP-515] Request for multi-words expressions (MWE) support in serialization formats - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Open
Priority: Minor
Resolution: Unresolved
Affects Version/s: tools-1.5.3
Fix Version/s: None
Component/s: Chunker, Command Line Interface, Doccat, Name Finder, Parser, POS Tagger
Labels:
None

Description

Multi-words expressions (MWE) are expressions with whitespace-separated words like "traffic light", "in order to", "two thousand and one", "Jules Verne"...

So far, by using the CLI to train a model (in particular a POS model), there was no way to specify what is a simple or a multi-word expressions.
By convention, users use the underscore character to concat the words of MWE and make MWE a token.
Consequently a model trained by the API on the same data can be distinct since this preprocessing is not required.

We need to offer to the users the possibility to set by parameter in the CLI what is the MWE separator char sequence.

This concerns both trainers and labelers.

A default MWE separator should be specified which will be used when serializing data with MWEs.

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Nicolas Hernandez

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 27/Jun/12 14:46

Updated:: 16/Jan/17 14:33