Description
It'd be nice to have the data preprocessing phase pluggable, with a default simple Java implementation and eventually other more advanced ones based on external tools like Apache OpenNLP.
That should replace our scripts based preprocessing: