Its fine with porting the existing seq2sparse to Spark for 0.10.1 so as to have a complete pipeline. In the long term we need to rethink how we wanna do this. seq2sparse was the big bottleneck in the legacy MR pipeline, not to mention that there was no way to incrementally update the term vectors for new streaming documents.
There have been discussions in the past about may be using Finite State Automaton (which comes with Lucene since 4.0), or Word2Vec etc. See the discussion in https://issues.apache.org/jira/browse/MAHOUT-1252