Details
-
Improvement
-
Status: Resolved
-
Major
-
Resolution: Duplicate
-
2.1.1
-
None
-
None
Description
Currently Spark ML ngram extractor requires an ngram size (which default to 2).
This means that to tokenize to words, bigrams and trigrams (which is pretty common) you need a pipeline like this:
tokenizer = Tokenizer(inputCol="text", outputCol="tokenized_text")
remover = StopWordsRemover(inputCol=tokenizer.getOutputCol(), outputCol="words")
bigram = NGram(n=2, inputCol=remover.getOutputCol(), outputCol="bigrams")
trigram = NGram(n=3, inputCol=remover.getOutputCol(), outputCol="trigrams")
pipeline = Pipeline(stages=[tokenizer, remover, bigram, trigram])
That's not terrible, but the big problem is that the words, bigrams and trigrams end up in separate fields, and the only way (in pyspark) to combine them is to explode each of the words, bigrams and trigrams field and then union them together.
In my experience this means it is slower to use this for feature extraction than to use a python UDF. This seems preposterous!
Attachments
Issue Links
- duplicates
-
SPARK-19668 Multiple NGram sizes
- Resolved