[SPARK-20838] Spark ML ngram feature extractor should support ngram range like scikit - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Duplicate
Affects Version/s: 2.1.1
Fix Version/s: None
Component/s: ML
Labels:
None

Description

Currently Spark ML ngram extractor requires an ngram size (which default to 2).

This means that to tokenize to words, bigrams and trigrams (which is pretty common) you need a pipeline like this:

tokenizer = Tokenizer(inputCol="text", outputCol="tokenized_text")
remover = StopWordsRemover(inputCol=tokenizer.getOutputCol(), outputCol="words")
bigram = NGram(n=2, inputCol=remover.getOutputCol(), outputCol="bigrams")
trigram = NGram(n=3, inputCol=remover.getOutputCol(), outputCol="trigrams")

pipeline = Pipeline(stages=[tokenizer, remover, bigram, trigram])

That's not terrible, but the big problem is that the words, bigrams and trigrams end up in separate fields, and the only way (in pyspark) to combine them is to explode each of the words, bigrams and trigrams field and then union them together.

In my experience this means it is slower to use this for feature extraction than to use a python UDF. This seems preposterous!

Attachments

Issue Links

duplicates

SPARK-19668 Multiple NGram sizes

Resolved

Activity

People

Assignee:: Unassigned

Reporter:: Nick Lothian

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 22/May/17 12:00

Updated:: 23/May/17 11:00

Resolved:: 23/May/17 11:00