[SPARK-19668] Multiple NGram sizes - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Minor
Resolution: Incomplete
Affects Version/s: 2.1.0
Fix Version/s: None
Component/s: ML
Labels:
- beginner
- bulk-closed
- easyfix
- newbie

Description

It would be nice to have a possibility of specyfing the range (or maybe a list of) sizes of ngrams, like it is done in sklearn, as in
http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html#sklearn.feature_extraction.text.TfidfVectorizer

This shouldn't be difficult to add, the code is very straightforward, and I can implement it. The only issue is with the NGram API - should it just accept a number/tuple/list?

Attachments

Issue Links

is duplicated by

SPARK-20838 Spark ML ngram feature extractor should support ngram range like scikit

Resolved

links to

[Github] Pull Request #19659 (mpetruska)

GitHub Pull Request #19659

Activity

People

Assignee:: Unassigned

Reporter:: Jacek KK

Votes:: 0 Vote for this issue

Watchers:: 7 Start watching this issue

Dates

Created:: 20/Feb/17 13:20

Updated:: 16/Jan/20 00:08

Resolved:: 21/May/19 04:14