[SPARK-4285] Transpose RDD[Vector] to column store for ML - ASF JIRA

Attach files

Attach Screenshot

Voters

Watch issue

Watchers

Link

Clone

Update Comment Author

Replace String in Comment

Update Comment Visibility

Delete Comments

XML

Word

Printable

JSON

Details

Type: Sub-task
Status: Resolved
Priority: Minor
Resolution: Incomplete
Affects Version/s: None
Fix Version/s: None
Component/s: MLlib
Labels:
- bulk-closed

Description

For certain ML algorithms, a column store is more efficient than a row store (which is currently used everywhere). E.g., deep decision trees can be faster to train when partitioning by features.

Proposal: Provide a method with the following API (probably in util/):
```
def rowToColumnStore(data: RDD[Vector]): RDD[(Int, Vector)]
```
The input Vectors will be data rows/instances, and the output Vectors will be columns/features paired with column/feature indices.

*Question*: Is it important to maintain matrix structure? That is, should output Vectors in the same partition be adjacent columns in the matrix?

Attachments

Issue Links

Add Link

blocks

SPARK-3717 DecisionTree, RandomForest: Partition by feature

Resolved

Delete this link

Activity

Comment

This comment will be Viewable by All Users Viewable by All Users

Cancel

People

Assignee:: Unassigned

Reporter:: Joseph K. Bradley

Votes:: 1 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 07/Nov/14 00:48

Updated:: 08/Oct/19 05:41

Resolved:: 08/Oct/19 05:41

Agile

View on Board

Transpose RDD[Vector] to column store for ML

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates

Agile

Slack

Issue deployment