Details
-
Sub-task
-
Status: Resolved
-
Minor
-
Resolution: Incomplete
-
None
-
None
Description
For certain ML algorithms, a column store is more efficient than a row store (which is currently used everywhere). E.g., deep decision trees can be faster to train when partitioning by features.
Proposal: Provide a method with the following API (probably in util/):
```
def rowToColumnStore(data: RDD[Vector]): RDD[(Int, Vector)]
```
The input Vectors will be data rows/instances, and the output Vectors will be columns/features paired with column/feature indices.
*Question*: Is it important to maintain matrix structure? That is, should output Vectors in the same partition be adjacent columns in the matrix?
Attachments
Issue Links
- blocks
-
SPARK-3717 DecisionTree, RandomForest: Partition by feature
- Resolved