[SYSTEMDS-3607] Split Balanced Internal - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: SystemDS 3.2
Component/s: None
Labels:
- StudentProject

Description

This task is to create a new split balanced internal builtin that optimize the already existing builtin called splitBalanced.

A list of optimizations initially proposed:

1. Avoid having to remove empty elements after, to avoid double allocation
2. Avoid having to materialize the combined MatrixBlock of X and Y but internally sort them and use the index created from sorting to construct a selection of elements for the balanced split.

Additionally this task could look into our normal random split that also calls remove empty, and move this into the same internal (overloaded) random split instruction.

An example script to optimize (the purpose of this script is to generate a challenging dataset to classify):

x = rand(rows=$1, cols=$2, min=$3, max=$4, seed=13)
x = ceil(x)

source("nn/layers/tanh.dml") as tanh

xs = scale(x, TRUE, TRUE)
L1 = rand(rows=$2, cols=100, min=-1, max =1, seed=14)
L2 = rand(rows=100, cols=50, min=-1, max =1, seed=15)
L3 = rand(rows=50, cols=25, min=-1, max =1, seed=16)
L4 = rand(rows=25, cols=10, min=-1, max =1, seed=18)

x1 = tanh::forward(x %*% L1)
x2 = tanh::forward(x1 %*% L2)
x3 = tanh::forward(x2 %*% L3)
x4 = x3 %*% L4

y = rowIndexMax(x4)


yt= table(y, 1)
print("Class Distribution")
print(toString(t(yt)))

[x, y, xt, yt] = splitBalanced(X=x,Y=y)
write(x, $5, format=$9)
write(y, $6, format=$9)
write(xt, $7, format=$9)
write(yt, $8, format=$9)

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Sebastian Baunsgaard

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 07/Aug/23 10:11

Updated:: 21/Aug/23 12:36

Resolved:: 21/Aug/23 12:36