Details
-
Improvement
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
None
-
None
Description
This task is to create a new split balanced internal builtin that optimize the already existing builtin called splitBalanced.
A list of optimizations initially proposed:
1. Avoid having to remove empty elements after, to avoid double allocation
2. Avoid having to materialize the combined MatrixBlock of X and Y but internally sort them and use the index created from sorting to construct a selection of elements for the balanced split.
Additionally this task could look into our normal random split that also calls remove empty, and move this into the same internal (overloaded) random split instruction.
An example script to optimize (the purpose of this script is to generate a challenging dataset to classify):
x = rand(rows=$1, cols=$2, min=$3, max=$4, seed=13) x = ceil(x) source("nn/layers/tanh.dml") as tanh xs = scale(x, TRUE, TRUE) L1 = rand(rows=$2, cols=100, min=-1, max =1, seed=14) L2 = rand(rows=100, cols=50, min=-1, max =1, seed=15) L3 = rand(rows=50, cols=25, min=-1, max =1, seed=16) L4 = rand(rows=25, cols=10, min=-1, max =1, seed=18) x1 = tanh::forward(x %*% L1) x2 = tanh::forward(x1 %*% L2) x3 = tanh::forward(x2 %*% L3) x4 = x3 %*% L4 y = rowIndexMax(x4) yt= table(y, 1) print("Class Distribution") print(toString(t(yt))) [x, y, xt, yt] = splitBalanced(X=x,Y=y) write(x, $5, format=$9) write(y, $6, format=$9) write(xt, $7, format=$9) write(yt, $8, format=$9)