Details
-
Wish
-
Status: Closed
-
Minor
-
Resolution: Not A Problem
-
None
-
None
-
None
-
None
Description
Traditionally I work with CSV's and really suffer with slow read/write times. Parquet and the Arrow project obviously give us huge speedups.
One thing I've noticed, however, is that there is a serious bottleneck when converting a DataFrame read in through pyarrow to a DMatrix used by xgboost. For example, I'm building a model with about 180k rows and 6k float64 columns. Reading into a pandas DataFrame takes about 20 seconds on my machine. However, converting that DataFrame to a DMatrix takes well over 10 minutes.
Interestingly, it takes about 10 minutes to read that same data from a CSV into a pandas DataFrame. Then, it takes less than a minute to convert to a DMatrix.
I'm sure there's a good technical explanation for why this happens (e.g. row vs column storage). Still, I imagine this use case may occur to many and it would be great to improve these times, if possible.
import pandas as pd import pyarrow as pa import pyarrow.parquet as pq import xgboost as xgb # Reading from parquet: table = pq.read_table('/path/to/parquet/files') # 20 seconds variables = table.to_pandas() # 1 second dtrain = xgb.DMatrix(variables.drop(['tag'], axis=1), label=variables['tag']) # takes 10-15 minutes # Reading from CSV: variables = pd.read_csv('/path/to/file.csv', ...) # takes about 10 minutes dtrain = xgb.DMatrix(variables.drop(['tag'], axis=1), label=variables['tag']) # less than 1 minute