[ARROW-1374] Compatibility with xgboost - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Wish
Status: Closed
Priority: Minor
Resolution: Not A Problem
Affects Version/s: None
Fix Version/s: None
Component/s: None
Labels:
None

External issue URL:
https://github.com/apache/arrow/issues/17404

Description

Traditionally I work with CSV's and really suffer with slow read/write times. Parquet and the Arrow project obviously give us huge speedups.

One thing I've noticed, however, is that there is a serious bottleneck when converting a DataFrame read in through pyarrow to a DMatrix used by xgboost. For example, I'm building a model with about 180k rows and 6k float64 columns. Reading into a pandas DataFrame takes about 20 seconds on my machine. However, converting that DataFrame to a DMatrix takes well over 10 minutes.

Interestingly, it takes about 10 minutes to read that same data from a CSV into a pandas DataFrame. Then, it takes less than a minute to convert to a DMatrix.

I'm sure there's a good technical explanation for why this happens (e.g. row vs column storage). Still, I imagine this use case may occur to many and it would be great to improve these times, if possible.

import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
import xgboost as xgb

# Reading from parquet:
table = pq.read_table('/path/to/parquet/files')  # 20 seconds
variables = table.to_pandas()  # 1 second
dtrain = xgb.DMatrix(variables.drop(['tag'], axis=1), label=variables['tag'])  # takes 10-15 minutes

# Reading from CSV:
variables = pd.read_csv('/path/to/file.csv', ...)  # takes about 10 minutes
dtrain = xgb.DMatrix(variables.drop(['tag'], axis=1), label=variables['tag'])  # less than 1 minute

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Steven Anton

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 19/Aug/17 00:38

Updated:: 11/Jan/23 07:14

Resolved:: 19/Aug/17 13:10