Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-1374

Compatibility with xgboost

    XMLWordPrintableJSON

Details

    • Wish
    • Status: Closed
    • Minor
    • Resolution: Not A Problem
    • None
    • None
    • None
    • None

    Description

      Traditionally I work with CSV's and really suffer with slow read/write times. Parquet and the Arrow project obviously give us huge speedups.

      One thing I've noticed, however, is that there is a serious bottleneck when converting a DataFrame read in through pyarrow to a DMatrix used by xgboost. For example, I'm building a model with about 180k rows and 6k float64 columns. Reading into a pandas DataFrame takes about 20 seconds on my machine. However, converting that DataFrame to a DMatrix takes well over 10 minutes.

      Interestingly, it takes about 10 minutes to read that same data from a CSV into a pandas DataFrame. Then, it takes less than a minute to convert to a DMatrix.

      I'm sure there's a good technical explanation for why this happens (e.g. row vs column storage). Still, I imagine this use case may occur to many and it would be great to improve these times, if possible.

      import pandas as pd
      import pyarrow as pa
      import pyarrow.parquet as pq
      import xgboost as xgb
      
      # Reading from parquet:
      table = pq.read_table('/path/to/parquet/files')  # 20 seconds
      variables = table.to_pandas()  # 1 second
      dtrain = xgb.DMatrix(variables.drop(['tag'], axis=1), label=variables['tag'])  # takes 10-15 minutes
      
      # Reading from CSV:
      variables = pd.read_csv('/path/to/file.csv', ...)  # takes about 10 minutes
      dtrain = xgb.DMatrix(variables.drop(['tag'], axis=1), label=variables['tag'])  # less than 1 minute
      

      Attachments

        Activity

          People

            Unassigned Unassigned
            santon Steven Anton
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: