Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-17679

slow performance when reading data from GCP

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 9.0.0
    • None
    • Parquet, Python
    • None

    Description

      I am using pyarrow and duckdb to query some parquet files in GCP, thanks for making  the experience so smooth, but I have an issue with the performance, see code used.
      import pyarrow.dataset as ds
      import duckdb
      import json
      lineitem = ds.dataset("gs://xxxxx/lineitem")
      lineitem_partition = ds.dataset("gs://xxxx/yyy",format="parquet", partitioning="hive")
      lineitem_180 = ds.dataset("gs://xxxxx/lineitem_180",format="parquet", partitioning="hive")
      con = duckdb.connect()
      con.register("lineitem", lineitem)
      con.register("lineitem_partition", lineitem_partition)
      con.register("lineitem_180", lineitem_180)
      def Query(request):
          SQL = request.get_json().get('name')
          df = con.execute(SQL).df()
          return json.dumps(df.to_json(orient="records")), 200, {'Content-Type': 'application/json'}
       
      the issue is I am getting slow some extremely slow throughput performance, around 30 MBper second, the same files using local ssd laptop is extremely fast.
      I am not sure what's the issue, I tried using pyarrow compute Query and it is the same performance 

      Attachments

        Activity

          People

            Unassigned Unassigned
            mim mimoune djouallah
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated: