Details
-
Bug
-
Status: Open
-
Major
-
Resolution: Unresolved
-
9.0.0
-
None
-
None
Description
I am using pyarrow and duckdb to query some parquet files in GCP, thanks for making the experience so smooth, but I have an issue with the performance, see code used.
import pyarrow.dataset as ds
import duckdb
import json
lineitem = ds.dataset("gs://xxxxx/lineitem")
lineitem_partition = ds.dataset("gs://xxxx/yyy",format="parquet", partitioning="hive")
lineitem_180 = ds.dataset("gs://xxxxx/lineitem_180",format="parquet", partitioning="hive")
con = duckdb.connect()
con.register("lineitem", lineitem)
con.register("lineitem_partition", lineitem_partition)
con.register("lineitem_180", lineitem_180)
def Query(request):
SQL = request.get_json().get('name')
df = con.execute(SQL).df()
return json.dumps(df.to_json(orient="records")), 200, {'Content-Type': 'application/json'}
the issue is I am getting slow some extremely slow throughput performance, around 30 MBper second, the same files using local ssd laptop is extremely fast.
I am not sure what's the issue, I tried using pyarrow compute Query and it is the same performance