Details
-
Improvement
-
Status: Open
-
Minor
-
Resolution: Unresolved
-
None
-
None
-
None
Description
I have ~800GBs of csvs distributed across ~1200 files and a mere 32GB of RAM. My objective is to incrementally build a parquet dataset holding the collection. I can only hold a small subset of the data in memory.
Following the docs as best I could, I was able to hack together a workflow that will do what I need, but it seems overly complex. I hope my problem is not out of scope, so I would love it if there was an effort to:
1) streamline the APIs to make this more straight-forward
2) better documentation on how to approach this problem
3) out of the box CLI utilities that would do this without any effort on my part
Expanding on 3), I was imagining something like a `parquet-cat`, `parquet-append`, `parquet-sample`, `parquet-metadata` or similar that would allow interacting with these files from the terminal. As it is, they are just blobs that require additional tooling to get even the barest sense of what is within.
Reproducible example below. Happy to hear what I missed that would have made this more straight-forward. Or that would also generate the parquet metadata at the same time.
EDIT: made example generate random dataframes so it can be run directly. Was to close to my use case where I was reading files from disk
import itertools import numpy as np import pandas as pd import pyarrow as pa import pyarrow.dataset as ds def gen_batches(): NUM_CSV_FILES = 15 NUM_ROWS = 25 for _ in range(NUM_CSV_FILES): dataf = pd.DataFrame(np.random.randint(0, 100, size=(NUM_ROWS, 5)), columns=list("abcde")) # PyArrow dataset would only consume batches iterable for batch in pa.Table.from_pandas(dataf).to_batches(): yield batch batches = gen_batches() # using the write_dataset method requires providing the schema, which is not accessible from a batch? peek_batch = batches.__next__() # needed to build a table to get to the schema schema = pa.Table.from_batches([peek_batch]).schema # consumed the first entry of the generator, rebuild it here renew_gen_batches = itertools.chain([peek_batch], batches) ds.write_dataset(renew_gen_batches, base_dir="parquet_dst.parquet", format="parquet", schema=schema) # attempting write_dataset with an iterable of Tables threw: pyarrow.lib.ArrowTypeError: Could not unwrap RecordBatch from Python object of type 'pyarrow.lib.Table'
Attachments
Issue Links
- relates to
-
ARROW-14931 [Python] csv/orc format strings missing from some dataset docs
- Resolved