[ARROW-14781] [Docs][Python] Improved Tooling/Documentation on Constructing Larger than Memory Parquet - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Open
Priority: Minor
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: None
Component/s: Documentation, Python
Labels:
None

External issue URL:
https://github.com/apache/arrow/issues/30316

Description

I have ~800GBs of csvs distributed across ~1200 files and a mere 32GB of RAM. My objective is to incrementally build a parquet dataset holding the collection. I can only hold a small subset of the data in memory.

Following the docs as best I could, I was able to hack together a workflow that will do what I need, but it seems overly complex. I hope my problem is not out of scope, so I would love it if there was an effort to:

1) streamline the APIs to make this more straight-forward
2) better documentation on how to approach this problem
3) out of the box CLI utilities that would do this without any effort on my part

Expanding on 3), I was imagining something like a `parquet-cat`, `parquet-append`, `parquet-sample`, `parquet-metadata` or similar that would allow interacting with these files from the terminal. As it is, they are just blobs that require additional tooling to get even the barest sense of what is within.

Reproducible example below. Happy to hear what I missed that would have made this more straight-forward. Or that would also generate the parquet metadata at the same time.

EDIT: made example generate random dataframes so it can be run directly. Was to close to my use case where I was reading files from disk

import itertools

import numpy as np
import pandas as pd
import pyarrow as pa
import pyarrow.dataset as ds

def gen_batches():
    NUM_CSV_FILES = 15
    NUM_ROWS = 25
    for _ in range(NUM_CSV_FILES):
        dataf = pd.DataFrame(np.random.randint(0, 100, size=(NUM_ROWS, 5)), columns=list("abcde"))

        # PyArrow dataset would only consume batches iterable
        for batch in pa.Table.from_pandas(dataf).to_batches():
            yield batch


batches = gen_batches()

# using the write_dataset method requires providing the schema, which is not accessible from a batch?
peek_batch = batches.__next__()
# needed to build a table to get to the schema
schema = pa.Table.from_batches([peek_batch]).schema

# consumed the first entry of the generator, rebuild it here
renew_gen_batches = itertools.chain([peek_batch], batches)

ds.write_dataset(renew_gen_batches, base_dir="parquet_dst.parquet", format="parquet", schema=schema)
# attempting write_dataset with an iterable of Tables threw: pyarrow.lib.ArrowTypeError: Could not unwrap RecordBatch from Python object of type 'pyarrow.lib.Table'

Attachments

Issue Links

relates to

ARROW-14931 [Python] csv/orc format strings missing from some dataset docs

Resolved

Activity

People

Assignee:: Unassigned

Reporter:: Damien Ready

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 20/Nov/21 07:35

Updated:: 11/Jan/23 08:42