Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-14781

[Docs][Python] Improved Tooling/Documentation on Constructing Larger than Memory Parquet

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Minor
    • Resolution: Unresolved
    • None
    • None
    • Documentation, Python
    • None

    Description

      I have ~800GBs of csvs distributed across ~1200 files and a mere 32GB of RAM. My objective is to incrementally build a parquet dataset holding the collection. I can only hold a small subset of the data in memory.

      Following the docs as best I could, I was able to hack together a workflow that will do what I need, but it seems overly complex. I hope my problem is not out of scope, so I would love it if there was an effort to:

      1) streamline the APIs to make this more straight-forward
      2) better documentation on how to approach this problem
      3) out of the box CLI utilities that would do this without any effort on my part

      Expanding on 3), I was imagining something like a `parquet-cat`, `parquet-append`, `parquet-sample`, `parquet-metadata` or similar that would allow interacting with these files from the terminal. As it is, they are just blobs that require additional tooling to get even the barest sense of what is within.

      Reproducible example below. Happy to hear what I missed that would have made this more straight-forward. Or that would also generate the parquet metadata at the same time.

      EDIT: made example generate random dataframes so it can be run directly. Was to close to my use case where I was reading files from disk

      import itertools
      
      import numpy as np
      import pandas as pd
      import pyarrow as pa
      import pyarrow.dataset as ds
      
      def gen_batches():
          NUM_CSV_FILES = 15
          NUM_ROWS = 25
          for _ in range(NUM_CSV_FILES):
              dataf = pd.DataFrame(np.random.randint(0, 100, size=(NUM_ROWS, 5)), columns=list("abcde"))
      
              # PyArrow dataset would only consume batches iterable
              for batch in pa.Table.from_pandas(dataf).to_batches():
                  yield batch
      
      
      batches = gen_batches()
      
      # using the write_dataset method requires providing the schema, which is not accessible from a batch?
      peek_batch = batches.__next__()
      # needed to build a table to get to the schema
      schema = pa.Table.from_batches([peek_batch]).schema
      
      # consumed the first entry of the generator, rebuild it here
      renew_gen_batches = itertools.chain([peek_batch], batches)
      
      ds.write_dataset(renew_gen_batches, base_dir="parquet_dst.parquet", format="parquet", schema=schema)
      # attempting write_dataset with an iterable of Tables threw: pyarrow.lib.ArrowTypeError: Could not unwrap RecordBatch from Python object of type 'pyarrow.lib.Table'
      

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              ludicrous_speed Damien Ready
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated: