Details
-
Task
-
Status: Open
-
Major
-
Resolution: Unresolved
-
0.15.1
-
None
-
Mac OSX
Description
My case of datasets stored is specific. I have large strings (1-100MB each).
Let's take for example a single row.
43mb.csv is a 1-row CSV with 10 columns. One column a 43mb string.
When I read this csv with pandas and then dump to parquet, my script consumes 10x of the 43mb.
With increasing amount of such rows memory footprint overhead diminishes, but I want to focus on this specific case.
Here's the footprint after running using memory profiler:
Line # Mem usage Increment Line Contents ================================================ 4 48.9 MiB 48.9 MiB @profile 5 def test(): 6 143.7 MiB 94.7 MiB data = pd.read_csv('43mb.csv') 7 498.6 MiB 354.9 MiB data.to_parquet('out.parquet')
Is this typical for parquet in case of big strings?
Attachments
Attachments
Issue Links
- relates to
-
ARROW-6994 [C++] Research jemalloc memory page reclamation configuration on macOS when background_thread option is unavailable
- Resolved