I am trying to convert a very sparse dataset to parquet (~3% rows in a range are populated). The file I am working with spans upto ~63M rows. I decided to iterate in batches of 500k rows, 127 batches in total. Each row batch is a RecordBatch. I create 4 batches at a time, and write to a parquet file incrementally. Something like this:
I was getting a segmentation fault at the final step, I narrowed it down to a specific iteration. I noticed that iteration had empty batches; specifically, [0, 0, 2876, 14423]. The number of rows for each RecordBatch for the whole dataset is below:
On excluding the empty RecordBatch-es, the segfault goes away, but unfortunately I couldn't create a proper minimal example with synthetic data.
The data I am using is from the 1000 Genome project, which has been public for many years, so we can be reasonably sure the data is good. The following steps should help you replicate the issue.
- Download the data file (and index), about 330MB:
- Install the Cython library pysam, a thin wrapper around the reference implementation of the VCF file spec. You will need zlib headers, but that's probably not a problem
- Now you can use the attached script to replicate the crash.
I have tried attaching gdb, the backtrace when the segfault occurs is shown below (maybe it helps, this is how I realised empty batches could be the reason).