Details
-
Bug
-
Status: Resolved
-
P1
-
Resolution: Won't Fix
-
2.22.0
-
None
-
Important
Description
I have this small sample:
import apache_beam as beam import apache_beam.io.filebasedsource import csv class CsvFileSource(apache_beam.io.filebasedsource.FileBasedSource): def read_records(self, file_name, range_tracker): with open(file_name, 'r') as file: reader = csv.DictReader(file) print("Load CSV file") for rec in reader: yield rec if __name__ == '__main__': with beam.Pipeline() as p: count_feature = (p | 'create' >> beam.io.Read(CsvFileSource("myFile.csv")) | 'count element' >> beam.combiners.Count.Globally() | 'Print' >> beam.Map(print) )
for some reason if the CSV file is too large it is loaded several times...
for example for a file with 80000 rows (18.5 mo) the file is loaded 5 times.
At the end I have 400000 elements in my PCollection.