Details
-
Bug
-
Status: Open
-
P3
-
Resolution: Unresolved
-
None
-
None
-
None
Description
After running pipeline for a while in a streaming mode (reading from Pub/Sub and writing to BigQuery, Datastore and another Pub/Sub) I noticed drastic memory usage of a process. Using guppy as a profiler I got the following results:
start
INFO *** MemoryReport Heap: Partition of a set of 240208 objects. Total size = 34988840 bytes. Index Count % Size % Cumulative % Kind (class / dict of class) 0 88289 37 8696984 25 8696984 25 str 1 53333 22 4897352 14 13594336 39 tuple 2 5083 2 2790664 8 16385000 47 dict (no owner) 3 1939 1 1749656 5 18134656 52 type 4 699 0 1723272 5 19857928 57 dict of module 5 12337 5 1579136 5 21437064 61 types.CodeType 6 12403 5 1488360 4 22925424 66 function 7 1939 1 1452616 4 24378040 70 dict of type 8 677 0 709496 2 25087536 72 dict of 0x1e4d880 9 25603 11 614472 2 25702008 73 int <1103 more rows. Type e.g. '_.more' to view.>
after several hours of running
INFO *** MemoryReport Heap: Partition of a set of 1255662 objects. Total size = 315029632 bytes. Index Count % Size % Cumulative % Kind (class / dict of class) 0 95554 8 99755056 32 99755056 32 dict of apache_beam.runners.direct.bundle_factory._Bundle 1 117943 9 54193192 17 153948248 49 dict (no owner) 2 161068 13 27169296 9 181117544 57 unicode 3 94571 8 26479880 8 207597424 66 dict of apache_beam.pvalue.PBegin 4 126461 10 12715336 4 220312760 70 str 5 44374 4 12424720 4 232737480 74 dict of apitools.base.protorpclite.messages.FieldList 6 44374 4 6348624 2 239086104 76 apitools.base.protorpclite.messages.FieldList 7 95556 8 6115584 2 245201688 78 apache_beam.runners.direct.bundle_factory._Bundle 8 94571 8 6052544 2 251254232 80 apache_beam.pvalue.PBegin 9 57371 5 5218424 2 256472656 81 tuple <1187 more rows. Type e.g. '_.more' to view.>
I see that every bundle still sits in memory and all its data too. why aren't the gc-ed?
What is the policy for gc for the dataflow processes?