[BEAM-3622] DirectRunner memory issue with Python SDK - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: P3
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: None
Component/s: sdk-py-core
Labels:
None

Description

After running pipeline for a while in a streaming mode (reading from Pub/Sub and writing to BigQuery, Datastore and another Pub/Sub) I noticed drastic memory usage of a process. Using guppy as a profiler I got the following results:

start

 INFO *** MemoryReport Heap:
 Partition of a set of 240208 objects. Total size = 34988840 bytes.
 Index  Count   %     Size   % Cumulative  % Kind (class / dict of class)
     0  88289  37  8696984  25   8696984  25 str
     1  53333  22  4897352  14  13594336  39 tuple
     2   5083   2  2790664   8  16385000  47 dict (no owner)
     3   1939   1  1749656   5  18134656  52 type
     4    699   0  1723272   5  19857928  57 dict of module
     5  12337   5  1579136   5  21437064  61 types.CodeType
     6  12403   5  1488360   4  22925424  66 function
     7   1939   1  1452616   4  24378040  70 dict of type
     8    677   0   709496   2  25087536  72 dict of 0x1e4d880
     9  25603  11   614472   2  25702008  73 int
<1103 more rows. Type e.g. '_.more' to view.>

after several hours of running

INFO *** MemoryReport Heap:
 Partition of a set of 1255662 objects. Total size = 315029632 bytes.
 Index  Count   %     Size   % Cumulative  % Kind (class / dict of class)
     0  95554   8 99755056  32  99755056  32 dict of
                                             apache_beam.runners.direct.bundle_factory._Bundle
     1 117943   9 54193192  17 153948248  49 dict (no owner)
     2 161068  13 27169296   9 181117544  57 unicode
     3  94571   8 26479880   8 207597424  66 dict of apache_beam.pvalue.PBegin
     4 126461  10 12715336   4 220312760  70 str
     5  44374   4 12424720   4 232737480  74 dict of apitools.base.protorpclite.messages.FieldList
     6  44374   4  6348624   2 239086104  76 apitools.base.protorpclite.messages.FieldList
     7  95556   8  6115584   2 245201688  78 apache_beam.runners.direct.bundle_factory._Bundle
     8  94571   8  6052544   2 251254232  80 apache_beam.pvalue.PBegin
     9  57371   5  5218424   2 256472656  81 tuple
<1187 more rows. Type e.g. '_.more' to view.>

I see that every bundle still sits in memory and all its data too. why aren't the gc-ed?

What is the policy for gc for the dataflow processes?

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: yuri krnr

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 06/Feb/18 12:55

Updated:: 03/Jun/22 19:32