Uploaded image for project: 'Beam'
  1. Beam
  2. BEAM-3622

DirectRunner memory issue with Python SDK

Details

    • Bug
    • Status: Open
    • P3
    • Resolution: Unresolved
    • None
    • None
    • sdk-py-core
    • None

    Description

      After running pipeline for a while in a streaming mode (reading from Pub/Sub and writing to BigQuery, Datastore and another Pub/Sub) I noticed drastic memory usage of a process. Using guppy as a profiler I got the following results:

      start

       INFO *** MemoryReport Heap:
       Partition of a set of 240208 objects. Total size = 34988840 bytes.
       Index  Count   %     Size   % Cumulative  % Kind (class / dict of class)
           0  88289  37  8696984  25   8696984  25 str
           1  53333  22  4897352  14  13594336  39 tuple
           2   5083   2  2790664   8  16385000  47 dict (no owner)
           3   1939   1  1749656   5  18134656  52 type
           4    699   0  1723272   5  19857928  57 dict of module
           5  12337   5  1579136   5  21437064  61 types.CodeType
           6  12403   5  1488360   4  22925424  66 function
           7   1939   1  1452616   4  24378040  70 dict of type
           8    677   0   709496   2  25087536  72 dict of 0x1e4d880
           9  25603  11   614472   2  25702008  73 int
      <1103 more rows. Type e.g. '_.more' to view.>
      

      after several hours of running

      INFO *** MemoryReport Heap:
       Partition of a set of 1255662 objects. Total size = 315029632 bytes.
       Index  Count   %     Size   % Cumulative  % Kind (class / dict of class)
           0  95554   8 99755056  32  99755056  32 dict of
                                                   apache_beam.runners.direct.bundle_factory._Bundle
           1 117943   9 54193192  17 153948248  49 dict (no owner)
           2 161068  13 27169296   9 181117544  57 unicode
           3  94571   8 26479880   8 207597424  66 dict of apache_beam.pvalue.PBegin
           4 126461  10 12715336   4 220312760  70 str
           5  44374   4 12424720   4 232737480  74 dict of apitools.base.protorpclite.messages.FieldList
           6  44374   4  6348624   2 239086104  76 apitools.base.protorpclite.messages.FieldList
           7  95556   8  6115584   2 245201688  78 apache_beam.runners.direct.bundle_factory._Bundle
           8  94571   8  6052544   2 251254232  80 apache_beam.pvalue.PBegin
           9  57371   5  5218424   2 256472656  81 tuple
      <1187 more rows. Type e.g. '_.more' to view.>
      

       

      I see that every bundle still sits in memory and all its data too. why aren't the gc-ed?

      What is the policy for gc for the dataflow processes?

      Attachments

        Activity

          People

            Unassigned Unassigned
            krnr yuri krnr
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated: