Details

    • Type: Sub-task Sub-task
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Won't Fix
    • Affects Version/s: 0.90.2
    • Fix Version/s: None
    • Component/s: Thrift
    • Labels:
      None

      Description

      The scannerMap in ThriftServer relies on the user to clean it by closing the scanner. If that doesn't happen, the ResultScanner will stay in the thrift server's memory and if any pre-fetching was done, it will also start accumulating Results (with all their data).

      1. HBASE-3852-0.94.patch
        10 kB
        Jean-Daniel Cryans
      2. thrift2-scanner.patch
        4 kB
        Bob Copeland
      3. 3852.txt
        0.5 kB
        Ted Yu

        Issue Links

          Activity

          Hide
          stack added a comment -

          Good one J-D. How'd you figure it? Did you restart the thrift server? Jack would be interested in this one?

          Show
          stack added a comment - Good one J-D. How'd you figure it? Did you restart the thrift server? Jack would be interested in this one?
          Hide
          Alex Newman added a comment -

          Should we close scanners on disconnect from the thrift server?

          Show
          Alex Newman added a comment - Should we close scanners on disconnect from the thrift server?
          Hide
          Jean-Daniel Cryans added a comment -

          @Alex, I'm not sure I understand what you're saying, by "on disconnect" you mean when client is done with the server and closes the socket? How do you know that? And what if the client was scanning but then never comes back?

          Show
          Jean-Daniel Cryans added a comment - @Alex, I'm not sure I understand what you're saying, by "on disconnect" you mean when client is done with the server and closes the socket? How do you know that? And what if the client was scanning but then never comes back?
          Hide
          Alex Newman added a comment -

          Yes, when the client closes the connection to the thrift server. I'm just trying to understand when you think the socket should be cleaned up.

          Show
          Alex Newman added a comment - Yes, when the client closes the connection to the thrift server. I'm just trying to understand when you think the socket should be cleaned up.
          Hide
          Ted Yu added a comment -

          In scannerGetList():

                      if (null == results) {
                          return new ArrayList<TRowResult>();
                      }
          

          Can we close the scanner before returning ?

          We still leave the scanner in scannerMap.
          HTable.ClientScanner would check whether the scanner has been closed.

          Show
          Ted Yu added a comment - In scannerGetList(): if ( null == results) { return new ArrayList<TRowResult>(); } Can we close the scanner before returning ? We still leave the scanner in scannerMap. HTable.ClientScanner would check whether the scanner has been closed.
          Hide
          Ted Yu added a comment -

          TestThriftServer passes.

          Show
          Ted Yu added a comment - TestThriftServer passes.
          Hide
          Jean-Daniel Cryans added a comment -

          I'm not sure it fixes the problem, since the issue here is about users not getting back to close even tho there's still more data. We patched in something at SU that we've been running for more than a month, but it is specific to our usecase. Anyways, feel free to inspire yourself with those two commits:

          https://github.com/stumbleupon/hbase/commit/7bcc63ee22f7e0218fb7225f387ffc0bd2279f59#src/main/java/org/apache/hadoop/hbase/thrift/ThriftServer.java
          https://github.com/stumbleupon/hbase/commit/3d08cc8ad5abfec11ee0340d5a75a2b308854e17#src/main/java/org/apache/hadoop/hbase/thrift/ThriftServer.java

          Show
          Jean-Daniel Cryans added a comment - I'm not sure it fixes the problem, since the issue here is about users not getting back to close even tho there's still more data. We patched in something at SU that we've been running for more than a month, but it is specific to our usecase. Anyways, feel free to inspire yourself with those two commits: https://github.com/stumbleupon/hbase/commit/7bcc63ee22f7e0218fb7225f387ffc0bd2279f59#src/main/java/org/apache/hadoop/hbase/thrift/ThriftServer.java https://github.com/stumbleupon/hbase/commit/3d08cc8ad5abfec11ee0340d5a75a2b308854e17#src/main/java/org/apache/hadoop/hbase/thrift/ThriftServer.java
          Hide
          Ted Yu added a comment -

          I was thinking about timeout-driven expiration policy.
          SU's implementation looks nice. We can introduce a new parameter (sigh) to control whether ScannerCleaner is instantiated at startup.

          @J-D: do you happen to know how many scanners were closed by ScannerCleaner in the past month ?

          Show
          Ted Yu added a comment - I was thinking about timeout-driven expiration policy. SU's implementation looks nice. We can introduce a new parameter (sigh) to control whether ScannerCleaner is instantiated at startup. @J-D: do you happen to know how many scanners were closed by ScannerCleaner in the past month ?
          Hide
          Jean-Daniel Cryans added a comment -

          do you happen to know how many scanners were closed by ScannerCleaner in the past month ?

          I used to print that out but it was really spammy. Probably tens of thousands.

          Show
          Jean-Daniel Cryans added a comment - do you happen to know how many scanners were closed by ScannerCleaner in the past month ? I used to print that out but it was really spammy. Probably tens of thousands.
          Hide
          Ted Yu added a comment -

          SU has a solution already.

          Show
          Ted Yu added a comment - SU has a solution already.
          Hide
          Ted Yu added a comment -

          Moving out of 0.92 for now.

          Show
          Ted Yu added a comment - Moving out of 0.92 for now.
          Hide
          Bob Copeland added a comment -

          We hit this running thrift2 in production. With a lot of scanners buffering a lot (100) of rows per call, thrift servers would OOME in no time. This patch just periodically expires old ones. I can respin/rework the patch if the approach is sound.

          Show
          Bob Copeland added a comment - We hit this running thrift2 in production. With a lot of scanners buffering a lot (100) of rows per call, thrift servers would OOME in no time. This patch just periodically expires old ones. I can respin/rework the patch if the approach is sound.
          Hide
          Ted Yu added a comment -
          +      timestamp = System.currentTimeMillis();
          

          We have EnvironmentEdgeManager which provides the above service.

          +      for (Integer key: toremove)
          +        removeScanner(key);
          

          Please use curly braces around the removeScanner() call.

          Show
          Ted Yu added a comment - + timestamp = System .currentTimeMillis(); We have EnvironmentEdgeManager which provides the above service. + for ( Integer key: toremove) + removeScanner(key); Please use curly braces around the removeScanner() call.
          Hide
          stack added a comment -

          Patch looks good. Thanks Bob. Suggest picking up the coding convention of surrounding class. Rather than hardcode 60 seconds, maybe use value of hbase.regionserver.lease.period gotten from Configuration instance?

          Show
          stack added a comment - Patch looks good. Thanks Bob. Suggest picking up the coding convention of surrounding class. Rather than hardcode 60 seconds, maybe use value of hbase.regionserver.lease.period gotten from Configuration instance?
          Hide
          Lars Hofhansl added a comment -

          Can we commit this? Or should this jira move to 0.96?

          Show
          Lars Hofhansl added a comment - Can we commit this? Or should this jira move to 0.96?
          Hide
          Lars Hofhansl added a comment -

          Let's try for 0.94.1

          Show
          Lars Hofhansl added a comment - Let's try for 0.94.1
          Hide
          Lars Hofhansl added a comment -

          In the remove case, should we also close the scanner?

          Show
          Lars Hofhansl added a comment - In the remove case, should we also close the scanner?
          Hide
          Lars Hofhansl added a comment -

          Any update on this?

          Show
          Lars Hofhansl added a comment - Any update on this?
          Hide
          Lars Hofhansl added a comment -

          Moving on to 0.94.3.
          Also, since we had for quite a while it does not appear to be critical.

          Show
          Lars Hofhansl added a comment - Moving on to 0.94.3. Also, since we had for quite a while it does not appear to be critical.
          Hide
          Lars Hofhansl added a comment -

          @J-D: Are you still working on this?

          Show
          Lars Hofhansl added a comment - @J-D: Are you still working on this?
          Hide
          Lars Hofhansl added a comment -

          And on to 0.94.4

          Show
          Lars Hofhansl added a comment - And on to 0.94.4
          Hide
          Lars Hofhansl added a comment -

          And on to 0.94.5

          Show
          Lars Hofhansl added a comment - And on to 0.94.5
          Hide
          Lars Hofhansl added a comment -

          This was moved from release to release, with nobody "owning" it.
          I am removing this from 0.94 altogether. Should probably be marked as "Won't fix" (I wish there was a "Nobody cares" state).

          Pull back in if you disagree.

          Show
          Lars Hofhansl added a comment - This was moved from release to release, with nobody "owning" it. I am removing this from 0.94 altogether. Should probably be marked as "Won't fix" (I wish there was a "Nobody cares" state). Pull back in if you disagree.
          Hide
          Lars George added a comment -

          Jean-Daniel Cryans The two commits you posted a while back are now gone. Is that still worth looking at or should I just go ahead with what Bob Copeland attached (plus the minor tweaks suggested)?

          Show
          Lars George added a comment - Jean-Daniel Cryans The two commits you posted a while back are now gone. Is that still worth looking at or should I just go ahead with what Bob Copeland attached (plus the minor tweaks suggested)?
          Hide
          Jean-Daniel Cryans added a comment -

          Ah yeah, that commit it here now: https://github.com/stumbleupon/hbase/commit/985508a185c51b47319541d9d151fcc108a0e590

          It's been running in prod for months except that hbase.thrift.scanner.cleaner.timeout was set down to 180000.

          Want me to put it in good shape for review here?

          Show
          Jean-Daniel Cryans added a comment - Ah yeah, that commit it here now: https://github.com/stumbleupon/hbase/commit/985508a185c51b47319541d9d151fcc108a0e590 It's been running in prod for months except that hbase.thrift.scanner.cleaner.timeout was set down to 180000. Want me to put it in good shape for review here?
          Hide
          Lars George added a comment -

          That would be awesome. Thanks Jean-Daniel Cryans.

          Show
          Lars George added a comment - That would be awesome. Thanks Jean-Daniel Cryans .
          Hide
          Jean-Daniel Cryans added a comment -

          Attaching patch.

          It got a lot messier then I expected but that's because I forgot that that code relies on the new scanner stuff that we used at SU.

          So kind, of like the thrift2 patch, I use a wrapper for the scanners. This is what you store in the scanner map. When you use it you first remove it from the map and then later you put it back with a new timestamp.

          Removing the scanner from the list has 2 nice properties: we don't race when scanner the map for timeouts and if you hit this code in scannerGetList:

          if (null == results) {
            return new ArrayList<TRowResult>();
          }
          

          Then the scanner is already removed from the map so the leak there is fixed too.

          Show
          Jean-Daniel Cryans added a comment - Attaching patch. It got a lot messier then I expected but that's because I forgot that that code relies on the new scanner stuff that we used at SU. So kind, of like the thrift2 patch, I use a wrapper for the scanners. This is what you store in the scanner map. When you use it you first remove it from the map and then later you put it back with a new timestamp. Removing the scanner from the list has 2 nice properties: we don't race when scanner the map for timeouts and if you hit this code in scannerGetList: if ( null == results) { return new ArrayList<TRowResult>(); } Then the scanner is already removed from the map so the leak there is fixed too.
          Hide
          Lars George added a comment -

          Just ran across this in 0.89-fb:

          commit b5f787b2836ce9750087392a4a3092f9a94f7360
          Author: Mikhail Bautin <mbautin@apache.org>
          Date:   Tue Aug 28 21:15:39 2012 +0000
          
              [HBASE-6673] Clear up the scanner in the thrift server whenever these scanners failed to read data
              
              Author: liyintang
              
              Summary:
              We have run into some memory leak problem in the thrift server. It is caused either by the application client never closed the outstanding scanner or the thr
              
              Test Plan: not tested; will tested in dev cluster;
              
              Reviewers: kannan, kranganathan
              
              Reviewed By: kannan
              
              CC: arjen, hbase-eng@, davejwatson
              
              Differential Revision: https://phabricator.fb.com/D541137
              
              git-svn-id: https://svn.apache.org/repos/asf/hbase/branches/0.89-fb@1378350 13f79535-47bb-0310-9956-ffa450edef68
          

          We should check what FB did there as well.

          Show
          Lars George added a comment - Just ran across this in 0.89-fb: commit b5f787b2836ce9750087392a4a3092f9a94f7360 Author: Mikhail Bautin <mbautin@apache.org> Date: Tue Aug 28 21:15:39 2012 +0000 [HBASE-6673] Clear up the scanner in the thrift server whenever these scanners failed to read data Author: liyintang Summary: We have run into some memory leak problem in the thrift server. It is caused either by the application client never closed the outstanding scanner or the thr Test Plan: not tested; will tested in dev cluster; Reviewers: kannan, kranganathan Reviewed By: kannan CC: arjen, hbase-eng@, davejwatson Differential Revision: https://phabricator.fb.com/D541137 git-svn-id: https://svn.apache.org/repos/asf/hbase/branches/0.89-fb@1378350 13f79535-47bb-0310-9956-ffa450edef68 We should check what FB did there as well.
          Hide
          Andrew Purtell added a comment -

          Reported against now unsupported version 0.90. Closing as Wont Fix.

          Show
          Andrew Purtell added a comment - Reported against now unsupported version 0.90. Closing as Wont Fix.

            People

            • Assignee:
              Unassigned
              Reporter:
              Jean-Daniel Cryans
            • Votes:
              1 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development