Bug 53040 - Crash in mod_socache_shmcb due to data misalignment in shared memory
Summary: Crash in mod_socache_shmcb due to data misalignment in shared memory
Status: RESOLVED FIXED
Alias: None
Product: Apache httpd-2
Classification: Unclassified
Component: mod_socache_(dbm|dc|memcache|shmcb) (show other bugs)
Version: 2.4.2
Hardware: Sun Solaris
: P2 normal with 3 votes (vote)
Target Milestone: ---
Assignee: Apache HTTPD Bugs Mailing List
URL:
Keywords: FixedInTrunk
Depends on:
Blocks:
 
Reported: 2012-04-05 16:44 UTC by Wenyan Peng
Modified: 2012-08-21 15:51 UTC (History)
4 users (show)



Attachments
Experimental patch of mod_socache_shmcb.c (1.97 KB, patch)
2012-08-08 15:10 UTC, Georg Schaudy
Details | Diff
Patch using inter struct padding instead of padding struct members. (3.91 KB, patch)
2012-08-13 20:25 UTC, Rainer Jung
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Wenyan Peng 2012-04-05 16:44:16 UTC
Either try to compile ssl module as shared or static all have the problem, and openssl version used to compile is 1.0.0g

When try to reach the ssl virtual hosts after idle for a while, get the following from browser:
"
The connection was reset    
          The connection to the server was reset while the page was loading.        

  The site could be temporarily unavailable or too busy. Try again in a few
    moments.
  If you are unable to load any pages, check your computer's network
    connection.
  If your computer or network is protected by a firewall or proxy, make sure
    that Firefox is permitted to access the Web.

Try again.

"
After click try again to refresh the page , the page get back.
In error_log we got:

[Thu Apr 05 11:11:20.272279 2012] [core:notice] [pid 25398:tid 1] AH00094: Command line: '/usr/local/apache2.4.1/bin/httpd'
[Thu Apr 05 11:21:00.983966 2012] [core:notice] [pid 25398:tid 1] AH00052: child pid 25400 exit signal Bus error (10)
[Thu Apr 05 11:21:08.054928 2012] [core:notice] [pid 25398:tid 1] AH00052: child pid 25401 exit signal Bus error (10)
Comment 1 Stefan Fritsch 2012-04-06 21:14:04 UTC
A backtrace of a crashing process would be very useful. See http://httpd.apache.org/dev/debugging.html and especially the part about solaris, there.
Comment 2 Wenyan Peng 2012-04-09 13:54:34 UTC
(In reply to comment #1)
> A backtrace of a crashing process would be very useful. See
> http://httpd.apache.org/dev/debugging.html and especially the part about
> solaris, there.

After I commented out SSLSessionCache and SSLSessionCache timeout, it gets stable.
I will do the backtrace when I get chance. Thanks.
Comment 3 Jon Hadfield 2012-04-17 15:42:43 UTC
I'm able to reproduce with httpd 2.4.1, mpm worker, OpenSSL 0.9.7d on Solaris (sparc T5220).
Intermittent connection resets messages returned to browser after a restart with the following in error_log.

[Tue Apr 17 16:38:41.033550 2012] [core:notice] [pid 15959:tid 1] AH00051: child pid 16109 exit signal Bus error (10), possible coredump in /usr/local/apache-2.4.1

ssl module compiled as static and error occurs with shmcb but not dbm storage type for SSLSessionCache. Still able to produce error with different ssl-cache mutex mechanisms.
Comment 4 Rainer Jung 2012-04-17 22:03:36 UTC
Wild guess: Bus error for shmcb on Solaris Sparc rings a bell: there were alignment problems in shm a few years ago. Those were fixed at that time, but maybe some type of this error is back.

Still a core dump which is inspectable by gdb would be helpful.
Comment 5 Georg Schaudy 2012-08-08 15:10:35 UTC
Created attachment 29187 [details]
Experimental patch of mod_socache_shmcb.c

That guess hits the nail on the head! This is indeed a memory alignment issue on SPARC systems!
See http://blog.jgc.org/2007/04/debugging-solaris-bus-error-caused-by.html for a good description of the basic problem.
We compiled httpd 2.4.2 (32bit) on SPARC Solaris 10 using gcc, and experienced the "Bus error" issue as described above.

It turns out to be an issue with the shmcb implementation (modules/cache/mod_socache_shmcb.c).
The problem is the SHMCBIndex structure having a member "expires" of type apr_time_t (long long). This needs to be 8 byte aligned, otherwise it will cause a Bus error on access!

However, the SHMCBIndex objects are nested within subcaches, which are prepended by an SHMCBHeader structure.
The memory allocation inside the SHM segment looks more or less like this:

                              [ expires, ...] [ expires, ...]
              [ SHMCBSubcache | SHMCBIndex   ,  SHMCBIndex   , ... | Data ]
[ SHMCBHeader | subcache 0                                                 ,
---------------------------------------------------------------------------->
Alignment:                      ^8 byte         ^8 byte

                 [ expires, ...]
 [ SHMCBSubcache | SHMCBIndex   , ... ]
   subcache 1                          , ... ]
>----------------------------------------------
Alignment:         ^8 byte


Having got so far, I found a fix that is not pretty, but quite simple and unintrusive. On the downside it depends heavily on the details of the current implementation.
The obvious way to ensure correct alignment of the SHMCBIndex structures seems to be to make sure all the building blocks' sizes are a multiple of 8 bytes.
This is already the case for SHMCBSubcache (size=16), and SHMCBIndex (24), but not for SHMCBHeader (52).
The size of the subcache's "Data" segment (subcache_data_size) is calculated at startup and depends on the configured SSLSessionCache size. (For the default it is 14180 ( % 8 = 4), which means that every other subcache will fail!)

The attached patch works on our system.
Maybe you can find a more elegant / robust fix for this issue!
Comment 6 Rainer Jung 2012-08-13 20:25:13 UTC
Created attachment 29220 [details]
Patch using inter struct padding instead of padding struct members.

Please check the following alternative patch. It does not need to add padding to the structures and instead aligns their positioning in shm correctly.
Comment 7 Rainer Jung 2012-08-15 09:21:30 UTC
Fixed in trunk in r1373270.
Proposed for backport to 2.4.x.
Comment 8 Georg Schaudy 2012-08-20 10:22:40 UTC

(In reply to comment #7)
Finally had the opportunity to test your latest patch. It works on our system!
Comment 9 Rainer Jung 2012-08-21 15:51:10 UTC
Fixed in 2.4 with r1373439.
Released with version 2.4.3.
Does not apply to 2.2 versions.