No fileevents fire for me. Test script: --test_socket_async.tcl-- set q I set h [open /usr/tmp/test.txt w] fileevent $h writable "set q W" after 1000 "set q T" vwait q close $h puts $q file delete /usr/tmp/test.txt --eof-- This script will output "T" when started in Apache Rivet normal mode and Output "W" when: -started by tclsh8.6 -started by apache single process mode (/usr/sbin/httpd -X) Platform: - Centos 64 bit - TCL 8.6.0 - Rivet 2.1.2rc2
I asked on clt for help: https://groups.google.com/forum/#!topic/comp.lang.tcl/Ja01kv75tVk Thread name: "Async sockets hang under Rivet but work in tclsh 8.6" Start Date: 27.6.2013 Here is the essence: - CentOS 6.2 64 bit - /usr/sbin/httpd -V Server version: Apache/2.2.15 (Unix) Server built: May 13 2013 22:11:16 Server's Module Magic Number: 20051115:25 Server loaded: APR 1.3.9, APR-Util 1.3.9 Compiled using: APR 1.3.9, APR-Util 1.3.9 Architecture: 64-bit Server MPM: Prefork threaded: no forked: yes (variable process count) - Apache was configured to only have one thread (/etc/httpd/conf/httpd.conf: <IfModule prefork.c> StartServers 1 MinSpareServers 1 MaxSpareServers 20 ServerLimit 256 MaxClients 256 MaxRequestsPerChild 4000 </IfModule> ) - Test file "test_file_async.tcl" in "/var/www/html". File contents as in the original bug. - Start http: service httpd start - start strace: ps awux | grep sbin/httpd | awk '{print"-p " $2}' | xargs strace -tt -f -o straceall.txt - the resulting trace file is attached to the bug. Here are the comments by Alexandre Ferieux to the strace log from clt: OK here is what I can see: (1) open the file, and wake up the notifier (by writing a single byte to the pipe) thread so that it takes the new fd into account for its select() mask: > 29969 12:04:52.368364 open("/usr/tmp/test.txt", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 21 > ... > 29969 12:04:52.368806 write(17, "\0", 1) = 1 (2) In the notifier: detect that activity on the control pipe... > 29965 12:04:52.368843 <... select resumed> ) = 1 (in [16]) > 29965 12:04:52.368889 read(16, <unfinished ...> > 29965 12:04:52.368922 <... read resumed> "\0", 1) = 1 (3) ... and normally restart a select with the new fd (21) in the write mask, but here, FAIL TO DO SO: > 29965 12:04:52.368963 select(17, [16], [], [], NULL <unfinished ...> (4) So, sleep stupidly (only monitoring the control pipe) until the timeout expires: > 29962 12:04:53.218193 <... select resumed> ) = 0 (Timeout) I ran your script in vanilla Tcl (which works, as we already know), and verified that the difference is exactly this (ignore the difference in fds): 2297 17:31:15.800767 open("/tmp/test.txt", O_WRONLY|O_CREAT|O_TRUNC|O_LARGEFILE, 0666) = 5 2298 17:31:15.803090 <... read resumed> "\0", 1) = 1 2298 17:31:15.803165 select(6, [3], [5], [], NULL) = 1 (out [5]) So here, somehow the communication with the notifier is incomplete: we do get the control signal through ("hey, new mask to recompute"), but the recomputation miserably fails. Next step, I guess, is to look in detail at the data structures that are shared between the caller and notifier. Perhaps the maintainer knows how the Rivet environment is different in that respect ? --- Comment 2 --- > 29965 12:04:52.368963 select(17, [16], [], [], NULL <unfinished ...> > ... > 29962 12:04:53.218193 <... select resumed> ) = 0 (Timeout) Hmmm, just noticed the different thread IDs: here we have a dance of 3 threads ! - 29969 is your main thread - 29965 is the notifier - 29962 is a mysterious other thread that does a 1-second periodic task Apart from the (remote) similarity between this 1-second period and your [after 1000], this does not really match what we see in the script. Definitely something to be explained by a Rivet-savvy person... -Alex
Created attachment 30522 [details] strace log of test file Alexandre Ferrieux based its analysis on this log.
Am Donnerstag, 4. Juli 2013 14:30:27 UTC+2 schrieb Harald Oehlmann: > Am Mittwoch, 3. Juli 2013 13:06:31 UTC+2 schrieb Alexandre Ferrieux: > > > I'd bet the next thing to do is to recompile Tcl with debug symbols, break in gdb in Tcl_CreateFileHandler, and single-step there to see what doesn't pass through. I suppose I have found something. When Apache starts, there are (at least) 2 processes, one owned by root and one owned by user apache: ps -L aux USER PID LWP %CPU NLWP %MEM VSZ RSS TTY STAT START TIME COMMAND root 1997 1997 0.0 2 1.4 393536 14704 ? Tsl 15:21 0:00 /usr/sbin/httpd root 1997 2000 0.0 2 1.4 393536 14704 ? Tsl 15:21 0:00 /usr/sbin/httpd apache 1999 1999 0.0 1 0.7 391472 7292 ? S 15:21 0:00 /usr/sbin/httpd apache 2001 2001 0.0 1 0.9 394648 9724 ? S 15:21 0:00 /usr/sbin/httpd The process 1997 is the controler process, the others are final handler processes. I run two gdb, one attached to 1997, one attached to 2001. The first has 2 threads: (gdb) info thread * 2 Thread 0x7fb15f77b700 (LWP 2000) NotifierThreadProc (clientData=0x0) at /home/admin/test/tcl8.6.0/unix/tclUnixNotfy.c:1186 1 Thread 0x7fb16db9c7e0 (LWP 1997) 0x00007fb16c1d24f3 in select () at ../sysdeps/unix/syscall-template.S:82 The Thread 2 (LWP 2000) runs the TCL notifier function "NotifierThreadProc". The second (ID 2001) has only one thread and runs the tcl interpreter: (gdb) info threads * 1 Thread 0x7fb16db9c7e0 (LWP 2001) 0x00007fb16c1d9f03 in epoll_wait () at ../sysdeps/unix/syscall-template.S:82 In thread 1 of process 2001, the procedure Tcl_CreateFileHandler is called, the file handle to monitor is written in a thread specific data memory area and "" is written to the communication pipe. In thread 2 of process 1997 (LWP2000), in NotifierThreadProc the write to the pipe is recognized by the select and the shared memory is checked. As it is another process, the data does not pass. So, the pipe connection works, but there is no shared memory and thus passing the file descriptor to select on does not work. In addition, the file description of process 2001 would propably not by valid in process 1997, LWP 2000. More in detail, there is a shared variable "waitingListPtr". This is set to a value in one process and remains zero on the other side. Here are some details: *** tsdPtr in procedure Tcl_CreateFileHandler, Process 2001: 523 ThreadSpecificData *tsdPtr = TCL_TSD_INIT(&dataKey); (gdb) print tsdPtr $1 = (ThreadSpecificData *) 0x7fb16eb9f6a0 *** and the memory configuration just before writing to the pipe: Procedure Tcl_WaitForEvent in Process 2001: 907 tsdPtr->prevPtr = 0; 908 waitingListPtr = tsdPtr; (gdb) print &waitingListPtr $5 = (ThreadSpecificData **) 0x7fb16260ee88 (gdb) print waitingListPtr $3 = (ThreadSpecificData *) 0x7fb16eb9f6a0 (gdb) print *waitingListPtr $4 = {firstFileHandlerPtr = 0x7fb16ec8d7b0, checkMasks = {readable = {__fds_bits = {0 <repeats 16 times>}}, writable = {__fds_bits = {1048576, 0 <repeats 15 times>}}, exception = {__fds_bits = {0 <repeats 16 times>}}}, readyMasks = {readable = {__fds_bits = {0 <repeats 16 times>}}, writable = {__fds_bits = {0 <repeats 16 times>}}, exception = {__fds_bits = {0 <repeats 16 times>}}}, numFdBits = 21, onList = 1, pollState = 0, nextPtr = 0x0, prevPtr = 0x0, waitCV = 0x7fb16ec83850, eventReady = 0} 909 tsdPtr->onList = 1; (gdb) print tsdPtr $2 = (ThreadSpecificData *) 0x7fb16eb9f6a0 910 911 if ((write(triggerPipe, "", 1) == -1) && (errno != EAGAIN)) { *** and now the notifier thread in Process 1997, thread 2 (LWP 2000): procedure NotifierThreadProc 1186 Tcl_MutexLock(¬ifierMutex); 1187 for (tsdPtr = waitingListPtr; tsdPtr; tsdPtr = tsdPtr->nextPtr) { (gdb) print &waitingListPtr $3 = (ThreadSpecificData **) 0x7fb16260ee88 (gdb) print waitingListPtr $2 = (ThreadSpecificData *) 0x0 --- The address of waitingListPtr is identical in the two threads but not the data. I suppose, this is due to the fork of the parent thread... --- Well, I suppose, this thread configuration is not as intended. Each tcl worker process should have its notifier thread... Thanks anybody to look into this. Thank you, Harald
great job Harald after your findings the question should be put on TCLCORE. Can you test a modified mod_rivet where Rivet_ChildInit creates an interpreter anew? I think that disregarding initialization scripts issues the test should be preatty easy, just replacing the interpreter inherited from the parent process with a new one just coined by Tcl_CreateInterpreter -- M
Created attachment 30592 [details] reintroducing child specific intepreter to see if the bug with fileevent can in case be cured Harald, please try apply this patch to see if the expected behavior for fileevent is restored. The patch makes void any initialization to the interpreter you may have done in a RivetServerInit directive script, but at least restores in the child process a pristine interpreter which is not a copy obtained by forking the parent process
I am sorry, Moassimo, but the patch does not change anything for me: a) the issue is not cured b) if I look via "ps -L aux" to the thread configuration, this did not change: ps -L aux USER PID LWP %CPU NLWP %MEM VSZ RSS TTY STAT START TIME COMMAND root 19610 19610 0.1 2 1.4 393540 14692 ? Ssl 13:13 0:00 /usr/sbin/httpd root 19610 19613 0.0 2 1.4 393540 14692 ? Ssl 13:13 0:00 /usr/sbin/httpd apache 19612 19612 0.0 1 0.7 391476 7292 ? S 13:13 0:00 /usr/sbin/httpd apache 19614 19614 0.0 1 1.0 393960 10432 ? S 13:13 0:00 /usr/sbin/httpd Thus, the notifier thread still runs one in the root process and not in the worker threads. I have looked a bit, where the thread creation function "Tcl_InitNotifier" is called. It eaven has a stub entry: tclStubInit.c: Tcl_InitNotifier, /* 307 */ Here is the general call: generic\tclNotify.c: procedure: TclInitNotifier() which is called by the procedure in the same file : TclInitSubsystems() This is called by by all of those: tclBasic.c: Tcl_CreateInterp() tclEncoding: Tcl_FindExecutable() The patch proposed to call: Rivet_CreateTclInterp which itself calls: Tcl_FindExecutable() and Tcl_CreateInterp() Thus the thread creation routine should be executed twice. So I tried to first end the notifier for the case, that the count variable "notifierCount" within Tcl_CreateNotifier() is already 1 which would cause that no notifier is created: Rivet_InitTclStuff(server_rec *s, apr_pool_t *p) { rivet_server_conf *rsc = RIVET_SERVER_CONF( s->module_config ); Tcl_FinalizeNotifier(NULL); Tcl_Interp *interp = Rivet_CreateTclInterp(s); ... This removed also the one nothifier thread owned by root but does not create one owned by apache used: In the upper ps -L aux, it removes LWP 19613 I also tried to add an explicite "Tcl_CreateNotifier(): Rivet_InitTclStuff(server_rec *s, apr_pool_t *p) { rivet_server_conf *rsc = RIVET_SERVER_CONF( s->module_config ); Tcl_FinalizeNotifier(NULL); Tcl_Interp *interp = Rivet_CreateTclInterp(s); Tcl_InitNotifier(); but this does not help... Sorry, Harald
I don't know how far we can get doing these tests, but let's try anyway. I don't understand what's wrong with the interpreter created within the child process, I can only surmise there are shared resources not handled consistently with our picture and in order to understand we need at least a way to create a sound interpreter in our child process. So what would be the correct way to release those supposedly conflicting resources? Rivet_InitHandler is the callback where the server master interpreter is created. Perhaps you could try releasing all the resource explicitly before returning OK by adding Tcl_Finalize(); "Tcl_Finalize is similar to Tcl_Exit except that it does not exit from the current process. It is useful for cleaning up when a process is finished using Tcl" (man 3tcl Tcl_Finalize) I suppose it's not essential to delete the interpreter
Massimo, thank you for the comment. Yes this might work. I had another idea: zclUnixNotify.c: Tcl_InitNotifier Modify this function, that it checks, if the Process has changed, e.g. we are in a fork. If so: - close the communication pipe (as we have copies of the file handlers) - restart the notifier thread This would preserve the capability of Rivet to: - first load - then fork which is currently the case and helped to spped-up the boot time of FlightAware around 40 Minuites... Do you think this is feasable? Harald
TCL was modified, that Tcl_InitNotifier also starts the thread if we are forked. Using that, fileevents do fire. An alternate solution I see would to create the Tcl interpreter after the fork, which would vanish advantages by the general init script... Required modifications in Rivet are: mod_rivet.c : Put Tcl_InitNotifier(); somewhere at the beginning of function: Rivet_InitTclStuff Alternate test script is: --test_file_async.tcl-- set q I set h [open /usr/tmp/test.txt w] set idafter [after 100000 "set q T"] fileevent $h writable "set q W;after cancel [after 1000 "set q T"]" vwait q close $h puts $q file delete /usr/tmp/test.txt -EOF Here is the tcl RFE: http://core.tcl.tk/tcl/info/a0bc856dcd Have fun ;-) Harald
Brilliant Harald My patch with Rivet_CreateTclInterp(s) was meant to create the Tcl interpreter anew hoping for the best about all the subsystems involved. It seems it didn't work. I tried it myself also Finalizing Rivet's master interpreter before Rivet_InitHandler exits, but I got a segmentation fault when the first request is served Please, would you share your patch to Tcl? Have you considered about a TIP?
There are some related mails on the rivet dev mailing list: --- Damon Courtney: issue does only show in threadded tcl --- Jeff Lawson: master interpreter preinitialisation is a big time saver: It is definitely important for us at FlightAware to be able to do Tcl interpreter initialization before forking. Our webserver environment uses multiple physical servers each with 300 Apache children processes. During the Tcl global init and before forking, we pre-load several gigabytes of infrequently changed data from our database for performance. The inability to do this pre-loading would cause each child to have to load this data from the database (which causes a massive database overhead for 300 connections to simultaneously request that much data), which indeed adds many minutes of startup time to our daily release cycle. Since our release cycle requires us to fully restart one physical server at a time so that we can continue to serve production traffic on the remaining physical servers, any increase in restart time cascades linearly with the number of physical servers we have. The added database overhead of these 300 Apache children requesting gigabytes of data would also compromise our ability to handle normal database traffic from our other servers while we are restarting one. Additionally, the RAM usage on the webservers is significantly impacted since this static data is now actually allocated and duplicated in each child, rather than being loaded in the parent and cloned via copy-on-write pages to each child. Several gigabyte per child multiplied by 300 processes is a significant memory footprint change for our webservers. --- Massimo about some tests: I got fileevent working again by inserting Tcl_FinalizeNotifier in Rivet_InitHandler and Tcl_InitNotifier in Rivet_InitTclStuff (at the cost of breaking the inheritance of an interpreter from the parent to the child process)
E-Mail by Jan Nijtmans on clt: 2013/7/17 Harald Oehlmann <harald.oehlmann@elmicron.de>: > Q1) Is this use-case reasonable to cover by the tcl core? Yes > Q2) May a wizard please review the code Compare your solution with the solution already present in the OSX-version of the notifier: <http://core.tcl.tk/tcl/artifact/8c82932603?ln=512-521> It's very similar. Looks OK to me, even though I'm not a wizard regarding the UNIX notifier. > Q3) Is this a bug fix or is a TIP required ? Give that the bug only manifests itself on UNIX with --enable-threads, I would say it's a bug. OSX already has a similar solution to what you propose and Windows doesn't have fork at all. Any Tcl application which doesn't use threads should behave the same whether threads are enabled or not. Failing that it's a bug, and the best way to demonstrate that is with a test-case. > Q4) If reasonable, we should also back-port it to tcl8.5 Bugs are normally fixed in core-8-5-branch first, then merged to trunk. I rebased "rfe-notifier-fork" to core-8-5-branch now and added a "testfork" test command. Adding a test-case is all that's left to be approved for the core. The OSX solution using pthread_atfork() disables the OSX notifier during the fork. I don't think that's necessary for the UNIX notifier, but maybe someone else can comment on that.
Fixed in *TCL* core "trunk" and "core-8-5-branch": http://core.tcl.tk/tcl/info/4e3f024bd7 http://core.tcl.tk/tcl/info/0107005970c This fix should be included in upcoming TCL 8.6.1 and 8.5.15 --- If pthread_atfork works (TCL configure reports YES), there is no need to modify Rivet. Otherwise, an additional Tcl_InitNotifier call is required. If an older TCL version is used, one may choose of the following options: - do not use threaded tcl (default for tcl 8.5.x) - do not use fileevents with rivet The init script feature should be well documented not to leave any file handles open. Thank you all, Harald
What could be done in addition: - include the pthread_atfork test from tcl in the configure script - include a Tcl_InitNotifier() if pthread_atfork is not available #if !defined(HAVE_PTHREAD_ATFORK) Tcl_InitNotifier(); #endif This would do it the same way as the tcl core test case which is IMHO a good thing. Thank you, Harald
OK Harald, I guess Rivet_InitTclStuff is place where Tcl_InitNotifier should be called