Since I expect this to be a long post, I’ll give you the answer at the top: both are in-effect critical sections, you should avoid performing unnecessary operations there at all cost. Or else it will cost you and other people days, months, years of wasted time.
POSIX Signals
I’ll start with a story about one of my successful open source projects: Asunder. I took over the project from a nice but busy guy at version 0.1. Most of it was written (pretty well too), and I’ve been fixing things here and there, improving it one bit at a time.
One of the things I added was logging, to make sure that I can fix problems experienced by others in their own environments. It was very useful (I now very rarely get bug reports) but I made one mistake: I added an fprintf in a SIGCHLD signal handler. It took me literally years to figure out that was a terrible mistake. For at least two years I kept getting bug reports about random, unrelated freezes and the log never provided any answers. This is what was happening:
- The app was running, starting sub-processes and waiting for them to complete.
- When a sub-process completed – it sent a SIGCHLD to the parent. That signal was handled in a signal handler, which interrupted whatever code was currently running in the parent.
- The above is expected and rarely a problem, except it turned out that the printf function makes some kind of global lock while it’s doing it work. So when:
- A signal handler itself was interrupted by another signal, the printf in the new signal handler was waiting for the old printf to complete, which would never happen because the original printf was interrupted by the new one.
When I figured that out I cried a little in my mind. But I fixed it and took it as a good addition to my industry experience.
MediaWiki
Fast forward a few years where I maintain a MediaWiki instance for our school. I migrated it to a new server, updated the PHP to the newest version, updated the database, etc. All worked well.
But then the semester started and new students were trying to register to get accounts on the wiki. And disaster struck. It turned out that new users could not register. To make thing worse – existing users couldn’t get their passwords changed or reset. Right in the beginning of the semester. When I did the migration I tested everything, but have not considered that operations on the user table were in any way special. Turns out they’re not, except they are. Here’s what was happening:
- The first person since the servers were rebooted tried to register for an account. The web interface would just hang there, with the spinning circle until the end of time. Not time outs or error messages.
- MediaWiki started an SQL transaction on the MySQL backend. To record that a user is being created.
- Before committing the said SQL transaction – MediaWiki would attempt to send an email to the new user via some PEAR library, via the server configured in $wgSMTP.
- $wgSMTP was not configured correctly, and the step above never completed.
- Which means the SQL transaction was never committed.
- Which means the users table remained locked, permanently.
I spent so much time (including three overnighters) figuring this out! I ended up nearly desperate, asking for help on the MediaWiki-l mailing list. One guy (Brian Wolff) replied saying he doesn’t know what the problem is but he offered what turned out to be the straw I needed ti figure it out myself: enabling the MediaWiki debug log. I had a bunch of logging enabled already, but this is the one that showed me the deadlock.
Before that, I would stare at MySQL’s “SHOW FULL PROCESSLIST” and wonder how it’s possible that even though no queries were being executed – new ones would result in a timeout like this:
MySQL [cdotwiki_db]> SELECT user_id FROM `mw_user` WHERE user_name = 'Asmith20' LIMIT 1 LOCK IN SHARE MODE ERROR 1205 (HY000): Lock wait timeout exceeded; try restarting transactio
I would look at the output of “SHOW ENGINE INNODB STATUS;” and wonder why there are multiple transactions there that have been sitting for hours but not causing a deadlock, even though it looked like a deadlock. I spent hours trying to decipher memory dumps like this:
---TRANSACTION 4748, ACTIVE 62 sec 8 lock struct(s), heap size 1136, 4 row lock(s), undo log entries 8 MySQL thread id 1345, OS thread handle 140329361524480, query id 20771 web-cdot1.sparc 10.7.0.45 cdotwiki_usr cleaning up Trx read view will not see trx with id >= 4742, sees < 4742 TABLE LOCK table `cdotwiki_db`.`mw_user` trx id 4748 lock mode IS RECORD LOCKS space id 70 page no 157 n bits 624 index user_name of table `cdotwiki_db`.`mw_user` trx id 4748 lock mode S locks gap before rec Record lock, heap no 244 PHYSICAL RECORD: n_fields 2; compact format; info bits 0 0: len 6; hex 41736f583139; asc AsoX19;; 1: len 4; hex 000001ac; asc ;; Record lock, heap no 558 PHYSICAL RECORD: n_fields 2; compact format; info bits 0 0: len 8; hex 41736d6974683230; asc Asmith20;; 1: len 4; hex 000036c2; asc 6 ;; TABLE LOCK table `cdotwiki_db`.`mw_user` trx id 4748 lock mode IX RECORD LOCKS space id 70 page no 436 n bits 112 index PRIMARY of table `cdotwiki_db`.`mw_user` trx id 4748 lock_mode X locks rec but not gap Record lock, heap no 42 PHYSICAL RECORD: n_fields 17; compact format; info bits 0 0: len 4; hex 000036c2; asc 6 ;; 1: len 6; hex 00000000128c; asc ;; 2: len 7; hex 21000001362118; asc ! 6! ;; 3: len 8; hex 41736d6974683230; asc Asmith20;; 4: len 0; hex ; asc ;; 5: len 30; hex 3a70626b6466323a7368613235363a31303030303a3132383a2f66545962; asc :pbkdf2:sha256:10000:128:/fTYb; (total 222 bytes); 6: len 0; hex ; asc ;; 7: len 21; hex 61736d6974683230406c6974746c657376722e6361; asc asmith20 at littlesvr.ca;; 8: len 14; hex 3230313630393135303530303137; asc 20160915050017;; 9: len 30; hex 623061346535323762613365336462656133323035633666343564663163; asc b0a4e527ba3e3dbea3205c6f45df1c; (total 32 bytes); 10: SQL NULL; 11: len 30; hex 396561386335613365663263623666353062303736646165393934393331; asc 9ea8c5a3ef2cb6f50b076dae994931; (total 32 bytes); 12: len 14; hex 3230313630393232303530303130; asc 20160922050010;; 13: len 14; hex 3230313630393135303530303130; asc 20160915050010;; 14: SQL NULL; 15: len 4; hex 80000000; asc ;; 16: SQL NULL; TABLE LOCK table `cdotwiki_db`.`mw_watchlist` trx id 4748 lock mode IX RECORD LOCKS space id 70 page no 157 n bits 624 index user_name of table `cdotwiki_db`.`mw_user` trx id 4748 lock_mode X locks rec but not gap Record lock, heap no 558 PHYSICAL RECORD: n_fields 2; compact format; info bits 0 0: len 8; hex 41736d6974683230; asc Asmith20;; 1: len 4; hex 000036c2; asc 6 ;; TABLE LOCK table `cdotwiki_db`.`mw_logging` trx id 4748 lock mode IX TABLE LOCK table `cdotwiki_db`.`mw_recentchanges` trx id 4748 lock mode IX
and getting no closer to figuring it out. In the end – how I found the problem was a single log line on a debug instance of the server. What an adventure!
Conclusion
The bug in Asunder happened because I ignored the warnings in the glibc manual that told me to keep unnecessary code out of signal handlers. I did not at the time know (or even consider) that printf could lock some kind of global structure, which could eventually cause a deadlock.
The bug in MediaWiki happened for the same reason, except they ignored the MySQL manual: “keep transactions that insert or update data small enough that they do not stay open for long periods of time”. I’m sure their code is a lot more complicated, but at the end of the day – they are sending an email in the middle of an SQL transaction, which is just a disaster waiting to happen. There’s no way I’m the only one who ran into this problem.
I’ll report the bug and we’ll see if they take it as seriously as I took my Asunder bug.