Review Board 1.7.22


Heartbeat timeout in Windows does not lead to timely reconnect

Review Request #4383 - Created March 16, 2012 and updated

Cliff Jansen
QPID-3759
Reviewers
qpid
astitcher, chug, shuston, tross
qpid
The cause of the hang was an outstanding read side completion when the AsynchIO object in charge of the socket was in the queuedClose state.

The completion handler drains outstanding async requests before closing the socket.  Since the cable had been pulled, the async read would never complete until Windows gave up on the socket altogether (some time much later).

This patch remembers the last aio read and will cancel it  if in the queuedClose state before blocking again.



Aside from the basic description from the Jira, I also removed an unused test for restartRead, which doesn't change the logic of the section, but may indicate an intention that wasn't fully coded or something left over from a previous change.
qpid-perftest, qpid-send, qpid-receive, cable pulls, broker pause/resumes
Review request changed
Updated (April 19, 2012, 6:52 a.m.)
Load tests over a period of time reveal a threading bug when closing a connection.

The testing for opsInprogress == 0 and the states of queuedDelete and queuedClose occurs outside the lock.  If an IO thread suspends right after releasing the lock (opsInProgress == 1) and resumes some time later, when another IO thread has decremented opsInProgress to zero, both threads will conclude that they are the last IO completion.  This results variously in double deletes of the underlying socket or the AsynchIO object itself.

This patch moves the test inside the lock.

It also uses the same lock to protect the setting of either queuedDelete or queuedClose and the handoff (if any) to the IO thread.  This has the effect of adding two additional locks over the life of the connection, but should have no effect on throughput or latency.
Ship it!
Posted (April 20, 2012, 1:50 p.m.)
I spun up VS2008 and VS2010, x86 and x64, debug and release versions of the C++ and .NET binding tools and ran 10's of thousands of these executables against each other with no problem. Previous versions of tests built with patches on this review (on 64-bit Server 2008 R2 Datacenter) usually showed some executable failures before this many executions.