Wait/synchronize redone
The 1.0 DRMAA spec wasn’t completely clear on multithreaded behavior, so I went to the drmaa-wg mailing list to ask a couple questions:
- What happens if two threads try to wait simultaneously for the same job to complete? Do they both get the job info data back, or does the earlier call get the data while the later call gets an invalid job error? (Answer: only one call gets the data.)
- What about synchronize? If one thread is waiting on a call, and another thread is waiting on a bunch of calls including that one, if the first thread gets the job info back, should the synchronize call get an error? (Answer: no error in this case. Since synchronize doesn’t get data back, it’s fine as long as the job finished.)
These are edge cases—things that would probably never happen in the real world—you’d have to be a bit crazy to be querying about the same job from a whole bunch of threads—but it should still be done the Right Way.
The right way to do this is to maintain a queue of all the calls that have come in from different threads, so that the ordering of the calls is a known quantity when the KVO observer gets notification about the state change of a job—without the queue, you can’t be sure which thread you’re supposed to wake up.
When a notification comes in from Xgrid that a job has finished, the observer method does the following to the call queue:
- Find the first call in the queue, if any, that reaps the job info: this could be a wait call on the specific job id, or a wait(any) call, or a synchronize call that with dispose=true.
- Find all subsequent wait calls (not including wait(any) calls) that are waiting on this job. Set an error, and wake their threads up as well.
- Look through all the synchronize calls to see if any care about this job. Remove this job from their list of jobs to monitor. If they have no more jobs to monitor, wake their threads up.
. If one is found, reap the job info, and notify the calling thread if it’s a wait call.
This is now implemented. Not tested heavily, but it works for my simple single-threaded tests. (Yes, that’s a very bad way to test multithreaded behavior. More tests to come. :))