Archive for June, 2006

Almost there…

Wednesday, June 28th, 2006

I’m very close to a full DRMAA implementation for Xgrid (still just in Objective-C), or as full an implementation as is currently possible with Xgrid. The only major missing feature right now is bulk jobs.

The biggest hurdle has been the fact that Xgrid doesn’t support a number of things needed by the specification. The most important of those are: (1) setting the working directory, and (2) actually getting useful information about job execution, exit status, etc.

The only way I saw to do this was to wrap each and every Xgrid job in a proxy executable, xgrid_drmaa_proxy. This proxy sets the environment, arguments, and stdin for the command being run; runs it; and retrieves resource usage data using the wait4 system call.

Some interesting and frustrating things I learned along the way:

  • I knew that NSTask is a great class for running other processes. Makes things so easy. But you can’t use wait4() on that process to get usage info. Apparently NSTask is doing funny things on another thread that interfere.
  • The combination of fork(), dup2(), execve() and wait() is very powerful, as long as you remember the following: (1) close one end of each of the redirected pipes; and (2) manually set argv[0] to contain the launch path.
  • Running an NSRunLoop recursively from something called back by running the run loop works, until you start dealing with finicky networking code to download files from Xgrid. Re-trying calls with -[NSObject performSelector:withObject:afterDelay:] is far more effective. I plan to switch all my recursive running of run loops to this model (or, if easier, condition-waits with NSConditionLock).
  • My biggest annoyance: XgridFoundation will accept @"YES" and @"NO" as values for whether a submitted file is executable or not, but not, say, [NSNumber numberWithBool:YES That’s stupid. Consider this the first (second?) in a long series of rants (and bug reports to Apple) about XgridFoundation. This one took me a *long* time—and a trip to Charles’s GridStuffer source code—to figure out

I’m going to take a break from this until Monday—work on some eco-stuff. Come Monday, bulk jobs, C bindings, and Java bindings will be the only things left (aside from a few detailed loose ends). Hopefully a release with installer early next week!

Wait/synchronize redone

Tuesday, June 27th, 2006

The 1.0 DRMAA spec wasn’t completely clear on multithreaded behavior, so I went to the drmaa-wg mailing list to ask a couple questions:

  • What happens if two threads try to wait simultaneously for the same job to complete? Do they both get the job info data back, or does the earlier call get the data while the later call gets an invalid job error? (Answer: only one call gets the data.)
  • What about synchronize? If one thread is waiting on a call, and another thread is waiting on a bunch of calls including that one, if the first thread gets the job info back, should the synchronize call get an error? (Answer: no error in this case. Since synchronize doesn’t get data back, it’s fine as long as the job finished.)

These are edge cases—things that would probably never happen in the real world—you’d have to be a bit crazy to be querying about the same job from a whole bunch of threads—but it should still be done the Right Way.

The right way to do this is to maintain a queue of all the calls that have come in from different threads, so that the ordering of the calls is a known quantity when the KVO observer gets notification about the state change of a job—without the queue, you can’t be sure which thread you’re supposed to wake up.

When a notification comes in from Xgrid that a job has finished, the observer method does the following to the call queue:

  1. Find the first call in the queue, if any, that reaps the job info: this could be a wait call on the specific job id, or a wait(any) call, or a synchronize call that with dispose=true.
  2. . If one is found, reap the job info, and notify the calling thread if it’s a wait call.

  3. Find all subsequent wait calls (not including wait(any) calls) that are waiting on this job. Set an error, and wake their threads up as well.
  4. Look through all the synchronize calls to see if any care about this job. Remove this job from their list of jobs to monitor. If they have no more jobs to monitor, wake their threads up.

This is now implemented. Not tested heavily, but it works for my simple single-threaded tests. (Yes, that’s a very bad way to test multithreaded behavior. More tests to come. :))

PrefPane complete

Monday, June 26th, 2006

I just put together that preference pane for controlling XgridDRMAA grid selection. It’s pretty straightforward: it lets you browse for a grid advertised via Zeroconf/Bonjour, or specify a hostname/IP address, or just choose to use the standard Xgrid environment variables, which requires you to set them up elsewhere (in your shell configuration for command-line programs; in ~/.MacOSX/environment.plist for Mac OS X-environment programs). It looks like this (click for a version you can actually read):



If you click the top radio button, a simple network browser shows up in a sheet attached to the System Preferences window. It supports not just the usual local. domain browsing, but also any other available domains for those with wide-area Bonjour configured. The picture:



If you click the second radio button, you get this simple form:



Either way, when you click Continue, it tries to connect to the server. If authentication succeeds, then we don’t need a password, and we go straight to grid selection. If it fails, then you have to set authentication information (this one might look familiar):



Finally, you select a grid. The default grid is in bold:



When you click Finish, your settings are magically saved in the user defaults database. Except the password, that is: if you’re using password authentication, the password gets saved to the user keychain, so it’s not sitting on the hard drive in cleartext.

The settings are now reflected in a little text below the radio button:



Now, whenever an application starts an XgridDRMAA session, it takes these settings, which are set in NSGlobalDomain. They can all be overridden on a per-application basis, of course, using the standard NSUserDefaults API. But most DRMAA apps won’t even need to know they’re connecting to Xgrid. Or running on a Mac. That’s the idea, anyway.

Job waiting

Thursday, June 22nd, 2006

Now in place: waiting for jobs to complete.

How it works: after a job starts, it’s added to a list of jobs being monitored, and is observed via Cocoa Key-Value Observing for its state key path. An NSConditionLock object is also created for each job being monitored, and when the state change is observed, the condition is set to correspond to the new state.

When the client code calls -[DRMAASession waitForJobId:timeout:error:], it simply tries to obtain a lock on the job with the XgridDRMAAJobConditionDoneOrFailed condition (using NSConditionLock’s timeout locking methods if a timeout is specified). Once the lock is obtained, it asks the Xgrid thread for a job info object, which right now consists of basically no data because I haven’t implemented an executable wrapper to actually collect the data. (Xgrid doesn’t collect it on its own.)

Job status & control

Thursday, June 22nd, 2006

Today: I implemented the job status and control functions in DRMAA. Pretty straightforward: for status, I map the Xgrid status to a corresponding DRMAA value (this mapping having been run by the xgrid-users list). For control, I just grab the XGJob object corresponding to the job and send it the right message.

I had to leave out two DRMAA control options: hold and release. Apparently “hold” means “leave this job in the queue, but don’t allow it to run yet”; “release” puts it into a runnable state. Xgrid doesn’t support this. (Similarly, I previously had to ignore the job submission state attribute—on other systems, you can have a job start out in the “hold” state.)

The one trickiness with today’s code is that jobs don’t show up immediately in the grid’s jobs list, even after the XGActionMonitor for monitoring submission indicates success. So I added some code in -[XgridDRMAASession submitXgridJobWithSpecification:] to wait for the job’s presence in the grid’s list before returning. That way, any subsequent calls regarding that job are guaranteed to find it as long as it hasn’t been removed from the grid.

Side note: as required by DRMAA, all the methods are thread-safe. So far, this just means putting @synchronzed(self) { } around the bodies of all the methods; all the Xgrid stuff is handled on a dedicated Xgrid thread.

Submitting jobs

Tuesday, June 20th, 2006

Two pieces of good news today.

First, the honorable Charlotte W. Woolard deemed my predicament of having multiple large software projects, a Monday Summer of Code check-in deadline, and a San Francisco move-out date of August 1 worthy of getting excused from a jury.

Which made possible the second piece: job submission, at least in its basic form, is working. Jobs with arguments—but not with stdin or environment settings—can be successfully submitted via XgridDRMAA.

It wasn’t too complicated. In short, this is what happens:

  1. The client code constructs a job template, as per the DRMAA spec, which includes details of the job like what command to run, what arguments to pass the command, etc.
  2. The client code then calls -[DRMAASession runJobWithJobTemplate:error:].
  3. The runJobWithJobTemplate:error: method does the fancy work. It first asks the job template instance (which is actually an instance of the subclass XgridDRMAAJobTemplate) for an Xgrid-style job specification dictionary. It passes this on, via Distributed Objects, to a helper method.
  4. The helper method, submitXgridJobWithSpecification:, submits the job using -[XGController performSubmitJobActionWithJobSpecification:gridIdentifier:], getting back an XGActionMonitor instance. It runs the run loop on the secondary thread until either failure or success has been recorded. If successful, the job identifier is returned from the results dictionary; if not, a NSError instance is generated in the DRMAAError domain, encapsulating the BEEP error generated by Xgrid.

(For the record, sometimes I hate how long Objective-C method names are. When I use ObjC, I miss Java. And vice-versa.)

To see it in code, a super-simple OCUnit test (which assumes a working Xgrid setup and valid XgridDRMAA user defaults settings) looks like this:

- (void)testRunSimpleJob
{
	[_session begin:nil];

	DRMAAJobTemplate *jobTemplate = [_session jobTemplate];

	[jobTemplate setRemoteCommand:@"/bin/ps"];
	[jobTemplate setJobName:@"testRunSimpleJob"];

	NSError *error = nil;
	NSString *jobId = [_session runJobWithJobTemplate:jobTemplate error:&error];
	STAssertNotNil(jobId, @"job run failed");

	if(jobId)
	{
		NSLog(@"jobId: %@", jobId);
	}
	else
	{
		STAssertNotNil(error, @"no error generated on failure");
		if(error) NSLog(@"error code: %d", [error code]);
	}

	[_session end:nil];
}

Making Xgrid synchronous

Thursday, June 15th, 2006

I finally got around to making the DRMAA code actually, um, talk to Xgrid. Right now all it does is open and close connections, but everything else will follow pretty much the same pattern.

From the client code, it’s very easy to begin a DRMAA session:

DRMAASession *session = [[DRMAASystem system] session];
NSError *error = nil;
BOOL success = [session begin:&error];

That translates into a whole pile of XgridFoundation code which, in summary, does the following:

  1. First, the code spawns a second thread, passing it a couple of ports through which to send Distributed Objects messages, and setting up a DO proxy object on the main thread. The second thread runs a standard NSRunLoop loop, which does two things: (1) gives the XgridFoundation objects the opportunity to do their thing, and (2) receives DO messages from the first thread, removing the need for dealing with mutex locks.
  2. The Xgrid thread having started its run loop, the main thread sends it a message to establish a connection with the Xgrid controller. In the Xgrid thread, a connection is opened with the usual XgridFoundation calls (using settings stored in user defaults as described in the previous post). Instead of using the delegate method callbacks, which would mean I’d need to signal the first thread with a condition variable, I wrote the code to simply run the run loop until the Xgrid connection was in the “open” state (or the “closed” state on account of an error). I was surprised this worked: even though the method is being called as part of -[NSRunLoop run...] (since it’s a DO message), it can itself run the run loop.
  3. Once the connection has (hopefully) opened successfully, the second thread’s DO-called method returns, letting the main thread continue. No callbacks, no nothing—one call, and the main thread knows whether the connection succeeded or not.

(The session-closing code just calls a method via DO in similar fashion.)

XgridDRMAA should make it really easy for procedural Objective-C (no, not necessarily an oxymoron) programs to use Xgrid. (C and Java too, of course, once the wrapping is done.) GridEZ is much better for real interactive Cocoa GUI apps—you don’t want to block your main thread waiting for a response from Xgrid, obviously—but for many scientists, this model is useful.

“Contact strings” and browsing for Xgrids

Tuesday, June 13th, 2006

The DRMAA specification includes the possibility of having the API return a list of “contacts”—identifiers for different grids. This seemed like a pretty natural place to return a list of controllers/grids discovered via Bonjour, so I went ahead and implemented the service browsing on a separate thread, managing it correctly with NSRunLoop, etc., and, when I was done, realized that authentication would not work at all in the context of DRMAA unless no authentication was required. So I commented all that code out—I’ll bring back snippets of the run-loop management back for the actual XgridFoundation code.

Instead, I have returned fully to the idea of just using the defaults database and a preference pane to store all this data. I’ve added string constants for all the data that will be needed to store a selected grid to XgridDRMAATypesAndConstants.h:

extern NSString *XgridDRMAAIdentificationMethod;
extern NSString *XgridDRMAANetServiceIdentificationMethod;
extern NSString *XgridDRMAAHostnameIdentificationMethod;

extern NSString *XgridDRMAANetServiceDomain;
extern NSString *XgridDRMAANetServiceName;

extern NSString *XgridDRMAAHostnameOrIP;
extern NSString *XgridDRMAAPortNumber;

extern NSString *XgridDRMAAGridName;

extern NSString *XgridDRMAAAuthenticationMethod;
extern NSString *XgridDRMAANoAuthenticationMethod;
extern NSString *XgridDRMAAPasswordAuthenticationMethod;
extern NSString *XgridDRMAAKerberosAuthenticationMethod;

extern NSString *XgridDRMAAUsername;
extern NSString *XgridDRMAAPassword;

Not all of these need to be set, obviously: for initial development, I’ve just set these values:

defaults write NSGlobalDomain XgridDRMAAIdentificationMethod XgridDRMAANetServiceIdentificationMethod
defaults write NSGlobalDomain XgridDRMAANetServiceName Astor (my G5)
defaults write XgridDRMAAAuthenticationMethod XgridDRMAANoAuthenticationMethod

To respond to Charles’s comment about OCUnit:

For one, I’m right now using OCUnit as a way to automatically run “tests” that aren’t really unit tests, because there are no (or only trivial) OCUnit assertions—they’re just simple short programs with some debugging output. This just means I don’t have to create a separate executable target and manually run that.

As for real unit tests, I think I’m just going to make the assumption that a grid is selected properly via the user defaults mechanism (eventually, through the prefpane) before the tests are run. There’s no reason that someone building the code on their own machine needs to run the unit tests—they can just build the framework target by itself.

Minor developments & design notes

Saturday, June 10th, 2006

It’s been a little slow going the last few days with my sister and another friend in town, but I’ve added a few little touches:

OCUnit Testing There’s really no complex code to test yet, but I added a OCUnit bundle target to the Xcode project. It’s pretty nice—tests automatically get run as part of the build process.

Xgrid Bonjour Browsing Really the first part of the code that actually, well, does something (rather than being purely structural). On initialization, the XgridDRMAASystem class starts up a Bonjour service browser for _xgrid._tcp on a secondary thread, and an array gets updated as services get discovered. Pretty standard stuff. (It’s always fun to learn from sample code I wrote four years ago as an Apple Tech Pubs intern…)

This basic structure of having a secondary thread with an active NSRunLoop will carry over to actual Xgrid-to-DRMAA communication: from DRMAA’s perspective, everything’s nice and procedural; in parallel on another thread, an event-based run loop will be getting delegate messages from the Xgrid system and (in a thread-safe manner) updating the data structures read by the DRMAA method calls. It should be pretty straightforward, and pretty much what Charles suggested off the top of his head on the Xgrid list a few months ago.

A few notes on the class hierarchy of my Objective-C DRMAA bindings vs. the Java bindings:

The Java bindings mimic the C bindings quite closely, which makes sense—the closest thing to a reference implementation is a JNI wrapper around the SGE C implementation. One downside to this is that where the C bindings lack elegant object-orientation, so do the Java bindings.

Case in point: the getDrmaaImplementation(), getDrmSystem() and getContact() methods return different things depending on if they’re called before or after init()—beforehand, they return a list of possibilities; afterwards, they return the choice selected. I set up a class relationship so that the possibilities are available at the appropriate level of representation, and the choices made are attached to an object corresponding to that choice.

From the top, you have class (”static” in Java parlance) methods of the DRMAASystem class: systems, which returns an array of available systems; as well as systemsString, implementationsString, and contactsString, included only to make the mapping to the standard C bindings a little easier. There will also be methods to retrieve a specific DRM system (so far just system, which returns the default Xgrid implementation—eventually this will be more configurable).

On the next level, once you have a specific DRM system, you can query that system for its specific systemString, implementationString, or contactStrings, and retrieve a particular DRMAASession object, which contains the methods for actually interacting with a session: begin:, end:, jobTemplate, controlJobId:withAction:error:, etc.

Now, it might seem like this adds unnecessary verbosity to the code, but for the default case, it’s really not that bad:

DRMAASession *session = [[DRMAASystem system] session];

etc.

XgridDRMAA ObjC interface

Monday, June 5th, 2006

I put together some headers for the Objective-C DRMAA interface. They follow the Java bindings pretty closely, with some name changes to match Cocoa naming conventions better, plus the use of NSError instead of exceptions. They also will reuse the constants defined in the C headers where relevant.

The one most glaring DRMAA requirement I’ve noticed missing from Xgrid is the ability to change the working directory before running the command. For missing things like this, I think it will be best to simply leave the implementation incomplete, tell Apple, and hope that the feature appears in the next release. This won’t hinder my ability to write GridSweeper as a pure-DRMAA app, so I can live without it for now.

Browse the code here.