Archive for the ‘XgridDRMAA’ Category

XgridDRMAA 0.1.1, With Examples

Tuesday, August 22nd, 2006

I silently released XgridDRMAA 0.1 the other day; today I’m releasing XgridDRMAA 0.1.1 and announcing it to the appropriate mailing lists.

You can download it here:

http://code.edbaskerville.com/xgrid_drmaa/XgridDRMAA.dmg

I have now included simple example code in all three supported languages (C, Objective-C, Java). For people working in Mac-only environments, I highly recommend the Objective-C interface. If you want your code to work with, e.g., Sun Grid Engine as well, the Java interface is convenient. If you’re attached to C, go ahead, but the API is a little cumbersome. All three languages use the Objective-C implementation at their core. (In the case of Java, there are actually two levels of wrapping: Java wraps C; C wraps Objective-C; this is so I could re-use Sun’s Java bindings for the Grid Engine.)

The code doesn’t take too much explanation, but here’s a basic outline of what it does:

  1. Obtain an object representing a DRMAA session (Objective-C and Java only).
  2. Open the session.
  3. Create a job template object describing various aspects of the job: in this case, it simply echoes a single number and writes stdout to the user’s desktop on the client machine.
  4. Run a “bulk job”—that is, a job with a parameterized index—with an index running from 1 to 5. The corresponding sub-job simply echoes the parameterized index.
  5. “Synchronize” all five jobs—wait for them all to complete.
  6. Clean up memory (C and Java only).
  7. Close the session and exit.

Development work will be on hold for the next few weeks as I work on GridSweeper and travel around Europe, but please send along feedback for when I return to it in early September! For bug reports, please use the bug database.

Summer of Code wrap-up

Monday, August 21st, 2006

My mixed-up brain thought the end of Summer of Code was August 26; it’s actually right now, so it’s time to wrap things up for this program. The code will be architecturally complete in the next couple of days.

The complete list of items that should be done:

  • XgridDRMAA 0.1 (already released)
  • Blog entry introducing the use of XgridDRMAA for developers
  • GridSweeper 0.0.1, with the following features:
    • Support for Drone-compliant models, and a standard interface for adding additional types of models
    • Support for file transfer via FTP, and a standard interface for adding adding additional filesystems
    • Command-line interface for running batches

The following GridSweeper features will be implemented in a post-SoC release:

  • Direct support for Repast models
  • Graphical user interface
  • Full plug-in support and developer documentation for plug-in interfaces

Additionally, XgridDRMAA will improve with user feedback and additional testing.

XgridDRMAA 0.1

Thursday, August 17th, 2006

I’m pleased to announce the first development release of XgridDRMAA. This version should be useful enough to do basic job submission and monitoring tasks, but will probably have some problems to work out. Due partially to limitations in Xgrid and partially to time, some features are still missing (see the readme).

Over the next few days (between Paris gigs with The April Fishes) I’ll upload some tutorials on how to actually use the framework. For now, you can consult the DRMAA website for general information.

You can download the file here:

http://code.edbaskerville.com/xgrid_drmaa/XgridDRMAA.dmg

Java bindings working

Tuesday, July 11th, 2006

Thanks to Dan Templeton’s Java bindings in the Sun Grid Engine, and his porting instructions on his blog, I have XgridDRMAA basically working in Java. I’ll still have to look through for some minor implementation differences (such as supported attributes), but basic things, including the DrmaaExample.java code included with SGE, are working.

Besides fixing bugs I found along the way, I had to do a couple additional things to make things work right:

  • Change the class names This meant changing the package name in the source files, but also meant fixing a couple lines in the actual code where classes are looked up by name.
  • Find DRMAA Java implementation The org.ggf.drmaa SessionFactory class uses a couple methods to try to find a DRMAA implementation: first, it tries System.getProperty to see if a class name has been set; if not, it looks for a setting in the classpath’s META-INF/services/org.ggf.drmaa.SessionFactory file. If there, it uses that. I just added this file to the XgridDRMAA jar file.
  • Find DRMAA library (JNI) The DRMAA Java implementation basically just maps onto the JNI, which is compiled into the XgridDRMAA framework. On Mac OS X, JNI libraries are just Mach-O dylibs containing the right C code. Mac OS X frameworks are also simply Mach-O dylibs wrapped in a nice directory structure. So it’s just a matter of having the magic line of code (System.loadLibrary("drmaa")) find the right library. As it turns out, you have to symlink the XgridDRMAA executable to a file called libdrmaa.jnilib, add the enclosing directory to DYLD_LIBRARY_PATH, and everything works.
  • Fix exit-status analysis Apparently Darwin does exit-status values differently than whatever the SGE DRMAA code was written for—at first, the example code kept telling me that jobs were finishing “with unclear conditions.” I fixed JobInfoImpl.java to use the same semantics as Darwin’s wait.h file.

For the XgridDRMAA installer, I’ll just have it put symlinks in /Library/Java/Extensions to both XgridDRMAA.jar (so it’s in the classpath) and XgridDRMAA (as libdrmaa.jnilib, so it’s in the library path). That way, the user of Java DRMAA apps won’t have to do any additional work (besides setting things up in the prefpane) to use Xgrid.

C bindings complete

Thursday, July 6th, 2006

The Objective-C DRMAA implementation has now been wrapped in C as per the 1.0 DRMAA C binding spec. Far more code than expected, but all very straightforward code.

All that’s left is some real testing of the C layer, filling in a couple holes in the Objective-C code (most notably supporting file transfer via scp from other hosts), and doing the Java bindings, which will consist essentially of code lifted from the Sun Open Source Grid Engine code base.

Almost there…

Wednesday, June 28th, 2006

I’m very close to a full DRMAA implementation for Xgrid (still just in Objective-C), or as full an implementation as is currently possible with Xgrid. The only major missing feature right now is bulk jobs.

The biggest hurdle has been the fact that Xgrid doesn’t support a number of things needed by the specification. The most important of those are: (1) setting the working directory, and (2) actually getting useful information about job execution, exit status, etc.

The only way I saw to do this was to wrap each and every Xgrid job in a proxy executable, xgrid_drmaa_proxy. This proxy sets the environment, arguments, and stdin for the command being run; runs it; and retrieves resource usage data using the wait4 system call.

Some interesting and frustrating things I learned along the way:

  • I knew that NSTask is a great class for running other processes. Makes things so easy. But you can’t use wait4() on that process to get usage info. Apparently NSTask is doing funny things on another thread that interfere.
  • The combination of fork(), dup2(), execve() and wait() is very powerful, as long as you remember the following: (1) close one end of each of the redirected pipes; and (2) manually set argv[0] to contain the launch path.
  • Running an NSRunLoop recursively from something called back by running the run loop works, until you start dealing with finicky networking code to download files from Xgrid. Re-trying calls with -[NSObject performSelector:withObject:afterDelay:] is far more effective. I plan to switch all my recursive running of run loops to this model (or, if easier, condition-waits with NSConditionLock).
  • My biggest annoyance: XgridFoundation will accept @"YES" and @"NO" as values for whether a submitted file is executable or not, but not, say, [NSNumber numberWithBool:YES That’s stupid. Consider this the first (second?) in a long series of rants (and bug reports to Apple) about XgridFoundation. This one took me a *long* time—and a trip to Charles’s GridStuffer source code—to figure out

I’m going to take a break from this until Monday—work on some eco-stuff. Come Monday, bulk jobs, C bindings, and Java bindings will be the only things left (aside from a few detailed loose ends). Hopefully a release with installer early next week!

Wait/synchronize redone

Tuesday, June 27th, 2006

The 1.0 DRMAA spec wasn’t completely clear on multithreaded behavior, so I went to the drmaa-wg mailing list to ask a couple questions:

  • What happens if two threads try to wait simultaneously for the same job to complete? Do they both get the job info data back, or does the earlier call get the data while the later call gets an invalid job error? (Answer: only one call gets the data.)
  • What about synchronize? If one thread is waiting on a call, and another thread is waiting on a bunch of calls including that one, if the first thread gets the job info back, should the synchronize call get an error? (Answer: no error in this case. Since synchronize doesn’t get data back, it’s fine as long as the job finished.)

These are edge cases—things that would probably never happen in the real world—you’d have to be a bit crazy to be querying about the same job from a whole bunch of threads—but it should still be done the Right Way.

The right way to do this is to maintain a queue of all the calls that have come in from different threads, so that the ordering of the calls is a known quantity when the KVO observer gets notification about the state change of a job—without the queue, you can’t be sure which thread you’re supposed to wake up.

When a notification comes in from Xgrid that a job has finished, the observer method does the following to the call queue:

  1. Find the first call in the queue, if any, that reaps the job info: this could be a wait call on the specific job id, or a wait(any) call, or a synchronize call that with dispose=true.
  2. . If one is found, reap the job info, and notify the calling thread if it’s a wait call.

  3. Find all subsequent wait calls (not including wait(any) calls) that are waiting on this job. Set an error, and wake their threads up as well.
  4. Look through all the synchronize calls to see if any care about this job. Remove this job from their list of jobs to monitor. If they have no more jobs to monitor, wake their threads up.

This is now implemented. Not tested heavily, but it works for my simple single-threaded tests. (Yes, that’s a very bad way to test multithreaded behavior. More tests to come. :))

PrefPane complete

Monday, June 26th, 2006

I just put together that preference pane for controlling XgridDRMAA grid selection. It’s pretty straightforward: it lets you browse for a grid advertised via Zeroconf/Bonjour, or specify a hostname/IP address, or just choose to use the standard Xgrid environment variables, which requires you to set them up elsewhere (in your shell configuration for command-line programs; in ~/.MacOSX/environment.plist for Mac OS X-environment programs). It looks like this (click for a version you can actually read):



If you click the top radio button, a simple network browser shows up in a sheet attached to the System Preferences window. It supports not just the usual local. domain browsing, but also any other available domains for those with wide-area Bonjour configured. The picture:



If you click the second radio button, you get this simple form:



Either way, when you click Continue, it tries to connect to the server. If authentication succeeds, then we don’t need a password, and we go straight to grid selection. If it fails, then you have to set authentication information (this one might look familiar):



Finally, you select a grid. The default grid is in bold:



When you click Finish, your settings are magically saved in the user defaults database. Except the password, that is: if you’re using password authentication, the password gets saved to the user keychain, so it’s not sitting on the hard drive in cleartext.

The settings are now reflected in a little text below the radio button:



Now, whenever an application starts an XgridDRMAA session, it takes these settings, which are set in NSGlobalDomain. They can all be overridden on a per-application basis, of course, using the standard NSUserDefaults API. But most DRMAA apps won’t even need to know they’re connecting to Xgrid. Or running on a Mac. That’s the idea, anyway.

Job waiting

Thursday, June 22nd, 2006

Now in place: waiting for jobs to complete.

How it works: after a job starts, it’s added to a list of jobs being monitored, and is observed via Cocoa Key-Value Observing for its state key path. An NSConditionLock object is also created for each job being monitored, and when the state change is observed, the condition is set to correspond to the new state.

When the client code calls -[DRMAASession waitForJobId:timeout:error:], it simply tries to obtain a lock on the job with the XgridDRMAAJobConditionDoneOrFailed condition (using NSConditionLock’s timeout locking methods if a timeout is specified). Once the lock is obtained, it asks the Xgrid thread for a job info object, which right now consists of basically no data because I haven’t implemented an executable wrapper to actually collect the data. (Xgrid doesn’t collect it on its own.)

Job status & control

Thursday, June 22nd, 2006

Today: I implemented the job status and control functions in DRMAA. Pretty straightforward: for status, I map the Xgrid status to a corresponding DRMAA value (this mapping having been run by the xgrid-users list). For control, I just grab the XGJob object corresponding to the job and send it the right message.

I had to leave out two DRMAA control options: hold and release. Apparently “hold” means “leave this job in the queue, but don’t allow it to run yet”; “release” puts it into a runnable state. Xgrid doesn’t support this. (Similarly, I previously had to ignore the job submission state attribute—on other systems, you can have a job start out in the “hold” state.)

The one trickiness with today’s code is that jobs don’t show up immediately in the grid’s jobs list, even after the XGActionMonitor for monitoring submission indicates success. So I added some code in -[XgridDRMAASession submitXgridJobWithSpecification:] to wait for the job’s presence in the grid’s list before returning. That way, any subsequent calls regarding that job are guaranteed to find it as long as it hasn’t been removed from the grid.

Side note: as required by DRMAA, all the methods are thread-safe. So far, this just means putting @synchronzed(self) { } around the bodies of all the methods; all the Xgrid stuff is handled on a dedicated Xgrid thread.