Summer of Code wrap-up

August 21st, 2006

My mixed-up brain thought the end of Summer of Code was August 26; it’s actually right now, so it’s time to wrap things up for this program. The code will be architecturally complete in the next couple of days.

The complete list of items that should be done:

  • XgridDRMAA 0.1 (already released)
  • Blog entry introducing the use of XgridDRMAA for developers
  • GridSweeper 0.0.1, with the following features:
    • Support for Drone-compliant models, and a standard interface for adding additional types of models
    • Support for file transfer via FTP, and a standard interface for adding adding additional filesystems
    • Command-line interface for running batches

The following GridSweeper features will be implemented in a post-SoC release:

  • Direct support for Repast models
  • Graphical user interface
  • Full plug-in support and developer documentation for plug-in interfaces

Additionally, XgridDRMAA will improve with user feedback and additional testing.

XgridDRMAA 0.1

August 17th, 2006

I’m pleased to announce the first development release of XgridDRMAA. This version should be useful enough to do basic job submission and monitoring tasks, but will probably have some problems to work out. Due partially to limitations in Xgrid and partially to time, some features are still missing (see the readme).

Over the next few days (between Paris gigs with The April Fishes) I’ll upload some tutorials on how to actually use the framework. For now, you can consult the DRMAA website for general information.

You can download the file here:

http://code.edbaskerville.com/xgrid_drmaa/XgridDRMAA.dmg

From control files to experiment runs…

July 20th, 2006

Here’s how a set of parameter sweeps will get translated into an actual experiment run…

  1. Generate a tree of Sweep objects from a control file and/or command-line arguments. (In the case of the GUI, the Sweep objects will be generated live as the model backing the view.)
  2. Get a list of parameter maps by calling generateMaps() on the top-level sweep.
  3. Convert that list of parameter maps into a list of jobs for the grid system, querying the plugin for the model system (Repast, Drone, etc.) along the way to generate the job submission data (and do things like stage files if there isn’t a shared filesystem).
  4. Submit the jobs to the grid.
  5. On the client/submit end, monitor progress of jobs using output from CLI tool or via GUI.
  6. On the agent end, run the jobs by passing the information to the plug-in. If required, stage files back to the FTP, etc. server when the job is done.

Sweeps

July 20th, 2006

The basic model code for parameter sweeps is done. There’s a standard interface (Sweep) for all sweep types that contains a single method:

public List generateMaps()

The returned List is simply a sequence of parameter settings. Each item in the list is a ParameterMap object, which is just a subclass of HashMap with some convenience constructors.

Currently, there are six different concrete subclasses of Sweep, plus a few abstract subclasses defining common elements.

SingleValueSweep Pretty simple: assigns a single value to a single parameter.

ListSweep The first nontrivial type: assigns a list of values to a single parameter.

RangeListSweep Probably the most useful type: lets you assign a range of values, defined with a start, end, and increment, to a single parameter. The values are represented using the arbitrary-precision BigDecimal class, so there’s no possibility for rounding error when adding values together.

LinearCombinationSweep Combines two other sweeps “linearly”—that is, in parallel, so the first parameter map in sweep 1’s list gets combined with the first parameter map in sweep 2’s list, and so on. For example, combining beta=0.1,0.2,0.3 with gamma=0.4,0.5,0.6 would result in a length-3 LinearCombinationSweep with (beta,gamma)=(0.1,0.4), (0.2,0.5), (0.3,0.6).

MultiplicativeCombinationSweep This is what most people want when varying multiple parameters: generate every combination of each parameter/value pair. So, to reuse the last example, combining beta=0.1,0.2,0.3 with gamma=0.4,0.5,0.6 results in a length-9 MultiplicativeCombinationSweep with (beta,gamma) = (0.1,0.4), (0.1,0.5), (0.1,0.6), (0.2,0.4), (0.2,0.5), (0.2,0.6), (0.3,0.4), (0.3,0.5), (0.3,0.6).

UniformDoubleSweep The first in a series of stochastic sweeps (more to be written), this sweep generates a number (provided) of values uniformly distributed within a range, so a parameter space can be explored stochastically. If you’re exploring your parameter space from 0 to 1 in increments of 0.1, and it just so happens that interesting spikes happen at 0.15, 0.25, and 0.35, you’re not going to notice them unless you explore the space stochastically.

CLI usage scenarios

July 12th, 2006

The most important piece of unfinished business in the GridSweeper design is what exactly the command-line interface will look like. It’s funny—I grew up on an old-school Mac, scoffing at my primitive DOS-using fourth-grade schoolmates. What an awful way to interact with a computer: remember arcane commands and type them in! But as soon as you start doing software development, or system administration, or anything that needs to be automated, the command line is often more efficient.

I have the same goal for the GridSweeper command-line tools as for the graphical interface: make it easy to run parameter sweeps of models. More accurately, make the most common types of parameter sweeps very easy to do; and make other types of sweeps possible, and as easy as possible.

Scenario 1: Multiple Parameters, Ranges, All Combinations

The most common usage scenario is to vary one or more parameters, and run the model one or more times for each combination of parameters. So if there are 3 parameters being varied, each with 4 different values, and the model is being run 10 times with different random seeds, there will be total of 4 x 4 x 4 x 10 = 640 runs.

Let’s say a model has three parameters, beta, gamma, and nu. Beta will go from 0.3 to 0.6 in increments of 0.1; gamma from 1.0 to 1.3; and nu from 0.1 to 0.4. The model will be run 10 times with different random seeds. Let’s say the

The syntax will go something like this:

grepast mymodel -n10 beta=0.3:0.1:0.6 gamma=1.0:0.1:1.3 nu=0.1:0.1:0.4

A breakdown of the pieces:

  • grepast will be a tool that just calls “gridsweeper repast”, telling the gridsweeper tool that this is a repast model, so the parts of the process that need to be handled by the repast plug-in will be.
  • mymodel says to use mymodel.jar in the current directory. If there’s a shared filesystem (this will be settable in a configuration file or in the GUI), nothing will be transfered over the network except the complete path to the file; if FTP is being used, this file will be staged to the FTP server before running the job, and downloaded by the job on the execution machines.
  • beta=0.3:0.1:0.6 etc. are the key: you can specify ranges of values with super-simple syntax: [start]:[increment]:[end].

Scenario 2: Multiple Parameters, Specified Values, All Combinations

Sometimes you don’t want to specify ranges & increments, but simply particular combinations of values. You’ll be able to specify a vector of values using commas:

grepast mymodel -n10 beta=0.1,0.4,0.7 gamma=1.0,1.4,1.9

Or, if you want, you can mix range/increment lists with specific values:

grepast mymodel -n10 beta=0.1:0.1:0.5,0.7,1.3

Scenario 3: Multiple Parameters, Specific Combinations

Another common need is to run certain combinations of parameters, but not others. For example, beta=0.3/gamma=0.5 and beta=0.4/gamma=0.6, but not beta=0.3/gamma=0.6. This is accomplished by separating parameter names with semicolons (quotes inserted so the shell sees the whole thing as one argument):

grepast mymodel -n10 "beta;gamma = 0.3;0.5, 0.4;0.6"

If you’d rather specify lists of values with all beta values together and all gamma values together, that’s fine too—just remember that commas separate parameter values for a particular parameter; semicolons separate values for different parameters:

grepast mymodel -n10 "beta;gamma = 0.3,0.4; 0.5,0.6"

Extending this a step further, you’ll be able to combine range/increment lists with this syntax:

grepast mymodel -n10 "beta;gamma = 0.3:0.1:0.6; 0.6:0.1:0.9"

is equivalent to

grepast mymodel -n10 "beta;gamma = 0.3;0.6, 0.4;0.7, 0.5;0.8, 0.6;0.9"

and to

grepast mymodel -n10 "beta;gamma = 0.3,0.4,0.5,0.6; 0.6,0.7,0.8,0.9"

This is as much complexity as command-line syntax will support, though. Beyond this, it’s probably time to use a control file anyway (to be covered in a later post).

Java bindings working

July 11th, 2006

Thanks to Dan Templeton’s Java bindings in the Sun Grid Engine, and his porting instructions on his blog, I have XgridDRMAA basically working in Java. I’ll still have to look through for some minor implementation differences (such as supported attributes), but basic things, including the DrmaaExample.java code included with SGE, are working.

Besides fixing bugs I found along the way, I had to do a couple additional things to make things work right:

  • Change the class names This meant changing the package name in the source files, but also meant fixing a couple lines in the actual code where classes are looked up by name.
  • Find DRMAA Java implementation The org.ggf.drmaa SessionFactory class uses a couple methods to try to find a DRMAA implementation: first, it tries System.getProperty to see if a class name has been set; if not, it looks for a setting in the classpath’s META-INF/services/org.ggf.drmaa.SessionFactory file. If there, it uses that. I just added this file to the XgridDRMAA jar file.
  • Find DRMAA library (JNI) The DRMAA Java implementation basically just maps onto the JNI, which is compiled into the XgridDRMAA framework. On Mac OS X, JNI libraries are just Mach-O dylibs containing the right C code. Mac OS X frameworks are also simply Mach-O dylibs wrapped in a nice directory structure. So it’s just a matter of having the magic line of code (System.loadLibrary("drmaa")) find the right library. As it turns out, you have to symlink the XgridDRMAA executable to a file called libdrmaa.jnilib, add the enclosing directory to DYLD_LIBRARY_PATH, and everything works.
  • Fix exit-status analysis Apparently Darwin does exit-status values differently than whatever the SGE DRMAA code was written for—at first, the example code kept telling me that jobs were finishing “with unclear conditions.” I fixed JobInfoImpl.java to use the same semantics as Darwin’s wait.h file.

For the XgridDRMAA installer, I’ll just have it put symlinks in /Library/Java/Extensions to both XgridDRMAA.jar (so it’s in the classpath) and XgridDRMAA (as libdrmaa.jnilib, so it’s in the library path). That way, the user of Java DRMAA apps won’t have to do any additional work (besides setting things up in the prefpane) to use Xgrid.

GridSweeper design overview

July 7th, 2006

Although XgridDRMAA has not quite stabilized yet, it’s time to move on to serious work on GridSweeper. (I’ll use it as a test suite for XgridDRMAA—I can run the code using Grid Engine’s DRMAA and XgridDRMAA, and problems in the latter will no doubt emerge.)

I spent a while at my whiteboard scrawling a mind map; here’s a simplified version in more legible form:

Development priorities for this software:

  1. Parameter control This is sort of the point: converting compact representations of parameter combinations into big long lists of parameters settings to be run.
  2. Plug-in interface This is how parameter settings get translated into control parameters for specific classes of models—e.g., Repast models, general command-line parameters, etc.
  3. Grid control This is the other part of the point: submitting lists of parameter settings to the grid. Very straightforward, thanks to DRMAA.
  4. CLI I need some way of interacting with the system (aside from writing new main() methods) as early as possible.
  5. Preferences Good to be able to save settings to shortcut things for both the CLI and the GUI—e.g.,
  6. File transfer interface Unfortunately, you can’t count on having a shared filesystem. (In fact, I don’t have a shared filesystem for my “grid” of two Macs.) So you need a way to transfer output files that aren’t stdin/out/err (which is provided for by DRMAA). I think the simplest solution is to just support FTP servers, my previous ramblings about having a custom file-transfer daemon notwithstanding. Most bang for my coding time, thanks to the Jakarta Commons Net FTP library.
  7. GUI This is the most open-ended component, so I’ll leave it for the end, and it can be as sophisticated or as simple as I have time for.

C bindings complete

July 6th, 2006

The Objective-C DRMAA implementation has now been wrapped in C as per the 1.0 DRMAA C binding spec. Far more code than expected, but all very straightforward code.

All that’s left is some real testing of the C layer, filling in a couple holes in the Objective-C code (most notably supporting file transfer via scp from other hosts), and doing the Java bindings, which will consist essentially of code lifted from the Sun Open Source Grid Engine code base.

Almost there…

June 28th, 2006

I’m very close to a full DRMAA implementation for Xgrid (still just in Objective-C), or as full an implementation as is currently possible with Xgrid. The only major missing feature right now is bulk jobs.

The biggest hurdle has been the fact that Xgrid doesn’t support a number of things needed by the specification. The most important of those are: (1) setting the working directory, and (2) actually getting useful information about job execution, exit status, etc.

The only way I saw to do this was to wrap each and every Xgrid job in a proxy executable, xgrid_drmaa_proxy. This proxy sets the environment, arguments, and stdin for the command being run; runs it; and retrieves resource usage data using the wait4 system call.

Some interesting and frustrating things I learned along the way:

  • I knew that NSTask is a great class for running other processes. Makes things so easy. But you can’t use wait4() on that process to get usage info. Apparently NSTask is doing funny things on another thread that interfere.
  • The combination of fork(), dup2(), execve() and wait() is very powerful, as long as you remember the following: (1) close one end of each of the redirected pipes; and (2) manually set argv[0] to contain the launch path.
  • Running an NSRunLoop recursively from something called back by running the run loop works, until you start dealing with finicky networking code to download files from Xgrid. Re-trying calls with -[NSObject performSelector:withObject:afterDelay:] is far more effective. I plan to switch all my recursive running of run loops to this model (or, if easier, condition-waits with NSConditionLock).
  • My biggest annoyance: XgridFoundation will accept @"YES" and @"NO" as values for whether a submitted file is executable or not, but not, say, [NSNumber numberWithBool:YES That’s stupid. Consider this the first (second?) in a long series of rants (and bug reports to Apple) about XgridFoundation. This one took me a *long* time—and a trip to Charles’s GridStuffer source code—to figure out

I’m going to take a break from this until Monday—work on some eco-stuff. Come Monday, bulk jobs, C bindings, and Java bindings will be the only things left (aside from a few detailed loose ends). Hopefully a release with installer early next week!

Wait/synchronize redone

June 27th, 2006

The 1.0 DRMAA spec wasn’t completely clear on multithreaded behavior, so I went to the drmaa-wg mailing list to ask a couple questions:

  • What happens if two threads try to wait simultaneously for the same job to complete? Do they both get the job info data back, or does the earlier call get the data while the later call gets an invalid job error? (Answer: only one call gets the data.)
  • What about synchronize? If one thread is waiting on a call, and another thread is waiting on a bunch of calls including that one, if the first thread gets the job info back, should the synchronize call get an error? (Answer: no error in this case. Since synchronize doesn’t get data back, it’s fine as long as the job finished.)

These are edge cases—things that would probably never happen in the real world—you’d have to be a bit crazy to be querying about the same job from a whole bunch of threads—but it should still be done the Right Way.

The right way to do this is to maintain a queue of all the calls that have come in from different threads, so that the ordering of the calls is a known quantity when the KVO observer gets notification about the state change of a job—without the queue, you can’t be sure which thread you’re supposed to wake up.

When a notification comes in from Xgrid that a job has finished, the observer method does the following to the call queue:

  1. Find the first call in the queue, if any, that reaps the job info: this could be a wait call on the specific job id, or a wait(any) call, or a synchronize call that with dispose=true.
  2. . If one is found, reap the job info, and notify the calling thread if it’s a wait call.

  3. Find all subsequent wait calls (not including wait(any) calls) that are waiting on this job. Set an error, and wake their threads up as well.
  4. Look through all the synchronize calls to see if any care about this job. Remove this job from their list of jobs to monitor. If they have no more jobs to monitor, wake their threads up.

This is now implemented. Not tested heavily, but it works for my simple single-threaded tests. (Yes, that’s a very bad way to test multithreaded behavior. More tests to come. :))