XgridDRMAA overview

June 4th, 2006

Because GridSweeper implementation details will take longer to hash out, and because I’d love to get the system working with Xgrid as soon as possible (for selfish personal reasons), the first code I will write will be the Xgrid DRMAA implementation, affectionately and creatively called XgridDRMAA for short. Here’s an overview of the design.

Components

Cocoa DRMAA Implementation Although there is no official Objective-C/Cocoa binding specification for DRMAA (for obvious reasons), XgridFoundation is a Cocoa API, so the DRMAA implementation will inevitably be Cocoa-based at some level. So, I thought, why not just create an Objective-C DRMAA interface? The structure will mirror the Java interface very closely. I’ll see if the DRMAA Working Group folks want to make this a standard binding—if so, great; if not, understandable (as probably only Xgrid people will be using it).

C DRMAA Implementation Easy part #1: wrap the Cocoa implementation in C, as per the DRMAA C Bindings document. Use the SGE implementation as a supplemental reference.

Java DRMAA Implementation Easy part #2: wrap the Cocoa implementation in Java, as per the DRMAA Java Bindings document. (Here is version 0.6.2; version 1.0 will be updated for JDK 1.5 and nice things like generics and typesafe enums.) This will actually be a more natural mapping, thanks to the stronger object-orientation. I predict lots of JNI calls to objc_msgSend(). Again, use the SGE implementation as a supplemental reference.

XgridDRMAA preference pane For reasons described below, it makes a lot of sense to let each user choose his/her favorite grid, and have DRMAA automatically use that one unless special steps are taken to use something else. This would fit nicely in a preference pane. Addendum: Charles notes in the comments that there are environment variables for specifying a controller host. But there doesn’t seem to be one for specifying a specific grid on that host, so you might get the wrong one if there are multiple available grids.

Packaging

XgridDRMAA.pkg A standard Mac OS X installer package to install XgridDRMAA.framework (in /Library/Frameworks/) and XgridDRMAA.prefPane (in /Library/PreferencePanes).

XgridDRMAA.framework The three APIs will be packaged in a single Mac OS X umbrella framework, XgridDRMAA.framework, which will contain one “real” framework for each language binding.

XgridDRMAA-Cocoa.framework The Cocoa/Objective-C DRMAA interface and implementation. This is the meat of the package, because this is where all the code interacting with XgridFoundation lives.

XgridDRMAA-C.framework The C interfaces (wrapping the Objective-C code).

XgridDRMAA-Java.framework The Java implementation (also wrapping the Objective-C code, via JNI), in the Java package com.edbaskerville.xgrid_drmaa. A version of Dan Templeton’s org.ggf.drmaa classes will also be included, modified to default to Xgrid rather than SGE, but still with the capability to select the SGE DRMAA at runtime.

XgridDRMAA.prefPane The grid-selection preference pane (see notes below).

Why a a Preference Pane

Sun Grid Engine has a very simple, effective method for selecting a grid/cell combination: the SGE_ROOT and SGE_CELL environment variables. These selections, nicely enough, carry over directly into DRMAA, so there is in fact no grid selection/authentication code whatsoever in the DRMAA API. Pretty nice.

Addendum, cont’d: Xgrid has the XGRID_CONTROLLER_HOSTNAME and XGRID_CONTROLLER_PASSWORD environment variables, which work if there’s only one grid on the controller. Inexplicably, there’s no XGRID_CONTROLLER_GRID, however (the equivalent to SGE_CELL). Furthermore, there’s no enforcement in the XgridFoundation API that applications use, or even default to, these settings.

The easy and simple solution: make a preference pane that lets the user select his/her grid of choice, and have DRMAA just use that one. The DRMAA-based application, then, won’t need to know anything about Xgrid grid selection or authentication. There might be good reasons, however, why different applications might want to use different grids, so I’ll also provide supplemental API to select a different grid before making any DRMAA calls. For most applications and people, though, I bet being able to select a standard grid on a per-user basis will be good enough.

With all of this XgridDRMAA work, the hope is that Apple will bring the code, or at least the concepts, into Xgrid itself at some point in the future. Not for Leopard, I don’t imagine, but for whatever cat comes next perhaps, after the thing has been field-tested for a while.

DRMAA Java: first run

June 2nd, 2006

I got a basic DRMAA program running. It lets you execute any command + arguments via the grid.

The code is here.

compile with:

javac -cp $SGE_ROOT/lib/drmaa.jar DrmaaTest.java

run with:

java -cp .:$SGE_ROOT/lib/drmaa.jar DrmaaTest [command] [args]

On a shared-filesystem SGE setup, stdout and stderr will show up as files in the current directory. Pretty spiffy. DRMAA appears to be really simple to use, and should be pretty simple to implement for Xgrid. I’m highly optimistic!

File transfer

June 2nd, 2006

In a sophisticated network setup like a typical Sun Grid Engine installation, a GridSweeper user will have the luxury of a shared filesystem, a network home directory, etc., etc., meaning that no files will need to be transferred as part of job submission. However, this isn’t always the case. With extra work, it’s apparently possible to set up SGE without a shared filesystem. And many Xgrid users, especially if they’re installing Xgrid for the sake of using GridSweeper on their simple Repast model and network of four Macs, won’t have any shared file system at all.

Although Xgrid provides built-in facilities for transferring files, SGE and DRMAA do not—today, the typical user of these systems is on a well managed network. But I want GridSweeper to be easy to set up for any Joe Repast modeler with a few computers. Although it might seem like too much network overhead to send an model’s executable code, plus input data, and retrieve output data on the other end. But compared to typical runtimes for ABMs, the time it takes to transfer a little executable is nothing. So providing a general, easy solution to this problem, I think, is vital.

All the obvious solutions to the file transfer problem come down to requiring the user to set up some kind of file-transfer infrastructure—NFS, AFP, FTP, SCP, etc., etc. But this defeats the whole purpose of ease of use: now they have to set something up!

All roads, in my view, point to including a simple file server as part of GridSweeper. This daemon will typically run on the same machine as, say, the Xgrid controller or SGE qmaster. The client GUI and command line will provide tools to add files to the GridSweeper file daemon. Additionally, clients will be able to upload files on a per-batch basis. When the agent/execution host* starts running, first it will see if it needs to download any files from the file daemon. (Timestamp checking & caching will ensure that if five runs on the same host all need the same file, it will only be downloaded once.)

At the end of a run, the GridSweeper monitor will send any output files back to the file daemon. Using a monitoring tool/GUI, the user will be able to download any results.

* this terminology difference between SGE and Xgrid is really starting to get to me. I bet DRMAA has its own set of terms.

SGE basics

June 2nd, 2006

Sadly, things never quite worked right on my local install of SGE. But it turns out UM CSCS already has one set up that I can use. So that’s what I’m doing.

A summary of basic commands in SGE…

qsub
submits a job in a shell script. if you try to submit an executable binary, it won’t work.

-m b|e|a|s|n
tells qsub when to send mail: at the beginning, end, abort/rescheduling, suspension, or not at all.
qstat
displays queue status.
qmon
really ugly gui for submitting and controlling/monitoring jobs.

Grid Engine installation, Episode III

May 30th, 2006

Now that the qmaster is set up and NFS is set up, I can finally set the machines up as execution hosts. I’m doing it in parallel on both machines. cd /usr/local/gridengine, sudo -s, and finally ./install_execd, and we’re on our way to another bulleted list describing lots of screens…

  • Welcome Why, thank you!
  • Checking directory Looks good: /usr/local/gridengine
  • Cells Also good: algore
  • Checking hostname resolving This worked fine out of the box on the qmaster machine, but on the PowerBook the Bonjour name wasn’t getting resolved properly. So I added an entry to /etc/hosts on astor.local: [local IP address] darwin.local. That fixed things.
  • Local spool directory configuration No local spools.
  • Creating local configuration Done!
  • execd startup script Yes! Done!
  • execution daemon startup Started up!
  • Adding a queue for this host Done on both. Looks like the 2-processor G5 detected two processors, and the 1-processor PowerBook detected 1. Smarty smarty. But a problem: “unable to resolve host [‘darwin’ | ‘astor’]”…I hope this doesn’t mean everything breaks.
  • The rest… is just information already shown during the other installation. I hope that name-resolution problem doesn’t bite me in the ass.

Well, it looks like everything’s done. Testing…tomorrow. Time to sleep.

Setting up NFS

May 30th, 2006

Turns out you have to have an NFS share for your SGE_ROOT directory. So I set up NFS.

I followed the GUI instructions from this one. In short, you add an /exports entry to NetInfo with settings for the directory you want to export. I couldn’t get the exports to show up right for a long time, but restarting the machine fixed that problem.

To set up an NFS automount on the PowerBook, there’s some more setup to be done, described here. The gist is to set up a NetInfo entry in /mounts for the server.

End result: my /usr/local/gridengine on darwin.local maps to the same directory on astor.local.

Grid Engine installation installation

May 29th, 2006

Fresh from a nice Memorial Day picnic lunch in Dolores Park, it feels like time to take a nap. But I’m going to install the Grid Engine instead! Here comes the installation part of the installation process.

Getting the Software

I downloaded the Grid Engine 6.0u8 common files and Mac OS X binaries linked from here and unpacked the contents of each into /usr/local/gridengine on both of my machines.

Then I set the $SGE_ROOT environment variables in the system-wide /etc/bashrc file, and added the binary directory to the standard $PATH:


export SGE_ROOT=/usr/local/gridengine
export PATH=$SGE_ROOT/bin/darwin:$PATH

and did a source /etc/bashrc to update my session’s environment variables.

Setting up the Master Host

Making sure I was in the $SGE_ROOT directory and in a sudo -s session, I ran this on good-old Astor:

./install_qmaster

I followed through some screens:

  • Admin user At the first screen, I said OK to use ebaskerv (my user account) as the admin user.
  • root directory The root directory was right.
  • TCP/IP services As requested, I added sge_qmaster to my /etc/services file, and in anticipation added one for sge_execd:
    sge_qmaster 781/tcp
    sge_execd 782/tcp
  • Cells Named my cell algore, as promised.
  • qmaster spool directory Default is fine: /usr/local/gridengine/algore/spool/qmaster
  • Windows Execution Host Support Are you going to install Windows Execution Hosts? Are you kidding me? At least, by Judas, the default is no.
  • File permissions I said no when asked if I had already verified and set file permissions. My guess is these would need fixing. I said yes at the next screen (please verify and set my permissions) and all looked hunky-dory (“Your file permissions were set”).
  • Hostname resolving method This asks if all my hosts are in one DNS domain. I’m going to cross my fingers and hope that the zeroconf pseudo-domain local. will work, and answer yes.
  • Making directories This seemed to go fine. (“Mrs. Crabapple and Principal Skinner were in the closet making directories, and I saw one of the directories, and the directory looked at me!”) RETURN!
  • Setup spooling I chose classic spooling, because I had this suspicion that BerkeleyDB wasn’t ever installed properly on my machine. I’m looking for simplicity, not performance. The spooling database seemed to be initialized properly on the next screen.
  • Group id range For some strange reason, the Grid Engine needs a range of UNIX group ids to assign dynamically to jobs. I’m pretty sure the example range 20000-20100 is free and large enough, so I’ll use that.
  • Cluster configuration First up: execd_spool_dir. The default seems fine. Then, administrator email: I gave it my email, but I don’t think email sending is even set up right on my machine, so it probably won’t work.
  • Creating local configuration This seemed to work…
  • qmaster/scheduler startup script Apparently, it knows how to set up a startup script. I’ll let it go ahead and try…wow! It put something in /Library/StartupItems! Clever girl.
  • qmaster and scheduler startup Started up successfully!
  • hosts This is easy: just two for now. astor.local. and darwin.local., maybe more later. (They misspelled “separated” in “Please enter a blank seperated list of hosts.”) This seemed to go correctly. I said no to a shadow host, partially because I like to live dangerously, and mostly because my grid consists of two computers. Then, the default queue and hostgroup were added: just astor.local.—maybe I have to add darwin.local. manually later.
  • Scheduler tuning Went with Normal.
  • Using gridengine Looks like they provide a nice script to set all the environment variables. So I replaced my old bashrc line with:
    . /usr/local/gridengine/algore/common/settings.sh
  • Messages FYI, messages logged to:
    /tmp/qmaster_messages
    /tmp/execd_messages
    /usr/local/gridengine/algore/spool/qmaster/messages
    [execd_spool_dir]/[hostname]/messages
    and startup scripts are at:
    /usr/local/gridengine/algore/common/sgemaster (qmaster and scheduler)
    /usr/local/gridengine/algore/common/sgeexecd (execd)
  • Almost done “Your Grid Engine qmaster installation is now completed” says the friendly screen. Now I get to start the execution host installation. Next post.

Grid Engine installation preparation

May 29th, 2006

Here goes trying to install the open-source Grid Engine 6.0u8 on Tiger. It would be nice if there were a Mac OS X installer package…if I have extra time (ha) maybe I’ll put one together.

I can already see that Xgrid is an infinitely simpler system. Apple wins on ease-of-use already—just based on the instructions in the Plan the Installation section of the Grid Engine manual.

SGE, on the other hand, looks way more powerful. Sophisticated scheduling, intelligent matching of available resources to job needs, etc., etc. I like.

For my own personal use, Xgrid looks great. But I’m going to slog through, because I think I’d better get some hands-on use of the reference implementation of DRMAA before writing my own new implementation.

First, some preliminary notes on how the Grid Engine works…

Definitions

master host
Runs master daemon and scheduler daemon—basically, controls the system. Equivalent to the Xgrid controller. By default, also an administration host and submit host.
shadow master host
A system that can detect a failure of master and take over. Despite my mission-critical enterprise-grade infrastructure, I won’t bother dealing with these.
execution host
Systems that execute jobs. Equivalent to an Xgrid agent.
administration host
Systems that carry out any “administrative activity.” I guess this means editing jobs, adjusting controller settings, etc.?
submit host
Systems that allow users to submit batch jobs. Like an Xgrid client.
queue
Container for jobs that can run on one or more hosts concurrently. Sort of a sub-grid. Can include any subset of hosts on the system.

Daemons

sge_qmaster
The master daemon. Handles all controller activity except scheduling decisions.
sge_schedd
The scheduling daemon—decides where to send jobs, how to order & priorities.
sge_execd
Execution daemon—actually runs jobs. Runs on execution hosts.

With this background, I can actually start thinking about how the hell to set up my own system! Here are the decisions I made for my giant 2-host grid:

Decisions

  • Single cluster My system will be a single cluster, rather than a collection of sub-clusters. My system consists, at last count, of my personal machines: a G5 and a four-year-old PowerBook. I’ll try to convince my roommates to let me use their machines too. At least they’re all connected via InfiniBand! Ha, just kidding.
  • Hosts The G5 will be everything: master, administration, submit, and execution. The PowerBook will be everything except a master.
  • Users “Ensure that all users of the grid engine system have the same user names on all submit and execution hosts.” This isn’t a decision! It’s an order!
  • Software Directories I guess I’ll put a full directory tree on both machines so I don’t have to think about what to install and what not to install.
  • Queue Structure One grid, one cluster, one queue; will include all (2) execution hosts. Easy peasy.
  • Network Services I have no idea what an NIS file is (Solaris thing?), so I guess that means I’ll set things up as “local to each workstation in /etc/services”.
  • Gathering Information Another command: “Use the information in this chapter to gather the information necessary to complete the installation worksheet.” Decisions my ass.

I guess I’ll fill out their silly little worksheet. It looks like it might be useful…

Necessary Information

Parameter Value
sge-root directory /usr/local/gridengine
cell name George W. Bush! My hero! Er, no, I’ll call it Al Gore.
administrative user ebaskerv (c’est moi)
sge_qmaster port number Uh…we’ll see what they use in the default file.
sge_execd port number Ditto.
master host astor.local., G5 of my heart
shadow master hosts Nada.
execution hosts astor.local. darwin.local.
administration hosts astor.local. darwin.local.
submit hosts astor.local. darwin.local.
group ID range for jobs I have no freaking clue. With one grid, probably doesn’t matter.
spooling mechanism Classic spooling sounds easier than messing with Berkeley DB.
Berkeley DB server host NA
Berkeley DB spooling directory NA
scheduler tuning profile “Normal” sounds good to me.
installation method automated?
If you are going to install N1GE 6 on a Windows system, acquire and install Microsoft Services for UNIX. See Appendix A for more information. What is this Windows you speak of?
If you are going to install N1GE 6 on a Windows system, create the required CSP certificates before installing N1GE. See the section called “How to Install a CSP-secured System” in Chapter 4 for information about CSP certificates. I see, it must be an operating system for people who want things to be even more complicated.
Check the Other Installations Appendix for applicability. Aigoo!

This post is getting very long. Ah well, I press on.

Aw, fuck, I just noticed they have a guide to all of those table entries. Let’s see if that changes anything…well, they use 536 and 537 as ports in their example. Maybe those are free. And perhaps interactive installation will be better.

Well, it looks like it’s time to start installing. I’ll cover that in the next post.

GridSweeper preliminaries

May 29th, 2006

Today I begin work on GridSweeper. (Which, I learned through Google, shares a name with what looks like a MineSweeper clone, detailed about halfway down this page. I’m not particularly worried about confusion.)

I’m not going to write any code of significance this week: rather, I’m going to get really familiar with Sun’s Grid Engine system, DRMAA, and test out doing manual runs of Repast, etc. with Xgrid and the Grid Engine, just to see what it will take.

I was worried about how the Grid Engine DRMAA implementation worked in Java—at first glance, I saw classes in org.ggf.drmaa and worried that Sun’s DRMAA packages were implemented inside that package. In fact, it’s nicely separated: the DRMAA interface is in org.ggf.drmaa, and com.sun.grid.drmaa contains the Sun implementation. So I can just put an Xgrid implementation in com.edbaskerville.xgrid-drmaa or something like that—and you’ll be able to select between the two grid systems *at runtime*!

To elaborate on the deliverables listed in the proposal, these are the pieces I plan to build:

  • GridSweeper The actual project. See the proposal.
  • Xgrid DRMAA implementation (C) This will be the meat of my DRMAA work. It will wrap the Objective-C XgridFoundation library in the standard DRMAA C interface, all packaged up in XgridDRMAA.framework.
  • Xgrid DRMAA implementation (Java) Mirroring the SGE Java implementation, this will just be a JNI wrapper for the Xgrid DRMAA implementation in C. Packaged in com.edbaskerville.xgrid-drmaa, class files included in XgridDRMAA.framework.
  • Objective-C DRMAA interface If I have time, an Objective-C wrapper for the C API. (Not useful for GridSweeper, but just a nice thing to have, and not very much work!) I’ll propose this to the drmaa-wg as a standard interface. Yes, that’s right: this will be an Objective-C wrapper for the DRMAA C interface layer to the Objective-C XgridFoundation API. But I’ll also make this play nice with SGE—in short, mirror the structure of the Java APIs, allowing you to select from different implementations at runtime. This will be included in XgridDRMAA.framework.

The goal with the Xgrid DRMAA stuff is to have Apple roll it into Mac OS X someday, replacing my C wrapper to their Objective-C API with something that connects directly to the Xgrid internals. It would be nice for all this stuff to be in XgridFoundation someday.

MIDI reimplemented

May 29th, 2006

Last night I reimplemented MIDI support in LilyPad. It turned out to be really easy to use the AudioToolbox framework’s MusicSequence and MusicPlayer classesopaque C data types + functions. One bug to be fixed: if a new preview happens while you’re playing, the old sequence continues to play. This can be comical, but not useful.