Archive for the ‘GridSweeper’ Category

CLI usage scenarios

Wednesday, July 12th, 2006

The most important piece of unfinished business in the GridSweeper design is what exactly the command-line interface will look like. It’s funny—I grew up on an old-school Mac, scoffing at my primitive DOS-using fourth-grade schoolmates. What an awful way to interact with a computer: remember arcane commands and type them in! But as soon as you start doing software development, or system administration, or anything that needs to be automated, the command line is often more efficient.

I have the same goal for the GridSweeper command-line tools as for the graphical interface: make it easy to run parameter sweeps of models. More accurately, make the most common types of parameter sweeps very easy to do; and make other types of sweeps possible, and as easy as possible.

Scenario 1: Multiple Parameters, Ranges, All Combinations

The most common usage scenario is to vary one or more parameters, and run the model one or more times for each combination of parameters. So if there are 3 parameters being varied, each with 4 different values, and the model is being run 10 times with different random seeds, there will be total of 4 x 4 x 4 x 10 = 640 runs.

Let’s say a model has three parameters, beta, gamma, and nu. Beta will go from 0.3 to 0.6 in increments of 0.1; gamma from 1.0 to 1.3; and nu from 0.1 to 0.4. The model will be run 10 times with different random seeds. Let’s say the

The syntax will go something like this:

grepast mymodel -n10 beta=0.3:0.1:0.6 gamma=1.0:0.1:1.3 nu=0.1:0.1:0.4

A breakdown of the pieces:

  • grepast will be a tool that just calls “gridsweeper repast”, telling the gridsweeper tool that this is a repast model, so the parts of the process that need to be handled by the repast plug-in will be.
  • mymodel says to use mymodel.jar in the current directory. If there’s a shared filesystem (this will be settable in a configuration file or in the GUI), nothing will be transfered over the network except the complete path to the file; if FTP is being used, this file will be staged to the FTP server before running the job, and downloaded by the job on the execution machines.
  • beta=0.3:0.1:0.6 etc. are the key: you can specify ranges of values with super-simple syntax: [start]:[increment]:[end].

Scenario 2: Multiple Parameters, Specified Values, All Combinations

Sometimes you don’t want to specify ranges & increments, but simply particular combinations of values. You’ll be able to specify a vector of values using commas:

grepast mymodel -n10 beta=0.1,0.4,0.7 gamma=1.0,1.4,1.9

Or, if you want, you can mix range/increment lists with specific values:

grepast mymodel -n10 beta=0.1:0.1:0.5,0.7,1.3

Scenario 3: Multiple Parameters, Specific Combinations

Another common need is to run certain combinations of parameters, but not others. For example, beta=0.3/gamma=0.5 and beta=0.4/gamma=0.6, but not beta=0.3/gamma=0.6. This is accomplished by separating parameter names with semicolons (quotes inserted so the shell sees the whole thing as one argument):

grepast mymodel -n10 "beta;gamma = 0.3;0.5, 0.4;0.6"

If you’d rather specify lists of values with all beta values together and all gamma values together, that’s fine too—just remember that commas separate parameter values for a particular parameter; semicolons separate values for different parameters:

grepast mymodel -n10 "beta;gamma = 0.3,0.4; 0.5,0.6"

Extending this a step further, you’ll be able to combine range/increment lists with this syntax:

grepast mymodel -n10 "beta;gamma = 0.3:0.1:0.6; 0.6:0.1:0.9"

is equivalent to

grepast mymodel -n10 "beta;gamma = 0.3;0.6, 0.4;0.7, 0.5;0.8, 0.6;0.9"

and to

grepast mymodel -n10 "beta;gamma = 0.3,0.4,0.5,0.6; 0.6,0.7,0.8,0.9"

This is as much complexity as command-line syntax will support, though. Beyond this, it’s probably time to use a control file anyway (to be covered in a later post).

GridSweeper design overview

Friday, July 7th, 2006

Although XgridDRMAA has not quite stabilized yet, it’s time to move on to serious work on GridSweeper. (I’ll use it as a test suite for XgridDRMAA—I can run the code using Grid Engine’s DRMAA and XgridDRMAA, and problems in the latter will no doubt emerge.)

I spent a while at my whiteboard scrawling a mind map; here’s a simplified version in more legible form:

Development priorities for this software:

  1. Parameter control This is sort of the point: converting compact representations of parameter combinations into big long lists of parameters settings to be run.
  2. Plug-in interface This is how parameter settings get translated into control parameters for specific classes of models—e.g., Repast models, general command-line parameters, etc.
  3. Grid control This is the other part of the point: submitting lists of parameter settings to the grid. Very straightforward, thanks to DRMAA.
  4. CLI I need some way of interacting with the system (aside from writing new main() methods) as early as possible.
  5. Preferences Good to be able to save settings to shortcut things for both the CLI and the GUI—e.g.,
  6. File transfer interface Unfortunately, you can’t count on having a shared filesystem. (In fact, I don’t have a shared filesystem for my “grid” of two Macs.) So you need a way to transfer output files that aren’t stdin/out/err (which is provided for by DRMAA). I think the simplest solution is to just support FTP servers, my previous ramblings about having a custom file-transfer daemon notwithstanding. Most bang for my coding time, thanks to the Jakarta Commons Net FTP library.
  7. GUI This is the most open-ended component, so I’ll leave it for the end, and it can be as sophisticated or as simple as I have time for.

DRMAA Java: first run

Friday, June 2nd, 2006

I got a basic DRMAA program running. It lets you execute any command + arguments via the grid.

The code is here.

compile with:

javac -cp $SGE_ROOT/lib/drmaa.jar DrmaaTest.java

run with:

java -cp .:$SGE_ROOT/lib/drmaa.jar DrmaaTest [command] [args]

On a shared-filesystem SGE setup, stdout and stderr will show up as files in the current directory. Pretty spiffy. DRMAA appears to be really simple to use, and should be pretty simple to implement for Xgrid. I’m highly optimistic!

File transfer

Friday, June 2nd, 2006

In a sophisticated network setup like a typical Sun Grid Engine installation, a GridSweeper user will have the luxury of a shared filesystem, a network home directory, etc., etc., meaning that no files will need to be transferred as part of job submission. However, this isn’t always the case. With extra work, it’s apparently possible to set up SGE without a shared filesystem. And many Xgrid users, especially if they’re installing Xgrid for the sake of using GridSweeper on their simple Repast model and network of four Macs, won’t have any shared file system at all.

Although Xgrid provides built-in facilities for transferring files, SGE and DRMAA do not—today, the typical user of these systems is on a well managed network. But I want GridSweeper to be easy to set up for any Joe Repast modeler with a few computers. Although it might seem like too much network overhead to send an model’s executable code, plus input data, and retrieve output data on the other end. But compared to typical runtimes for ABMs, the time it takes to transfer a little executable is nothing. So providing a general, easy solution to this problem, I think, is vital.

All the obvious solutions to the file transfer problem come down to requiring the user to set up some kind of file-transfer infrastructure—NFS, AFP, FTP, SCP, etc., etc. But this defeats the whole purpose of ease of use: now they have to set something up!

All roads, in my view, point to including a simple file server as part of GridSweeper. This daemon will typically run on the same machine as, say, the Xgrid controller or SGE qmaster. The client GUI and command line will provide tools to add files to the GridSweeper file daemon. Additionally, clients will be able to upload files on a per-batch basis. When the agent/execution host* starts running, first it will see if it needs to download any files from the file daemon. (Timestamp checking & caching will ensure that if five runs on the same host all need the same file, it will only be downloaded once.)

At the end of a run, the GridSweeper monitor will send any output files back to the file daemon. Using a monitoring tool/GUI, the user will be able to download any results.

* this terminology difference between SGE and Xgrid is really starting to get to me. I bet DRMAA has its own set of terms.

SGE basics

Friday, June 2nd, 2006

Sadly, things never quite worked right on my local install of SGE. But it turns out UM CSCS already has one set up that I can use. So that’s what I’m doing.

A summary of basic commands in SGE…

qsub
submits a job in a shell script. if you try to submit an executable binary, it won’t work.

-m b|e|a|s|n
tells qsub when to send mail: at the beginning, end, abort/rescheduling, suspension, or not at all.
qstat
displays queue status.
qmon
really ugly gui for submitting and controlling/monitoring jobs.

Grid Engine installation, Episode III

Tuesday, May 30th, 2006

Now that the qmaster is set up and NFS is set up, I can finally set the machines up as execution hosts. I’m doing it in parallel on both machines. cd /usr/local/gridengine, sudo -s, and finally ./install_execd, and we’re on our way to another bulleted list describing lots of screens…

  • Welcome Why, thank you!
  • Checking directory Looks good: /usr/local/gridengine
  • Cells Also good: algore
  • Checking hostname resolving This worked fine out of the box on the qmaster machine, but on the PowerBook the Bonjour name wasn’t getting resolved properly. So I added an entry to /etc/hosts on astor.local: [local IP address] darwin.local. That fixed things.
  • Local spool directory configuration No local spools.
  • Creating local configuration Done!
  • execd startup script Yes! Done!
  • execution daemon startup Started up!
  • Adding a queue for this host Done on both. Looks like the 2-processor G5 detected two processors, and the 1-processor PowerBook detected 1. Smarty smarty. But a problem: “unable to resolve host [‘darwin’ | ‘astor’]”…I hope this doesn’t mean everything breaks.
  • The rest… is just information already shown during the other installation. I hope that name-resolution problem doesn’t bite me in the ass.

Well, it looks like everything’s done. Testing…tomorrow. Time to sleep.

Setting up NFS

Tuesday, May 30th, 2006

Turns out you have to have an NFS share for your SGE_ROOT directory. So I set up NFS.

I followed the GUI instructions from this one. In short, you add an /exports entry to NetInfo with settings for the directory you want to export. I couldn’t get the exports to show up right for a long time, but restarting the machine fixed that problem.

To set up an NFS automount on the PowerBook, there’s some more setup to be done, described here. The gist is to set up a NetInfo entry in /mounts for the server.

End result: my /usr/local/gridengine on darwin.local maps to the same directory on astor.local.

Grid Engine installation installation

Monday, May 29th, 2006

Fresh from a nice Memorial Day picnic lunch in Dolores Park, it feels like time to take a nap. But I’m going to install the Grid Engine instead! Here comes the installation part of the installation process.

Getting the Software

I downloaded the Grid Engine 6.0u8 common files and Mac OS X binaries linked from here and unpacked the contents of each into /usr/local/gridengine on both of my machines.

Then I set the $SGE_ROOT environment variables in the system-wide /etc/bashrc file, and added the binary directory to the standard $PATH:


export SGE_ROOT=/usr/local/gridengine
export PATH=$SGE_ROOT/bin/darwin:$PATH

and did a source /etc/bashrc to update my session’s environment variables.

Setting up the Master Host

Making sure I was in the $SGE_ROOT directory and in a sudo -s session, I ran this on good-old Astor:

./install_qmaster

I followed through some screens:

  • Admin user At the first screen, I said OK to use ebaskerv (my user account) as the admin user.
  • root directory The root directory was right.
  • TCP/IP services As requested, I added sge_qmaster to my /etc/services file, and in anticipation added one for sge_execd:
    sge_qmaster 781/tcp
    sge_execd 782/tcp
  • Cells Named my cell algore, as promised.
  • qmaster spool directory Default is fine: /usr/local/gridengine/algore/spool/qmaster
  • Windows Execution Host Support Are you going to install Windows Execution Hosts? Are you kidding me? At least, by Judas, the default is no.
  • File permissions I said no when asked if I had already verified and set file permissions. My guess is these would need fixing. I said yes at the next screen (please verify and set my permissions) and all looked hunky-dory (“Your file permissions were set”).
  • Hostname resolving method This asks if all my hosts are in one DNS domain. I’m going to cross my fingers and hope that the zeroconf pseudo-domain local. will work, and answer yes.
  • Making directories This seemed to go fine. (“Mrs. Crabapple and Principal Skinner were in the closet making directories, and I saw one of the directories, and the directory looked at me!”) RETURN!
  • Setup spooling I chose classic spooling, because I had this suspicion that BerkeleyDB wasn’t ever installed properly on my machine. I’m looking for simplicity, not performance. The spooling database seemed to be initialized properly on the next screen.
  • Group id range For some strange reason, the Grid Engine needs a range of UNIX group ids to assign dynamically to jobs. I’m pretty sure the example range 20000-20100 is free and large enough, so I’ll use that.
  • Cluster configuration First up: execd_spool_dir. The default seems fine. Then, administrator email: I gave it my email, but I don’t think email sending is even set up right on my machine, so it probably won’t work.
  • Creating local configuration This seemed to work…
  • qmaster/scheduler startup script Apparently, it knows how to set up a startup script. I’ll let it go ahead and try…wow! It put something in /Library/StartupItems! Clever girl.
  • qmaster and scheduler startup Started up successfully!
  • hosts This is easy: just two for now. astor.local. and darwin.local., maybe more later. (They misspelled “separated” in “Please enter a blank seperated list of hosts.”) This seemed to go correctly. I said no to a shadow host, partially because I like to live dangerously, and mostly because my grid consists of two computers. Then, the default queue and hostgroup were added: just astor.local.—maybe I have to add darwin.local. manually later.
  • Scheduler tuning Went with Normal.
  • Using gridengine Looks like they provide a nice script to set all the environment variables. So I replaced my old bashrc line with:
    . /usr/local/gridengine/algore/common/settings.sh
  • Messages FYI, messages logged to:
    /tmp/qmaster_messages
    /tmp/execd_messages
    /usr/local/gridengine/algore/spool/qmaster/messages
    [execd_spool_dir]/[hostname]/messages
    and startup scripts are at:
    /usr/local/gridengine/algore/common/sgemaster (qmaster and scheduler)
    /usr/local/gridengine/algore/common/sgeexecd (execd)
  • Almost done “Your Grid Engine qmaster installation is now completed” says the friendly screen. Now I get to start the execution host installation. Next post.

Grid Engine installation preparation

Monday, May 29th, 2006

Here goes trying to install the open-source Grid Engine 6.0u8 on Tiger. It would be nice if there were a Mac OS X installer package…if I have extra time (ha) maybe I’ll put one together.

I can already see that Xgrid is an infinitely simpler system. Apple wins on ease-of-use already—just based on the instructions in the Plan the Installation section of the Grid Engine manual.

SGE, on the other hand, looks way more powerful. Sophisticated scheduling, intelligent matching of available resources to job needs, etc., etc. I like.

For my own personal use, Xgrid looks great. But I’m going to slog through, because I think I’d better get some hands-on use of the reference implementation of DRMAA before writing my own new implementation.

First, some preliminary notes on how the Grid Engine works…

Definitions

master host
Runs master daemon and scheduler daemon—basically, controls the system. Equivalent to the Xgrid controller. By default, also an administration host and submit host.
shadow master host
A system that can detect a failure of master and take over. Despite my mission-critical enterprise-grade infrastructure, I won’t bother dealing with these.
execution host
Systems that execute jobs. Equivalent to an Xgrid agent.
administration host
Systems that carry out any “administrative activity.” I guess this means editing jobs, adjusting controller settings, etc.?
submit host
Systems that allow users to submit batch jobs. Like an Xgrid client.
queue
Container for jobs that can run on one or more hosts concurrently. Sort of a sub-grid. Can include any subset of hosts on the system.

Daemons

sge_qmaster
The master daemon. Handles all controller activity except scheduling decisions.
sge_schedd
The scheduling daemon—decides where to send jobs, how to order & priorities.
sge_execd
Execution daemon—actually runs jobs. Runs on execution hosts.

With this background, I can actually start thinking about how the hell to set up my own system! Here are the decisions I made for my giant 2-host grid:

Decisions

  • Single cluster My system will be a single cluster, rather than a collection of sub-clusters. My system consists, at last count, of my personal machines: a G5 and a four-year-old PowerBook. I’ll try to convince my roommates to let me use their machines too. At least they’re all connected via InfiniBand! Ha, just kidding.
  • Hosts The G5 will be everything: master, administration, submit, and execution. The PowerBook will be everything except a master.
  • Users “Ensure that all users of the grid engine system have the same user names on all submit and execution hosts.” This isn’t a decision! It’s an order!
  • Software Directories I guess I’ll put a full directory tree on both machines so I don’t have to think about what to install and what not to install.
  • Queue Structure One grid, one cluster, one queue; will include all (2) execution hosts. Easy peasy.
  • Network Services I have no idea what an NIS file is (Solaris thing?), so I guess that means I’ll set things up as “local to each workstation in /etc/services”.
  • Gathering Information Another command: “Use the information in this chapter to gather the information necessary to complete the installation worksheet.” Decisions my ass.

I guess I’ll fill out their silly little worksheet. It looks like it might be useful…

Necessary Information

Parameter Value
sge-root directory /usr/local/gridengine
cell name George W. Bush! My hero! Er, no, I’ll call it Al Gore.
administrative user ebaskerv (c’est moi)
sge_qmaster port number Uh…we’ll see what they use in the default file.
sge_execd port number Ditto.
master host astor.local., G5 of my heart
shadow master hosts Nada.
execution hosts astor.local. darwin.local.
administration hosts astor.local. darwin.local.
submit hosts astor.local. darwin.local.
group ID range for jobs I have no freaking clue. With one grid, probably doesn’t matter.
spooling mechanism Classic spooling sounds easier than messing with Berkeley DB.
Berkeley DB server host NA
Berkeley DB spooling directory NA
scheduler tuning profile “Normal” sounds good to me.
installation method automated?
If you are going to install N1GE 6 on a Windows system, acquire and install Microsoft Services for UNIX. See Appendix A for more information. What is this Windows you speak of?
If you are going to install N1GE 6 on a Windows system, create the required CSP certificates before installing N1GE. See the section called “How to Install a CSP-secured System” in Chapter 4 for information about CSP certificates. I see, it must be an operating system for people who want things to be even more complicated.
Check the Other Installations Appendix for applicability. Aigoo!

This post is getting very long. Ah well, I press on.

Aw, fuck, I just noticed they have a guide to all of those table entries. Let’s see if that changes anything…well, they use 536 and 537 as ports in their example. Maybe those are free. And perhaps interactive installation will be better.

Well, it looks like it’s time to start installing. I’ll cover that in the next post.

GridSweeper preliminaries

Monday, May 29th, 2006

Today I begin work on GridSweeper. (Which, I learned through Google, shares a name with what looks like a MineSweeper clone, detailed about halfway down this page. I’m not particularly worried about confusion.)

I’m not going to write any code of significance this week: rather, I’m going to get really familiar with Sun’s Grid Engine system, DRMAA, and test out doing manual runs of Repast, etc. with Xgrid and the Grid Engine, just to see what it will take.

I was worried about how the Grid Engine DRMAA implementation worked in Java—at first glance, I saw classes in org.ggf.drmaa and worried that Sun’s DRMAA packages were implemented inside that package. In fact, it’s nicely separated: the DRMAA interface is in org.ggf.drmaa, and com.sun.grid.drmaa contains the Sun implementation. So I can just put an Xgrid implementation in com.edbaskerville.xgrid-drmaa or something like that—and you’ll be able to select between the two grid systems *at runtime*!

To elaborate on the deliverables listed in the proposal, these are the pieces I plan to build:

  • GridSweeper The actual project. See the proposal.
  • Xgrid DRMAA implementation (C) This will be the meat of my DRMAA work. It will wrap the Objective-C XgridFoundation library in the standard DRMAA C interface, all packaged up in XgridDRMAA.framework.
  • Xgrid DRMAA implementation (Java) Mirroring the SGE Java implementation, this will just be a JNI wrapper for the Xgrid DRMAA implementation in C. Packaged in com.edbaskerville.xgrid-drmaa, class files included in XgridDRMAA.framework.
  • Objective-C DRMAA interface If I have time, an Objective-C wrapper for the C API. (Not useful for GridSweeper, but just a nice thing to have, and not very much work!) I’ll propose this to the drmaa-wg as a standard interface. Yes, that’s right: this will be an Objective-C wrapper for the DRMAA C interface layer to the Objective-C XgridFoundation API. But I’ll also make this play nice with SGE—in short, mirror the structure of the Java APIs, allowing you to select from different implementations at runtime. This will be included in XgridDRMAA.framework.

The goal with the Xgrid DRMAA stuff is to have Apple roll it into Mac OS X someday, replacing my C wrapper to their Objective-C API with something that connects directly to the Xgrid internals. It would be nice for all this stuff to be in XgridFoundation someday.