Original Document

Manual wolf

This document describes step by step instructions on building a Beowulf cluster. After seeing all of the documentation that was available, I felt there were enough gaps and omissions that my own document that accurately describes how to build a Beowulf cluster would be beneficial.

I first saw Thomas Sterling’s article in Scientific American, and immediately got the book, because its title was “How to Build a Beowulf”. NO DOUBT - it was a valuable reference, but it really does not walk you through instructions on exactly what to do.

So next, I consulted the web. There were hundreds of web pages, and still, certain important details were vaguely implied, or just plain left out.

After working it out on my own, off and on, for several months, I re-consulted the web, and found a very good web page that stated all of the important details in no uncertain terms: http://www.phy.duke.edu/brahma/beowulf_book/node62.html through node6n.html. It also made me feel better, because that author did the same things I did. Thanks, Brahma!

If you went to the above web page, you will see that author suggests that you manually configure each box, and then, later on, after you get the feel of doing this whole “wolfing up” procedure, you can set up new nodes automatically, which I will describe in a later document.

So here is a description of what I got to work. It is only one example – my example. You may choose a different message passing interface; you may choose a different Linux distribution. You may also spend as much time as I did researching and experimenting, and learn on your own.

I have proven these instructions to work on Red Hat 8 and 6; I am currently doing experiments on making it work on other distributions.

Let’s briefly outline your requirements:

1. More than one box, each equipped with a network card.

2. A switch or hub to connect them.

3. Linux

4. A message passing interface. [I used lam]

I recall, during my research, seeing someone use a binary tree of kvm switches so he could switch between every single box in his cluster. It is not a requirement to have a kvm switch, but merely a TWO port switch is convenient while setting up and / or debugging.

So let’s get wolfing. Choose the most powerful box to be the head node. Install linux on there, and choose every package you want. The only requirement is that you [in RH speak] choose “Network Servers” because you need to have NFS and rsh. That’s all you need. But in my case, I was going to do development of the Beowulf application, so I added X and C development.

Those of you researching Beowulf systems will also know how you can have a 2^nd network card on the head node so you can access it from the outside world. Pretty good advice, but in my case I didn’t. The only thing on MY wolf connected to the outside world is the power cord. Therefore I dispensed with all of the extra work of setting up firewalls.

Log on to the head node as root, because you will be doing sysadmin commands.

If you use lam as your message passing interface, you will read in the manual to turn OFF the firewalls, because they use random port numbers to communicate between nodes. Here is a rule: If the manual tells you to do something, DO IT! The lam manual also tells you to run as a non-root user. Make the same user for every box. Every box on the cluster will have the same user “wolf” with the same password.

Another thing I learned the hard way: use a password that obeys the strong password constraints for your distribution. I used an easily-typed password like “a” for my user, and the whole thing did not work. When I changed my password to a good “B1l3M5l2a” or something, and it worked.

In order to responsibly set your cluster up, you should have some measure of security. After you create your user, create a group, and add the user to the group. Then, you may modify your files and directories to only be accessible by the users within that group:

groupadd beowulf

usermod –g beowulf wolf

… and add the following to .bash_profile:

umask 007

Now any files created by the user “wolf” [or any user within the group] will automatically be only readable by the group “beowulf”.

My network is 192.168.0.nnn because it is one of the “private” network IP ranges. Thomas Sterling talks about it on page 106 of his book. It is inside my firewall, and works just fine. My head node, which I call “wolf00” is 192.168.0.100, and every other node is named “wolfnn”, with an ip of 192.168.0.100 + nn. I am following the sage advice of many of the web pages out there, and setting myself up for an easier task of scaling up my cluster.

Refer to the following web site:

http://www.ibiblio.org/mdw/HOWTO/NFS-HOWTO/server.html#CONFIG

Print that up, and have it at your side. I will be directing you how to modify your system in order to create an NFS server, but I have found this site invaluable, as you may also.

Make a directory for everybody to share:

mkdir /mnt/wolf

chmod 770 /mnt/wolf

chown wolf:beowulf /mnt/wolf -R

Go to the /etc directory, and add your “shared” directory to the exports file:

cd /etc

cat exports

cat >> exports

/mnt/wolf 192.168.0.100/192.168.0.255 (rw)

Now modify hosts. You will see the comments telling you to leave the “localhost” line alone. I blatantly ignored that advice and fixed it to not include my hostname as the loopback address.

The line used to say: 127.0.0.1 wolf00 localhost.localdomain localhost

It now says: 127.0.0.1 localhost.localdomain localhost

Then I added all the boxes on my network. Note: This is not required for the operation of a Beowulf cluster; only convenient for me, so that I may type a simple “wolf01” instead of 192.168.0.101:

192.168.0.100 wolf00

192.168.0.101 wolf01

192.168.0.102 wolf02

192.168.0.103 wolf03

When you use rsh to remote login to another box, you will be prompted for userid and password. You can fix that with hosts.equiv:

cat > hosts.equiv

wolf00

wolf01

wolf02

wolf03

Make sure that services that I want are up:

chkconfig –add rsh

chkconfig –add telnet

chkconfig –add nfs

chkconfig –add rexec

chkconfig –add rlogin

chkconfig –level 3 rsh on

chkconfig –level 3 telnet on

chkconfig –level 3 nfs on

chkconfig –level 3 rexec on

chkconfig –level 3 rlogin on

Telnet? I added this just as a convenience. It is not needed but it is nice to have while debugging your nfs stuff. How are you going to log into a box if you cant rsh to the box? Here is the only reason why I used the kvm switch. It is useful for going back and forth between the head node and the node I am currently setting up.

…And, during startup, I saw some services that I know I don’t want, and in my opinion, could be removed:

chkconfig –del atd

chkconfig –del sendmail

To be responsible, we make ssh work.

chkconfig --add sshd

chkconfig --level 3 sshd on

add to the end of /etc/rc.d/rc.local:

sshd &

once, do this:

ssh –keygen –b 1024 –f filename –t rsa –N big fat passphrase

… and log on once, where it will ask you a question or two to initialize. After which you should be able to log on with ssh and not be asked anything [but password, but we will fix this].

Lastly, put your message passing interface on the box. You can either build it using the supplied source, or use their precompiled package. It is not in the scope of this document to describe that – I just got the source and followed the directions, and in another experiment I installed their rpm; both of them worked fine. Remember the whole reason we are doing this is to learn – go forth and learn.

Okay, get your network cables out. Install Linux on the first non-head node. Going with my example node names and IP addresses, this is what I chose during setup:

Workstation

auto partition

remove all partitions on system

use LILO as the boot loader

put boot loader on the MBR

host name wolf01

ip address 192.168.0.101

add the user “wolf” with the same password as on all other nodes

NO firewall

ONLY package installed: network servers. UN select all other packages.

I don’t care what else you choose; this is the minimum of what you need. Like I mentioned earlier, many Beowulf-ers are using legacy hand me down boxes with limited resources, so why fill the box up with non-essential software you will never use? My research has been concentrated on finding that minimal configuration to get up and running.

Here’s another very important point. When you move on to an automated install and config, you really will NEVER log in to the box. Only during setup and install do I type anything directly on the box. It makes me laugh when I think of the guy with his pile of n-1 kvm switches.

When the computer starts up, it will complain if it does not have a keyboard connected. I was not able to modify the BIOS, because I had older discarded boxes with no documentation, so I just connected a “fake” keyboard. I am in the computer industry, and see hundreds of keyboards come and go, and some occasionally end up in the garbage. I get the old dead keyboard out of the garbage, remove JUST the cord with the tiny circuit board up there in the corner, where the num lock and caps lock lights are. Then I plug the cord in, and the computer thinks it has a complete keyboard without incident. Again, you would be better off modifying your bios, if you are able to. This is just a trick to use in the case that you don’t have a bios program.

After your newly installed box reboots, log on as root again, and…

1. do the same chkconfig commands stated above to set up the right services.

2. modify hosts; remove “wolf0n” from localhost, and just add wolf0n and wolf00.

3. install lam

4. make the dir /mnt/wolf, chmod 777 /mnt/wolf

Up to this point, we are pretty much the same as the head node. I do NOT do the modification of the exports file. And, I do a new thing or two:

cat >> /etc/fstab

wolf00:/mnt/wolf /mnt/wolf nfs rw,hard,intr 0 0

Then I modify /etc/lilo.conf. The 2^nd line of this file says timeout=nn

This is where my wondrous use of cat and the redirection operators breaks down. Notice every modification I have done so far has been using cat. So you need to somehow modify that 2^nd line to say “timeout=1200”. I broke down and used vi, but you can do it however you want, and if you hate vi enough, modify and copy this lilo.conf file on a floppy and just copy it to your newly created system.

After it is modified, as root, say /sbin/lilo, and it will make the changes take effect. It will say “Added linux *”.

Why do I do this lilo modification? If you were researching Beowulf on the web, and understand everything I have done so far, you would wonder, “I don’t remember reading anything about lilo.conf.”

My Beowulf cluster all sits on a single power strip. I turn on the power strip, and every box on the cluster starts up immediately. As the startup procedure progresses, it mounts file systems. Seeing that the non-head nodes mount the shared directory from the head node, they all will have to wait a little bit until the head node is up, with NFS ready to go. So, I make each non-head node wait 2 minutes in the lilo step. Meanwhile, the head node is coming up, and making the shared directory available. By then, the non-head nodes finally start booting up because lilo has waited 2 minutes.

All done! You are almost ready to start wolfing. Reboot your boxes. Did they all come up? Can you ping the head node from each box? Can you ping each node from the head node? Can you telnet? Can you rsh? Don’t worry about doing rsh as root; only as wolf. If you are logged in as wolf, and rsh to a box, does it go automatically, without prompting for password?

After the node boots up, log in, and say “mount”. Does it show wolf00:/mnt/wolf mounted? On the head node, copy a file into /mnt/wolf. Can you read and write that file from the node box? This is really not required; it is merely convenient to have a common directory reside on the head node. You can easily do rcp to copy files between boxes. Also Sterling states in his book on page 119, a single NFS server causes a serious obstacle to scaling up to large numbers of nodes. I learned this when I went from a small number of boxes up to a large number.

Once you can do all the tests shown above, you should be able to run a program. From here on in, the instructions are lam specific. Go back to the head node, log in as wolf, and:

cat > /mnt/wolf/lamhosts

wolf00

wolf01

wolf02

wolf03

wolf04

Go to the lam examples directory, and compile “hello.c”:

mpicc –o hello hello.c

cp hello /mnt/wolf

Then, as shown in the lam documentation, start up lam:

[wolf@wolf00 wolf]$ lamboot -v lamhosts

LAM 7.0/MPI 2 C++/ROMIO - Indiana University

n0<2572> ssi:boot:base:linear: booting n0 (wolf00)

n0<2572> ssi:boot:base:linear: booting n1 (wolf01)

n0<2572> ssi:boot:base:linear: booting n2 (wolf02)

n0<2572> ssi:boot:base:linear: booting n3 (wolf04)

n0<2572> ssi:boot:base:linear: finished

So we are now finally ready to run an app. [Remember, I am using lam; your message passing interface may have different syntax].

[wolf@wolf00 wolf]$ mpirun n0-3 /mnt/wolf/hello

Hello, world! I am 0 of 4

Hello, world! I am 3 of 4

Hello, world! I am 2 of 4

Hello, world! I am 1 of 4

[wolf@wolf00 wolf]$

Recall I mentioned the use of NFS above. I am telling the nodes to all use the nfs shared directory, which will get pretty bad when using a larger number of boxes. You could very easily copy the executable to each box, and then have the mpirun command tell each box to use its own local copy. In fact I have done this, and it worked better than using the nfs shared executable. Of course this theory breaks down if my cluster application needs to modify a file shared across the cluster. To this, I say, “Do ‘man autofs’ and see how it says ‘The documentation leaves much to be desired.’” Then you will know what I mean.

So now you know how I did it; hope this helps; have fun on your own project.

You may wonder – “Why does he say ‘wolf’ and not ‘beowulf’ like it is supposed to be called?” I say to you – “Because it’s fun. Nobody else calls their cluster a ‘bay wolf’ so I am.”

Auto–wolf

Now let’s automate the install so you may create a node by merely inserting a floppy, and the box will completely build itself with no user interaction.

On the head node, make another directory, and install directory, for everybody to share:

mkdir /mnt/install

Go to the /etc directory, and add your “shared” directory to the exports file:

cd /etc

cat exports

cat >> exports

/mnt/install 192.168.0.100/192.168.0.255 (ro)

Go to your original distribution, and note the directory structure. A directory called RedHat contains a directory called RPMS, which contains all the packages that you choose from when doing an install. Copy the whole RedHat directory tree into the /mnt/install directory. After you are complete, modify their security so they are only accessible to the group “beowulf”:

chmod 770 /mnt/install -R

chown wolf:beowulf /mnt/install -R

Remember, for non-head nodes, we only installed Network Servers package, and none others. So you may carefully choose just the RPMs that are necessary. In order to see this, you can refer to /root/install.log on one of your manually built non-head nodes.

Go to your install CD, and just like you created your original install floppy to do your first Linux install, choose the “bootnet.img” file to create a network install floppy. Use that floppy to install your next node. Instead of asking for you to insert a CD ROM, it will ask you the ip address of your head node, and where the shared install directory is. Doing this build will also make sure you have copied all of the right RPMs into your shared install directory. If any of the files are missing, you will see an error stating that a file is missing. Copy the missing file to your shared install directory, and continue the install.

Next, install DHCP on your head node. Go to http://tldp.org/HOWTO/DHCP/x369.html and follow the instructions. Here is a basic summary of what I did:

Gunzip and untar the file.

Go to the directory that it just created: dhcp-3.0p12

Do the following commands:

./configure

make

make install

mv dhcp-3.0p12/work.linux-2.2/server/dhcpd /usr/sbin

route add –host 255.255.255.255 dev eth0

mkdir /var/state/dhcp

touch /var/state/dhcp/dhcpd.leases

and add the following line to the end of /etc/rc.d/rc.local:

/usr/sbin/dhcpd –d –f > /var/log/dhcp.log 2>&1 &

In my experience, I found that I had to add the following line in dhcpd.conf:

ddns-update-style ad-hoc;

… after the “max-lease-time” line.

After successfully installing the dhcp server on your head node, you may now install a node again with your network install floppy like before. When it comes to the screen that has you enter an IP address, you may now choose “use bootp / dhcp”. After the install is complete, the newly installed box should have a dynamically assigned IP address.

I saw, when specifying a range of IP addresses in the dhcpd.conf file, the addresses were handed out in reverse order – that is:

First box: x.x.x.255

2^nd box: x.x.x.254

3^rd box: x.x.x.253

and so on. The web site mentioned before speaks about statically assigning IPs to specific MAC addresses: http://www.phy.duke.edu/brahma/beowulf_book/node64.html

Really, it does not matter if you use one method over another – the goal is to build a box without any user intervention, including the assignment of the IP address. The reason I chose to use static IPs is because I stick a label on the front of each box saying what its IP address is. So when 192.168.0.103 has a problem, I know which box 103 is, and don’t have to hunt down the physical box to do debugging or reloading.

So now we are at a point where we can completely automate our install. By now you probably have reinstalled your nodes enough times to be bored with it, and truly would appreciate if a program would perform these repetitive tasks for us.

Run the kickstart configurator program, which will ask you all the steps you have performed in your installs. It will save your answers in a file, ks.cfg.

In my experience, after making this file, I also had to manually modify it like so:

#Disk partitioning information

part swap --recommended

part / --fstype ext3 --size=1 --grow --maxsize 100000

Sure, there could have possibly been a proper set of responses within the configurator to create this output, but I found it easier to just manually modify the file as shown.

On your network install floppy, modify the file syslinux.cfg. As you recall, after the box boots off this floppy, it gives you a minute or so to enter any parameters, and then automatically goes into the interactive install. We want to force it to use the kickstart file we generated.

Change the line default linux to say default ks

Change the line timeout nnn to say timeout 2

On the label ks section, add “append ks=floppy” like so:

label ks

kernel vmlinuz

append ks=floppy initrd=initrd.img lang= devfs=nomount ramdisk_size=9216

Copy this modified syslinux.cfg on to the network install floppy, along with your ks.cfg file, and do another install. You may sit back and watch the show on your monitor. You should not have to type a thing. It should go through the whole install process, answering every question the way you would have manually entered it.

Lastly, we will add the “post” section to the ks.cfg file. As the earlier instructions show, we do a series of modifications to our newly installed system to make it a Beowulf node: install lam, change files in /etc, start and stop services, and so on.

Here is the resulting ks.cfg file:

#Generated by Kickstart Configurator

#System language

lang en_US

#Language modules to install

langsupport en_US

#System keyboard

keyboard us

#System mouse

mouse none

#System timezone

timezone --utc America/Chicago

#Root password

rootpw --iscrypted big garbage string

#Reboot after installation

reboot

#Use text mode install

text

#Install Red Hat Linux instead of upgrade

install

#Use NFS installation media

nfs --server 192.168.0.100 --dir /mnt/install

#System bootloader configuration

bootloader --useLilo --linear --location=mbr --append ks=floppy

#Clear the Master Boot Record

zerombr yes

#Clear only Linux partitions from the disk

clearpart --linux --initlabel

#Disk partitioning information

part swap --recommended

part / --fstype ext3 --size=1 --grow --maxsize 100000

#Use DHCP networking

network --bootproto dhcp

#System authorization information

auth --useshadow --enablemd5

#Firewall configuration

firewall --disabled

#Do not configure the X Window System

skipx

%packages --resolvedeps

@Network Servers

%post

mkdir /mnt/wolf

chmod 777 /mnt/wolf

/usr/sbin/useradd wolf

sleep 20

/usr/sbin/usermod -p 'big garbage string' wolf

/sbin/chkconfig --add telnet

/sbin/chkconfig --add rsh

/sbin/chkconfig --add nfs

/sbin/chkconfig --add rexec

/sbin/chkconfig --add rlogin

/sbin/chkconfig --level 3 telnet on

/sbin/chkconfig --level 3 rsh on

/sbin/chkconfig --level 3 nfs on

/sbin/chkconfig --level 3 rexec on

/sbin/chkconfig --level 3 rlogin on

/sbin/chkconfig --del atd

/sbin/chkconfig --del sendmail

/sbin/chkconfig --del sshd

/sbin/chkconfig --del kudzu

mkdir /mnt/inst

chmod 777 /mnt/inst

mount 192.168.0.100:/mnt/install /mnt/inst

ls -l /mnt/inst >> /home/wolf/proof1.txt

cp /mnt/inst/etc/lilo.conf /etc

/sbin/lilo >> /home/wolf/proof2.txt 2>&1

ls -l /mnt/inst/etc/ho* >> /home/wolf/proof3.txt

cp /mnt/inst/etc/ho* /etc

cat /mnt/inst/etc/fstab >> /etc/fstab

ls -l /mnt/inst/home >> /home/wolf/proof4.txt

cp /mnt/inst/home/* /home/wolf

rpm -i /home/wolf/lam.rpm

As you see, I chose “reboot” on the end, which puts a little responsibility on your shoulders. If you put the floppy in, turn on the box, and walk away, it will go through all the steps, and then reboot. Upon rebooting, it will boot from floppy, and start the whole process over again.

You could do two things:

1. Skip the reboot, and let the box just sit there when it is done installing. Then you will have to eject the floppy, and reboot the box with the reboot command, or the “trip over the power cord” reboot, which has its own ups and downs.

2. I timed the amount of time it took to read all of its data off the floppy, which was a minute or two, and then ejected the floppy. The long part, approximately 20 minutes, will go on and on, and reboot itself safely, because I got the floppy out of the way. I have an 18 minute window to remember to eject the floppy.

There is another issue worthy of mention: this easy “insert a floppy” install implies that you have saved off any important data from the box, because it will get completely erased and rebuilt. But as the Beowulf documentation out there describes, you should not be saving important data on the nodes – you should view each node as an expendable resource, and at the moment it acts up, you would have no hard feelings in completely destroying and rebuilding it.