This document describes step by step instructions on building a Beowulf cluster. After seeing all of the documentation that was available, I felt there were enough gaps and omissions that my own document that accurately describes how to build a Beowulf cluster would be beneficial.
I first saw Thomas Sterling’s article in Scientific American, and immediately got the book, because its title was “How to Build a Beowulf”. NO DOUBT - it was a valuable reference, but it really does not walk you through instructions on exactly what to do.
So next, I consulted the web. There were hundreds of web pages, and still, certain important details were vaguely implied, or just plain left out.
After working it out on my own, off and on, for several months, I re-consulted the web, and found a very good web page that stated all of the important details in no uncertain terms: http://www.phy.duke.edu/brahma/beowulf_book/node62.html through node6n.html. It also made me feel better, because that author did the same things I did. Thanks, Brahma!
If you went to the above web page, you will see that author suggests that you manually configure each box, and then, later on, after you get the feel of doing this whole “wolfing up” procedure, you can set up new nodes automatically, which I will describe in a later document.
So here is a description of what I got to work. It is only one example – my example. You may choose a different message passing interface; you may choose a different Linux distribution. You may also spend as much time as I did researching and experimenting, and learn on your own.
I have proven these instructions to work on Red Hat 8 and 6; I am currently doing experiments on making it work on other distributions.
Let’s briefly outline your requirements:
1. More than one box, each equipped with a network card.
2. A switch or hub to connect them.
4. A message passing interface. [I used lam]
I recall, during my research, seeing someone use a binary tree of kvm switches so he could switch between every single box in his cluster. It is not a requirement to have a kvm switch, but merely a TWO port switch is convenient while setting up and / or debugging.
So let’s get wolfing. Choose the most powerful box to be the head node. Install linux on there, and choose every package you want. The only requirement is that you [in RH speak] choose “Network Servers” because you need to have NFS and rsh. That’s all you need. But in my case, I was going to do development of the Beowulf application, so I added X and C development.
Those of you researching Beowulf systems will also know how you can have a 2nd network card on the head node so you can access it from the outside world. Pretty good advice, but in my case I didn’t. The only thing on MY wolf connected to the outside world is the power cord. Therefore I dispensed with all of the extra work of setting up firewalls.
Log on to the head node as root, because you will be doing sysadmin commands.
If you use lam as your message passing interface, you will read in the manual to turn OFF the firewalls, because they use random port numbers to communicate between nodes. Here is a rule: If the manual tells you to do something, DO IT! The lam manual also tells you to run as a non-root user. Make the same user for every box. Every box on the cluster will have the same user “wolf” with the same password.
Another thing I learned the hard way: use a password that obeys the strong password constraints for your distribution. I used an easily-typed password like “a” for my user, and the whole thing did not work. When I changed my password to a good “B1l3M5l2a” or something, and it worked.
In order to responsibly set your cluster up, you should have some measure of security. After you create your user, create a group, and add the user to the group. Then, you may modify your files and directories to only be accessible by the users within that group:
usermod –g beowulf wolf
… and add the following to .bash_profile:
Now any files created by the user “wolf” [or any user within the group] will automatically be only readable by the group “beowulf”.
My network is 192.168.0.nnn because it is one of the “private” network IP ranges. Thomas Sterling talks about it on page 106 of his book. It is inside my firewall, and works just fine. My head node, which I call “wolf00” is 192.168.0.100, and every other node is named “wolfnn”, with an ip of 192.168.0.100 + nn. I am following the sage advice of many of the web pages out there, and setting myself up for an easier task of scaling up my cluster.
Refer to the following web site:
Print that up, and have it at your side. I will be directing you how to modify your system in order to create an NFS server, but I have found this site invaluable, as you may also.
Make a directory for everybody to share:
chmod 770 /mnt/wolf
chown wolf:beowulf /mnt/wolf -R
Go to the /etc directory, and add your “shared” directory to the exports file:
cat >> exports
/mnt/wolf 192.168.0.100/192.168.0.255 (rw)
Now modify hosts. You will see the comments telling you to leave the “localhost” line alone. I blatantly ignored that advice and fixed it to not include my hostname as the loopback address.
The line used to say: 127.0.0.1 wolf00 localhost.localdomain localhost
It now says: 127.0.0.1 localhost.localdomain localhost
Then I added all the boxes on my network. Note: This is not required for the operation of a Beowulf cluster; only convenient for me, so that I may type a simple “wolf01” instead of 192.168.0.101:
When you use rsh to remote login to another box, you will be prompted for userid and password. You can fix that with hosts.equiv:
cat > hosts.equiv
Make sure that services that I want are up:
chkconfig –add rsh
chkconfig –add telnet
chkconfig –add nfs
chkconfig –add rexec
chkconfig –add rlogin
chkconfig –level 3 rsh on
chkconfig –level 3 telnet on
chkconfig –level 3 nfs on
chkconfig –level 3 rexec on
chkconfig –level 3 rlogin on
Telnet? I added this just as a convenience. It is not needed but it is nice to have while debugging your nfs stuff. How are you going to log into a box if you cant rsh to the box? Here is the only reason why I used the kvm switch. It is useful for going back and forth between the head node and the node I am currently setting up.
…And, during startup, I saw some services that I know I don’t want, and in my opinion, could be removed:
To be responsible, we make ssh work.
chkconfig --add sshd
chkconfig --level 3 sshd on
add to the end of /etc/rc.d/rc.local:
once, do this:
ssh –keygen –b 1024 –f filename –t rsa –N big fat passphrase
… and log on once, where it will ask you a question or two to initialize. After which you should be able to log on with ssh and not be asked anything [but password, but we will fix this].
Lastly, put your message passing interface on the box. You can either build it using the supplied source, or use their precompiled package. It is not in the scope of this document to describe that – I just got the source and followed the directions, and in another experiment I installed their rpm; both of them worked fine. Remember the whole reason we are doing this is to learn – go forth and learn.
Okay, get your network cables out. Install Linux on the first non-head node. Going with my example node names and IP addresses, this is what I chose during setup:
remove all partitions on system
use LILO as the boot loader
put boot loader on the MBR
host name wolf01
ip address 192.168.0.101
add the user “wolf” with the same password as on all other nodes
ONLY package installed: network servers. UN select all other packages.
I don’t care what else you choose; this is the minimum of what you need. Like I mentioned earlier, many Beowulf-ers are using legacy hand me down boxes with limited resources, so why fill the box up with non-essential software you will never use? My research has been concentrated on finding that minimal configuration to get up and running.
Here’s another very important point. When you move on to an automated install and config, you really will NEVER log in to the box. Only during setup and install do I type anything directly on the box. It makes me laugh when I think of the guy with his pile of n-1 kvm switches.
When the computer starts up, it will complain if it does not have a keyboard connected. I was not able to modify the BIOS, because I had older discarded boxes with no documentation, so I just connected a “fake” keyboard. I am in the computer industry, and see hundreds of keyboards come and go, and some occasionally end up in the garbage. I get the old dead keyboard out of the garbage, remove JUST the cord with the tiny circuit board up there in the corner, where the num lock and caps lock lights are. Then I plug the cord in, and the computer thinks it has a complete keyboard without incident. Again, you would be better off modifying your bios, if you are able to. This is just a trick to use in the case that you don’t have a bios program.
After your newly installed box reboots, log on as root again, and…
1. do the same chkconfig commands stated above to set up the right services.
2. modify hosts; remove “wolf0n” from localhost, and just add wolf0n and wolf00.
3. install lam
4. make the dir /mnt/wolf, chmod 777 /mnt/wolf
Up to this point, we are pretty much the same as the head node. I do NOT do the modification of the exports file. And, I do a new thing or two:
cat >> /etc/fstab
wolf00:/mnt/wolf /mnt/wolf nfs rw,hard,intr 0 0
Then I modify /etc/lilo.conf. The 2nd line of this file says timeout=nn
This is where my wondrous use of cat and the redirection operators breaks down. Notice every modification I have done so far has been using cat. So you need to somehow modify that 2nd line to say “timeout=1200”. I broke down and used vi, but you can do it however you want, and if you hate vi enough, modify and copy this lilo.conf file on a floppy and just copy it to your newly created system.
After it is modified, as root, say /sbin/lilo, and it will make the changes take effect. It will say “Added linux *”.
Why do I do this lilo modification? If you were researching Beowulf on the web, and understand everything I have done so far, you would wonder, “I don’t remember reading anything about lilo.conf.”
My Beowulf cluster all sits on a single power strip. I turn on the power strip, and every box on the cluster starts up immediately. As the startup procedure progresses, it mounts file systems. Seeing that the non-head nodes mount the shared directory from the head node, they all will have to wait a little bit until the head node is up, with NFS ready to go. So, I make each non-head node wait 2 minutes in the lilo step. Meanwhile, the head node is coming up, and making the shared directory available. By then, the non-head nodes finally start booting up because lilo has waited 2 minutes.
All done! You are almost ready to start wolfing. Reboot your boxes. Did they all come up? Can you ping the head node from each box? Can you ping each node from the head node? Can you telnet? Can you rsh? Don’t worry about doing rsh as root; only as wolf. If you are logged in as wolf, and rsh to a box, does it go automatically, without prompting for password?
After the node boots up,
log in, and say “mount”. Does it show
wolf00:/mnt/wolf mounted? On the head
node, copy a file into /mnt/wolf. Can you read and write that file from the
node box? This is really not required;
it is merely convenient to have a common directory reside on the head node. You can easily do rcp
to copy files between boxes. Also
Once you can do all the tests shown above, you should be able to run a program. From here on in, the instructions are lam specific. Go back to the head node, log in as wolf, and:
cat > /mnt/wolf/lamhosts
Go to the lam examples directory, and compile “hello.c”:
mpicc –o hello hello.c
cp hello /mnt/wolf
Then, as shown in the lam documentation, start up lam:
[wolf@wolf00 wolf]$ lamboot -v lamhosts
LAM 7.0/MPI 2 C++/ROMIO -
n0<2572> ssi:boot:base:linear: booting n0 (wolf00)
n0<2572> ssi:boot:base:linear: booting n1 (wolf01)
n0<2572> ssi:boot:base:linear: booting n2 (wolf02)
n0<2572> ssi:boot:base:linear: booting n3 (wolf04)
n0<2572> ssi:boot:base:linear: finished
So we are now finally ready to run an app. [Remember, I am using lam; your message passing interface may have different syntax].
[wolf@wolf00 wolf]$ mpirun n0-3 /mnt/wolf/hello
Hello, world! I am 0 of 4
Hello, world! I am 3 of 4
Hello, world! I am 2 of 4
Hello, world! I am 1 of 4
Recall I mentioned the use
So now you know how I did it; hope this helps; have fun on your own project.
You may wonder – “Why does he say ‘wolf’ and not ‘beowulf’ like it is supposed to be called?” I say to you – “Because it’s fun. Nobody else calls their cluster a ‘bay wolf’ so I am.”
Now let’s automate the install so you may create a node by merely inserting a floppy, and the box will completely build itself with no user interaction.
On the head node, make another directory, and install directory, for everybody to share:
Go to the /etc directory, and add your “shared” directory to the exports file:
cat >> exports
/mnt/install 192.168.0.100/192.168.0.255 (ro)
Go to your original distribution, and note the directory structure. A directory called RedHat contains a directory called RPMS, which contains all the packages that you choose from when doing an install. Copy the whole RedHat directory tree into the /mnt/install directory. After you are complete, modify their security so they are only accessible to the group “beowulf”:
chmod 770 /mnt/install -R
chown wolf:beowulf /mnt/install -R
Remember, for non-head nodes, we only installed Network Servers package, and none others. So you may carefully choose just the RPMs that are necessary. In order to see this, you can refer to /root/install.log on one of your manually built non-head nodes.
Go to your install CD, and just like you created your original install floppy to do your first Linux install, choose the “bootnet.img” file to create a network install floppy. Use that floppy to install your next node. Instead of asking for you to insert a CD ROM, it will ask you the ip address of your head node, and where the shared install directory is. Doing this build will also make sure you have copied all of the right RPMs into your shared install directory. If any of the files are missing, you will see an error stating that a file is missing. Copy the missing file to your shared install directory, and continue the install.
Next, install DHCP on your head node. Go to http://tldp.org/HOWTO/DHCP/x369.html and follow the instructions. Here is a basic summary of what I did:
Gunzip and untar the file.
Go to the directory that it just created: dhcp-3.0p12
Do the following commands:
mv dhcp-3.0p12/work.linux-2.2/server/dhcpd /usr/sbin
route add –host 255.255.255.255 dev eth0
and add the following line to the end of /etc/rc.d/rc.local:
/usr/sbin/dhcpd –d –f > /var/log/dhcp.log 2>&1 &
In my experience, I found that I had to add the following line in dhcpd.conf:
… after the “max-lease-time” line.
After successfully installing the dhcp server on your head node, you may now install a node again with your network install floppy like before. When it comes to the screen that has you enter an IP address, you may now choose “use bootp / dhcp”. After the install is complete, the newly installed box should have a dynamically assigned IP address.
I saw, when specifying a range of IP addresses in the dhcpd.conf file, the addresses were handed out in reverse order – that is:
First box: x.x.x.255
2nd box: x.x.x.254
3rd box: x.x.x.253
and so on. The web site mentioned before speaks about statically assigning IPs to specific MAC addresses: http://www.phy.duke.edu/brahma/beowulf_book/node64.html
Really, it does not matter
if you use one method over another – the goal is to build a box without any
user intervention, including the assignment of the IP address. The reason I chose to use static IPs is because I stick a label on the front of each box
saying what its IP address is. So when
192.168.0.103 has a problem, I know which
So now we are at a point where we can completely automate our install. By now you probably have reinstalled your nodes enough times to be bored with it, and truly would appreciate if a program would perform these repetitive tasks for us.
Run the kickstart configurator program, which will ask you all the steps you have performed in your installs. It will save your answers in a file, ks.cfg.
In my experience, after making this file, I also had to manually modify it like so:
#Disk partitioning information
part swap --recommended
part / --fstype ext3 --size=1 --grow --maxsize 100000
Sure, there could have possibly been a proper set of responses within the configurator to create this output, but I found it easier to just manually modify the file as shown.
On your network install floppy, modify the file syslinux.cfg. As you recall, after the box boots off this floppy, it gives you a minute or so to enter any parameters, and then automatically goes into the interactive install. We want to force it to use the kickstart file we generated.
Change the line default linux to say default ks
Change the line timeout nnn to say timeout 2
On the label ks section, add “append ks=floppy” like so:
append ks=floppy initrd=initrd.img lang= devfs=nomount ramdisk_size=9216
Copy this modified syslinux.cfg on to the network install floppy, along with your ks.cfg file, and do another install. You may sit back and watch the show on your monitor. You should not have to type a thing. It should go through the whole install process, answering every question the way you would have manually entered it.
Lastly, we will add the “post” section to the ks.cfg file. As the earlier instructions show, we do a series of modifications to our newly installed system to make it a Beowulf node: install lam, change files in /etc, start and stop services, and so on.
Here is the resulting ks.cfg file:
#Generated by Kickstart Configurator
#Language modules to install
timezone --utc America/Chicago
rootpw --iscrypted big garbage string
#Reboot after installation
#Use text mode install
#Install Red Hat Linux instead of upgrade
nfs --server 192.168.0.100 --dir /mnt/install
#System bootloader configuration
bootloader --useLilo --linear --location=mbr --append ks=floppy
#Clear the Master Boot Record
#Clear only Linux partitions from the disk
clearpart --linux --initlabel
#Disk partitioning information
part swap --recommended
part / --fstype ext3 --size=1 --grow --maxsize 100000
#Use DHCP networking
network --bootproto dhcp
#System authorization information
auth --useshadow --enablemd5
#Do not configure the X Window System
chmod 777 /mnt/wolf
/usr/sbin/usermod -p 'big garbage string' wolf
/sbin/chkconfig --add telnet
/sbin/chkconfig --add rsh
/sbin/chkconfig --add nfs
/sbin/chkconfig --add rexec
/sbin/chkconfig --add rlogin
/sbin/chkconfig --level 3 telnet on
/sbin/chkconfig --level 3 rsh on
/sbin/chkconfig --level 3 nfs on
/sbin/chkconfig --level 3 rexec on
/sbin/chkconfig --level 3 rlogin on
chmod 777 /mnt/inst
mount 192.168.0.100:/mnt/install /mnt/inst
ls -l /mnt/inst >> /home/wolf/proof1.txt
cp /mnt/inst/etc/lilo.conf /etc
/sbin/lilo >> /home/wolf/proof2.txt 2>&1
ls -l /mnt/inst/etc/ho* >> /home/wolf/proof3.txt
cp /mnt/inst/etc/ho* /etc
cat /mnt/inst/etc/fstab >> /etc/fstab
ls -l /mnt/inst/home >> /home/wolf/proof4.txt
cp /mnt/inst/home/* /home/wolf
rpm -i /home/wolf/lam.rpm
As you see, I chose “reboot” on the end, which puts a little responsibility on your shoulders. If you put the floppy in, turn on the box, and walk away, it will go through all the steps, and then reboot. Upon rebooting, it will boot from floppy, and start the whole process over again.
You could do two things:
1. Skip the reboot, and let the box just sit there when it is done installing. Then you will have to eject the floppy, and reboot the box with the reboot command, or the “trip over the power cord” reboot, which has its own ups and downs.
2. I timed the amount of time it took to read all of its data off the floppy, which was a minute or two, and then ejected the floppy. The long part, approximately 20 minutes, will go on and on, and reboot itself safely, because I got the floppy out of the way. I have an 18 minute window to remember to eject the floppy.
There is another issue worthy of mention: this easy “insert a floppy” install implies that you have saved off any important data from the box, because it will get completely erased and rebuilt. But as the Beowulf documentation out there describes, you should not be saving important data on the nodes – you should view each node as an expendable resource, and at the moment it acts up, you would have no hard feelings in completely destroying and rebuilding it.