May 14, 2007
Franklin Annex Power Outage May 19th 8am - 12pm
POWER OUTAGE on Saturday May 19, 8AM-12NOON. There is no confirmation that
this will or will not affect FBA P114 and FBA P121 machine rooms. Please
plan accordingly; CETS' plan is to shut down our clusters, and be on hand
(and in communication with facilities) during the outage to confirm whether
or not this particular riser affects those machine rooms. This will prevent
future guesswork and confusion.
Regards,
Dan Widyono
----- Forwarded message from Mike Ferraiolo <mikeferr@pobox.upenn.edu> -----
From: Mike Ferraiolo <mikeferr@pobox.upenn.edu>
Subject: RE: Franklin Building electrical riser shut down (MAY 19TH)
Reply-To: mikeferr@pobox.upenn.edu
Please be advised that the electrical riser shut down in the Franklin
Building will occur on Saturday May 19, 2007 from 8:00AM to 12:00NOON.
This is a 480V buss duct riser that feeds the 277v lighting and 120-208
outlets. The work will entail the installation of a 60A isolation breaker
that will be for the Development Office 5th fl Server.
Please assume that all power will be affected in both Franklin Building and
Franklin Annex for the maximum duration of 4 hours.
----- End forwarded message -----
Blogged with Flock
Posted by ssc_upenn at 12:29 PM | Comments (0) | TrackBack
May 1, 2007
Important: McNeil Server Outage 5/01/2007 4pm.
Due to an electrical malfunction in the McNeil Server room, all SSC servers will be shut down at 4pm. We do not expect the downtime to exceed two hours.
Please ensure that you are not logged onto any of these servers at 4pm. If you remain logged on at shutdown time, you may lose data.
Do not hesitate to contact us with any questions or concerns, particularly if you are wondering if you are using network drives, etc. Thank you for your patience and cooperation on this matter.
The following UNIX servers and services will be affected:
porter.ssc.upenn.edu -- Most SSC Web services
lambic.ssc.upenn.edu -- SSC UNIX file shares
stout.ssc.upenn.edu -- SSC Statistical Computing applications
mailman.ssc.upenn.edu -- SSC-hosted mailing lists
max.econ.upenn.edu -- Econ Beowulf Cluster
icod.econ.upenn.edu -- Econ Beowulf Cluster
Posted by ssc_upenn at 11:47 AM | Comments (0) | TrackBack
January 29, 2007
So you want to build a cluster...
Here's a neat web application that can help you spec out your cluster:
http://cgi.aggregate.org/cgi-bin/cdr.cgi
And here's a write-up on it at Cluster Monkey.
http://www.clustermonkey.net//content/view/181/29/
Posted by ssc_upenn at 12:49 PM | Comments (0) | TrackBack
November 6, 2006
Power Outage Scheduled for Clusters 11/11 2006
This just in...
-------------------------------Dear FBA Cluster Owners:
Please take note of the scheduled power outage for both
FBA P121 and FBA P114. You should assume that it will
happen unless I post otherwise.
It is the responsibility of each cluster owner to
protect their own equipment in the manner that they
wish. For a two hour outage, this probably means
shutting down the cluster over this time period.
Thanks, John
The Details:
All,
I would like to propose *Saturday, November 11th from 7AM-9AM* as the rescheduled date and time for the electrical shutdown. Details below:
Buildings: Franklin Building and Franklin Building
Annex
Utility: Electric
Date: Saturday, November 9, 2006
Time/Duration: 7:00AM – 9:00 AM
Areas Affected: All areas of both buildings. All lights and
power will be out for 2 hours.
Reason for Shutdown: Landair Wireless and Liberty Electric will
install a new circuit breaker into existing switchgear in the 8^th floor mechanical room which is needed to provide power to the Nextel / Sprint wireless antenna installation on the roof
This means both the old cluster room (jove and momax) and the new cluster room (nemeth) will be affected. I will shut down the clusters at close of business on Friday the 10th, and start them up at the beginning of the day on Monday the 13th
Blogged with Flock
Posted by ssc_upenn at 3:32 PM | Comments (0) | TrackBack
October 26, 2006
Another course for would-be Cluster-jocks...
Introduction to Beowulf Design, Planning, Building and AdministeringThe ARC team at Georgetown is putting this course on November 7th through 10th in Washington. At $800, it seems like a reasonable deal. If I only hadn't already spent my travel allowance...
;-)
technorati tags:Beowulf, training, clustering, Linux
Blogged with Flock
Posted by ssc_upenn at 6:17 PM | Comments (0) | TrackBack
October 11, 2006
SuperComputing 06!
The SuperComputing 06 Conference is happening in Tampa. Here's a link to the registration page...
http://sc06.supercomp.org/registration/
Posted by ssc_upenn at 12:02 PM | Comments (0) | TrackBack
September 20, 2006
New Cluster Monkey link...
I've mentioned ClusterMonkey (www.clustermonkey.net) before, but here's a nice article from them....
http://www.clustermonkey.net//content/view/158/32/
Posted by ssc_upenn at 10:18 AM | Comments (0) | TrackBack
September 8, 2006
Welcome Doug McKee
We'd like to take a moment to welcome Doug McKee, who is a postdoc working with Beth Soldo in the Population Studies Center. Doug arrives from UCLA (won't he be surprised by Philadelphia weather...) and has extensive cluster computing experience.
Posted by ssc_upenn at 12:01 PM | Comments (1) | TrackBack
July 31, 2006
How do I kill processes on a cluster compute node?
I'm trying to troubleshoot my cluster program, and after a couple of tries, I realize (by using 'ps -aux | grep $myusername' that I have lots of processes running on the compute nodes that aren't in use. How do I kill these processes, so that I can start with a clean slate?
The answer is to run the command /usr/local/bin/terminate_processes. This is a custom script that we have that allows you to terminate all of your processes except for those assocated with your connection (ie: ssh, bash, etc...). It's a good idea to run terminate_processes (it's in your $PATH, so you should just have to type the program name), before starting your next run.
Posted by ssc_upenn at 9:42 AM | Comments (0) | TrackBack
July 15, 2006
32-bit intel version of mpich installed on nemeth
I have built and installed a version of mpich compiled with the 32-bit intel compilers, to allow for the construction of 32-bit versions of your programs. The 32-bit version of mpich built agains the 32-bit intel compilers is found in at:
/usr/local/mpich/1.2.7/x86/intel/ssh
The man pages found therein have been added to MANPATH.
Posted by ssc_upenn at 6:19 PM | Comments (0) | TrackBack
June 30, 2006
How do I login to the clusters?
To connect to a cluster controlled by SSC, you use a program (on Windows) called Secure CRT:
Hostname: The name of the cluster you're connecting to, provided either by the sysadmin or your prof
Username: Your username, usually the same as your Pennname
Cipher: SSH2
More information on Penn's supported version of Secure CRT can be found at
http://www.upenn.edu/computing/product/specs/securecrt.html
Posted by ssc_upenn at 11:56 AM | Comments (0) | TrackBack
June 29, 2006
MPI class?
Was browsing the web, and ran across this:
Ohio Supercomputing Center MPI Course
Posted by ssc_upenn at 11:51 AM | Comments (0) | TrackBack
June 19, 2006
Another power outage?!?!?!
Just got this from John Yates, SAS Cluster majordomo...
Folks:
As you may know by visiting the room, the next build
out is in progress. There has been an upgrade redesign
in the power provided so that the current PDU will need
to be upgraded for higher capacity. We are looking at
a probable one day power outage in the room in the
last half of July so that the PDU's guts can be
replaced.
I will post follow ups as the schedule is narrowed
down. This is just an early warning.
[Overhead trays will be placed for the full 10 rows
available in the room, a second PDU will be installed
as well as the existing one upgraded, and more Liebert
A/C units will be installed].
Thanks, John
Posted by ssc_upenn at 4:10 PM | Comments (0) | TrackBack
February 7, 2006
Here's a fun link...
As I was perusing the net this morning, I ran across this link. Check it out...
http://www.clustermonkey.net//content/view/16/33/
Posted by ssc_upenn at 8:19 AM | Comments (0) | TrackBack
January 30, 2006
momax node 4 unavailable
Momax compute node 4 is unavailable, it will not POST. More news later...
Posted by ssc_upenn at 10:17 AM | Comments (0) | TrackBack
nemeth.pop.upenn.edu -- rebooted 1/30/2006
Nemeth was rebooted by 10am on 1/30/2006. The node failures on Saturday morning were due to a power outage in the Franklin Annex.
Posted by ssc_upenn at 10:12 AM | Comments (0) | TrackBack
January 28, 2006
FYI: nemeth
Apparently, most of the compute nodes on the nemeth cluster have ceased responding to network requests. I'll reboot the cluster first thing, Monday morning.
This message also posted to ssc-cluster-contacts.
Posted by ssc_upenn at 8:40 AM | Comments (0) | TrackBack
January 17, 2006
nemeth-s8 back online
The rebuild of nemeth-s8 has completed. The reason it didn't start on Friday was that it was unable to get a DHCP offer from the master node, which was busy (both network and disk) servicing the nemeth-s15 rebuild; rule of thumb: one rebuild at a time.
This notice also posted to ssc-cluster-contacts.
Posted by ssc_upenn at 1:52 PM | Comments (0) | TrackBack
January 14, 2006
nemeth update
New hard drives were installed on nemeth-s8 and nemeth-s15. Unfortunately, the auto rebuild routine didn't kick off on slave 8 (although it did on slave 15), so there's still work to be done.
Additionally, the terminate_processes script has been installed on nemeth; please check to make sure it works for you.
This message also posted to the ssc-cluster-contacts list.
Posted by ssc_upenn at 12:32 AM | Comments (0) | TrackBack
December 12, 2005
nemeth node 8 dead -- failed hard drive
Compute node 8 has a failed hard drive, which has been reported to the manufacturer.
Posted by ssc_upenn at 12:24 PM | Comments (0) | TrackBack
nemeth.pop.upenn.edu -- rebooted 12/12/2005
Nemeth was finally rebooted and online at 12:00pm today. Compute node 8 has a failed hard drive, which has been reported to the manufacturer. We continue to try and solve the problems which cause nemeth to unpredictably go down.
This message also sent to the ssc-cluster-contacts mailing list.
Posted by ssc_upenn at 12:22 PM | Comments (0) | TrackBack
November 15, 2005
intel compilers on jove.pop.upenn.edu
I have installed the intel version 9 compilers on jove.pop.upenn.edu. They are installed into /opt/intel, so the next step is to build mpich against them. I will file updates as they are available.
Remarkably, there were no dependency problems with the installation of the compilers themselves. Update:mpich compiled using the intel compilers; the mpich build is in /usr/local/mpich/mpich_intel, so those who want to test the intel-built mpi can do so. I'll work on porting the pgi environment scripts to intel, so that this testing will be easier.
Posted by ssc_upenn at 1:47 PM | Comments (0) | TrackBack
nemeth node 15 dead -- failed hard drive
The hard drive on nemeth node 15 has failed. I have reported the failure to Western Scientific, and await their response. When I have their response, I will post it here.
This message also sent to the ssc-cluster-contacts list.
Posted by ssc_upenn at 10:27 AM | Comments (0) | TrackBack
November 14, 2005
nemeth.pop.upenn.edu -- reboot scheduled 1115 11/14/2005
nemeth.pop.upenn.edu has again ceased accepting new ssh connections; please stop your jobs in anticipation of this reboot. We continue to work with Western Scientific on this issue
, which is resource starvation on the IDE bus. AMD-64 chips have a feature called an IOMMU (IO Memory Management Unit), which seems to be at the core of the issue.
Update: nemeth was rebooted at 12:10pm this afternoon.
This message also posted to the ssc-cluster-contacts list.
Posted by ssc_upenn at 10:49 AM | Comments (0) | TrackBack
November 9, 2005
What is the difference between a processor and a node?
When discussing the Beowulf clusters, the distinction between a processor and a node is a fundamental one.
A cluster can be though of as simply a stack of computers. A node is simply a different name for a computer. For example, jove.pop.upenn.edu is a cluster made up of 13 nodes, or computers. In mpich, which we use on the Econ clusters, you identify a node in the context of the machine file.
The processor is the part of the computer that does the computation work; you may also hear it referred to as the CPU. A node can have more than one CPU in it; for example, jove.pop.upenn.edu and momax.pop.upenn.edu have two CPUs in each node, while nemeth has four CPUs in each node. mpich allows you to customize exactly how many processors you want to use to do a particular job with the --np directive.
Posted by ssc_upenn at 11:42 AM | Comments (1) | TrackBack
How do I tell the cluster to only use certain machines when running my code?
My professor has told me that I can only use the first five nodes of the cluster max.econ.upenn.edu. How do I make my code obey this edict?
On the ssc.upenn.edu-managed Beowulf clusters, we use the mpich package to run jobs across multiple nodes. The way to solve your problem is to create a machine file. An example machine file which would work in this instance would look like this:
maxsl1-d 2 maxsl2-d 2 maxsl3-d 2 maxsl4-d 2 maxsl5-d 2
You could save this file to your home directory on max.econ.upenn.edu, calling it machines.max5. Then when you wanted to start running your program, you would use the following command:
# mpirun --machinefile=~/machines.max5 --np=10 myjob
This command says "Run myjob using the 10 processors on the list of machines contained in the file machines.max5 in the root level of my home directory."
Posted by ssc_upenn at 11:31 AM | Comments (0) | TrackBack
November 8, 2005
jove.pop.upenn.edu node 6 returns
Node 6 for the cluster jove.pop.upenn.edu has been returned to us from Aspen.
Node 6 will be reinstalled (and restored to the machines.LINUX file) tomorrow morning by 11am.
This message also posted to the ssc-cluster-contacts list.
Update: node 6 was replaced before 11am, and has been made available in the machines_ files for use.
Posted by ssc_upenn at 3:22 PM | Comments (0) | TrackBack
November 7, 2005
nemeth.pop.upenn.edu -- problems logging in
You may experience trouble logging into nemeth.pop.upenn.edu.
The symptoms are the same as the previous issues regarding an exhaustion of available memory space for the IDE controller on the master node. SSC is working with the vendor to assess and troubleshoot the issue.
As soon as there is something to report, it will be posted here, as well as to the ssc-cluster-contacts mailing list.
This message also posted to the ssc-cluster-contacts list
Update: nemeth was rebooted at approximately 4pm on Tuesday afternoon, after appropriate care was taken to ensure that no jobs were being unsafely interrupted.
Posted by ssc_upenn at 3:55 PM | Comments (0) | TrackBack
October 27, 2005
Maintenance, McNeil clusters
In order to recable the clusters, in an attempt to provide better troubleshooting, icod and jove will be down for one hour, while I move the network switch from the front to the back of the rack.
I will send a note when this is finished, and the clusters are back online.
This message also posted to the ssc-cluster-contacts list.
Update: The clusters were back online at 12:15pm; icod-sl6 and icod-sl8 are unavailable, as is max-sl6. Check back in this section for updates regarding these nodes.
Update II: icod-sl8 is back online; icod-sl6 has a failed hard drive and failed power supply. I will post an updated time frame for replacement of these parts when it is available.
Posted by ssc_upenn at 11:03 AM | Comments (0) | TrackBack
October 25, 2005
FBA Cluster Outage, Part II
The electrical service which was supposed to have been performed in Franklin Annex this morning did not occur. I have been informed that it *will* occur tomorrow morning. This requires the shutdown of the cluster prior to 6am. The cluster will be restarted at 9am tomorrow morning. I will send a note when the clusters are available again.
The affected clusters are nemeth, jove, and momax.
This message was also posted to the ssc-cluster-contacts list.
Update: The affected clusters were brought online at 9am this morning.
Posted by ssc_upenn at 2:30 PM | Comments (0) | TrackBack
October 24, 2005
Cluster Outages 10/25 FBA
The clusters housed in the Franklin Building Annex:
jove.pop.upenn.edu
momax.pop.upenn.edu
nemeth.pop.upenn.edu
will be unavailable between 6am and 8am tomorrow morning while electrical service is performed.
Please make sure that you have stopped your jobs before 6am.
This notice also has gone to the ssc-cluster-contacts list
Update: The FBA clusters were restarted and operational by 9 am.
Posted by ssc_upenn at 8:28 PM | Comments (0) | TrackBack
April 1, 2005
I thought Beowulf was an epic poem...
I see references to Beowulf clusters on this site. What are they?
A Beowulf cluster is a computer system which is made up of several individual computers (known as 'compute nodes') which work in concert to solve a given problem. The compute nodes' efforts are coordinated by a 'master node', which handles assigning subtasks to the compute nodes, as well as serving as a centralized storage repository for the system as a whole.
We have four five (after nemeth.pop.upenn.edu went live in October) Beowulf clusters in SSC. If you have questions about the use of the Beowulf clusters, please contact the SSC Helpdesk.
Posted by ssc_upenn at 2:25 PM | Comments (0)
Cluster Backups
Does SSC back up the information in my home directory on the Bewoulf clusters? If not, how can I do it myself?
SSC does not backup up the home directories on the Beowulf clusters. It is certainly advisable that cluster users take steps to make sure that their data are backed up to another machine in the event of a cluster failure.
One way to do this is by using a UNIX utility called 'rsync' in concert with SSH to automatically copy your home directory to another UNIX location (your home directory on lambic, say...) and then to synchronize the changes nightly. Below are instructions on how to set this up. Substitute your userid for $username, and lines beginning with # are shell commands you type.
Step 1
Log into lambic.ssc.upenn.edu (or another UNIX server, if you have access to one) using SecureCRT or another SSH client. This is the server you will be backing your cluster files to.
# cd /home/$username
Step 2
Check to see if you already have a public/private key pair on this server# cd .ssh # ls *.pub
Look for a file called id_rsa.pub or id_dsa.pub -- these are public keyfiles If you find a public key, skip to step 3, below. If not, follow step 2a.
Step 2a
Generate a public key on the UNIX server# ssh-keygen -t dsa -N "" -b 1024This generates a public/private keypair for your userid, and places it in the expected location, above.
Step 3
Move your public key to the cluster you wish to backup ($cluster).# scp ~/.ssh/id_dsa.pub $username@$cluster.upenn.edu:/home/$username/.ssh/authorized_keys2
(this copies your pubkey from athena, and places it on max. This allows you to ssh in to max without having to type a password. This will prove useful in the next step)
Step 4
Backup your home directory on the cluster to the UNIX server you logged into in Step 1, above.# mkdir /home//$cluster_backup
This step creates a directory on the local UNIX server where your backup files will be stored.
# rsync -avze ssh $username@$cluster.upenn.edu:/home/$username/$cluster_backup
Once you run the second command, you should see the message 'generating file list...' and then the names of the files that are being backed up whiz by. Because you've made your public key from the local UNIX server one of the cluster's authorized keys for your username, you should not have to provide your password.
Once you're satisfied that this has worked properly, you can add an entry to cron along these lines (using the command 'crontab -e'):
18 3 * * * /path/to/rsync -avze ssh $username@$cluster.upenn.edu:/home/$username/$cluster_backup
which would run the backup at 3:18am every morning.
(Remember to subsititue your username and the name of the cluster you wish to backup in the above examples).
If there are problems or questions, please don't hesitate to contact SSC-help.
Posted by ssc_upenn at 10:47 AM | Comments (0)