Tuesday, May 20, 2008

OpenSolaris 2008.05 installation and the fault manager

Last week I had to install an OpenSolaris 2008.05 system at work. The first system I tried to use wouldn't boot the kernel. I did some searching, and found that adding -kv to the kernel command line in grub enables more verbose output during boot. Adding -d enables the kernel debugger. The system was hanging so early in the boot process that I wasn't able to diagnose very much. BIOS and firmware updates didn't seem to help either.

After that, I moved to another system. That was where I had my first encounter with the Solaris Fault Manager system. During a reboot after an image update, it showed an error on a PCI device but continued working fine. After the next reboot, there was a message on the console about a device being retired and the system did not come up on the network. A look at dmesg showed that the e1000g0 devices had been unregistered. I did some searching, and came across this Sun Documentation page that describes the faulty device retirement feature. Basically, you can use the fmadm faulty command to list devices that have show failures. In my case, I issued the fmadm repair command with the fault ID. After another reboot, the system came back up with the e1000g0 device working properly.

No comments: