Recent Changes - Search:

Classes

FinalExam

Troubleshooting

edit SideBar

ConnectionAlarm

Just to review what happened this morning and how it was fixed...

The switch serving both Spectrum network management servers, apple and peach, quit. Spectrum isn't too happy about losing all contact to everything. When the switch came back, OneClick [Spectrum's modern client] showed a landscape alarm for peach.

Assuming peach had a problem (in retrospect, I'm not so sure), we restarted processd (the all-things-Spectrum controller, and Spectrum processes PPID). Still landscape alarm. Since a OneKick is so often effective, we tried that. Then peach vanished from OneClick. I launched SpectroGraph, the EOL ancient funky Motif-based Spectrum client before OneClick, and it saw the same thing. When 1C and SG agree, there's no need for OneKick.

I used netstat -an to see that apple:56063 was in TIME_WAIT to peach while peach had no such connection. I checked all logs on peach, and one said apple refused connection to Location Service on port 56063. So I killed LocServer on apple, it restarted automatically, and the landscape alarm finally went away!

Notice the TIME_WAIT connection on apple with peach, and note the port number 56063 in particular!

hope@apple$ netstat -na | grep 152.2.145.20
152.2.145.17.48879   152.2.145.20.50025   49640      0 49640      0 ESTABLISHED
152.2.145.17.56063   152.2.145.20.50116   49640      0 49640      0 TIME_WAIT

That connection isn't there on peach. (In fact, that turns out to be the problem.)

hope@peach$ netstat -na | grep 152.2.145.17
152.2.145.20.50025   152.2.145.17.48879   49640      0 49640      0 ESTABLISHED

And unless 2xMSL is in the range of hours, there's something wrong with that TIME_WAIT connection on apple because it blocked the legitimate connection for hours!

hope@peach$ cat /opt/spectrum/NS/NSAGENT.OUT
Failure to connect to Location Service at host: peach.net.unc.edu, port #: 56063.

NSAgent is now ready on port 0xf00d...
hope@apple$ netstat -na | grep 152.2.145.20
152.2.145.17.48879   152.2.145.20.50025   49640      0 49640      0 ESTABLISHED
152.2.145.17.56063   152.2.145.20.50129   49640      0 49640      0 ESTABLISHED

Rebooting apple would have fixed it with overkill. Likewise, stopping processd on apple, letting all processes go away, and then starting processd would have worked too, and we'd still be waiting for apple to contact its models. But the "connection to landscape" alarm made me check the connection with netstat -an, and I know TIME_WAIT shouldn't last for more than 4 minutes. However, the only way to close it is to kill that process (discovered with lsof or my log file serendipity).

It's quite possible that peach and OneClick didn't need attention.

- Joni

Sent from my iPod

Edit - History - Print - Recent Changes - Search
Page last modified on February 18, 2009, at 03:59 PM EST