IT@UT

An Informal History of the UT Austin Tech Community

User Tools

Site Tools


systems_went_down

Reasons Systems Went Down

Alcohol

Drunken student, wandered out of main building into the courtyard after midnight. Gate closed. Seeking attention, found the breakers. Killed the power to the Mainframe. Circa 1983. Bryan Wilcox says “I was the operator that found the blood in the Main basement when the drunk shut our generator power to the mainframe. He had cut his self when he broke into the basement. Kinda scary finding blood in that old building. Scared the police too.”

Beautiful Days

{Possibly an apocryphal story; confirmation and details needed] In the mid-1980's, Steve Holland figured out that he could bring down the mainframe by issuing a particular command string. He did this on beautiful afternoons so people could leave early. After this happened a few times, the Systems Staff fixed the problem but instead of advertising this, they set a trap for the culprit by putting in place a screen that would pop up only for the person issuing that command string. The next time Steve issued it, he got a screen instructing him to call the System Staff and report a special error code. When he did, they knew who had been bringing down the system.

Chuck says, yes, well-known story. Probably 1982. And John Wheat had his swim suit in a bag so he would be ready to go to the lake when this happened.

Steve, BTW, was a little wild in other respects. If he were to drive the wrong way at night on a one-way street, he would do it with his lights off. But when it came to security testing, he was always invited to try to hack. The System Staff didn't necessarily trust him, but they did trust that if he had a good secret, he wouldn't share it with anyone.

Disaster testing

Excessive Literalness

Disaster drill, mid-2000's. Followed checklist up to “simulate pulling the plug”. Actually pulled the plug. Twelve hours to recover. Most admin employees sent home. ( Date, please?)

Silent Alarm

Circa 1995. Phil Erickson was checking things out in the machine room and decided to test the smoke detectors. He lit a match and held it under one of the detectors, but as far as he could tell nothing happened. That was because, since nobody was normally in the machine room, the alarm went off at the police department. Anyway, he decided to test another one to see if just the one was “bad”. However, the room had been wired so that if two detectors went off it would cut all electricity to the room, so once he held up the second match everything went down. When we asked him what he was thinking, he said he’d applied the rules he’d used moving nuclear missiles around Europe while in the Army: “Will it cause a diplomatic incident? Will it embarrass the President? Will it start a war?” The answer to all these questions was “No”, so he went ahead. (Colonel Erickson had been chief of logistical planning for Operation Desert Storm, the war in Iraq in 1991.)

Regression

Someone other than Jim Ferrero rebooted EntireX Broker, causing it to revert its character set to EBCDIC. This caused all users with “!” in their passwords to be locked out of the system after too many invalid login attempts. Circa 2008?

Proving a Negative

A system went down because its servers came unplugged. While the system was brought back up, Data Center employees attempted to identify a root cause. One suggested that perhaps the plug had been kicked out of its socket. Another employee insisted that was impossible and kicked the plug to prove his point. The servers immediately died… and the system went down for the second time that day. Circa 2010.

Load Testing Should Be Intentional

Back in February of 2010, a relatively recent graduate of the training program ran a batch program at about 6pm with a HUGE array (something ridiculous like 1 million occurrences of an A30000) which sucked up all the memory on the mainframe and stopped new batch jobs from being kicked off. The resources used also slowed down UT Direct significantly and systems staff had to IPL the mainframe, which took something like 30-40 minutes. A couple months later, Curtis Pew gave an FYI on “effective memory usage” for the mainframe: https://dpdev1.dp.utexas.edu/epd/continuing/fyi/resources/materials/2010/memory_usage.pdf.

Environmental Sensitivity

I think this was sometime last year (2014). I (Curtis Pew) was preparing a change that required an IPL (restarting the mainframe.) It was Friday afternoon, and our next IPL was scheduled for Sunday morning. In COM 2 we have a workstation where we can bring up consoles and perform IPLs. I was making changes on our test system and then IPLing that system to see how they worked. I had made my last change and went back to COM 2 and started bringing down the system, but didn’t notice that while I was back in my own office someone had brought the production console to the front. I entered the command to run the script that brings down the system, when suddenly Scott said, “It looks like COM-PLETE just went down,” and I realized what I had done. We quickly brought things back up, and the next week Jon changed the script so that you have to tell it the name of the system you’re trying to bring down before it will stop anything. (Interesting trivia: the name of the production system is “V470” because when the University converted to MVS the mainframe was an Amdahl v470 and we’ve never changed it.)

Backup Systems Go Wild

Power outage at UT, February 17, 2015. Although UT generates its own electricity, it contracts to Austin Energy for back-up. Austin Energy threw the wrong switch, and brought down UT power, for four hours.

Backlash Over Feral Cats

Christmas 1994 or 1995. Over Christmas break, Groundskeeping staff were dispatched to catch all the feral cats which hung around the main building and were fed by employees. And the cats were caught and dispatched. In the preceding years, two improvements had been made to the *TXMAIL system. (1) It supported transmission to the Internet, and forwarding of logon ID's to the Internet. (2) It allowed creation of user-created bulletin boards and discussion groups, permitting individual subscription. A list @USPETS had been established. The interface to receive incoming mail from the Internet and upload it to *TXMAIL was a batch job run every 30 minutes, and quite sensitive to load. The outrage over the euthanization of the cats was enormous, and the bounce backs from messages forwarded to bad internet addresses overwhelmed the system.

Dropped Screwdriver

ca. 2000. I (Phil) adminstered the Campus-wide Document Imaging system. Images were all stored on optical platters, and the associated servers and storage devices were kept in Main 22 (aka “The Vault”)–which also housed the TEX phone bank. The primary near-line storage devices were optical jukeboxes which held up to 24 optical platters. I got a call from an office/department that they were unable to view images. I went to the vault, and noticed a workman doing some line-stringing or some such in there; I checked the jukeboxes and they seemed to be OK. Then I looked behind them; the SCSI connector to one was broken off. The worker guy looked at me looking at it and said, “Oh, I dropped my screwdriver behind that and I had to move it but I moved it back.” That jukebox was down for at least a day for repairs.

systems_went_down.txt · Last modified: 2022/11/16 19:22 by phil_g