Folklore of programmers and engineers (part 1)





This is a collection of stories from the Internet about how bugs sometimes have incredible manifestations. Perhaps you also have a story to tell.



Car allergy to vanilla ice cream



A story for engineers who understand that the obvious is not always the solution, and that no matter how implausible the facts are, they are facts. General Motors Corporation's Pontiac Division received a complaint:



, , , . : . , , , , . Pontiac, . , , , . , . , , : « Pontiac, - , , , ?».


As you can imagine, the division president was skeptical about the letter. However, just in case, I sent an engineer to check. He was surprised that he was met by a wealthy, well-educated man living in a beautiful area. They agreed to meet right after dinner to go to the ice cream store together. It was vanilla that night, and when they got back into the car, it wouldn't start.



The engineer came three more evenings. The first time the ice cream was chocolate. The car started up. The second time there was strawberry ice cream. The car started up. On the third evening, he asked for vanilla. The car did not start.



Having reasoned wisely, the engineer refused to believe in the car's allergy to vanilla ice cream. Therefore, I agreed with the owner of the car that he will continue his visits until he finds a solution to the problem. And along the way, he began to take notes: he wrote down all the information, the time of day, the type of gasoline, the time of arrival and return from the store, etc.



Soon, the engineer realized that the owner of the car was spending less time buying vanilla ice cream. The reason was the layout of the product in the store. Vanilla ice cream was the most popular and was kept in a separate freezer at the front of the store to make it easier to find. And all the other varieties were in the back of the store, and it took much more time to find the right variety and pay.



Now the question was for the engineer: why did the car not start, if less time passed from the moment the engine was turned off? Since the problem was time, not vanilla ice cream, the engineer quickly found the answer: it was a gas lock. It arose every evening, but when the owner of the car spent more time looking for ice cream, the engine had time to cool down enough and started quietly. And when the man bought vanilla ice cream, the engine was still too hot and the gas plug did not have time to dissolve.



Moral: Even completely insane problems can sometimes be real.



Crash Bandicoot



It's painful to experience this. As a programmer, you get used to blaming your code first, second, third ... and somewhere in the ten thousandth place you blame the compiler. And even lower down the list, you already blame the equipment.



Here is my story about a hardware bug.



For the Crash Bandicoot game, I wrote a code to load and save to a memory card. For such a self-righteous game developer, it was like walking in a park: I thought the work would take several days. However, as a result, I debugged the code for six weeks. I solved other problems along the way, but every few days I returned to this code for several hours. It was agony.



The symptom looked like this: when you save the current playthrough of the game and access the memory card, almost always everything goes well ... But sometimes the read or write operation is completed by timeout without any obvious reason. Short recordings often damage the memory card. When a player tries to save, he not only fails to save, but also destroys the map. Pancake.



After a while, our producer at Sony, Connie Bus, began to panic. We couldn't ship the game with this bug, and after six weeks I didn't understand what was the cause of this problem. Through Connie, we contacted other PS1 developers: has anyone encountered a similar issue? No. No one had any problems with the memory card.



When you don’t have any ideas for debugging, then almost the only approach is "divide and conquer": you remove more and more code from the erroneous program until there is a relatively small fragment, which still causes a problem. That is, you cut off the program piece by piece until the part that contains the bug remains.



But the point is, it is very difficult to cut pieces out of a video game. How to run it if you removed the gravity emulation code? Or drawing characters?



Therefore, you have to replace entire modules with stubs that pretend to be doing something useful, but in fact do something very simple that cannot contain errors. We have to write such crutches to make the game work. It is a slow and painful process.



In short, I did it. I removed more and more pieces of code until there was an initial code that sets up the system to start the game, initializes the equipment for rendering, etc. Of course, at this stage, I could not make the save and load menu, because I would have to make a stub for all the graphics code. But I could pretend to be a user who uses the (invisible) save and load screen and asks to save and then write to the memory card.



As a result, I was left with a small piece of code that was still having the aforementioned problem - but so far it was happening randomly! Most of the time everything worked fine, but occasionally it crashed. I removed almost all of the game code, but the bug still lived on. This was puzzling: the remaining code didn't really do anything.



At some point, probably at three in the morning, a thought occurred to me. Read and write (I / O) operations assume exact timing. When working with a hard drive, memory card or Bluetooth module, the low-level code responsible for reading and writing does so in accordance with the clock pulses.



With the help of a clock, a device that is not directly connected to the processor is synchronized with the code executable in the processor. The clock determines the baud rate - the baud rate. If there is confusion with the timings, then either the hardware or the software, or both of them, are also confused. And this is very bad, because the data can get corrupted.



What if something in our code confuses timings? I checked everything related to this in the code of the test program, and noticed that we set the programmable timer in the PS1 to a frequency of 1 kHz (1000 cycles per second). This is quite a lot, by default, when the set-top box is started, it runs at 100 Hz. And most games use this frequency.



Andy, the developer of the game, set the timer to 1 kHz so that movements were calculated more accurately. Andy is prone to excessiveness, and if we emulate gravity, then we do it as accurately as possible!



But what if speeding up the timer somehow affected the overall timing of the program, and hence the clock that adjusts the baud rate for the memory card?



I have commented out the timer code. The error never happened again. But this does not mean that we fixed it, because the crash occurred randomly. What if I just got lucky?



A few days later, I experimented with the test program again. The bug was not repeated. I went back to the full game codebase and changed the save and load code so that the programmable timer was reset to its original value (100Hz) before accessing the memory card, and then back to 1kHz again. There were no more crashes.



But why did this happen?



I went back to the test program again. I tried to find some regularity in the occurrence of an error with a 1 kHz timer. Eventually I noticed that the error occurs when someone plays with the PS1 controller. Since I rarely would do this myself - why would I need a controller when testing save and load code? - then I did not notice this dependence. But one day one of our artists was waiting for me to finish testing - I was probably cursing at that moment - and nervously twisted the controller in his hands. An error has occurred. “Wait, what ?! Well, do it again! "



When I realized that these two events are interconnected, I was able to easily reproduce the error: I started writing to the memory card, moved the controller, damaged the memory card. To me it looked like a hardware bug.



I went to Connie and told about my discovery. She relayed the information to one of the engineers who designed the PS1. "Impossible," he replied, "it can't be a hardware problem." I asked Connie to talk to us.



The engineer called me and we argued with him in his broken English and my (extremely) broken Japanese. Finally I said, "Let me just send in my 30-line test program where controller movement causes a bug." He agreed. Said it was a waste of time and that he was terribly busy working on a new project, but would give up because we are a very important developer for Sony. I cleaned up my test program and sent it to him.



The next evening (we were in Los Angeles, and he was in Tokyo) he called me and apologized embarrassedly. It was a hardware problem.



I don't know what exactly the bug was, but from what I heard at Sony headquarters, setting the timer to a high enough value would interfere with the components on the motherboard near the timer crystal. One of these was the memory card baud rate controller, which also set the baud rate for the controllers. I'm not an engineer, so I might have confused something.



But the point is, there was interference between the components on the motherboard. And when simultaneously transmitting data through the controller port and the memory card port with a timer operating at a frequency of 1 kHz, the bits were lost, data was lost, and the card was damaged.



Bad cows



In the 1980s, my mentor Sergei wrote software for the CM-1800, a Soviet clone of the PDP-11. This microcomputer has just been installed at a railway station near Sverdlovsk, an important transport hub in the USSR. The new system was designed for the routing of wagons and freight flows. But it turned out to be an annoying bug that led to random crashes and crashes. Falls always occurred when someone went home in the evening. But despite careful investigation the next day, the computer worked correctly with all the manual and automated tests. This usually indicates a race condition or some other concurrency bug that manifests itself under certain conditions. Tired of late night calls, Sergei decided to get to the bottom of it, and first of all, to understand what conditions at the marshalling yard led to a breakdown of the computer.



First, he collected statistics on all unexplained falls and built a graph based on dates and times. The pattern was obvious. After watching for a few more days, Sergey realized that he could easily predict the time of future system failures.



He soon learned that disruptions only occurred when the station was sorting cattle wagons from northern Ukraine and western Russia to a nearby slaughterhouse. This in itself was strange, because the slaughterhouse was supplied by farms that were much closer to Kazakhstan.



The Chernobyl nuclear power plant exploded in 1986, and radioactive fallout made the surrounding area uninhabitable. Large areas in northern Ukraine, Belarus and western Russia have been contaminated. Suspecting a high level of radiation in arriving cars, Sergei developed a method to test this theory. The population was forbidden to have dosimeters, so Sergei put down several military men at the railway station. After several drinks of vodka, he managed to convince the soldier to measure the level of radiation in one of the suspicious cars. It turned out that the level is several times higher than the usual values.



Not only did the cattle emit strong radiation, its level was so high that it led to the accidental loss of bits in the memory of the CM-1800, which was in the building next to the station.



There was a shortage of food in the USSR, and the authorities decided to mix "Chernobyl" meat with meat from other regions of the country. This made it possible to reduce the overall level of radioactivity without wasting valuable resources. Upon learning of this, Sergei immediately filled out the emigration documents. And the falls of the computer stopped by themselves when the level of radiation decreased over time.



Through the pipes



Movietech Solutions once created software for movie theaters for ticketing, ticketing and general management. The DOS version of the flagship application has been quite popular with small and medium-sized theater chains in North America. So it’s not surprising that when the Windows 95 version was announced, integrated with the latest touchscreens and self-service kiosks, and equipped with all sorts of reporting tools, it also quickly became popular. Most of the time, the update went smoothly. IT professionals on the ground installed new hardware, migrated data, and business continued. Except when it didn't continue. When this happened, the company sent James the Cleaner.



Although this nickname alludes to the nefarious type, the cleaner is just a combination of instructor, installer, and jack of all trades. James could spend a few days at the client's place putting all the components together, and then for a couple of days he taught the staff how to use the new system, fixing any hardware problems that might arise, and actually helping the software go through its formative period.



Therefore, it is not surprising that at this hectic time, James came to the office in the morning, and did not have time to reach his desk, when he was greeted by the boss, filled with caffeine above normal.



“I'm afraid you need to travel to Annapolis in Nova Scotia as soon as possible. Their whole system went down, and after a night of working together with their engineers, we can't figure out what happened. It looks like the server has a network failure. But only after the system has worked for several minutes.



- They didn't go back to the old system? - James answered quite seriously, although in his mind he widened his eyes in surprise.



- Exactly: their IT specialist "changed priorities" and he decided to leave with their old server. James, they have installed the system at six sites and have just paid for premium support, and their business is now doing the 1950s.



James straightened slightly.



- This is another matter. Okay, let's get started.



When he arrived in Annapolis, the first thing he did was find the client's first movie theater that had a problem. On the map taken at the airport, everything looked decent, but the surroundings of the desired address looked suspicious. Not a ghetto, but reminiscent of film noir. As James parked at the curb in the center, a prostitute approached him. Given the size of Annapolis, it was most likely the only one in the entire city. Her appearance immediately reminded of the famous character who offered sex for money on the big screen. No, not about Julia Roberts, but about Jon Voight [a hint of the movie "Midnight Cowboy" - approx. per. ].



After sending the prostitute home, James went to the cinema. The surroundings have become better, but still the impression is created of seedy. Not that James was too worried. He had already been to squalid places. And this was Canada, where even the robbers are polite enough to say thank you after they took your wallet.



The side entrance to the cinema was in a dank alley. James went to the door and knocked. Soon she creaked and opened a little.



- Are you a cleaner? A hoarse voice came from within.



“Yes, it's me… I came to fix everything.



James walked into the lobby of the cinema. Probably with no other choice, staff began issuing paper tickets to visitors. This made financial reporting difficult, let alone more interesting details. But the staff greeted James with relief and immediately took him to the server room.



At first glance, everything was in order. James logged into the server and checked the usual suspicious places. No problem. However, as a precaution, James shut down the server, replaced the NIC, and rolled back the system. She instantly started working in full. The staff started selling tickets again.



James called Mark and reported the situation. It's not hard to assume that James might want to linger here and see if something unexpected happens. He went down the stairs and started asking the staff about what had happened. Obviously, the system has stopped working. They turned it off and on, it worked. But after 10 minutes, the system fell off.



Just at that moment something similar happened. All of a sudden, the ticketing system started giving errors. The staff sighed and grabbed the paper tickets, and James hurried to the server room. Everything looked good with the server.



Then one of the employees entered.



- The system is working again.



James was puzzled because he hadn't done anything. More precisely, nothing that would make the system work. He logged out, picked up the phone, and called his company's support team. Soon the same employee entered the server room.



- The system is lying.



James glanced at the server. An interesting and familiar pattern of multicolored shapes danced on the screen - chaotically twisting and intertwining pipes. We have all seen this screensaver once. It was beautifully rendered and literally hypnotized.





James pressed the button and the pattern disappeared. He hurried to the ticket office and on the way met the employee returning to him.



- The system is working again.



If you can mentally make a facepalm, then that's exactly what James did. Screensaver. It uses OpenGL. And therefore, during operation, it consumes all the resources of the server processor. As a result, each request to the server is timed out.



James went back to the server room, logged in and replaced the beautiful pipes screensaver with a blank screen. That is, instead of a screensaver that consumes 100% of processor resources, I installed another one that does not consume resources. Then I waited 10 minutes to check my guess.



When James arrived at the next movie theater, he wondered how to explain to his supervisor that he had just flown 800 km to turn off the screensaver.



Crash at a specific moon phase



True story. Once there was a software bug that depended on the phase of the moon. There was a little subroutine that was commonly used in various MIT programs to calculate the approximation to the true phase of the moon. GLS built this subroutine into a LISP program that outputs a timestamped string nearly 80 characters long when writing a file. It was very rare that the first line of a message was too long and went on to the next line. And when the program then read this file, it cursed. The length of the first line depended on the exact date and time as well as the length of the phase specification at the time the time stamp was printed. That is, the bug literally depended on the phase of the moon!



First paper edition of the Jargon File(Steele-1983) contained a sample of such a string leading to the described bug, but the compositor "fixed" it. It has since been described as a "moon phase bug".



However, be careful with assumptions. Several years ago, engineers at CERN (European Center for Nuclear Research) encountered errors in experiments carried out at the Large Electron-Positron Collider. Since computers are actively processing the enormous amount of data generated by this device before showing the result to scientists, many have assumed that the software is somehow sensitive to the phase of the moon. Several desperate engineers got to the bottom of the truth. The error occurred due to a slight change in the geometry of the 27 km long ring due to the deformation of the Earth during the passage of the Moon! This story entered the folklore of physicists as "Newton's revenge on particle physics" and an example of the connection between the simplest and oldest physical laws with the most advanced scientific concepts.



Flushing the toilet stops the train



The best hardware bug I've heard of was on a high speed train in France. The bug led to emergency braking of the train, but only if there were passengers on board. In each such case, the train was taken out of service, checked, but nothing was found. Then he was sent back to the line, and he immediately stopped abnormally.



During one of the checks, an engineer traveling on the train went to the toilet. Soon it washed away after itself, BOOM! Emergency stop.



The engineer contacted the driver and asked:



- What did you do just before braking?



- Well, I slowed down on the descent ...



It was strange, because during normal running the train slows down on the descents dozens of times. The train went on, and on the next descent the driver warned:



- I'm going to slow down.



Nothing happened.



- What did you do with the last braking? - asked the driver.



- Well ... I was in the toilet ...



- Well, then go to the toilet and do what you did when we go down again!



The engineer went to the toilet, and when the driver warned: "I'm braking," he flushed the water. Of course, the train stopped immediately.



Now they could reproduce the problem and needed to find the cause.



Two minutes later, they noticed that the remote control cable for engine braking (the train had one engine at both ends) was disconnected from the wall of the electrical cabinet and lay on the relay that controlled the toilet plug solenoid ... When the relay turned on, it interfered in the brake cable, and the system crash protection simply included emergency braking.



The gateway that hated FORTRAN



A few months ago we noticed that network connections on the mainland [this was in Hawaii] were getting very, very slow. It could last 10-15 minutes and then suddenly reappear. After a while, a colleague of mine complained to me that the network connections on the mainland were not working at all. He had some FORTRAN code that needed to be copied to a machine on the mainland, but it didn't work because "the network didn't last long enough for the FTP upload to complete."



Yes, it turned out that network failures occurred when a colleague tried to FTP the FORTRAN source file to a machine on the mainland. We tried to archive the file: then it copied quietly (but there was no unpacker on the target machine, so the problem was not solved). Finally, we "split" the FORTRAN code into very small chunks and shipped them one at a time. Most of the fragments copied without problems, but a few of them did not work, or they did after numerous attempts.



After examining the problem fragments, we found that they have something in common: they all contain blocks of comments that start and end with lines consisting of uppercase C letters (as a colleague preferred to comment on FORTRAN). We sent emails to the mainland to network specialists and asked for help. Of course, they wanted to see samples of our files that could not be sent via FTP ... but our letters did not reach them. Finally, we came up with a simple description of what non-forwarded files look like. It worked :) [Do I dare add here an example of one of the problematic comments on FORTRAN? Probably not worth it!]



In the end we managed to figure it out. A new gateway was recently installed between our part of the campus and the mainland network. He had HUGE difficulty transmitting packets that contained duplicate uppercase Cs! Only a few of these packets could take up all the resources of the gateway and prevent most other packets from breaking through. We complained to the gateway manufacturer ... and they told us, “Oh, yes, you ran into a duplicate C bug! We already know about him. " In the end, we solved the problem by purchasing a new gateway from another manufacturer (in defense of the former, I will say that the inability to transfer programs to FORTRAN for someone may be an advantage!).



Hard times



Several years ago, while working on a Perl ETL system designed to reduce the cost of Phase 3 clinical trials, I needed to process about 40,000 dates. Two of them did not pass the test. This did not bother me too much, because these dates were taken from the data provided by the client, which was often, let's say, surprising. But when I checked the initial data, it turned out that these dates were January 1, 2011 and January 1, 2007. I thought that the bug was in the program I just wrote, but it turned out that it was already 30 years old. This may sound mysterious to those unfamiliar with the software ecosystem. Because of a longtime decision by another company to make money, my client paid me to fix a bug that one company accidentally introduced and another that intentionally. So that you understand what this is about,I need to tell you about the company that added the feature that became a bug as a result, as well as a few other curious events that contributed to the mysterious bug I fixed.



In the good old days, Apple computers sometimes spontaneously reset their date to January 1, 1904. The reason was simple: a battery-powered "system clock" was used to keep track of the date and time. What happened when the battery ran out? Computers began to track the date by the number of seconds since the beginning of the era. The epoch meant the reference original date, and for the Macintosh it was January 1, 1904. And after the battery died, the current date was reset to the specified one. But why did this happen?



Previously, Apple used 32 bits to store the number of seconds from the original date. One bit can store one of two values ​​- 1 or 0. Two bits can store one of four values: 00, 01, 10, 11. Three bits - one value out of eight: 000, 001, 010, 011, 100, 101, 110, 111, etc. And 32 could store one of 2 32 values, that is, 4 294 967 296 seconds. For Apple dates, this was approximately 136 years old, so older Macs cannot handle dates after 2040. And if the system battery runs out, the date is reset to 0 seconds from the beginning of the epoch, and you have to manually set the date every time you turn on the computer (or until you buy a new battery).



However, Apple's decision to store dates as seconds from the beginning of the epoch meant that we couldn't handle dates before the beginning of the epoch, which had far-reaching implications, as we'll see. Apple introduced a feature, not a bug. Among other things, this meant that the Macintosh operating system was immune to the "millennium bug" (which cannot be said about many Mac applications that had their own date systems to circumvent restrictions).



Move on. We used Lotus 1-2-3, the "killer application" developed by IBM that helped launch the PC revolution, although Apple computers had VisiCalc, which made personal computers successful. To be fair, if 1-2-3 hadn't appeared, PCs would hardly have taken off, and the history of personal computers could have developed very differently. Lotus 1-2-3 incorrectly treated 1900 as a leap year. When Microsoft released its first Multiplan spreadsheet, it had a small market share. And when we launched the Excel project, we decided not only to copy the naming scheme for rows and columns from Lotus 1-2-3, but also to ensure compatibility for bugs, deliberately treating 1900 as a leap year. This problem persists to this day. That is, in 1-2-3 it was a bug, and in Excel it was a deliberate decision that guaranteedthat all 1-2-3 users can import their spreadsheets into Excel without changing the data, even if they are wrong.



But there was another problem. Microsoft first released Excel for the Macintosh, which did not recognize dates until January 1, 1904. In Excel, January 1, 1900, was considered the beginning of an era. Therefore, the developers made a change so that their program recognizes the type of epoch and stores data inside itself in accordance with the desired epoch. Microsoft even wrote an explanatory article about this. And this decision led to my bug.



My ETL system received Excel spreadsheets from customers that were created on Windows but could also be created on a Mac. Therefore, the beginning of an era in the table could be either January 1, 1900, or January 1, 1904. How to find out? The Excel file format shows the necessary information, but the parser that I used did not show (now it does), and assumed that you know the era for a particular table. Probably, I could spend more time understanding the Excel binary and submitting the patch to the author of the parser, but I had a lot to do for the client, so I quickly wrote a heuristic to determine the epoch. It was simple.



In Excel, the date July 5, 1998 can be represented in the format "07-05-98" (useless American system), "Jul 5, 98", "July 5, 1998", "5-Jul-98" or in some another useless format (ironically, one of the formats my version of Excel didn't offer was the ISO 8601 standard). However, within the table, the unformatted date was stored as either "35981" for epoch-1900 or "34519" for epoch-1904 (the numbers represent the number of days since the beginning of the epoch). I was just using a simple parser to extract the year from the formatted date and then using the Excel parser to extract the year from the unformatted date. If both values ​​differed by 4 years, then I understood that I was using the system with the era-1904.



Why didn't I just use formatted dates? Because July 5, 1998 can be formatted as "July, 98" with the day of the month missing. We received tables from so many companies that created them in such different ways that we (in this case, me) had to deal with the dates. Plus, if Excel gets it right, then so should we!



Then I ran into 39082. Let me remind you that Lotus 1-2-3 considered 1900 to be a leap year, and this was faithfully repeated in Excel. And since this added one day to 1900, many date functions could be wrong for that very day. That is, 39082 could have been January 1, 2011 (on Macs) or December 31, 2006 (on Windows). If my "years parser" was extracting 2011 from the formatted value, then everything is fine. But since the Excel parser does not know which epoch to use, it defaults to epoch-1900, returning 2006. My application saw that there was a 5 year difference, considered it an error, logged and returned an unformatted value.



To get around this, I wrote this (pseudocode):



diff = formatted_year - parsed_year
if 0 == diff
    assume 1900 date system
if 4 == diff
    assume 1904 date system
if 5 == diff and month is December and day is 31
    assume 1904 date system


And then all 40,000 dates were parsed correctly.



In the midst of large print jobs



In the early 1980s, my father worked at Storage Technology, a defunct business unit that built tape drives and pneumatic systems for high speed tape feeding.



They redesigned drives so that they could have one central drive "A" connected to seven drives "B", and the small OS in RAM that controlled drive "A" could delegate read and write operations to all drives "B".



Each time drive "A" was started, a floppy disk had to be inserted into the peripheral drive connected to "A" in order to load the operating system into its memory. It was extremely primitive: the processing power was provided by an 8-bit microcontroller.



The target audience for such equipment was companies with very large data stores - banks, retail chains, etc. - that needed to print many address labels or bank statements.



One client had a problem. In the middle of a print job, one particular drive "A" could stop working, causing the entire job to get up. To get the drive back to work, the staff had to reboot everything. And if this happened in the middle of a six-hour task, then a huge amount of expensive computer time was wasted and the schedule of the entire operation was disrupted.



Storage Technologies sent technicians. Despite their best efforts, they were unable to reproduce the bug in test conditions: it appears to have occurred in the middle of large print jobs. The problem was not with the hardware, they replaced everything they could: the RAM, the microcontroller, the floppy drive, every conceivable part of a tape drive - the problem persisted.



Then the technicians called the headquarters and called the Expert.



The examiner grabbed a chair and a cup of coffee, sat down in the computer room - in those days there were dedicated rooms for computers - and watched as the staff queued up a large print job. The expert waited for a failure to occur - and that happened. Everyone looked at the Expert - and he had no idea why this happened. Therefore, he ordered the task to be queued again, and all employees with technicians returned to work.



The expert sat down in his chair again and waited for the failure. It took about six hours and the failure occurred. The Expert again had no ideas, except that everything happened in a room filled with people. He ordered to restart the mission, sat down again and waited.



By the third glitch, the Expert noticed something. The failure occurred when personnel changed belts in an external drive. Moreover, the crash occurred as soon as one of the employees walked through a certain tile on the floor.



The raised floor was made of aluminum tiles laid 6-8 inches high. Numerous computer wires ran under the raised floor so that someone would not accidentally step on an important cable. The tiles were laid very tightly so that no debris could get under the raised floor.



The expert realized that one of the tiles was deformed. When an employee stepped on its corner, the tile rubbed its edges against adjacent tiles. They also rubbed the plastic parts that connected the tiles, which created static micro-discharges that created radio frequency interference.



Nowadays, RAM is much better protected from radio frequency interference. But in those years it was not so. The expert realized that these interference disturbed the memory, and with it the operation of the operating system. He called the escort service, ordered a new tile, installed it himself, and the problem went away.



It's the tide!



The story took place in a server room on the fourth or fifth floor of an office in Portsmouth (I think), in the dockside area.



One day a Unix server with the main database crashed. It was rebooted, but it joyfully continued to fall over and over again. We decided to call someone from the support service.



Support dude ... I think his name was Mark, but it doesn't matter ... I don't think I know him. It doesn't matter, really. Let's stay at the Mark, okay? Excellent.



So, a few hours later Mark arrived (from Leeds to Portsmouth the path is not close, you know), turned on the server and everything worked without problems. Typical fucking support, the customer gets very upset about it. Mark looks through the log files and finds nothing objectionable. Then Mark gets back on the train (or whatever mode of transport he took, it could have been a lame cow as far as I know ... well, it doesn't matter, okay?) And heads back to Leeds, wasting the day.



The server crashes again that evening. The story is the same ... the server does not rise. Mark tries to help remotely, but the client cannot start the server.



Another train, bus, lemon meringue or some other bullshit, and Mark is back in Portsmouth. Look, the server boots up without any problems! Miracle. Mark checks for several hours that everything is in order with the operating system or software, and goes to Leeds.



Around the middle of the day, the server crashes (take it easy!). This time, it seems wise to bring in the hardware support people to replace the server. But no, after about 10 hours it also falls.



The situation repeated itself for several days. The server is up, crashes after about 10 hours and does not start for the next 2 hours. They checked cooling, memory leaks, they checked everything, but they found nothing. Then the crashes stopped.



The week passed without a care ... everyone was happy. Happy until it all starts again. The picture is the same. 10 hours of work, 2-3 hours of downtime ...



And then someone (I think they told me that this person had nothing to do with IT) said:



"This is the tide!"



The exclamation was met with blank stares, and, probably, someone's hand shook at the button to call the guard.



"He stops working with the tide."



This would seem like a completely foreign concept for IT support staff who hardly read the Tide Yearbook while sitting down to coffee. They explained that it had nothing to do with the tide because the server had been running for a week without a hitch.



"The tide was low last week and high this week."



A bit of terminology for those who do not have a license to operate a yacht. The tides depend on the lunar cycle. And as the Earth rotates, every 12.5 hours the gravitational pull of the Sun and Moon creates a tidal wave. At the beginning of a 12.5-hour cycle, there is a high tide, in the middle of the cycle there is an ebb tide, and at the end of a high tide again. But as the moon's orbit changes, so does the difference between ebb and flow. When the Moon is between the Sun and the Earth or on the opposite side of the Earth (full moon or no moon), we get Syzygy tides - the highest tides and the lowest ebb tides. At half moon we get quadrature tides - the lowest tides. The difference between the two extremes is greatly reduced. The lunar cycle lasts 28 days: syzygy - quadrature - syzygy - quadrature.



When the tidal forces were explained to the techies, they immediately thought about calling the police. And it is quite logical. But it turned out that the dude was right. Two weeks earlier, a destroyer had docked near the office. Each time the tide raised it to a certain height, the ship's radar post would be at the level of the server room floor. And radar (or electronic warfare equipment, or some other toy of the military) created chaos in computers.



Flight mission for a rocket



I was instructed to port a large (about 400 thousand lines) missile launch control and monitoring system for new versions of the operating system, compiler and language. More precisely, from Solaris 2.5.1 on Solaris 7, and from Verdix Ada Development System (VADS) written in Ada 83 to Rational Apex Ada system written in Ada 95. VADS was purchased by Rational and its product is obsolete, although Rational tried to implement compatible versions of VADS-specific packages to ease the transition to the Apex compiler.



Three people helped me just to get cleanly compiled code. It took two weeks. And then I worked on my own to get the system to work. In short, it was the worst architecture and implementation of a software system that I have ever encountered, so it took another two months to complete the porting. Then the system was handed over for testing, which took several more months. I immediately fixed the bugs that I found during testing, but their number quickly decreased (the source code was a production system, so its functionality worked quite reliably, I just had to remove the bugs that arose when adapting to the new compiler). In the end, when everything worked as it should, I was transferred to another project.



And on the Friday before Thanksgiving, the phone rang.



About three weeks later, a rocket launch was to be tested, and in laboratory tests of the countdown, the sequence of commands was blocked. In real life, this would lead to the interruption of tests, and if a blockage occurred within a few seconds after starting the engine, then several irreversible actions would occur in the auxiliary systems, which would take a long - and expensive - to re-prepare the rocket. It wouldn't start, but a lot of people would get very upset about the loss of time and very, very big money. Don't let anyone tell you that the Department of Defense is spending money offhand - I have yet to meet a contract manager who doesn't have a budget first or second, followed by a schedule.



In previous months, this countdown test had been run hundreds of times in many variations, with only a few minor hitches. So the likelihood of this happening was very low, but its consequences were very significant. Multiply both of these factors, and you will understand that the news predicted a ruined holiday week for me and dozens of engineers and managers.



And attention was drawn to me as the person who ported the system.



As with most security-critical systems, a lot of parameters were logged here, so it was fairly easy to identify the few lines of code that were executed before the system crashed. And of course, there was absolutely nothing out of the ordinary about them, the same expressions were successfully performed literally thousands of times during the same run.



We called the people from Apex into Rational because they were the ones who developed the compiler and some of the routines they had developed were called in the suspicious code. They (and everyone else) were impressed by the need to find out the cause of the problem of literal national significance.



Since there was nothing interesting in the logs, we decided to try to reproduce the problem in a local laboratory. This was not an easy task, as the event occurred approximately once every 1000 runs. One of the supposed reasons was that the call to a vendor-developed mutex function (part of the VADS migration batch)Unlockdid not lead to unlocking. The calling thread processed heartbeat messages, which nominally arrived every second. We raised the frequency to 10 Hz, that is, 10 times per second, and started running. After about an hour, the system was locked. In the log, we saw that the sequence of recorded messages was the same as during the failed test. We made a few more runs, the system stably blocked 45-90 minutes after the start, and each time the log contained the same track. Despite the fact that we were technically executing different code now - the message rate was different - the behavior of the system was repeated, so we made sure that this load scenario led to the same problem.



Now it was necessary to find out exactly where in the sequence of expressions the lock occurred.



This implementation used the Ada task system and was incredibly poorly used. Tasks are a high-level, concurrently executable construct in Ada, sort of like threads of execution, just built into the language itself. When two tasks need to interact, they "rendezvous", exchange the necessary data, and then stop the rendezvous and return to their independent performances. However, the system was implemented differently. After a target had a rendezvous, that target would rendezvous with another, which would then rendezvous with a third, and so on, until some processing was completed. After that, all these rendezvous ended and each task had to return to its execution. That is, we were dealing with the world's most expensive function calling system,which stopped the whole "multitasking" process while processing some of the input data. And before, this did not lead to problems just because the throughput was very low.



I described this task mechanism because when a rendezvous was requested or expected to complete, a "task switch" could occur. That is, the processor could start processing another task ready to be executed. It turns out that when one task is ready to rendezvous with another task, the execution of a completely different task can begin, and eventually control returns to the first rendezvous. And other events can occur that lead to a task switch; one such event is a system function call, such as printing or executing a mutex.



To understand which line of code was causing the problem, I needed to find a way to record the progress of the sequence of expressions without triggering a task switch, which could prevent the crash from occurring. So I couldn't take advantagePut_Line()to not perform I / O operations. You could set a counter variable or something like that, but how can I see its value if I can't display it on the screen?



Also, when examining the log, it turned out that, despite the freezing of the processing of heartbeat messages, which blocked all I / O operations of the process and did not allow other processing to be performed, other independent tasks continued to be executed. That is, the work was not blocked entirely, only the (critical) chain of tasks.



This was the hook required to evaluate the blocking expression.



I made an Ada package that contained a task, an enumerated type, and a global variable of that type. Enumerated literals were tied to specific expressions problematic sequences (e.g. Incrementing_Buffer_Index, Locking_Mutex,Mutex_Unlocked), and then inserted assignment expressions into it, which assigned the corresponding enumeration to a global variable. Since the object code of all this simply kept a constant in memory, task switching as a result of its execution was extremely unlikely. First of all, we suspected expressions that could switch the task, since the blocking occurred during execution, and not returned when switching the task back (for several reasons).



The tracking task simply ran in a loop and periodically checked to see if the value of the global variable had changed. With each change, the value was saved to a file. Then a short wait and a new check. I wrote the variable to a file because the task was executed only when the system selected it for execution when switching the task in the problem area. Whatever happens in this task would not affect other unrelated locked tasks.



It was expected that when the system reaches the execution of the problematic code, the global variable will be reset with each next expression. Then something will happen, leading to a switch of the task, and since the frequency of its execution (10 Hz) is lower than that of the monitoring task, the monitor could fix the value of the global variable and write it. In a normal situation, I could get a repeating sequence of a subset of enumerations: the last values ​​of the variable at the time of the task switch. When hanging, the global variable should not change anymore, and the last value written will show which expression did not complete execution.



Launched the tracking code. It is frozen. And the monitoring worked like clockwork.



The log ended up with the expected sequence, which was interrupted by a value indicating that the mutex was called Unlockand the task was pending - as was the case with thousands of previous calls.



Apex engineers at this time frantically analyzed their code and found a place in the mutex where, in theory, a lock could occur. But its probability was very small, since only a certain sequence of events occurring at a certain time could lead to a blockage. Murphy's law boys, it's Murphy's law.



To protect this piece of code, I replaced the calls to the mutex functions (built on top of the OS mutex functionality) with a small native Ada mutex package to control mutex access to this piece.



Pasted it into the code and ran the test. Seven hours later, the code continued to work.



My code was transferred to Rational, where it was compiled, disassembled and verified that it does not use the same approach that was used in the problematic mutex functions.



It was the most crowded code review in my career :) There were about ten engineers and managers in the room with me, a dozen more people connected on a conference call - and they all examined about 20 lines of code.



The code was reviewed, new executable files were built and submitted for formal regression testing. A couple of weeks later, the countdown tests were successful and the rocket took off.



Okay, this is all good and beautiful, but what's the point of this story?



It was a totally disgusting problem. Hundreds of thousands of lines of code, parallel execution, over a dozen interacting processes, poor architecture and poor implementation, interfaces for embedded systems and millions of dollars spent. No pressure, right.



I was not the only one working on this issue, although I was in the spotlight as I was doing the porting. But even though I did it, this does not mean that I figured out all the hundreds of thousands of lines of code, or at least skimmed them. The code and logs were analyzed by engineers all over the country, but when they told me their hypotheses about the causes of the failure, it took me half a minute to refute them. And when I was asked to analyze theories, I passed it on to someone else, because it was obvious to me that these engineers were going the wrong way. Sounds arrogant? Yes, it is, but I rejected hypotheses and requests for a different reason.



I understood the nature of the problem. I didn't know exactly where it was or why, but I knew exactly what was happening.



Over the years, I have accumulated a lot of knowledge and experience. I was one of the pioneers of using Ada, I understood its advantages and disadvantages. I know how the Ada runtime libraries handle tasks and deal with parallel execution. And I am good at low-level programming at the level of memory, registers and assembler. In other words, I have deep knowledge in my field. And I used them to find the cause of the problem. I didn't just get around the bug, but figured out how to find it in a very sensitive execution environment.



Such stories of struggle with code are not very interesting for those who are not familiar with the peculiarities and conditions of such a struggle. But these stories help to understand what it takes to solve really difficult problems.



You need to be more than just a programmer to solve really difficult problems. You need to understand the “fate” of the code, how it interacts with its environment, and how the environment itself works.



And then you have your spoiled holiday week.






To be continued.



All Articles