2016-11-24

Expensive integer overflows, part N+1

Now the European Space Agency has published its preliminary report into what happened with the Schiaparelli lander, it confirms what many had suspected:

As Schiaparelli descended under its parachute, its radar Doppler altimeter functioned correctly and the measurements were included in the guidance, navigation and control system. However, saturation – maximum measurement – of the Inertial Measurement Unit (IMU) had occurred shortly after the parachute deployment. The IMU measures the rotation rates of the vehicle. Its output was generally as predicted except for this event, which persisted for about one second – longer than would be expected. [My italics]
This is a classic software mistake - of which more later - where a stored value becomes too large for its storage slot. The lander was spinning faster than its programmers had estimated, and the measured rotation speed exceeded the maximum value which the control software was designed to store and process.
When merged into the navigation system, the erroneous information generated an estimated altitude that was negative – that is, below ground level.
The stream of estimated altitude reading would have looked something like "4.0km... 3.9km... 3.8km... -200km". Since the most recent value was below the "cut off parachute, you're about to land" altitude, the lander obligingly cut off its parachute, gave a brief fire of the braking thrusters, and completed the rest of its descent under Mars' gravitational acceleration of 3.8m/s^2. That's a lot weaker than Earth's, but 3.7km of freefall gave the lander plenty of time to accelerate; a back-of-the-envelope calculation (v^2 = 2as) suggests a terminal velocity of 167 m/s, minus effects of drag.

Well, there goes $250M down the drain. How did the excessive rotation speed cause all this to happen?

When dealing with signed integers, if - for instance - you are using 16 bits to store a value then the classic two's-complement representation can store values between -32768 and +32767 in those bits. If you add 1 to the stored value 32767 then the effect is that the stored value "wraps around" to -32768; sometimes this is what you actually want to happen, but most of the time it isn't. As a result, everyone writing software knows about integer overflow, and is supposed to take account of it while writing code. Some programming languages (e.g. C, Java, Go) require you to manually check that this won't happen; code for this might look like:

/* Will not work if b is negative */
if (INT16_MAX - b >= a) {
   /* a + b will fit */
   result = a + b
} else {
   /* a + b will overflow, return the biggest
    * positive value we can
    */
   result = INT16_MAX
}
Other languages (e.g. Ada) allow you to trap this in a run-time exception, such as Constraint_Error. When this exception arises, you know you've hit an overflow and can have some additional logic to handle it appropriately. The key point is that you need to consider that this situation may arise, and plan to detect it and handle it appropriately. Simply hoping that the situation won't arise is not enough.

This is why the "longer than would be expected" line in the ESA report particularly annoys me - the software authors shouldn't have been "expecting" anything, they should have had an actual plan to handle out-of-expected-value sensors. They could have capped the value at its expected max, they could have rejected the use of that particular sensor and used a less accurate calculation omitting that sensor's value, they could have bounded the calculation's result based on the last known good altitude and velocity - there are many options. But they should have done something.

Reading the technical specs of the Schiaparelli Mars Lander, the interesting bit is the Guidance, Navigation and Control system (GNC). There are several instruments used to collect navigational data: inertial navigation systems, accelerometers and a radar altimeter. The signals from these instruments are collected, processed through analogue-to-digital conversion and then sent to the spacecraft. The spec proudly announces:

Overall, EDM's GNC system achieves an altitude error of under 0.7 meters
Apparently, the altitude error margin is a teeny bit larger than that if you don't process the data robustly.

What's particularly tragic is that arithmetic overflow has been well established as a failure mode for ESA space flight for more than 20 years. The canonical example is the Ariane 5 failure of 4th June 1996 where ESA's new Ariane 5 rocket went out of control shortly after launch and had to be destroyed, sending $500M of rocket and payload up in smoke. The root cause was an overflow while converting a 64 bit floating point number to a 16 bit integer. In that case, the software authors had actually explicitly identified the risk of overflow in 7 places of the code, but for some reason only added error handling code for 4 of them. One of the remaining cases was triggered, and "foom!"

It's always easy in hindsight to criticise a software design after an accident, but in the case of Schiaparelli it seems reasonable to have expected a certain amount of foresight from the developers.

ESA's David Parker notes "...we will have learned much from Schiaparelli that will directly contribute to the second ExoMars mission being developed with our international partners for launch in 2020." I hope that's true, because they don't seem to have learned very much from Ariane 5.

No comments:

Post a Comment

All comments are subject to retrospective moderation. I will only reject spam, gratuitous abuse, and wilful stupidity.