2013-08-23

Regulation is not the answer to IT failure

Widespread gloom and despondency today as the NASDAQ exchange shut down for three hours due to a "glitch":

"I would not want to speculate other than to say this is huge. Everything is halted in the market," said Sal Arnuk at Themis Trading in Chatham, New Jersey. Options trading was also halted, the exchange said.
Guess what? This is what happens if you have a SPOF (single point of failure). If you can only trade your Apple, Facebook shares on the NASDAQ then NASDAQ is a single point of failure for your systems. If you don't want a single point of failure, you have to make it easy to trade on multiple systems.

The WSJ has a few more details:

Nasdaq said it plans to work with other exchanges to investigate Thursday's outage, which centered on a problem with the data feed supplying U.S. markets with trade information, and supports "any necessary steps to enhance the platform."
Nasdaq officials internally pointed to a "connectivity" problem with rival NYSE Arca, according to people familiar with the matter, that led to price quotes not being reported.
Nice muddying of the waters there, NASDAQ. "Work with other exchanges", forsooth. If the problem affected feeds in general, and not just an isolated feed to one exchange, then the problem was at NASDAQ's end. In theory you could get improved robustness by reporting from each exchange to NASDAQ on problems with the feed, but in practice each exchange's clients would notice quickly enough that something was up. If the problem were at Arca's end, it seems odd that NASDAQ would suspend operations. Rumours have it that Arca somehow "locked" an order causing the NASDAQ side to freeze. Unfortunately, that doesn't let NASDAQ off the hook. If you are designing a client-server system then you should plan for clients to do arbitrary and crazy things, especially if you don't control the client code. Your server should tolerate badly-behaved clients, or at least alert you to them and give you the option to force ignoring that client until they have sorted themselves out. Letting a single client freeze the whole system with no work-around for three hours in the middle of the trading day - when all your techs are in the office - is terrible design.

For reference, since the NASDAQ market opening hours are 6.5 hours per day 5 days a week they are open for about 1690 hours per year. Therefore this downtime was 0.18% of the year, bringing them below "three nines" of reliability. If they keep from having an outage next year they should go back to three nines - but it gives you some perspective on the limits of reliability even for a firm where time is, quite literally, money.

What's the answer to this outage? Regulation!

Currently, exchanges can voluntarily choose to have their backup plans reviewed by the SEC, which then audits their technological systems. One potential rule, known as Regulation SCI, would require major exchanges to submit to the audits. That regulation is pending comment.
Lauer said the shutdown should be a wake-up call to regulators to monitor exchanges, which he said have not kept up with the speed of current technology. "We have an overly complex system, and it's complex to the point of dysfunction," he said.
FFS. Lauer may be right about the complexity - actually, I'm pretty sure he is right - but his solution blows chunks. Adding regulators very seldom improves technological problems. For a start, if you have a complex and broken software system, you're not going to be able to evolve it into a reliable system. You're going to have to develop (carefully, with due respect of the Second System Effect) a new system in parallel with the old one, and very slowly and carefully migrate traffic from old to new while detecting and fixing the inevitable bugs and scaling issues. A regulator might be able to make you initiate that process, but sure as little green apples won't make that process any more reliable.

Why not? Let's be brutally honest. What really good software engineer / technical project manager would work for a regulator, employed on a government-standard seniority-based competence-hostile salary scheme, battling with much more highly paid software engineers to make them try to do the right thing? Even if their employer (here, the SEC) has anti-poaching arrangements with the major banks, forbidding them from poaching an SEC tech who works or has recently worked on their compliance, there's nothing to stop Fred from Bank of America informally recommending his ex-colleague Jim at Goldman Sachs that they hire Sheila from the SEC who's been doing compliance testing on BofA and showing unusual technical competence. In return, Jim could tell Fred to look at hiring Sophie from the SEC who's been working with Goldman Sachs, but avoid Hermann at all costs as he's a talentless box-ticking drone.

NASDAQ clearly has software architecture problems, but regulator intervention is not going to fix them. Only commercial competition is going to help. If another firm is willing to set up a small exchange for (say) the top 50 NASDAQ-listed firms, and persuades some major banks to act as market makers, it will slightly increase liquidity in those firms and (more importantly) provide redundancy in the event that NASDAQ trading fails. They may have to plan and provision for NASDAQ failure, handling several times their normal traffic until NASDAQ get back on their feet, but that's feasible. A beneficient side-effect will be that NASDAQ will realise that downtime will no longer just delay trades, but will actually move trades to their competitor and lose them money. If that's not an incentive to improve, I don't know what would be.

No comments:

Post a Comment

All comments are subject to retrospective moderation. I will only reject spam, gratuitous abuse, and wilful stupidity.