Testing software fast and hard

A version of this blog was originally posted by Test Magazine in March 2019, we’re sharing it once more for those who missed it the first time around!

Developers like flame wars. Last Friday, when the afternoon was slowing in the office, an inadvertent GIF on Slack sparked some friendly debate about the right way to test. Because winning flame wars is important, here is my “well, actually”…

At some point in the 2000s, when PHP wasn’t just considered a templating language anymore and Ruby just got on Rails, the programming community decided that dynamically typed languages are a great way to reduce cognitive load on the programmer, to stay ruthlessly pragmatic, and avoid the factory-like culture of Java.

The Facebook mantra of “Move fast and break things” since has been perpetuated all over the startup scene, and is looked at the way you build business, products, houses, or anything, really, that can collapse on you. And if you’ve ever worked on software, you know that what can happen, will happen.

Along the way somewhere Clean Code arrived, and the rise of TDD gathered a crowd around Writing Tests First, the belief in The Coverage, and other mantras that are fun to recite, but significantly less fun to do.

A lot of self-help books like Clean Code are vaguely based on personal experience, and bring certain programming patterns into domains where they had traditionally not been used: for instance, simplify the C++-esque object orientation with some functional programming concepts such as small, simple, pure functions.

In reality, there has been a lot of research on the software crisis and how to get out of the mess we’re in, and it often contradicts the wisdom of the crowd. Let’s take a look at different strategies that drive software quality, and where they actually make a difference.

From the bottom to top, I generally look at software verification at the following layers: type system, unit tests, integration tests, and organizational management structure.

Wait what? Organizational management structure? Well, maybe we can start with that then.

Management

Organizational behaviour is a social science on its own right, and studies the subtle art of people management structures. Since software is usually made by humans, they have needs, internal and external motivators, and occasionally need to work together to deliver some deliverable.

“quantifiable results from the study show that team and collaboration structure can be a better predictor of quality than tooling, testing strategies, or other code-based metrics.”

Google ran on a study on its teams as working units to identify what made them more effective, but Microsoft‘s research focused on how the organizational structure determined software failure rates. Both are interesting approaches in their own way, and Microsoft’s study gives us an interesting view into the development of Windows Vista. The study tells us that smaller, more focused teams produce more reliable software. High churn of engineers lowers software quality, while tighter collaboration between teams working on the same project will result in lower failure rates.

These might seem as statements coming from Captain Obvious (or Melvin Conway), however, quantifiable results from the study show that team and collaboration structure can be a better predictor of quality than tooling, testing strategies, or other code-based metrics.

The way any individual team controls the quality of their output is the next step from here. Code reviews in particular are a great way to create and maintain a common set of standards. Written code reviews force engineers to communicate their concerns clearly and this increased technical communication will help everyone on the team learn about different styles and perspectives, and simultaneously help level the skills across the team. Maybe avoid the style of Linus Torvalds, though.

Integration tests

Integration tests, surprise-surprise, test integration of components or modules in a system. You can also test the integration of integrated modules, and there are turtles all the way down.

It’s often easier to write correct code in isolation, so a large amount of bugs occur at system boundaries. Validating inputs and formatting outputs, failure to check for permission levels, or bad implementation of interface schemas. This problem is amplified by the current trend of microservices, where interface versions can fall out of sync between various services within the system.

At this level, we’re best off writing pass-through end-to-end tests for features, and try to leverage that we have so many other layers of protection against failures that something will eventually trip them wires. In fact, code coverage in integration tests is shown not to be a reliable indicator of failure rates.

If you look at it another way, production is just one big integration test. A trick I love to do is create a mute-production instance, which receives a portion of the actual production traffic, but will never generate responses to the users. With enough investment in a stateless orchestration layer, we can even mute-test subtrees of services at strategic places, then make them active and discard the old subtree once the workload is gone.

Coupled with principles behind building highly observable systems, this kind of test environment removes a lot of anxiety around what happens when we deploy to production, because the mute-prod will receive precisely the same data. The more knobs and probes we expose in live systems, the better visibility we get into the internals.

Unit testing

So in order to integrate modules, we want some confidence that the modules themselves work according to specification. This is where unit testing enters the picture.

Unit tests are usually fast, and more or less comprehensive tests of isolated pieces of easy-to-grasp building blocks. How fast? Ruby on Rails’s master repo runs about 67 tests, and 176 assertions per second.

As a rule of thumb, one test should cover one scenario that can happen to a module. In comparison to integration testing, the same study by Niedermayr, Juergens and Wagner shows that code coverage on a unit testing level does influence failure rates, if done well.

A study from ’94 by Hutchins et al. claims that coverage levels over 90% showed better fault detection rates than smaller test sets, and “significant” improvements occured as coverage increased from 90% to 100%.

The BDD movement has this fun practice of developing specifications and turning them straight into unit tests. The benefits of clear, human-readable unit tests help document the code and ease some of the knowledge transfer that needs to happen when developers inevitably come and go, or requirements of existing components change.

Unit testing in my book also includes QuickCheck-style generated tests. The idea of QuickCheck is that instead of having some imperative code of if-this-then-that to walk through, the programmer can list assumptions that need to hold true for the output of the function given some inputs. QuickCheck then generates tests that try and falsify these assumptions using the implementation, and, if it finds one, reduces it to a minimal input that proves them wrong.

Interestingly, the amount of scenarios a unit test normally has to cover is heavily influenced by the programming language it’s written in. Which leads me to necessarily discuss the holy flamewar about static and dynamic typing.

Type systems

Hindley-Milner

Hindley-Milner-style static type systems such as Haskell, or Rust force the programmer to establish contracts that are checked before the program can be run.

What the programmer finds, then, is that they’re suddenly programming 2 languages in parallel: the type system provides a proof for the program, while there is also program that fulfills the requirements.

This allows a style of reasoning about correctness, that, coupled with a helpful compiler, will allow the programmer to focus on things that cannot be proven by the type system: business logic.

Of course, this is not a net win. In many cases, a solution using a dynamic type system is much more straightforward and elegant, or, in other cases, the type system constraints make certain implementations basically impossible.

In other cases, using Haskell allowed writing a much smaller program significantly faster compared to the alternatives, so much so that the US Navy had to repeat the test in disbelief.

Elegance is in the eye of the beholder, and a beautifully typed abstraction that reduces to a simple state machine during compilation can be just as attractive as a quick and dirty LISP macro. Sometimes, all that complexity is hard to justify only to please the compiler. It comes through experience, taste, and applying reasoned judgement in the right situation. People often look at programming as craftsmanship. Yes, we do know how to do the math, but with this neat trick we get 90% there with 10% of the effort, and it may just be good enough. And it will explode in that edge case I think would never actually happen. But I digress.

Weak but static

Somewhat more approachable, but providing less stringent verification are languages in the C++/Java/C#-style OOP family, as well as the likes of C and Go.

The type systems here allow for different kind of flexibility, and more desirable escape hatches to the dynamic world.

A weaker, but still static type system provides fewer guarantees about the correctness of the programs, something that we have to make up for in testing, and/or coding standards. NASA’s Jet Propulsion Lab, a mass manufacturer of Mars rovers, maintains a set of safe programming guidelines for C. Their guidelines seem to be effective. Opportunity exceeded its originally planned 90 days of activity by 14 years by careful maintenance. Curiosity is still cruising the surface of Mars, and is being patched on a regular basis.

Dynamic

Speaking of JPL, the internet folklore preserves the tale of using LISP at the NASA lab, a dynamic, functional programming language from the 60s, that’s still looked at as one of the most influential inventions, or even discoveries, in computing science. Today’s most commonly used LISP dialect is Clojure, which sees increasing popularity in data science circles.

Dynamic languages provide ultimate freedom, and little safety. Most commonly, the only way to determine if a piece of code is in any way reasonable is to run it, which means our testing strategy needs to be more principled, and, indeed, thorough, as there’s no next layer to fall back to.

Probably the most widespread use of a dynamic type system today comes from JavaScript. Interestingly, as companies strive for lowering the entry barrier for contributors while scaling up development and maintaining quality, Google, Microsoft, and Facebook all came up with their own solutions to introduce some form of static type checking into the language.

Even though Google’s Dart hasn’t seen significant adoption, TypeScript from Microsoft did, and its use is driven by large and popular projects such as VSCode. Both approaches introduce a language with static type checking that compiles to JavaScript, making it easy to gradually introduce to existing projects, too.

In contrast, Facebook’s Flow is a static analyser purely built on JavaScript, which introduces its own kind of type annotations. The idea is that if there are type annotations at strategic places, the type checker should be able to figure out if there are any type errors in a part of the program by tracing the data flow.

Enthusiastic programmers are going to tell you that both approaches to static typing in JavaScript are flawed in their own way, and they would be right. In the end, a lot of the arguments boil down to subjective ideas and tastes about software architecture. It seems difficult to deny, however, that some form of static type checking provides several benefits to scaling and maintaining software projects.

The little things that slip away

The list of things we can do to ensure correctness of software is far from over. The state of the art keeps pushing further, and new approaches gain popularity quickly, especially within the security community.

In absolutely critical modules, such as anything cryptography or safety related, formal verification can increase confidence in parts of the system, but it’s hard to scale.

A familiar sentiment can be seen behind the principles of LangSec. In many cases, the power and expressiveness of our languages allow inadvertent bugs to creep in. LangSec says, make all the invalid states unrepresentable by the language itself. Make the language limit what the programmer can do, so they can avoid what they shouldn’t. This is also the motivation behind coding standards such as JPL’s, which allows for reasoning about state and data flow throughout the program code easier.

When we’re reasonably sure that what we need is good enough, we can start fuzzing it. Fuzzing is great. It is all about feeding unexpected states into a system, and waiting for them to cause failures. This simple idea helps discover security holes in popular software or can help engineer chaos in the cloud.

As always, producing a stable and secure system requires principled engineering, in software just as much as architecture. We need to understand the pieces that make up the whole, analyze then verify their interactions internally, and with the environment. Despite our best efforts, bugs will always creep in, and all we can do is try to ensure the ones that remain are not catastrophic.

However, once software goes live, verification does not stop. Designing for observability by exposing knobs, tracing, alerting, and collecting a set of operational metrics all help us reason about the state of the system while it’s running, which is the ultimate test of it all.

Software development is a process, and it’s practically impossible to achieve perfection. As long as the team has a plan to approximate it, and everybody is committed, we can call it good enough, then get out of the office to enjoy the sun.

If you’re interested in debating the rights and wrongs of software testing some more drop me a line – @rhapsodhy or by email – I’d love to hear your thoughts!

Here at Red Sift, when we’re not debating over the right way to test, we’re enabling security-first organizations to successfully communicate with and ensure the trust of their employees, vendors, and customers. Check out our homepage to find out more about what we do.

Testing software fast and hard

Management

Integration tests

Unit testing