If you have done any kind of interview for a software developer/engineer position, you probably got asked a few questions about tests. “What is a unit test? What is an integration test? Have you ever done system tests?”. Your interviewer, most likely than not, also asked you “And how did you handle test data?”

If you ignored that question you may have missed one of the greatest opportunities to shine and separate yourself from the rest of the pack. Handling test data is something you must do when writing complex tests and that is a very good heuristic to identify an excellent software developer.

Understanding the issue

Doing unit tests is hard. Doing integration and system tests is exponentially harder because you’re now handling much larger parts of your application and your application often requires a complex configuration to work properly – this is especially true for database-backed applications.

You can just say “ship it” and shove all the logic into your applications lifecycle, and now you’re creating additional points of failure. When dealing with system tests you have to deploy the application, wait for it to startup and then you can start running your tests – and while this logic should be part of your application’s pipeline, it shouldn’t be coupled with your build cycles.

Strategy #1: Stubs/MocksCrash Testing for Dummies

Mocking at the unit test level is usually pretty simple, not so much when running them at the integration/system level since the mock’s complexity grows exponentially with the complexity of the system itself. This is, nevertheless, the best possible strategy and should be preferred.

Although it requires deep knowledge of the interfaces of your application with the outside components, the solution is ideal for new applications which have either sparse or with well-designed interfaces and a very clearly defined set of boundaries.

Strategy #2: Generate your data

Generate DataThe second best option is to generate your own data, and this is often SQL scripts which fill tables with specific values. It gives you a fairly decent amount of control on what exactly is being tested, while removing a very significant part of your application’s complexity. Now it’s not required to know all the possible interactions, but it has a few caveats.

One of them is having a much bigger dependency from your stack than with mocking. While this solution enables you to run more comprehensive tests, it also means your test suite will probably take much longer to run and failures will be harder to properly debug.

Strategy #3: Copy your data from production

Probably the only economical solution when your database schema is Database structurevery complex or simply too laborious to properly setup. Some examples
are setups with multiple databases and legacy applications. What you lose in control, you gain in richness: having the production data allows you to run your tests with real world data sets.

How we did it

We opted for option #3. While taking great care to ensure no personal data is floating around, it enables us to write system tests with very little knowledge of the underlying models which significantly reduces the ramp up time of new developers to start testing the new features being developed.

We actually dialed it up to 11 and created a service which indexes customers based on specific criteria. This enables us to run our tests by specifying customers based on characteristics instead of using specific users.

Our tests look like “a customer with two orange accounts and one mortgage is able to sign up for a new mortgage”. As you can see, we are no longer coupled with using a specific user for a test case, the service will always return a new user which fulfills the desired characteristics.

The next step is in having this service actually create new users when no test data is found (which is the case when developing new features).