UCL for Code in Research

The companion podcast for courses on programming from the Advanced Research Computing Centre of the University College of London, UK.

All Episodes

UCL for Code in Research

5/9 Research Software Engineering with Python (COMP233) - Testing with Python

October 30, 2025 • Peter Schmidt • Season 2 • Episode 5

Testing your software is part of development. In this episode I talk about different types of testing, automated tests, legacy code etc. My guest is Stef Piatek from UCL who tells us how he approaches testing in his daily work.

Links

https://docs.python.org/3/library/unittest.html
https://docs.pytest.org/
- https://docs.pytest.org/en/7.1.x/how-to/monkeypatch.html PyTest Monkeypatch
https://agiledata.org/essays/tdd.html test driven development TDD
https://en.wikipedia.org/wiki/Extreme_programming
https://joss.readthedocs.io/en/latest/review_criteria.html criteria for open source software reviews, which includes a section on testing
https://www.freecodecamp.org/news/a-practical-guide-to-start-opensource-contributions/
https://docs.github.com/en/actions GitHub Actions
https://martinfowler.com Martin Fowler's great web site. Also look out for his book:
- Working Effectively With Legacy Code - Michael Feathers, 2004, ISBN: 8601400968741

Don't be shy - say Hi

This podcast is brought to you by the Advanced Research Computing Centre of the University College London, UK.
Producer and Host: Peter Schmidt

This episode is about testing, and specifically testing with Python. My guest this time is Stefan Piatek from the Advanced Research Computing Centre at the University College London. You’re going to hear from Stef a bit later.

But first, fasten your seat belts, there is a lot to cover this session. Because:

first, we need to talk about what we mean by testing and the kind of different tests that exist,
then there are the kind of tools and libraries Python has available for you to write and run tests
and how this will affect the way you write your code
and finally, the question of how and when to run your tests.

Testing has a long history in software engineering. An important job of testing is, of course, finding any errors in your program and making sure that it actually does what it says on the tin. By errors I don’t mean program language errors: these are usually captured by either the compiler of the language such as C or C++ or by the Python interpreter when you run your script. Errors in this context is when your Python script doesn’t return the results you expect from it.

In the olden days, there were precious few tools available to find out any problems with your code. It was basically going through the code line by line, maybe have a few print statements splattered over it to help you find out what’s going on. It’s a little bit like a detective hunting for clues.

When debuggers were introduced they helped stepping through a program one by one. That’s helpful, of course, and debuggers are one of the essential tools of developers today and available in all integrated development environments I know. But debugging can be very time consuming and in particular when you’re dealing with a very large code base.

On top of that, for a long time software was tested on the basis of writing your code and then do some tests. In some companies I worked for, tests were done by a different team than the engineers who wrote it.

There are a number of problems with this approach. One is that by the time you and your team mates finish writing your code, a number of bugs may have slipped in - and let’s be honest - they usually did.

So in the end it isn’t just a question of finding one bug, but a few - or even many. And that, as the code continues to grow becomes a problem. Not only because fixing bugs after you wrote your code takes longer and longer. But also, it’s getting harder to find them in the first place.

The other problem is, it affects the way you design and write your code. This argument may be a bit difficult to see at first. And perhaps even against our intuition. You may tell yourself: Oh, let’s not worry about testing now, I can do this easily after I am done writing my code. In my experiences, this makes us write code that is long, sometimes even a bit messy. In some cases, it may lead to what some engineers call “spaghetti code”. And I have seen such code. Maybe you have, too. Code that is hundreds, thousands or even tens of thousands of lines long. And one day, the inevitable happens: something goes wrong. Maybe a value gets returned that doesn’t make sense. Maybe the code unexpectedly crashes. And then you’re all spending days or even weeks or months trying to find what went wrong. And where.

Avoiding spaghetti code, and writing robust code right from the start is something many engineers have thought about for years. And lucky for us, they came up with a few techniques and methods to look at testing and coding in different ways.

The bottom line is this:

preventing bugs is better than finding and fixing them.
coding and testing go hand in hand

In the next section I am going to talk about how we can do this.

[transition]

Making testing part of your code development is a lot easier today. There are plenty of tools and libraries to choose from in most programming languages. And Python is no exception.

But let’s first look at what kind of tests there are.

Perhaps, some of you already heard about so-called Unit tests. The term comes from looking at a piece of code as a unit. For instance a function that produces a value, or a class method that returns a true or false statement. A unit test tries to go down to the functional level of your code and evaluate each step.

Next level up are so-called integration tests and system tests. These level of tests are relevant for software that consists of different modules and applications. For instance you could have an application that is run with a number of Docker containers and you want to make sure that the interaction between components work and, of course, that the whole system works end to end. And so, integration tests focus on the interfaces between different components. System tests are for entire applications end to end. Sometimes, you may find that the boundaries between the two can be a bit blurry. Let’s just say that in addition to unit tests, which check that individual functions work ok, they check whether your components and your entire application works as expected.

In the interest of time, I am not going into the big wide world of user interface testing or A and B testing. Because there is a lot that can be said about that too - and plenty of tools to go with it.

For now, I want to focus a bit more on unit testing.

Writing unit tests is relatively straight forward these days. Python has built in support for this. There are also other unit test libraries and frameworks available.

These unit test frameworks are typically based on the concept of assertions. And the core function to do this is called - assert.

Assert functions return a true or false boolean value, depending on, whether the statement passed into the assert function is true or not. They can take different values, from booleans, integer, float values or Strings and sometimes more complex objects can be tested as well.

A set of unit test functions is grouped together in a so-called Test Case. And in languages like Python, there is a base class called TestCase available, which you extend to write your own specific TestCase implementation. A set of Test Cases in return is called a Test Suite.

More often than not you end up with at least one TestSuite containing several TestCase implementations each with their set of Unit test functions.

There are a lot of other features, too, like for instance fixing certain values before you run a particular test. These are called fixtures and they can be defined for individual unit test functions or entire TestCases or suites.

And then there is the ability to mimick certain functionalities, which is called mocking. Here’s an example: let’s say you want to write a unit test for a function that relies on data coming from a database. You don’t really want to spin up a database for a single unit test, or mess up your data tables with lots of test data. Instead, you would create a so-called mock object, which acts as a placeholder. It is set up to mimic the behaviour of the database - well, at least the behaviour you want to see in your unit test.

Incidentally, the frameworks you use for unit tests can also be used for integration tests. In the episode notes you will find some of the popular testing frameworks used in Python. But in short summary: in addition to the built-in testing facilities of the Python language, you have frameworks such as pytest and monkeypatch (the latter for mocking). Each of these frameworks come with a rich set of documentation, tutorials and examples you can get your teeth into.

Ok, so far so good. In the next section I want to talk about how unit tests can be integrated with development. And what shall we do with legacy code?

[transition]

I said already that coding and testing go hand in hand and they should start at the same time. In fact, there is a development method that goes even further than that. It’s called test driven development or TDD for short. It’s a concept that goes back to an engineer called Ken Beck, who also developed an agile software methodology called extreme programming, or XP for short.

He argues that test driven development, TDD, doesn’t only make sure you have sufficient tests for your code it also drives better software design.

Of course, when you start writing tests first, they will all fail - until you put some meat on the bone so to speak and have implemented the actual method.

I have experienced test driven development in only one engineering team in my career. In all other teams we wrote unit tests, but usually in parallel with the actual method, rather than the test first.

Test driven development is hard, maybe because it feels counter intuitive. But it is also hard because more often than not, you are working on an existing code base. A code base that may already have been around for a while by the time you start working on it. A code base, that may or may not have tests available.

And that brings me to another interesting question regarding tests: what to do with legacy code. There is - lucky for us - quite a bit of literature around it and I will touch briefly on it in my chat with Stef in a moment. At this stage all I can say is that retro-fitting unit tests into existing code is a thankless task, and sometimes impossible. Unless you decide and change the code. Some people call that ‘refactoring’. Which brings me to one of the books, which was published in 1999. It has lost none of its value in modern software engineering: it’s called ‘Refactoring - Improving the Design of Existing Code’ and is written by Martin Fowler. Martin Fowler also has an interesting blog on all things engineering that I encourage you to look at.

But whether you are starting a brand new project and decide on test driven development or you are working on a piece of legacy code - the message is clear: writing tests will drive writing better code and a better design.

Open source projects have caught on to that. And when you plan to contribute to one of them, you should make sure that your code is tested and testable - otherwise it may not be accepted.

There are also projects and journals that review and help publish open source research software, like pyOpenSci or the Journal of Open Source Software, called JOSS: In its submission guidelines JOSS says that Authors are strongly encouraged to include an automated test suite covering the core functionality of their software. If a software package is not testable or has no tests it will be marked as non acceptable.

So, the sooner you get into the mode of writing tests as part of your development, the better.

The final question is this: what shall I do when I have written all these tests?

[transition]

Automation, automation, automation… Don’t let it depend on you remembering to kick off a test run. Integrate it into your software build process.

But if you do have to run tests manually, you can, of course. Unit tests written with any unit test frameworks in Python, whether it’s the built-in one or PyTest, they can be run from the commandline.

But the true power of testing lies in automating the tests. And in particular, when you work in a team that uses repository services such as GitLab, BitBucket or GitHub, which is used at UCL.

We already talked a little bit about Git in previous sessions. In order to share your code changes with your team mates you will either push the changes to a branch on GitHub - or, more often than not, create a pull request to be reviewed by your colleagues.

The good news is, that GitHub allows you to run your unit tests each time you create a pull request or each time you push code to your GitHub repo. The feature for this is called GitHub Actions. There is a lot more you can do with GitHub actions, but let’s focus on tests right now.

You create and update GitHub actions directly in your repository on GitHub. The workflow is configured using a file in YAML format, which defines which kind of actions, or jobs, the action should execute. The tool that performs the action defined in your YAML file is called - a Runner.

The whole process of automating your software builds and testing is called Continuous Integration. Continuous, because the idea is, that you and your colleagues submit small changes often. And with automated testing at each code submission, you’ll catch anything wrong early.

ok - this is a lot of talking about what an engineer SHOULD do. But how do engineers use testing in actual projects and in their daily work?

This brings me to my conversation with Stef, who’s been doing it for quite some time. As you will hear you will find yourself in situtations where you need to balance what you SHOULD do with what you CAN do. So, let me hand over to my conversation with Stef, now.

PETER Hi, Stef. thanks for taking your time to talk to me today about testing in Python. But before we do that, could you quickly introduce yourself, please?

STEF I have been working at UCL, the Advanced Research Computing Centre and I work mostly in the health care space, moving a lot of healthcare data around

PETER And you do a lot of with Python, or at least some of it in Python,

STEF Yeah, probably a decent amount for my sins. I’ve been doing a lot of Java in the group, but yeah, roughly, maybe like half of projects I worked on are in Python

PETER Well, I think testing is important for both Java or any language and Not just Python, of course. But how do you test your Python code or what kind of strategies do you use for testing in general,

STEF I probably don’t follow as much the model of sort of trying to do unit tests for everything. Mostly, because I feel like if you then end up refactoring your unit, suddenly you’ve got to refactor all your tests. So I try to do much more effort on integration level tests when it’s not a huge performance hit to actually do that. But otherwise, almost always have used PyTest as a library for all of my tests.

PETER That’s unit tests with PyTests.

STEF And integration testing as well. Even sometimes sort of system level testing have a hilarious situation. We spin up different Docker containers and then run our tests to make sure like an entire endpoint that goes through a couple of services can give you the results, but it’s a lot of work and we haven’t done that much.

PETER I think that’s quite important point you make that very often you end up with different Docker containers that spin up. It’s not just one or one in an application. And a test needs to cover end to end, are there any particular tricks to look out for to make this happen?

STEF Yeah, I wish I had a really nice answer for you. Often that’s a really tricky part in a project. I have worked when we’ve had much more, static data, so I would just assume that the database already has this and then we make this one action and you end up having lots of different ways that you set up your database specifically for a test. But for real integration level testing across multiple systems end to end. Yeah, I wish I had a nicer answer for you.

PETER Take us through what the differences are between the unit test and the integration test. I think we talked a little bit about it already, but maybe you can make it clearer for the listener

STEF to me, a unit test is very much testing like an individual function, like right down to even private functions in Python; testing really tiny bit of functionality, possibly the smallest unit of work that you can do and you wanna make sure that works you can have that super tiny level or you could have as you only test let’s say the methods on a class that you expect a user to use. you know, like digging into all the fine grained ones. And then integration testing is much more testing that different classes or different functions interact well with each other . I think the integration level testing, you want to make sure that part of your system, which you hopefully already know that they work well on their own together, they actually give the achieved results. So it’s maybe a little bit less like it works technically, but it also does the right thing that you want to do when you’re actually using it.

PETER Which also brings us to the subject of mocking. very often, particularly for integration tests, but not only for integration, but also for unit test, we may have to make some assumptions of the state of the system that we want to test. how do you approach that?

STEF Yeah. So I’ve used, monkeypatch in mock within PyTest. A classic case would be like if you’re trying to mock out what a frontend is going to be receiving. So you might like mock out an API request and like, well, for this test you will always just return this. Or each time I query it’ll give me this data that, this and this. Because you don’t want to be making those calls, especially a live API. I find managing those mocks quickly gets out of hand. Especially within PyTest it is you can have fixtures at the module level of test. and so I’m only using these fixtures which can mock these things out in this individual file or these sets of files which can make, separating out a bit of that easier. Whereas other times it just turns into spaghetti really quickly cause you just don’t know where things are coming from.

PETER I mean, testing is all very well, but who’s testing the tests in the end? The other question, you mentioned something earlier with regards you don’t write tests for every single function or every single class. So What are your criteria to decide whether to write tests or not.

STEF interesting point. ideally you’d aim for, uh, very close to 100% test coverage, but there are some cases, especially you looking at the pure like integration level. It’s like, oh, it’s really difficult for me to test, let’s say like a web application written in Django. you can go through and like make the user click on all these buttons. But actually it’s much easier to abstract all of your the logic that’s not in the framework. It’s like a separate package. And then you test that package really well and maybe have much more light testing across the entire web application. It’s often a bit of a pragmatic choice on how how easy it is to be able to test that and how important it is. E.g. you might not want to test someone else’s library.

PETER Yeah. Which is also another question that I have now that you mentioned to other people, to libraries, very often we don’t really write code from scratch. We inherit some legacy code that may or may not have tests. So what’s your approach to that? Are there any tips you can give at what stage, how you deal with some legacy code you inherited and saying it has no or very little test and you think, I really need to do that, but it costs too much. How do you go about that?

STEF it’s a tricky one. Especially usually if I had to the archetype. It’s kind of a monolith that’s grown out of control, like, one file that’s, 10,000 lines long is not easily testable in itself. I’m not terribly happy with any solution that I’ve come across, but I think trying to get something that’s like roughly an end to end test with expected input to expected output usually is kind of possible after you’ve got that, then you’d maybe start breaking it out and try to make it to a structure where you could test a whole chunk of code at once rather than like the entire script end to end. I’d be very concerned trying to sort of refactor code when it has no tests. On that aspect, it’s really tricky catch 22 because if there were tests for the code, it’s a lot more testable and it’s been structured in a way that you can test it. if you’re not used to writing code that is testable, you can structuring it in a very different way. That just makes it really painful and arduous.

PETER hmm. the message here clearly is that when you write new code, you need to have testing in mind. And there’s this test driven development approach that really, quite frankly, I’ve never seen anybody applying. We all say, Yeah, yeah, it makes sense to start writing the test before we actually write the code, but we usually do it the other way round

STEF the times I’ve seen it work best is if you’re onboarding someone else to a project like, Oh, here’s some tests that I already know what I want it to do. So I’ve written you the tests and I can implement the code and like get introduced the code base. Yeah, sometimes I’m good and do it, but definitely it’s not necessarily always. Well, I write the test and then I do the minimal and do like the classic, red, amber and green refactoring on it.

PETER Okay. Finally, when we are running tests. So there is obviously the approach of running the manually. But nowadays with GitHub and GitHub actions, we have the option of doing that automatically. Is that what you’re using or are there any particular tips you can share with people?

STEF I think to get that continuous integration testing in as early as possible into your project. You could go, Oh, I always run my tests and Yeah, I know what I’m working on. Of course I understand this. You might have written the entire code base. How could you not know what it’s doing? And you just realise that you’ve essentially killed half of your functionality by accident because you didn’t think this function was being used anywhere else. Other than that, GitHub actions in our group has probably become the de facto place where all integration tests will be run. But I think any of the platforms are basically fine. Like just make sure that you set something up

PETER Yeah. Is it actually easy to set up GitHub actions?

STEF I would say yes. there’s always like the curse of knowledge, but I think there’s a lot of really good tutorials out there and if you can run the tests locally, you should be able to set it up so you can run thinking of actions. I’m going to say roughly in 20 lines with YAML configuration, you should be able to get something that works, debugging it when things are going wrong. It can be a bit painful. But you can do a couple of things as they’re always remote into even the run of actions running. you can actually connect to that. And there’s always the classic print statements…

PETER Yeah, you’re right. You can actually go into GitHub and sort of see what’s happening there. Okay, Stef, very interesting. Thanks very much for your time. All the best with your programming and see you soon.

STEF Thanks much to just.

Well, I hope you’ll find this session on testing useful. As I said there are plenty of frameworks, tools and tutorials out there. But in the end there are two points to remember:

start testing, RIGHT NOW. Don’t put it off until later
don’t rely on running tests manually. Automate as much as you can - like with GitHub Actions

Better testing, better code.

Peter Schmidt

Host

Podcasts we love

Check out these other fine podcasts recommended by us, not an algorithm.

Code for Thought

Peter Schmidt