UCL for Code in Research
The companion podcast for courses on programming from the Advanced Research Computing Centre of the University College of London, UK.
UCL for Code in Research
9/9 Research Software Engineering with Python (COMP233) - Performance
In this last episode of this course, I talk to Itamar Turner-Trauring who created the website PythonSpeed and spent a considerable time on finding ways to make Python code faster and more efficient. Python and its ecosystem also have great tools how you can measure performance.
Links:
- https://pythonspeed.com a set of articles and recommendations on how to improve your performance
- https://blog.sentry.io/python-performance-testing-a-comprehensive-guide/ a general blog post on performance testing
- https://uwpce-pythoncert.github.io/SystemDevelopment/profiling.html
- https://en.wikipedia.org/wiki/Computer_performance
- https://python-102.readthedocs.io/en/latest/performance.html
- https://docs.python.org/3/tutorial/datastructures.html
- https://www.green-algorithms.org
- https://doi.org/10.1145/356635.356640 Donald Knuth's paper on over optimisation
- https://wiki.python.org/moin/TimeComplexity
- https://blog.jetbrains.com/dataspell/2023/08/polars-vs-pandas-what-s-the-difference/ comparing Polars with Panda
Profiling tools
- https://pyinstrument.readthedocs.io/en/latest/
- https://docs.python.org/3/library/profile.html
- https://docs.python.org/3/library/time.html the time function in Python
- https://docs.python.org/3/library/timeit.html another function to measure time in Python
- https://jiffyclub.github.io/snakeviz/ a graphic profile viewer
- https://bloomberg.github.io/memray/ flexible memory profiler
- https://github.com/benfred/py-spy
- https://github.com/P403n1x87/austin-python The Python wrapper for the Austin profiler
This podcast is brought to you by the Advanced Research Computing Centre of the University College London, UK.
Producer and Host: Peter Schmidt
In this last episode for this course I want to focus on the subject of - performance.
[sound effect]
My guest is Itamar Turner Trauring from the US who is the author of a website called PythonSpeed and who will indeed publish the very subject of performance some time next year.
Performance in computing can be a number of things:
- you might think of performance in terms of speed. That is, the time it takes for a computer program to run.
- it could also mean how much memory and resources, like the number of processors, a program uses
- and in machine learning and data science, performance often means the differences between data predicted by a model and given actual data.
- finally, it also refers to the environmental costs of running an application.
Improving performance and optimising an application is not a question of convenience, leave alone luxury. All computer programs are constrained by the limits of the machines they run on, whether this is your private laptop, a server in the cloud or a supercomputer in your research centre or university. They are also constrained by the time you have available and your ability to complete a programming task. And, as I mentioned just earlier, let’s not forget that computer programs need energy. And particularly for large scale applications, cloud services or for programs that run on supercomputers, the energy costs and carbon footprint can be considerable.
In short, there are pretty good incentives to make your program run faster and with less resources.
So, when shall we start thinking about making our apps run faster with fewer resources? There are some who would quote Donald Knuth, the well-known computer scientists. In 1974, he published a paper called “Structured Programming with GOTO Statements”. On page 8 you will find the following paragraph that says:
“Programmers waste enormous amounts of time thinking about, or worrying about, the speed of noncritical parts of their programs, and these attempts at efficiency actually have a strong negative impact when debugging and maintenance are considered. We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil.”
The reason I bring up this quote is, that I’ve seen it on some web-sites that cover performance, like one of the links I listed in the episode notes. And Donald Knuth has a point: you should improve the code that has the biggest impact on performance first. And before you do any optimisation at all, you should think about what problem you’re trying to solve.
But premature optimisation doesn’t mean, that you shouldn’t think about writing faster and better code right from the beginning. In fact, it is a pretty good idea to do so.
In my conversation with Itamar a bit later, he talks about different layers when it comes to improving performance. Which is something, I’ll be turning to next.
[transition]
In the previous episode I talked about code refactoring principles and software design patterns. And arguably, they are the starting point in writing code that is fast and uses resources efficiently. And questions about software design and software architecture include things like: am I going to use a 3rd party library or package - or shall I write the code myself? And if I go for an external package, which one should I use?
Let’s say your program reads in data from a CSV file. It’s relatively straightforward to write a program yourself, that reads in lines of string from the file and turn it into an array.
But there are tons of programs out there that do the same thing. Like, for instance the Pandas or Numpy packages. And they do that far more efficiently than you would be able to do in pure Python.
Numpy and Pandas achieve a better performance, because - under the hood - they are written in C and C++. Compiled languages can optimise in different and often better ways than scripted or interpreted languages like Python.
Which means, you do well not to reinvent the wheel when something already exists that is proven to work very well. Spare yourself the trouble!
But one package is not like the other. And sometimes it pays off to compare and try to find out which one works better for your specific use case. This can give your program an extra boost.
For instance, in a previous episode on data handling, my interview partner Nick Radcliffe pointed out that the Python package called Polars can perform better than Pandas when it comes to machine learning and data science operations. This is down to some optimisations Polars built in, for instance the socalled lazy execution for database queries.
One of the reasons, Python has become the go to language in scientific programming is that it comes with a huge set of libraries and packages. Chances are, that the problem you are trying to solve has already been made available as a Python package.
But software design and picking a good 3rd party library is just one layer of producing performant code.
The other aspect is: picking the right algorithm and data format in your Python code.
[transition]
In my conversation with Itamar in a moment, he came up with a great line: the important thing about how well an algorithm performs is, how it scales.
Let’s take for instance a simple for loop in Python. For a single loop, the time of execution grows linearily with the range to loop over. But let’s take a nested loop. For instance when you try to read in or write out a matrix. Then the execution time would be the product between the number of items to loop over the inner loop and the number of items of the outer loop. That’s no longer a linear growth, but a quadratic one.
In fact, there is a measure, that expresses the performance of algorithms, called the Big Oh, usual written with a capital O and then the expression in brackets. For instance, Oh brackets n, stands for a linear algorithm. That is the time a function takes growths proportionally with the number of executions. Similarly, Oh brackets n squared, stands for an execution time in line with a quadratic growth.
The episode notes contain a link that gives you the Big Oh values for a number of Python execution. And I encourage you to take a look at that to get a feel for how much time some typical Python operations take.
Going back to loops: they are pretty basic operations and used frequently. But how you use them makes a huge difference. Using a for loop is usually not an efficient way to loop over arrays or lists. Often they can be expressed in a single line with a technique called ‘list comprehension’.
Let’s say you want to create an array of 10 elements, where each value is the square of the index. Using the for loop, you’d start by declaring an empty array, followed by the code line with the for loop with range 10 and inside adding the square of each index to the array. With list comprehension you could do all of that in one line and it would be more efficient.
Another, and sometime far more effective way to increase the performance of your code is - and perhaps you can guess what I am going to say - to use packages like numpy, pandas or polars.
In many cases their algorithms are better optimised what you would be able to write in Python. In fact, Robin Wilson, my guest from the episode on Python, pointed out - that if you want to harness the full performance from pandas and numpy you must use the functions they have.
The same, by the way, holds for accessing data in a way that is memory efficient. Again, packages like pandas and polars offer optimised ways that are hard to beat.
So now that we talked about two layers, architecture and algorithms, how then can we be sure that our code actually performs better.
And this takes us to measuring performance.
[transition]
This is the part where I will finally hand over to Itamar. But let’s just say, that Python has a number of tools built in that let you measure at least the execution time of your code. There are built in functions like the function time or the function timeit.
Measuring the time is often not enough. You also want to check how often certain parts of the code are being called and how much time is spent on each call. For that you would need to use something that’s called a ‘Profiler’.
Python comes with a built in profiler called cProfile, where the P in cProfile is capitalised.
ITAMAR, however, points out a number of different tools, that also includes a profiler for memory usage. I found Itamar through a website he created, which is dedicated to Performance of code. It’s an interesting story how Itamar got to this point, as he explains in the following conversation.
[Interview with Itamar ]
ITAMAR Hi. My name is Itamar. I have been writing software for quite a while now, and in recent years, I’ve been doing a lot of research and writing about Python performance, Python memory usage, mostly from the perspective of data processing. So scientific computing, data science, that sort of thing.
PETER Thank you. Well, that’s right down our street, because scientific computing is exactly what this course is about. So why is it important that we actually worry about it? Maybe we can sort of warm up with that a little bit.
ITAMAR Okay, so I can start from, like, my experience, which is more in the commercial world, I was working for a company doing a spatial gene sequencing device So you can see the genes are expressing not just in terms of the tissue or culture sample like where they were in the sample, so you could say this cell had this gene expression. And so there’s a microscope, and it took the series of pictures and different frequencies and I was working on combining those and outputting the actual sequences. So it was image processing for scientific computing. Ah, it was a commercial company - at one point I was basically prototyping this based on the algorithms that applied mathematician and biologist had developed and I was just making it work and it wasn’t very fast. I was running in the cloud. The idea was, this is going to be a commercial service. So people would pay the company to do the sequencing. And so I went and, you know, figured out for management how much we expected to get customers to pay for sequencing something. And then I did the math on how much cloud computing resources we were using. And we were going to be spending something like 70% of our revenue on cloud computing based on the code I’d written. And that’s bad.
PETER And it can be quite expensive
ITAMAR That is not a viable company if that’s how it’s going to happen. And so I spent not that much time, but making it faster. I got it like, I don’t know ten times - like cut the memory usage massively got away to faster and …this is a few years ago and these days I could do a lot better than that. And so, you know, just in terms of like making this affordable, that was really important. But So just in terms of speed: speed of your feedback loop is sort of critical for your ability to experiment and develop whatever you’re working on. And if it takes
PETER Indeed.
ITAMAR you like an hour to run your software and then you discover you have like you have about two crashes, then you can do like five, six iterations a day and that’s it. If it takes you two days to find a problem, it’s, know, two iterations a week kind of thing. To get down to a minute, now you’re like feedback loop is much higher and that means you can just getting things working as much faster, but also doing more experiments is much faster. So basically the faster your software, the better your feedback loop is and the easier it is for you to, try different things, find problems. This is sort of speed on in terms of memory usage, if your memory usage is too high, then you hit the situation where your computer starts writing everything to disk and everything grinds to a halt, and then either your program is killed or it just takes effectively forever to run. There’s a sort of catastrophic point where your software basically cannot run with too much memory usage. Performance is like a continuum. Like it could be slow or fast you know, if it’s very slow, it will finish eventually. With memory, it’s sort of hard to coalesce where once you’re past a certain point, you cannot run your program. One approach people have is to say, okay, well, we’ll just, run this thing on multiple computers, bringing in a whole bunch of complexity. And it turns out with the right techniques, not always… and people who do like earth sciences, like the data get I is just immense. And so, like, they have no choice but to use large, very large clusters of in many cases… But there are many domains where just, you buy a computer of 64 gigabytes of RAM, and then you write the software the right way and you can process really large amounts of data without running out of memory if you do it right.
PETER what do you look for when it comes to making your code faster? what are the kind of common techniques that you could recommend that people look out for? Are there any I mean, it’s probably a very wide field. I appreciate that. It’s like asking how long is a piece of string? But, are there some kind of common practices that people could use?
ITAMAR yeah. So you can think of it in terms of different layers of how you write your software. Yeah, so the first layer like - you’re choosing an architecture, it’s about thinking about, like, how the problem works and how you make it fast. And if you’re running your own processing, you won’t necessarily be writing a whole framework like Polars, but you might be choosing which libraries to use. And so understanding like what possibilities they can use for performance is…
And the second layer is choosing a good algorithm in data structure. And here measurement isn’t necessarily quite relevant because algorithm is less about how slower or fast it is for particular input and more about how it scales. if you’re algorithm scales with N*squared of your input, or if your input is in size N, then it doesn’t matter how fast the implementation is like, it’s like as soon as your data gets a little bit bigger, it’s going to get much slower. And so that tends to overwhelm how fast the implementation is. So like if you have an end squared algorithm that’s written in like a really fast language like C or Rust or FORTRAN and you have a like a linear algorithm in Python: if it’s not that large of an input, python will be faster just because the algorithm scales better.
So I picked up, scaled the algorithm - and this points like you start thinking about, okay, how am I going to make things faster on a more granular level? so depending where you’re optimizing, there’s different tools. Py-Spy is a profiler that s sampling profiler. So it will take samples as the program is running and just see where the time was spent. Austin is another similar profiler tp Py-Spy. I find, sometimes one works and the other doesn’t. Both of them are designed to run in a full program. So if you want to measure smaller, like just a function PyInstrument - so that’s another profiler that is … allows you to run only in a specific function.
And all of these won’t be that great if you have a lot of concurrency and you want to understand what the different threads are doing. Because the visualization they use doesn’t really visualize concurrency very well. They will, depending on their options in which one you’re using, let you see this code was in this thread, and this code in this thread. You won’t see what was running at the same time. (this tracer?) is another profiler that has rather more overhead and because it traces every single function you run, but it gives you more of a timeline so you can see more of this concurrency and I have created a profiler called Sciagraph. So S-c-i-a-graph, which is gives you a timeline as well, it’s more of a sampling profiler than …tracer. And also just memory profiling memory profile is a whole other…
PETER Yeah, exactly. And I wanted to get to that because we’ve talked about speed … And I find it quite interesting that you haven’t mentioned cProfile, which is actually the stuff that comes with Python built in.
ITAMAR so cProfile adds a bunch of overhead which can be more of a problem if you have lots of little python functions, If you’re like doing all the heavy lifting and something like Numpy or Pandas is it’s less of a problem. And also the visualization tools, the built in visualization just isn’t quite good enough. There is a third party tool I’ve tried. The visualization, like, that took a web UI and it fell over once you had enough data like you would and… there’s scripts that will turn the cProfile output into other options. Yeah, cProfile you can use. But the other nice thing about Py-Spy or Austin is that they can give you native code stacks too. They can give you the C code or Rust code can give you stack traces of those too. depending on the code you’re writing can be helpful to have a combination of both. And cProfile can’t do that. It’s just the python layer.
PETER So memory. So that’s a whole different ball game, isn’t it? And maybe we can finish off with that,
ITAMAR Memory is a bit trickier because different kinds of applications have very different memory issues. So a lot of Python developers use Python for web applications. And for a web application in the kind of memory problems you have or typically like you have a memory leak. Like every time you get a request, some memory, somehow doesn’t get freed. And so you end up with this slow slope, like, memory usage just creeps up and up and up and up and up and up and up over time. So you use too much memory. Your program gets killed, drops back down, and then the pattern repeats.
Whereas scientific computing and data processing, the reason you often have issues of the memory is that you’re using a lot of data and you’re losing a lot of data or like you’re copying an array by mistake and you don’t realize it. Or that you’re keeping an array around for, like, 100 milliseconds, but now like you’re using 30 gigabytes of RAM and 50 gigabytes of RAM … like it’s very easy to have these really huge spikes in memory usage. So many memory profiling tools for Python are designed more for the first use case for leaks, memory leaks since memory profiler is the rather bad name of a memory profiler for python. So it shows you line by line. And the problem is it shows you the difference between the memory at the beginning of the line and the end of the line. So that line calls a function. It’s like, okay, I call that function. The memory is the same at the beginning of the end of that function. So great. And it may be that in that function it just allocated three gigabytes and then freed it and it was never sieved. And so for scientific computing, the kind of memory profiler you want is one that shows you peak memory. That finds the point you had the most memory, which might have only been a very brief moment in your application, but that moment is the bottleneck that means your program will never finish. So you want something that finds that peak memory and tells you that the sources of memory allocation there. So the most flexible profiler here is Memray, M-e-m-r-a-y and believe it works with your native code as well. And it stack traces and it has a variety of options for working with the UI, and that’s a useful tool and it gives you this view. Before Memray was written, I wrote my own memory profiler called F-i-l, Fil and Sciagraph is the newer one I worked on and also does memory profiling. And both of them, I think, I would say are easier to use interface than Memray, but Memray is more general purpose and more powerful,
PETER You yourself, I mean, you already alluded to the kind of tools that you’re producing, like Sciagraph and Fil - but you also created this website, PythonSpeed. And I was just interested in how you got into that, because you mentioned that you worked for a company earlier. how did that start?
ITAMAR I got interested in scientific computing. And so I took this job, local biotech company. And so I had this experience, as I said, with like optimizing the software and I just found that tooling was not very good. So at the time there were no memory providers for Python at all that would give you the information you needed. So I had to do that the hard way, which was not fun. And just the experience of optimizing the code and sort of learning how to optimize it was like, I just enjoyed a lot. So I just ended up deciding to spend a bunch of times of researching and learning more about the topic. Started writing articles and building tools and it just accumulated over the years…
[interview end]
As per usual, you will find a number of relevant links on the subject of performance in the episode notes. Including links to the tools that Itamar mentioned, like Py Spy, Austin and Memray as well as his own Fli and Sciagraph.
Well, this is it. I hope you enjoyed the course on Research Software Engineering with Python. And I hope you find the episodes for each class useful. If you haven’t done so yet, it would be great to get your feedback and what you think worked well and what you think we should do better next time.
I would also like to take this opportunity to thank David, Will and the team from UCL for their great support. And with that, good-bye.
Podcasts we love
Check out these other fine podcasts recommended by us, not an algorithm.