UCL for Code in Research
The companion podcast for courses on programming from the Advanced Research Computing Centre of the University College of London, UK.
UCL for Code in Research
3/9 Research Software Engineering with Python (Comp233) - Introduction to Python
Python is one of the most widely used programming languages in research and science. How did it come to that. And what makes Python special? It's something I discuss with my guest in this episode, Robin Wilson, who also takes us through some of the less favourable or more complex aspects of the language.
Links
- http://www.rtwilson.com Robert Wilson
- https://blog.rtwilson.com Robert's blog posts
- https://www.python.org
- https://anaconda.org Anaconda Python distribution
- https://jupyter.org all things Jupyter
- https://ipython.org iPython
- https://inference-review.com/article/the-origins-of-python
- https://en.wikipedia.org/wiki/Literate_programming
Python libraries for science
This podcast is brought to you by the Advanced Research Computing Centre of the University College London, UK.
Producer and Host: Peter Schmidt
PETER And now for something completely Python…
[Music]
Recognise the tune? No? Well, to be fair, you’d have to step back into the mist of time. To the end of the 1960s to be precise. The time when a British comedy troupe called Monte Python aired a TV programme called ‘flying circus’, whose opening jingle you’ve just heard.
So what on earth is the link between that and programming?
Between Christmas and New Year 1989/1990 a keen Dutch programmer called Guido van Rossum presented a new progamming language to his colleagues. Not knowing what to call it he turned to his favourite comedy show for inspiration. And so, in addition to an iconic TV show filled with black and at times surreal humour we now have a programming language named after it as well: Python.
Maybe that was an omen. Because, like the Monte Python TV programme, Python, the programming language has become very popular indeed. Since the 1990s it has become the default and go to programming language for many scientists and researchers and is - more or less - the lingua franca for all things around data science, machine learning and artificial intelligence.
For this episode I invited Robin Wilson, an experienced Python engineer, and we talk about why it has become so popular. Without giving too much away, a lot of its popularity has to do with the fact that Python is relatively easy to learn. So much so that Robin will later on refer to it as “syntactic sugar”.
Syntactic sugar aside, the other reason Python is so attractive to users is the fact that there is such a vast set of tools, libraries and development environments available. And it ranges from libraries used for data processing such as numpy or pandas, machine learning libraries such as scikit-learn, science packages like scipy but also libraries for web apps such as flask or django.
For instance, in a medical imaging app my colleagues and I developed a couple of years ago, we used a bit of everything: medical imaging and data libraries, libraries for image processing and conversion, database management as well as flask for serving a web app.
Python itself is Open Source and the official site for this is Python dot org, where you can also download Python to your computer and find a rich set of documentation and tutorials. Python is extremely flexible and runs on a range of different hardwares and operating systems including Linux distributions, Mac OSX, Windows, individual computers as well as parallel computing platforms. Python dot org is not the only one providing you with a Python version. There are other distributors as well. A very popular choice is Anaconda dot com, or simply called conda. It’s a commercial product, but has free downloads and some free features for individual users.
Talking about Python distributions: One thing to look out for is the version you are using on your machine. At the time of recording, in 2023, the default version in Python distributions is based on the major release version 3. Specifically, the latest release of Python in October 2023 is 3 point 12. But that doesn’t mean that all Python distributions will support that. For instance, at this moment in time, the latest version of the conda distribution is 3 point 11.
Generally, this shouldn’t cause any problems and within the Python 3 release there should be enough backward compatibility to ensure your existing code still runs on newer versions.
But version management of your code gets a little bit more complicated when you include third party libraries and packages. Some may support features that may not be available with the Python version installed on your computer. Or they may use older Python versions that are no longer supported. A good example are libraries based on Python 2 releases. There are significant differences between Python 2 and 3, which means it’s unlikely you can run Python 2 code with a Python 3 interpreter. Lucky for us, most packages these days will be based on Python 3. But, as I said, it can still be tricky to manage different versions of your code and that of your dependencies even within version 3.
And for that reason, Python has the concept of so-called virtual environments. This enables you to run different versions of Python loaded with different packages or dependencies independent from each other. Robin and I will talk about virtual environments and Python distributions a bit later.
I talked about Python as a programming language. But there is an important difference between Python and other languages such as Fortran, C and C++. Python is an interpreted language. What does that mean? With languages like Fortran, C or C++ you have to translate your code into a binary format first. Only then can you run the program - as a binary executable. The process for this is called compilation and linking - to bring in other libraries and packages. In short, there is an extra step or two between writing and running your code.
No such thing is necessary with Python. You run the code as is. It is the Python interpreter that will take care of reading your code and interpreting it into something your machine can run.
The Python interpreter allows you to run Python programmes in different ways. And it is this flexibility that adds to the strength of Python.
You start the Python interpreter with the python command in your application or terminal. The python command comes with different options that makes execution of python code very flexible. Often, the default command for the interpreter is simply called Python. But this is often just a placeholder name which is linked to the actual python command installed. And this version includes the version string in its name. So for instance your actual python program may be called python 3 dot 12. This allows you to install several versions of Python. Which in return enables the use of virtual environments I mentioned earlier.
Before I hand over to Robin Wilson there is one final thing I’d like to mention. You can’t really talk about Python without mentioning Jupyter and in particular Jupyter notebook. And often you will hear people talk about it in the same breath - to the point that you might think, Jupyter is just an extension of Python that allows you to write Python code in a web browser.
In fact, Jupyter is much more than that. It started with an extension of the Python interpreter and shell called IPython in 2001. Fernando Perez and others developed it and started to create something called an IPython notebook. Further development led to a separation of IPython into what we now call Jupyter in 2014. Which makes a lot of sense, because Jupyter does not only support Python as a programming language. The hint is in the name. It’s been initially based on three languages: Julia, Python and R. Ju Pyt R. More languages have been added since and you can now code in C++ and Fortran if you wish.
Having said that, Python does play a key role in Jupyter and in particular an application inside Jupyter called Jupyter notebooks. More than just Python code, Jupyter notebooks allow you to add rich metadata and documentation. Which has made Jupyter notebooks very popular in the classroom as interactive course material. And perhaps you have used Jupyter notebooks yourself. But they go even further than that: the American Geophysical Union AGU announced a new initiative called Notebooks Now in 2023. The aim is to work with scientific publishers to encourage and offer researchers different ways to publish and share their research results using digital and interactive notebooks, like Jupyter notebook. Like Python itself, Jupyter has become a huge community worldwide with its own conference and rich set of tools and applications.
Some online courses and online applications provide Jupyter as an online service. But you can also install it locally on your machine as a Python package.
I have been waxing lyrically about Jupyter. Which brings me to my interview partner Robin Wilson who issues a note of caution when it comes to Jupyter notebooks.
So let’s hand over to my conversation with Robin.
PETER Hi, Robin. Thanks very much for your time today to talk about Python. But first, let’s start with some introductions from yourself.
ROBIN Yeah. Hi. So I’m Robin Wilson. currently a freelance geospatial software engineer and data scientist. I’ve got a background in academia. I did a PhD in satellite imaging. I then worked in academia for a while. I now work, as I said, freelance for a range of clients ranging from tiny two person companies to tens of thousands, hundreds of thousands of people, multinational corporations doing stuff mostly in Python, and mostly to do with geographic data of some sort or other.
PETER how did you get into Python in the first place?
ROBIN I did programming from quite a young age. I started with basic when I was sort of really quite young and then through my teenage years and through doing my A-levels, I did things like Visual basic when I came to university, I think I picked up Python a little bit, so I did geography at university and we weren’t taught programming as part of the course, but I picked up a little bit of that and then it was really in my Master’s and my PhD that I really got into Python. Before that, I’ve been using a language called IDA, which is a bit unusual these days. It’s sort of dual life, I guess in sort of Fortran and similar language is used quite a bit in the physics astronomy kind of community. But then I realised that Python could do nearly all of that and more and has a wide range of libraries and so on. So moved over to Python and that’s what I spend 90% of of my time in these days.
PETER Is that Python version two that you started using then?
ROBIN Yeah. I would have started with version two. Yeah. 2.7 I think I was when 60.7 was, was what I started with many years ago. But now I’m entirely Python three. Although the transition was rather painful, I must admit.
PETER I wanted to talk a little bit about that. Transitions, because I think there was quite a seismic shift was made between Python two and Python three.
ROBIN Yes. And it was particularly tricky for scientific software
PETER Why?
ROBIN when you have scientific software in Python, it’s all built on a massive hierarchy of Python modules and sitting at the base python itself and then you’ve got modules like numpy and scipy, a matplot, which is kind of the key scientific python libraries for handling arrays and and doing basic operations and things. Everyone had to wait for those to get upgraded to Python three. And then when those were upgraded to Python three, you then had to wait for the next level of modules and the next level of modules. And you have so many dependencies built upon dependencies that if you’re using some kind of satellite imaging module that the library that’s got various functions that might be useful to you, you find it’s based on about eight of these different libraries and you’ve got to wait for all of them to be ready for Python three before you can get that final one ready for Python three. So it was kind of a an iterative process and took a while.
PETER So it’s like a daisy chain of events, basically.
ROBIN Exactly. I stayed longer on Python two than probably people who are doing things like web development did because with the web development lives, I think there’s less of that building on top of things. You’ve got, you know, you’re using flask one flask is upgraded to work with Python three all the rest of the stuff slots in fairly easily. Yeah.
PETER I think nowadays it’s python version three that people would be using and certainly learning from scratch. So do you think is the attraction of Python and why is it used so widely? Because it is used quite widely, particularly in scientific software and data science in particular.
ROBIN Yes, the some benefits to the python language itself. It’s easy to learn. It looks a bit like kind of executable pseudo code really. quite nicely well structured, quite simple and so on. But the real reason it’s used in so many different places I think is the breadth of libraries available and Python was really picked up by scientists initially before data science was really a thing and people built these foundational modules like numpy and scipy, matplot and so on. Then people brought in things like pandas and other libs like that, and it really took off in data science and scientific data processing. But there are a lot of other areas in which Python is used as well as it’s very commonly used for web backends. Nowadays something like fast API is very common for that. Most Python libraries designed for doing kind of API backends. There’s also things like Flask and Django, and so it’s quite widely used in that as well. And I find when I’m using Python, I use it for a whole range of things. most my work is writing sort of data science code or data processing of, some sort. I’ve also written APIs using False API, which link in to the data science code that I’m doing. So you have an API you call that does a bit of processing for you. I use it for writing simple little system scripts running things on my server. I use it for you know, huge range of things. And that’s really One of its benefits is it’s very, very versatile. And with the wide range of lives that are available, you can turn it to almost any task.
PETER Python as a scripting languages and scripting languages often have the reputation of being rather slow in terms of performance. What do you think Python is doing as Python underperforming in certain aspects?
ROBIN right. that is a common view and it is true in some way but in pure Python code is going to be slower than running the equivalent code in C or something. However, a number of factors that mean in the real world that isn’t so much of a problem. One of those is about developer speed versus running speed. It’s often the majority of your time that he spent writing the code it does matter if it takes a second or two longer, if it’s going to take you weeks longer to write the he’s still worth for position. that’s one of the benefits of Python that is quite quick to write. There’s a lot of nice sort of syntactic sugar nice you’re easy to use language aspects that really help you write false cut. So that’s one side of it. It’s the other side is there’s been a lot of work to easily integrate Python with compiled languages of various kinds. C, C++, Fortran and so on. So you’ll find that actually most of the foundational scientific data Python library is you see code under the hood for the expensive time consuming operation. So NumPy does a lot of C stuff for all of its arrays. You find pandas does a lot of C stuff for various manipulations, so those bits all fall off. But they’re easy to call from an interpretive scripting language like Python and you’ll find that it’s all completely transparent to the users. And most users pandas don’t know the behind the scenes it’s using C your site them which is kind of a hybrid of C and python and you don’t really have to worry about it it just works gets you the full speed with still the user friendliness of Python.
PETER where does I Python and Jupiter and Jupyter notebook sit in this universe? Because a number of people get introduced to Python, for instance, through
ROBIN Yes. So Jupyter notebooks are great and I use them a lot. I do have some concerns about, and particularly if you’re introduced to programming purely through a Ubuntu notebook environment, which I must admit I have done sometimes when I’ve taught Python and there was potentially some downside to that. But for those who don’t know, Jupyter is it is a notebook environment. So you have cells of code and you can have cells of text and images and things and you can join them all together and you can run them and you can of move around and interactively interact with things and so on. And it’s a lovely environment for developing. problems related to things like sort of the maintainability of the code in the Jupyter notebook, the fact that you can run cells out of AWS as a movement, if you’re running a script, it runs from top to bottom, but you can run so far even then. So knowing and then cell three and it will still work. But if you do it in the wrong order in future, it’s going to give you a different answer or whatever. Jupyter notebooks really give you the opportunity to combine your code with explanation reasons and other of the sort of code. And going back to what Cormac, produced years ago, the idea of literate programming where you’re combining code and documentation in the same, but they also give you a lovely interactive environment. There’s lots of things you can use really quite easily within Jupyter, so you can get a little slider to control up parameter for your model and then it will automatically update the graph below as you change the slider. And that’s kind of really useful for explaining these models to people who aren’t programmers. And the other great thing about Jupiter is it provides a really easy way to use Python on someone else’s computer. a lot of cloud environments where you can easily bring up a Jupyter notebook and run your code on some massively more powerful machine, maybe with access to cluster of computers for parallel processing you’re interacting with it in exactly the same way that you would interact with Jupyter on your local computer,
PETER are there any aspects that you don’t like about Python that you wish were different?
ROBIN Python packaging is recognized as the problem, so this is packaging up code as a library to be installed and for other people. Python was relatively early in providing a defined language specific library of packages. It’s called PPR. But due to some decisions that were taken early on in the python in the life of Python packaging can be tricky and a use file called setup.py which gives information about the packages, but that file can in fact have arbitrary python code which leads to all sorts of issues about installing a package. People are moving away from that now, but you still have to support the old way because many packages still use the old way. you run into problems also with packaging on certain systems. So as I mentioned, a lot of scientific python libraries use some sort of C or siphon or something underneath the hoods, and then you need to sometimes wait to compile that for your platform.
And so if you’re installing on Windows and you haven’t easily got a C compiler available, which is often the case, then you can run into problems installing those packages. Again, those problems are a lot less difficult to deal with now than they were maybe five years ago. But they are still a problem. One of the things that was developed in the last few years that’s really helped, that is the Conda packaging manager which manages both Python packages and system libraries that underlie some of these packages. So for example, in Geospatial data we use a package called G dial which has a Python interface, but also has a big C C++ library behind it. And Conda is quite good managing these on various different systems so you can get reliable environments other things that are potential problems with Python. The python to do Python three transition, as I mentioned, was quite challenging. That put a number of people off and annoyed a number of people quite a lot. So bits of that are still reverberating around the little bits. And I guess the other thing that I would say in some ways the features that are being added to Python now, many of them are great and I like using them, but they do make the language more complex.
PETER Could you give an example?
ROBIN Yes. So Python doesn’t have… well, it didn’t have a select case statement. They then added this match statement, which is really powerful. It does proper pattern matching and lets you sort of assign variables into objects as part of the matching process. But it is quite complex and some things inside a match statement, behave in a slightly less intuitive way that isn’t the same as how they behave outside of a match statements. So it’s great. I would love to use it, but I’m wary of using it because I’m thinking that people who are less familiar with Python, who are trying to use my code or or modify my code would struggle with it. And it’s not well-covered in all of the tutorials in the books because it was only introduced last year.
There’s a lot of stuff you can do these days with typing in Python supposedly dynamically typed, but you can specify types from objects and there’s other libraries like my play that will go and check your types and give you errors if you try and do things that aren’t allowed in those types. And that’s again very powerful and it allows a lot of very clever stuff to happen in lobbies like False API, where you, you annotate types of things and then it does clever stuff with validation automatically for your API is. But again, it adds an extra level of complexity and goes away from Python as the fairly simple language that anyone can get to grips with into almost diverging into two languages that you kind of need to know both even if you only want to use one.
PETER All right, well, that’s a good explanation. I would like to finish today with tooling. Like any software development environment, you need to have the proper tools available. And I believe there are a number of tools available for Python. So what are you using and what would you recommend?
ROBIN A lot of this I think is very much down to individual taste. you don’t go necessarily use what I, you use whatever you feel comfortable with. I tend to go for a reasonably simple editor, something like voice code, Visual Studio code, which works in a lot of languages. So you need to learn it once and it’s got plug ins for the various different languages, I use Jupyter quite a lot. As I mentioned, a lot of my stuff is done on remote clusters and things, so I’m using Jupyter notebooks there if you’re using Conda, out. Strongly recommend actually using something called Mamba instead, which is basically a faster version of Conda. if you’re doing using Conda, you might be aware that sometimes it can take a long time to resolve your patch because there’s so many different packages and so many different different versions of them. It can take quite a while to work out which ones to install. Mamba is exactly the same as code, which is called a false. The dependency was over and so that can speed up. Your CONDA installs really quite significantly.
PETER the very final question, I know that I said this was the final question, but the very final one is about virtual environments, because that can be quite confusing to people. What is a virtual environment and how would you use that in Python?
ROBIN virtual environment is a self-contained environment a Python installation and the load of Python packages. So you enter a virtual environment and then you’re using that version of Python with those packages and you’re completely separate from the other Python installations in the Python environments that you have with different packages. So for my work, I would have a virtual environment for my work with one client, but I have a virtual environment for my development of a library that I maintain. I’d have another virtual environment for work with a different client. So I switch between these so that if I upgrade the software in one virtual environment, upgrade a couple of libraries, for example, it won’t affect any of the other virtual environments. They can stay entirely isolated from each other. And I don’t run into problems with different versions of libraries that I’ve done what I expect in different places.
PETER I hope you enjoyed this little intro to Python and I hope you’ll find the links in the episode notes interesting and helpful. The music is indeed the jingle for Flying Circus, the TV series created by Monte Python. In fact, it’s a US military march called the Liberty Bell march, composed by John Philip Sousa in 1893. Monte Python allegedly used it because the music is in the public domain and royalty free. Which - and let’s be honest - is why it’s in this episode as well.
Podcasts we love
Check out these other fine podcasts recommended by us, not an algorithm.