UCL for Code in Research
The companion podcast for courses on programming from the Advanced Research Computing Centre of the University College of London, UK.
UCL for Code in Research
1/9 Research Software Engineering with Python (COMP233) - Git Part 1
Peter Schmidt - the host of this podcast - interviews Sam Harrison, an environmental modeller at the UK Centre for ecology and hydrology. Is it version control important for his research? what tool does he use? How he learn it? Links and transcript available in the show notes.
About Sam
About Git
- https://git-scm.com/
- https://github.com/git/git/
- https://github.com/git/git/tree/v0.99 (almost) first version of Git from 2005
- https://github-pages.ucl.ac.uk/rsd-engineeringcourse/ Lecture notes
This podcast is brought to you by the Advanced Research Computing Centre of the University College London, UK.
Producer and Host: Peter Schmidt
PETER SCHMIDT: Hello and welcome to the course on research software engineering with Python at the University College London. In this first class and episode, I’ll be introducing you to a version control system called git. It’s arguably one of the most popular version control systems there is to date, if not the most popular one. After a bit of background in phone history, you’ll be hearing from Sam Harrison. Sam is an environmental developer at the UK Centre for Ecology and Hydrology in Lancaster, and git plays an important part in his daily work, as you will hear later.
But let’s start with some background and history of version control systems, how to manage and track different versions of your research material and research output is probably a question as old as research itself and it affects everything from your draft of your thesis, background material or a research paper, your experimental analysis and nowadays of course the source code of the software you write. There are simpler times when you need to go back to your previous version or work with others on the same documents and not override each other’s changes. Over time people have found different ways how to solve this. A very hand wavy and pedestrian way is using a naming scheme. So for instance, people have been adding version numbers, or date and time stamps to file names and I must confess I’ve done the same myself at some point. And it kind of works, at least, initially. But after a while it starts getting messy and at some point downright unmanageable. In particular, when you work with others. Software engineers have been thinking about this problem as well, and actually, from quite early on they came up with a number of solutions dating back to the 1960s, when a team at IBM developed a software update management tool. In 1972, Marc Rochkind, also from IBM, introduced the system called source code Control System, or SCCS. A decade later Walter F. Tichy from Purdue University in the US, created the revision Control System RCS, which was followed by the concurrent version system CVS. A few years and a number of acronyms later in the year 2000, another version control system called Subversion was made available, which still exists and is now part of the Apache project and of course there are other version control systems as well, such as, for instance, Perforce. Some commercial vendors offered solutions as part of software development environments. So, for instance, Microsoft’s Visual Studio had a system called Visual Studios Sourcesafe, which my colleagues and I used in the early 2000s to develop C++ code.
So where is git you ask? Well, Git was released for the first time as late as 2005, and the creator of Git might be actually somebody you know: Linus Torvalds or Liners, as some people call him. You might recognise the name Linus Torvalds as the creator of the popular Linux operating systems. Which he released in the early 1990s. Like with Linux, Linus made Git available as open source and in the episode notes you will find the link to the Git repository itself from where you can still download the first release version from the year 2005 should you so wish. And in the first release notes, Linus also tells us why Git is called Git, or rather he doesn’t really because what he really writes is and I’m quoting: “Git can mean anything depending on your mood” and he continues saying, “Git could be random three letter combination that is pronounceable and not usually used by any common Unix command. The fact that it is a mispronunciation of get may or may not be relevant. It could also mean stupid, contemptible, despicable, simple. Take your pick from the Dictionary of slang. It could also mean global information tracker”. Well, he’s got a sense of humour, has our Linus. And, by the way, Linus wasn’t the only one releasing a new version control system in 2005. In the same year, a Olivia Mackall launched another version control system called Mercurial, and both Git and Mercurial are still in use today, but let’s continue with Git.
Since 2005, it has become popular very quickly, as it has been adopted in operating systems such as Linux, and Mac OS where it is actually preinstalled. Would also help the spread and the use of Git was the launch of a cloud based source code management service called GitHub. Only three years later in 2008. GitHub and git work very well in tandem and in the following episode you’ll find out why. Nowadays, we have a bunch of cloud based source control systems apart from GitHub, like for instance GitLab which was created in 2011. As I mentioned, Git is preinstalled on Linux and Mac OS systems, it’s, of course, also available for Windows, however, you’d need to download and install a version of it yourself. When it comes to Git version control system, there’s quite a bit of a lingo and terminology to get used to, and Sam Harrison and I talk about it in the interview shortly. But let’s start with a few of those terms and discuss some of the very basic git commands you’ll be using quite often. First, there’s the term repository. Which is used by all version control systems by the way. It is meant to be a storage of files and any changes you have made to them, that is their version control history. Git itself is a so-called distributed version control system, which means that each developer or git user has a local copy of the repository on their computer. There is no single, or central repository. Being a distributed system has important consequences and this takes me right to the first set of Git commands. The commands that allow you to get to an existing repository or to create a new one. So let’s say you join a research team and they use Git to mount the source code and documents. And let’s say that they also use GitHub as a joint cloud based service for you to get started, what you would need to do is to get a copy of the repository and all the files in it. It’s a process that’s called cloning and therefore in your command line terminal you would use the following git command: git clone followed by the URL that is the link to the team’s repository on GitHub or wherever it is. This will download everything in a git repository, including the change in version history. Working with existing repositories is something you’ll be doing quite often, and equally you sometimes need to create a new repository from scratch, say right at the start of a new project. And the Git command for that is simply git init, with both the Git clone and the git init commands, you create repositories on your computer. And once you’ve done that, you are now ready to actually add new files and change existing ones with git taking care of all the changes. And this takes us to the last set of commands I’m going to talk about in this episode, which is how you can add new or change files to your git repository. This is a 2 step process, and it’s important to know how it works. Rather than putting new and changed files directly into your local git repository, you will have to add them to a temporary holding space first, and this process is called staging and the command for this is git add followed by the files you want to add or you can also simply do a git add . for adding all changes in your files. Once you added them to the stating area, you can then move them into the repository for good and this second command in the process is called git commit. The Git commit command requires you to provide a text comment to describe the changes you’ve made, which kind of makes sense. And you can add that comment with a -m option in the Git command directly. So in summary, the basic git commands we talked about today are git clone, for getting an existing repository copy to your computer, git init, for creating a new and empty repository, with git add, you move new and changed files into the git staging area, and with git commit -M you move files from the staging area into your repository with a comment attached to it. And in the class you’ll have the chance to practise all of these commands and some more with some other git commands I haven’t mentioned yet, such as, how to configure Git on your machine. If you are new to version control systems, and git in particular, I appreciate there’s quite a bit to take in, popular as it may be, it isn’t always straightforward to get your head around it. And it’s something I touched on in my conversation with Sam Harrison. So over to Sam now.
Hello, Sam. Thanks very much for your time today. Let’s start with brief introductions.
SAM HARRISON: So my name is Sam Harrison. I work at the UK Centre for Ecology and Hydrology. I’m based up in Lancaster in the North West of the UK. My main specialism is environmental exposure modelling, so I model the flows, the transports, chemistry of of how different potential contaminants. Like microplastics, like pharmaceuticals, how they move around the environment, and how that might affect the environment and ecosystems. I’ve got a bit of a background in software type stuff and obviously a a large part of that modelling work now is building models, basically writing code, building models of environmental transport of contaminants.
PETER SCHMIDT: And it’s the software bit. That we would like to talk about today. In particular, version control systems. So from your perspective, how important are version control systems?
SAM HARRISON: Yeah. I mean they’re they’re pretty fundamental in in what I do. The models that I write, I can’t imagine being able to write the models, write the code, the scripts, et cetera, that I write without the use of version control systems, because it would be incredibly difficult to keep track of versions. It would be very difficult to collaborate with colleagues. Yeah, they’re very fundamental.
PETER SCHMIDT: And you’re using git. I assume for the version control system. So have you ever used anything, other than Git before?
SAM HARRISON: Not in angst, no. I’ve looked into other options before, but I think when I started my journey with version control it seemed to be the de facto tool to use. So I turned to that straight away and haven’t looked back since.
PETER SCHMIDT: So Git is usually used in conjunction with cloud services such as GitLab and GitHub and Bitbucket as well. These two are not the only ones. Which one are you using?
SAM HARRISON: Predominantly GitHub. I’ve used others a little bit, but again when I started with version control, GitHub seemed to be the go to online repository, so that’s what I’m most experienced with and what I’m most used.
PETER SCHMIDT: So from your point of view, what is it that makes git so special? Is it special, and if so, why?
SAM HARRISON: I think it is. I mean generally taking version control systems, the point that what makes a version control system special is that ability to have different versions of your code never lose old code. You know, if you imagine the world before version control systems, you might end up commenting out chunks of code. Because you don’t want to necessarily lose them, but you don’t want to use them anymore, so you end up with a a messy script file with loads of comments everywhere, version control systems you can do away with that, because that code is always going to be there on that version control system. It enables collaboration, much easier, collaboration. You don’t have to send files between people, you don’t have to e-mail files. Over you don’t have to be working on the same file, which can cause conflicts. It enables you to work with collaborators in parallel, merging those changes and that really helps with things like reproducibility. So for instance, if you run some scripts and produce some outputs and go and publish those, or write a report based on those, then it’s useful knowing what version of the code was used to produce those outputs, and being able to tag versions etc in version control is a great way of of being able to do that. In terms of Git over other version control systems, as I say, I haven’t used, well, there’s a huge amount, but I think a lot of the power of Git comes in in how comprehensive it is. It can do a lot. That also makes it quite difficult to get started with and quite difficult to get a full grasp of. Yeah, that’s a trade off.
PETER SCHMIDT: I would like to hone in a little bit on the difficulties in working with Git. Nowadays we have user interface based tools such As for instance, the one that comes for free with GitHub, which is GitHub desktop. But a lot of the times, well, maybe not a lot of the times, but sometimes you may want to use the command line. So how easy was it for you to get into git and how long do you think it took to get your head around it?
SAM HARRISON: Yeah, I probably still haven’t got my head round it fully. It’s an ever learning process I suppose is the best way of looking at it because as I say, it’s very complex and powerful. I think it’s not too difficult to get your head around the basics of adding files, committing files, pushing them up to remote repositories, pulling down changes. That’s not too difficult, and it’s a little bit confusing at first because there’s lots of new terminology flying around staging areas or, yeah, branches, staging areas, commits, tags, remote repositories. There’s all these things that you kind of need to learn what they mean first, if you start out maybe by, you know, looking at learning the basic setup of how Git works in terms of having a staging area and adding files to the staging area and then making commits. If you kind of learn that that paradigm that those concepts, it helps a little bit. And get you going and I think how most people learn it is just through practise, through experience. I suppose my tip in that area would be just try and practise with git as much as you can. The more you use it, the more familiar you get with it in the knowledge that you know you will end up breaking things. But that’s a learning experience.
PETER SCHMIDT: Is that how you learned it? By just practising it?
SAM HARRISON: Absolutely. Yeah. I mean, when I started, it was probably the start of my PhD that I started using it in angst. And I used it not just for code. I decided right, I’m gonna invest in this. And so for instance, I wrote my thesis using version control, so I wrote it in Latex, which is text based files and I used version control for that. Not necessarily because I needed to, although actually I was quite useful, but it was a great way of just learning stuff on my own without having to worry about messing things up for collaborators.
PETER SCHMIDT: That’s quite an important message actually, because it’s not only used for source code, it can also be used for documents, for text based documents, right? So I think for binary files we need to be a little bit careful because I think it doesn’t probably work that well. You can store them in git, but you can’t do for instance things like comparing older versions with new ones because they’re binary format. Can I just ask how are you using Git? Is it through command line? Or do you use graphical user interface and application?
SAM HARRISON: I use the command line. It’s kind of what I’m most familiar with because I do quite a lot of coding and there’s I I like using the command line for as much as I can. Basically, I find that easier. I’ve tried using things like source Tree before, but I think I’ve always reverted to the command line, but that’s very much a personal preference. I just prefer typing commands and pressing enter. I know lots of colleagues that much prefer using the graphical versions because they find them more intuitive.
PETER SCHMIDT: So you mentioned that you tried graphical versions like Source Tree. Are there any others that you heard of or you could recommend?
SAM HARRISON: Yeah, I’m not so much of an expert here. GitHub, I think, has their own graphical desktop version, but to be honest, I’ve never tried that. SourceTree when I started was one of the three options that seemed to be the most powerful. Seemed to have a lot of features and that’s why I gave that a go.
PETER SCHMIDT: Not quite sure how free it is these days, but when it comes to the graphical git managing systems or the applications. Then there are the freebies like GitHub and then there are other ones that you can use and but have to purchase or have to buy a subscription these days. In terms of learning to use Git on command line, what kind of tips would you give people? How do you memorise all these different commands? Because there are quite a few, aren’t’ there?
SAM HARRISON: Yeah, it kind of goes back to the practise point, a little bit. You use it for as much as you can, because I think that’s the best way of kind of committing stuff to memory. I mean, there’s a few things you could. So, for instance, Cheat Sheets are quite good resources and if you Google: “Git cheat sheet”, you’ll get tonnes of results of like people have just compiled, like a little cheat sheet of these are the common commands and you can have that as a reference. You know you could print it out, put it on your desk and that’s a way of quickly being able to learn the commands. I think actually what I started doing and I found this more useful as I made my own cheat sheet. So as I was going along, if I learned a new command, then I’d add it to the cheat sheet. I don’t know the way my brain works is that I’m much more likely to remember something if I write it down.
PETER SCHMIDT: So what are the basic commands in git that you’re using and that you’d recommend people to start with out of that whole pool of different commands.
SAM HARRISON: Yeah, I think the key things are adding. So git add and then with a dot after that that basically adds all of your files that you’re currently working on to what they call a staging area, git commit -m and then inverted quoted marks a message, so that commits those files that you’ve just staged, and then gives it a message saying what the commit message is. So you’ve got git add, git commit, and then if you have a remote repository set up, you can then push those files to that remote repository. So git push and then the name of the remote repository, which is often origin, and then the name of the branch you’re pushing to. And by the way, this is why I say it starts getting complex. Simple commands you you you start delving into what remotes are what branches are, but the common one. There would be git push origin main or git push origin master potentially depending on what your branch names are, and then if you’re working with collaborators then the the opposite of that. So getting code off the remote repository is git pull again, origin main, origin master. For instance, if I was collaborating on a repository, what I would generally do, my workflow would be to go to that code. If I’m starting new git pull so I get code from any of my collaborators that have written code whilst I’ve been away and then I do my edits. Then I’ll git add, git commit and git push the other important ones are git clone and then a URL and that pulls code from a remote repository that you don’t already have and git init creates a new blank repository. They’re probably the key ones.
PETER SCHMIDT: Got quite a few git init, which is probably where you start with a brand new git repository. Then git add . add files to the staging area. commit, which is where you committed locally to the database, because that’s what git maintains and then you have the interaction with the remote repository, so it’s quite a bit to get your head around. I quite like the idea of creating your own cheat sheet. Are you still adding to the cheat sheet?
SAM HARRISON: I’m not actually anymore. That’s a good question. I probably should, because I find myself Googling problems that I’ve encountered lots of times before and can’t remember how to solve them, so maybe this should be a prompt to me to go back to that cheat sheet and carry on updating it.
PETER SCHMIDT: Well, I think that using a cheat sheet yourself and creating one is a jolly good idea. And maybe you start one right now when you start with git. But Sam mentioned a number of commands I haven’t touched on yet, and in particular how git interacts with cloud based services such as GitHub or GitLab, which are already mentioned a little bit earlier and all of this and more will be for the next episode where you will hear from Sam again and from Irene Solba from the Alan Turing Institute. You will find the material for this course on the Moodle platform at UCL.
This podcast is produced by Peter Schmidt. The music comes from Daniel Lindenblatt in Berlin, Germany. And finally, thank you for listening and I hope you enjoy the course and with that good bye.
Podcasts we love
Check out these other fine podcasts recommended by us, not an algorithm.