UCL for Code in Research

2/9 Research Software Engineering with Python (COMP233) - Git Part 2

Peter Schmidt Season 2 Episode 2

In this episode we look into more essential Git commands, such as branching and merging. Branching and merging are key concepts that help you develop code or even text documents in a team. They help you maintain different versions of files and work on them independently.

Another element of collaborative working is provided by GitHub: the pull request. Pull request are a great way to do code reviews, which avoids introducing bugs and also learn from each other.

In my conversations Sam and Eirini talk about these key features and their experiences with it.

Links:

Don't be shy - say Hi

This podcast is brought to you by the Advanced Research Computing Centre of the University College London, UK.
Producer and Host: Peter Schmidt

PETER SCHMIDT: Hello and welcome to the course on research software engineering with Python at the University College London. There’s quite a lot to talk about when it comes to Git and GitHub, so much, in fact, that you get another episode on the subject. And in this session I want to focus on some key features of Git that will make your life as developer easier whether you work on your own or in a team. In particular the Git features I want to focus on are branching and merging and a bit later I’ll also talk about GitHub and how that fits into the world of Git. Helping me out in this episode is again, Sam Harrison, who you’ve met in the last episode and also Erin Zormpa from the Alan Turing Institute in London. As I said, there is a lot of covers, so let’s go straight into it. 

Branching and merging what is it and why does it matter? In short, with branching you can maintain and work on separate versions of your code independently, and merging is kind of the inverse operation where you bring changes off your files from one branch to another. Well, that all sounds very airy, fairy and very general, and may not make a lot of sense initially. So let’s do an example. Let’s say you have a Python script that analyses some data. It’s a crucial piece of your research work. You spend a long time developing it, and you have to use it regularly to get results for your project. But at the same time, there are also some features you want to add. But on the other hand, you really need this script to work and you can’t afford to break it. So what do you do? And this is exactly where Git branches come to the rescue. To begin with, Git always comes with a single branch by default. It’s usually called main or master. This main or master branch. It’s like the trunk of a tree from which you can create as many branches as you like. So let’s do that. The Git command for this is git branch, followed by the name of the new branch, and for this example, which is the name development. And so for this command it would be git branch development and voila, you have a new branch called development and from now on git maintains 2 versions of your script and you can use one unchanged to run your analysis on the main branch and add some new features to the script on the development branch. To switch between branches, you would use the command git checkout followed by the name of the branch you want to switch to. So, you do git checkout main if you want to run this stable Python script and you do git checkout development. If you want to make changes if at anytime you’re unclear which branch you’re on, you can simply type in git status and it will tell you well. So far so good. But at some stage you work on the development branches done, you’re happy with the changes. And you want to run the changed script from now on. There is no need to keep separate branches anymore. And in short, what you want to do is to merge your changes from the development branch back into the script on the main branch. The git command for this is called git merge, followed by the name of the branch from where you want to bring changes in. And this means that if you do a merge, you need to be on the right branch. This can create some confusion and I myself fall victim of merging changes into the wrong branch in the past. So before you do a git merge, make sure you’re on the right one with git status. So back to the example from earlier if you want to merge the development branch into the main branch, you would do the following sequence of commands. First git status to check which branch you’re on. Second do do git checkout main to switch to the main branch. You merge with the command git merge development to bring the changes into the main branch. The git merge command has an option -m to add a text description. Finally, the git branch command you use to create a branch can also be used to delete it, and the reason I mention this is that it’s good practise to remove any branches you no longer use and no longer need. The command to remove the development branch is git branch -D development. OK? There’s quite a lot of commands to take in in one sweep: git branch, git merge, git checkout and git status. With git branch, git checkout and git status they are quick and easy to use. In principle, the same holds for git merge. In most cases it works with our glitch. I said, in most cases, not in all of them. The thing is that sometimes the git merge algorithm cannot resolve differences between different versions. And rather than overwriting changes automatically, what git does is to stop you from merging in the 1st place with a message that’s called a merge conflict. It sounds a bit more dramatic than it is, and in most instances merge conflicts can be resolved easily and quickly, but there are also techniques you can use in your development to lower the risk, which is what Sam Harrison explains in this brief chat I had with him, some already appeared in the previous episode, and here he talks about merging and handling merge conflicts. 

Does Git actually work very well when things go wrong? Because one of the things that we haven’t talked about are conflicts, that probably is a whole different conversation to have. But let’s say that you want to merge something into a file from a different branch, say, and you have a conflict. So have you ever run into a problem that it’s not actually that easy to resolve? 

SAM HARRISON: Yeah, it can be. If you do your commits, your adds and your commits in like quite small chunks form to a better description, so you know if you write a little bit of code, add and commit it, that’s a good practise because if your collaborators do that as well, then there’s less chance of getting conflicts. If you’re, you know, just adding little bits of code at a time. What can happen if you or your collaborator has just committed a huge chunk of code, and maybe you’ve been writing some other code in similar files at the same time as they’ve committed that huge chunk of code? You might end up with, as you say: conflicts. Git tries to automatically merge changes to files, so if you’ve both been working on the same file, it will try and automatically merge them. But if you for instance you know change the same line of code then it doesn’t know what to do there. It doesn’t know which is the source of truth. So you get conflicts and how it denotes this is, it literally adds bits of text to your source code saying: big red flag here’s a conflict you need to deal with this. You need to figure out which one is the best one to go with. If that’s only just one line of code, that’s pretty easy to resolve. You just choose the line of code. You can either just delete the extra bits, or if you’re using an IDE like VS code or something like that, there’s actually interactive interfaces. To do that, and things like Source Tree give you interactive interfaces to do this. That’s OK that’s not too difficult. When you get loads of commits across loads of different files, it can be a real nightmare. And I mean I’ve encountered situations where you’re one of my collaborators has done a huge commit and it’s like a month’s worth of work, and they’ve just put it all into one commit and it’s created 100 files with lots of conflicts in, and you have to go through all those manually and that can be time consuming. But as a strategy for avoiding that, just doing your adds commits etcetera in small chunks is a good. Way to avoid that. 

PETER SCHMIDT: What Sam mentioned here, commit your changes in small chunks and as often as it makes sense, cannot be repeated enough. It doesn’t avoid merge conflicts, but it reduces the risk of them and the work involved in sorting them out. There is a lot more that can be said about branching and merging. For instance, a lot of developers have thought about branching patterns for development and depending on the nature of the project, these branching patterns can actually be quite complex. There is an excellent blog post for which you will find the link in the episode notes. It’s by Martin Fowler. Who, by the way, is also a great source for other development practises. Git is very powerful and has many other commands. I can’t possibly cover them all in this episode and the notes have also links with git references and some other documentations. But I’d like to move on to another great tool called GitHub. GitHub was created in 2008. It’s now part of Microsoft. It’s a hugely successful and popular cloud based service for managing source code repositories. It has well over 100 million repositories and over 30 million visits per month, so it’s huge. GitHub also has an ever expanding set of features, some of which include automation of software builds and delivery, providing a space for static web sites, code inspection and automated testing, some project management tools, and, and, and. But how does the interaction between GitHub, the remote service, and your local git repository work? In essence, there are two git commands that allow you to share code and code changes between repositories, including remote ones and GitHub. They are git push origin, which sends changes from your local repository to GitHub and the corresponding branch, and on the other hand git pull origin, which merges any changes from GitHub into your local repository. In short, git pull and push are a way of sharing between a local and a remote git repository. All of this means that there is an extra step when you want to make changes in your local git repository and make them available on GitHub as well. So where do pull requests come into this? Pushing changes directly into a branch on your remote repository can be quite a risk. If you’re working in a team, you could overwrite someone else’s work, or they could override yours, or errors could slip in, which is why development teams often introduce rules how code can be merged to GitHub repositories. Rather than allowing you to push the changes directly onto the branch. The pull request follows a 2-step process. First you submit a request for your branch with your changes to be merged into another branch and you need to do this on the web page of your GitHub repository directly with a button called compare and create pull request, that’s step one. And the 2nd and final step is for someone to accept the request, and pull your changes into the target branch, hence the name pull request that some one is called the reviewer and who that is depends on how the repository is set up and I’ll return to that in a minute. The reviewer will be able to see how many files have changed and look at the changes in detail. They can also leave comments and on top of that, GitHub also warns you if there is a potential merge conflict. All of this makes reviewing code easy. Code reviews are indeed a useful software development practise. Their purpose is not only to avoid introducing bugs, but also as a learning experience for both the reviewer and the developer. Of course, the pull request on its own is not a code review, it’s up to you that the developer and there are fewer to make the most of it and turn it into something useful. I personally have benefited hugely from pull requests. GitHub allows you to set up pull request in many different ways. How depends on the project and at the end of the day you and your team. Should every single change require a pull requests? Who should approve it? Should there be exceptions and could people bypass the requests and under what circumstances? These are questions git cannot answer, but only you and your team. Well, that’s quite a lot to take in in this episode, and I believe it’s time for a break. And in this final section, I’d like to turn over to Eirini from the Alan Turing Institute. I met her recently to talk to her about Git, GitHub and her experiences as a teacher. So let’s hand over to Eirini. 

EIRINI ZORMPA: So my name is Eirini. I am a research community manager at the Alan Turing institute and yeah, I’m here to talk about git, I guess so. I’ll talk about my experience with Git. So I started learning how to use this when I was doing a PhD in psycholinguistics. So like experimental psychology and after my PhD transitioned to a trainer role at the Delft University of Technology. Where I taught research data management and open science to early career researchers, and this is when I really got into teaching git through carpentries, software carpentry. And I kind of carried that forward in my current position at the Turing where I work with health researchers mostly that use AI methods to study multiple long term conditions. A lot of them have very diverse backgrounds and do a lot of work, teaching kind of foundational computational skills, including version control with git. 

PETER SCHMIDT: Talking about your teaching experiences, what has your teaching experience been like? For git and GitHub. How easy is it for the students to get their heads around them? 

EIRINI ZORMPA: I think it can be really difficult. I think the terminologies used in the software themselves don’t make it necessarily easy for people to know what is meant. You know that Git and GitHub are different. It’s already something that is really important and one of the main kind of misconceptions that I tried to disabuse people of when when teaching. So the first thing I had, like misconceptions in my teaching so on, was that Git and GitHub are not the same thing. One was that GitHub is not only for programmers, so you can use both Git and GitHub for all sorts of things. All sorts of files that are text based. In my current position we use this a lot for documentation as well, which is really important in research and the other thing was that I encourage people to collaborate with other people when using GitHub, but for the shier amongst them, I do remind them that using Git and GitHub is a good way for you to collaborate with yourself as well. It’s not something that you only need to use if you’re planning on creating the next big open source tool or whatever, but it is a really useful tool to be using even if you’re working on something small. Even if you’re working by yourself most of the time, because I think people are like ohh is this over engineered or is this too much for this little script that I’m writing? No, you know, this can give you a lot of benefit even if you’re working by yourself. 

PETER SCHMIDT: So that you mentioned the terminology, are there any other issues that you find with your experience of teaching that people find difficult? Are there any particular commands or the process that people find difficult to get into? 

EIRINI ZORMPA: Yes. And I will say that I teach beginners and I teach the basics, so I have thankfully never had to teach anyone how to do an interactive rebase or whatever, you know, fancy things people do. But honestly, something that maybe is what I struggled with when I was teaching is how often GitHub changes and how often it does things for you that it thinks you want. But you actually don’t. I will give I will give an example of this. 

I was trying to teach people about GitHub collaborating with each other and I wanted to start with branching. I thought that we had given everyone the correct access to the repository or use an organisation structure and I thought if people were added to the organisation they would have access to create branches. That is not correct. You need to have right access. So when people tried to create a branch in that repository. GitHub automatically created a fork instead, without ever asking you or letting you know that this is what it did. So that was very confusing for people because when they were trying to submit the pull request, et cetera, it just looked different. And we were all very confused because this was an online course, so I couldn’t see what they were seeing and they couldn’t understand why things looked different between what I had and between what they had. So that was something that you learned from experience, right? Like the differences between branching and forking and when you’d use one and when you’d use the other. Is something that, I think people do struggle with. Something else that I think is more common now, when you work with so many documents like Google Doc or whatever, that autosave 5-10 years ago when you were writing your Word document and you had to click on the floppy disc to be like please save this version of the file, right? It was much easier to create a link between that and the process of committing something. But now when you have like the autosave people do sometimes find it a little bit not jarring, but weird that ohh I have to make an active decision to save this. It’s a difference of like how technology has changed, how we interact with files in a way that I was not necessarily expecting. 

PETER SCHMIDT: You mentioned the keyword that I wanted to talk about, which is branching. I think you highlighted a very important difference between branches and forks, but maybe you can highlight or explain that a bit more in detail. 

EIRINI ZORMPA: Yeah, totally. So the way that I like to think about the difference between branching and forking is that a branch is something you would create when you are already a collaborator in that repository and you want the changes that you have made to be incorporated into the original repository. You would create a fork when you’re someone external to the team, and perhaps you want the work that you are doing to be independent of the original. You have the option to always suggest your changes back to the original repository, but you don’t have to. And if that’s what you want to do, it does make a lot more sense to create a fork rather than to create a branch. So one of the differences is a little bit social. Are you already part of the developer team and the other is a little bit about the intention that you have with the work that you’re creating? 

PETER SCHMIDT: I mean you can create branches locally in Git as well as GitHub, so I think it’s good to keep that in mind. That’s branching doesn’t work only in GitHub, it’s actually a git feature. How easy is it to create branches and under what circumstances would you recommend we use them? 

EIRINI ZORMPA: Yeah, I would recommend creating branches a lot. And yes, as you said, you can create branches in Git and GitHub, but you can only have forks in GitHub like that’s the only context in which that makes sense. So, for branches, yes, I think frequent is sensible. You would create a branch when you want to test out a feature. So say that you’re developing an analysis pipeline and maybe you want to change how you cleaned your data at the beginning and you don’t know if that is going to work. So you don’t want to break this beautiful code that you’ve already written, that importantly works, by changing something that could then make your pipeline not work. So you want to experiment with that in a safe way in a parallel production line. So that is when you create a branch. And you’re like, OK. I’m going to try this out over here. I’m going to see if it works. And if it does work and it does what I want it to be, and if it is better than what I already had, then I can merge my changes into the main branch such that it is now my main development. So this is the case in Git in general, going into the very specific circumstances of GitHub. The reason why I love branches is that a lot of the work that I do. Is really collaborative, the way that we work is our main branch tends to be protected such that you can’t actually just commit directly into the main branch. You can only commit a change to a new branch, which then one of your collaborators has to review and either approve or ask for some changes. If you’re working in that kind of collaborative space, I think having a lot of those branches is really, really nice, because then you can have a branch per kind of like features. So it’s really nice and well organised. But also it’s very easy for people to have a look at what you’re working on, and when you want feedback and you submit that pull request. It’s a great mechanism of updating people on what you’ve done and using it as quality control essentially. 

PETER SCHMIDT: That’s the pull request subject, which is very closely linked to branching and branching patterns in GitHub. Because you can set it up that nobody can submit anything to a particular branch. And you mentioned main which I think is the default branch, but git upgrades for you. Nobody can commit to it without having approval by at least some one else. And that’s the beauty of the pull requests. It makes it a little bit harder at first, you know, because you’re forced to actually discuss things with other people. 

EIRINI ZORMPA: Also, I spoke about confusing terminology before. How confusing is the term Pull request? I’m sorry, it it took me, it took me quite some time to wrap my head around. OK. I am requesting that you pull my changes. It’s like what? OK, surely there’s some… 

PETER SCHMIDT: I know it’s like double negation. Anyway, so you push something to pull something and it’s kind of weird. Pull requests are actually very easy to set up, and you have to do that in GitHub. Are there any particular branching patterns that you have come across that you would recommend so you mentioned already so you have the main branch which is protected. It could be something like your release branch for instance, and then you have another branch, a feature branch where you put the actual updates and then pull it in. Are there any other patterns that you’ve observed? 

EIRINI ZORMPA: That is the one that we mostly work with. I will kind of say that I probably have some word strange experience with using Git and GitHub just because when I was using it for the most common type that people use git and GitHub force writing code, I was mostly doing it by myself and now that I do use Git and GitHub very collaboratively, it is almost exclusively for documentation, planning documents, I don’t know if that creates some kind of difference in the patterns of branches that people use when working with these tools, but it is something that we’ve been working within the team I’m part of and has served us very well. 

PETER SCHMIDT: Yeah, as Eirini said, branching patterns really depend on the project, and it’s probably a good idea to start with something simple first. But there are also three things she mentioned that I’d like to finish with in this episode. The 1st is that git and GitHub are not only for code and software development, like for instance the Turing Way Eirini mentioned, which consists of documents sitting in a GitHub repo. The second is that GitHub and git are not only good for working with others, you can also use it effectively for yourself and from my own experience I can confirm that Git and GitHub are really essential for a lot of things that I do, like for instance this podcast. And finally, the language and the terms may sound strange if you are new to git, but give it a try. Practise it and be patient, it will pay off. You will find the material for this course on the Moodle platform at UCL. This podcast is produced by Peter Schmidt. The music comes from Daniel Lindenblatt in Berlin, Germany. And finally, thank you for listening and I hope you enjoy the course and with that. Good bye. 

Podcasts we love

Check out these other fine podcasts recommended by us, not an algorithm.

Code for Thought Artwork

Code for Thought

Peter Schmidt