UCL for Code in Research
The companion podcast for courses on programming from the Advanced Research Computing Centre of the University College of London, UK.
UCL for Code in Research
6/9 Research Software Engineering with Python (COMP233) - Documentation
Documentating software is part of the life of software engineers. But what kind of documentation do we need? In this episode I take you through three levels of documentation: the basic README and LICENSE files everyone should have, how to be good at writing git commit messages and using tools to turn your source code comments into browsable documentation.
Links
- https://www.sphinx-doc.org/
- https://pdoc.dev/docs/pdoc.html
- https://wiki.python.org/moin/DocumentationTools
- https://peps.python.org/pep-0287/ reStructuredText Docstring Format
- https://github.com/matiassingers/awesome-readme some README examples
- https://www.gitkraken.com/learn/git/best-practices/git-commit-message GIT commit messaging
- https://www.warp.dev/terminus/git-commit-history Git Commit history
This podcast is brought to you by the Advanced Research Computing Centre of the University College London, UK.
Producer and Host: Peter Schmidt
In this episode we’re getting busy with documentation. So, let’s get the keyboard out Aaaaand off we go…
[typewriter song Leroy Anderson]
So, why documentation? Why bother? Haven’t we got enough to do writing code? Well, hang in there. No one’s asking you to write a bestseller or your memoirs.
Just a little something that tells people what your code is actually supposed to do and how people can use it. Or maybe you just a little reminder what you’ve done - in case you need to go back to it at some stage later. In short: Writing software documentation is always a little bit like washing up. It’s far less exciting than cooking and eating. But it’s got to be done.
And it’s a job as old as engineering and computing themselves. We had documentation of computers, computer programs and progamming languages right from the start. After all, computers in the 1950s - right until the days when the first personal computers arrived - required a lot of experience and knowledge so that engineers could actually work on them. And then there were the first programming languages themselves and the programs that were written in them. A well known tutorials at the time was by Grace Mitchell. Grace Mitchell was an engineer at IBM who worked on developing a programming language called Fortran. Her main focus was on version two of the language. She put together a book called ‘Programmer’s Primer for FORTRAN’ to help people get up to speed with Fortran. The book was published in 1957 and they say it inspired a lot of future engineers.
In those days manuals and documents were printed. And some of them occupied entire shelves, they were that big. In fact, having printed manuals and documentation went on for a long time.
I remember a project in medical imaging from 1997: we asked a contractor to develop a hardware driver for a tape recorder we used in the hospital. They came back with two thick folders filled with detailed specifications and description of functions all printed on shiny and glossy papers. And, of course, software applications were installed from floppy discs and later CDs or DVDs and came with little booklets on how to install and use them.
Documentation was often a big job. So much so that engineers didn’t have the time to write it themselves - as Grace Mitchell did in the 1950s. Instead, organisations hired technical writers who put it all together.
Since then, things have changed. Software doesn’t get delivered in boxes with thick manuals anymore. Instead, you download and install it. Documentation is mostly found online. And with modern integrated development environments you often have documentation built in and at your fingertips - right when you right code.
But there has also been another shift: namely, that engineers write a lot of the documentation themselves - rather than outsourcing it to a technical writer. Of course, we still have a lot of technical writers - as you can see from the number of books you can buy.
And yet, the fact that engineers can write a lot of the documentation themselves is not a bad thing.
- Firstly, they are the ones writing the code and therefore know what’s important for users to know
- Secondly, there are now a number of tools available that help you write documentation, even automatically. And I am going to talk about it in a minute.
And finally, software development is mostly a team effort. So, even if you don’t intend to write manuals, you need to tell your team mates what you’re doing and why.
In short: software documentation got a lot easier these days and much of it is part of the process of writing code.
So, let’s dive into what kind of documentation we’re talking about and how to create it.
[transition]
The first thing I think about when talking about documentation is describing how a function or a class works. And that’s all very well. But for a new user, or someone who just joined your team, that’s already a step too far. They need a high level introduction first. On top of that, and this is something less obvious perhaps, they need to be clear under which circumstances they can use the code.
Which brings me to the two files every project on GitHub should have: a License file and a file called README. Let’s start with the license. Many think that it doesn’t matter if you don’t have one. But that’s wrong. Without a license others can’t use the code - at least not legally.
In 2015, GitHub did a survey and found that just 20% of repositories had licenses. Since then, the situation hasn’t improved much, because I listened to a talk early in 2023 when someone reported similar figures.
And yet, GitHub makes it very easy to create a license for your repository. GitHub has a set of standard licenses that you can choose from. And in fact, whenever you create a new repository GitHub will prompt you for adding a license. Even if you already have a repository but no license - you can easily add one in the project’s settings.
I believe the most common license is the one from the Massachusetts Institute of Technology MIT. Another popular one is from the Apache Software Foundation or then the Berkeley Software Distribution - BSD. I should also mention the GNU public licenses - by the Free Software Foundation. Their use is more restrictive in the sense that if you use software under a GPL license it requires your own software to use a GPL license as well. Which is why most of us pick either an MIT or Apache license.
Once you pick a license, GitHub will store it in the root directory of your project. The file itself is a text file in markdown language and it’s called simply license dot md with the word license typically in capital letters.
The license file is not the only one GitHub can create for you. The other is the so-called README file. Like the license file it is a text file in markdown format in your root directory. And - again like the license file - the name is all in capital letters.
The README file wasn’t invented by GitHub, by the way. They say that the practice of creating a text file called README dates back to as early as the 1970s. Back then it was just a standard text file. With the arrival of the markdown format by John Gruber in 2004, the README file got more structure and can be made easier to read.
Which helps, when you publish a README - or any markdown files - on GitHub or similar services like GitLab.
The clue of the README file is in the name. It begs you to read it. Again repository services such as GitHub help, because they show the README file - nicely formatted - in the webpage - no matter where it finds it.
In effect, this turns your README file into the shopwindow of your repository. It’s the first thing any user will see when they check out your project.
Which brings me to what should go into the README. Like any good shopwindow, you want any visitor and user to find out what this project is all about.
Getting the level of details right depends on your project. But I would say, the README should contain
- the name of the project, library or application
- what the purpose of it is
- how people can install it
- how people can run it and
- how they can get in touch in case they need more info or want to report a bug
A shopwindow doesn’t show all the stuff that’s in the store. And likewise, the README usually focuses on the bare essentials - like the ones I just mentioned. So if there is more detailed and longer technical information available, it’s good to put a link to that in the README - rather than overloading it.
There are some guidelines and templates for how to write catchy READMEs and I included them in the episode notes.
Now that we have a README and a LICENSE file it’s time to move on to documenting the code itself. And how this can be in an automated way.
[transition]
From early on, programming languages had the ability to include simple text in plain language. Text that compilers and interpreters ignored. This text could be an explanation. Or it could be a reminder that this code needs more work. I am talking, of course, about comments in source code.
Each language has its own way to add comments. In Python, you do that by starting a new line with the hash character. Or if the comment is for several lines you enclose it with triple quotes at the start and end of the comment.
For a long time, comments were meant for the programmers who worked on the code directly. And, as I mentioned earlier, technical writers were the ones to create the reference books and other help. Writing comments had little to do with technical documentation.
Not any more. And so sometime in the late 1990s engineers started developing tools that can turn comments in source code into browsable documentation. It makes a lot of sense. First, the engineers writing the code know about the functions, classes and overall structure. And then, each time the code changes - the documentation can be updated with it.
Because, in the end a lot of reference documentation is along the lines of this is the name of function X and it takes that many parameters - and perhaps an extra line or two briefly explaining what it does.
But you had to put some structure around the way comments were written. A way that allows a tool to parse the comment and format it in a way that can be easily published as for instance a markdown file or an HTML page.
The first time I came across such a tool was in the early 2000s - and it was called Javadoc. A documentation system for the Java language. It is still there and still used, of course.
Python has a few documentation generators on offer - in true Python fashion. One of the tools that gets mentioned a lot is called Sphinx. Other popular packages are pdoc and pydoc. Take a look at the episode notes.
HTML is no longer the only format being used as a documentation output. In fact, markdown format is often supported or even is the default. Which makes it very suitable for repository services like GitHub or GitLab.
And talking of repository services. The nice thing about tools like pdoc and Sphinx is, that you can run them in your code delivery workflow. For instance inside a GitHub Action. You don’t have to remember to run the documentation tool, it is automatically done each time to do a pull request.
And that really is the big PLUS of document generators.
It’s not all plain sailing, of course. You still need to install and setup the tools. Then you need to conform to a particular style of writing a comment in your source code - not to mention remembering that you should write a comment in the first place. But once you’re in the habit of doing that, you’ll get the benefit of having an up to date reference of your software.
Ok, so far I talked about READMEs, LICENSE and source code documentation.
There is one aspect of documentation, however, that I think is also important. And that’s documenting the code changes themselves using Git. And that’s what I am going to finish with in the next section.
[transition]
Here’s the thing: you are getting close to a release and a deadline. But just in the last moment, you find there is a bug you or someone in your team introduced in one of the last commits. The problem is, you can’t remember which one. And so you start looking into the list of commits on GitHub, which in itself is straightforward. But, hang on, what’s this? Every commit has the same message, that simply says: ‘Updated stuff’ That’s not helping, is it?
So you think: wouldn’t it be nice if the commits had meaningful descriptions? And you wouldn’t be the first to think so.
In fact, there are quite a few sites that talk about git commit ettiquettes and why you should care about them. The bottom line: the git commit message should tell you what this commit is about. By convention git messages have two parts: a short title and a longer description.
The title should give a brief indication of what this commit contains. Brief in the sense of one line long and 72 characters max. You should have the
- name the file that changed, or a component in case of several file changes
- the name the number or ID in case the commit is in response to a ticket on GitHub or JIRA - and
- indicate if it’s an update or an addition
The description that follows can be a bit more verbose, where you should list all the files that changed together with a note why it has changed. No harm is done by repeating the issue number here as well.
And, if I haven’t mentioned it in one of the Git episodes already: make sure you commit often and in smaller bits. If nothing else, if helps making the commit messages shorter and clearer.
There is one thing that might puzzle some Git users though. Particularly those of you who prefer to use Git on the command line of a terminal: The Git command actually has only one message option: minus lowercase m.
So what’s this about title and description? It’s got something to do with the way messages are being read and interpreted. The convention is, to use multiple lines in messages, where the first line is the title. Then you leave a blank line after which you add a longer description.
And so, on the command line you can do one of the following:
- if you use git commit without the message option, you will get a commandline editor. In there type the title in the first line. Leave a blank line and then add a new line or lines with the description
- or, on the command line you use the message option of the git command twice. First for the title, then for the description.
Users of git desktop clients, such as the one from GitHub, don’t have to do this kind of gymnastics. The commit option comes with two text fields, one for the title and one for the description by default.
Ok, in summary:
- source code projects should contain a License file and a README in their root directories
- take care how you write Git commit messages: make clear what the commit is about
- and think about using tools to turn your source code comments into your very own source code reference
There is more about documentation, like writing tutorials, introductions to technologies and similar content. But that usually goes beyond the scope what engineers can do during the working hours. Which is, of course, why we need technical writers.
Podcasts we love
Check out these other fine podcasts recommended by us, not an algorithm.