UCL for Code in Research

4/9 Research Software Engineering with Python (COMP233) - Data Formats

Peter Schmidt Season 2 Episode 4

In this episode I'll be discussing data formats such as CSV, JSON and YAML. My guest is Nick Radcliffe from Stochastic Solutions and the Uni. Edinburgh. Nick's expertise is in data science and he has a lot to share about data, data formats and how to use them.

Links

Libraries

Don't be shy - say Hi

This podcast is brought to you by the Advanced Research Computing Centre of the University College London, UK.
Producer and Host: Peter Schmidt

PETER In this episode I’ll be talking about data and data formats. My guest is Nick Radcliffe, CEO from Stoachstic Solutions and a Visiting professor of Mathematics at the university of Edinburgh. Nick worked in data science for a long time and is the author of a Python library called test driven data analysis, or tdda for short. You’ll hear from Nick a bit later. 

But before that, let’s talk about some of the popular data formats and where they came from. 

Data formats have existed for a long time. After all, computers need to be able to understand the kinds of data you give them. Meaning, you need to set out the structure of the data you pass in. A very simple structure is a list or table. Where each row has a set of columns into which you can insert some values. One of those list based formats is the socalled Comma Separated Value format or CSV for short. This goes back to the early 1970s and was used by IBM in a compiler for the Fortran programming language. The name CSV itself seemed to have appeared in the early 1980s. 

And CSV files have stayed with us ever since. CSV files have a very simple structure. They are basically a list of rows, where each row has a set of values, separated by a comma. Commas are not the only way to separate values, though. You could also use SPACE, a TAB, a semicolon or others. The values themselves can be anything as long as they can be expressed as a string, that is text. 

And that, of course, is one of the downsides of CSV files. In each row, your CSV file may contain a wild mix of different data: simple integers, real numbers, that is numbers with floating points, Strings or complex types such as embedded lists or maps. To help programs deal with it, you could add a header row to indicate the type of values. But it remains a very weak binding and your CSV file, even when it is machine generated, can contain the wrong kind of data in the wrong place. 

Since then, other data formats have come up, that allow better and more complex structures and type definitions. JSON for example. JSON stands for JavaScript Object Notation. As the name suggests it started with the JavaScript language, one of the Core technologies used in web applications and platforms. Two people, Douglas Crockford and Chip Morningstar are credited with publishing it in the early 2000s. It was first standardised in 2013. And roughly at the same time, another data format was published, called YAML, which stands for yet another markup language. YAML also allows for deeper structures and using data types. But the format is somewhat a little easier to write out. 

CSV, YAML and JSON are all text based data formats and, therefore, easily portable between different machines, different applications and different operating systems. 

But each of them tends to be used for different purposes. CSV is usually associated with spreadsheet applications like Excel. Sometimes you can write out Database tables as CSV files as well. But when you have relational databases or more complex data structure, CSV is not a good choice. 

Which is where JSON comes in. It may have started as a data format for web based applications, but it is used in a much wider range of applications today. Often, JSON is used as a format for exchanging messages and data between applications. It’s also used by databases to import and export data tables. 

YAML also allows for structured data and data types. I am not sure how wide it is spread to store data or how widely it is used as a data exchange format. However, it is used a lot as a format for configuration files, be that for system or individual applications. 

For all of the formats I mentioned so far, CSV, JSON and YAML, Python has a rich set of tools for parsing and writing data in CSV, JSON and YAML format. In addition to that, libraries such as numpy and others have their own tools to read and write data in these formats. 

There are a lot more file formats out there, of course, including proprietary formats in commercial applications. The one thing we haven’t talked about, yet, are when you want to store and process really big data sets. Like in data science, machine learning and artificial intelligence applications. We’re not talking large Excel files here, with their limitation to just over a million rows. These are datasets measured in gigabytes or terrabytes. Datasets that are stored on cloud services or maybe processed by supercomputers. Storing them as CSV, JSON or YAML files is usually not practical. 

There are two formats that I’d like to mention briefly: HFD5 and Parquet. 

Parquet is an open source project and part of the Apache Software Foundation. Parquet files are stored in a format that allows for quick access. And one aspect of this is, that - unlike CSV files - Parquet files are stored by columns and not rows. Parquet is used for instance in cloud based services and solutions. 

HDF5 - or to give it is full name - hierarchical data format version 5 is also capable of dealing with large data sets. It started its life in 1987 with a portable scientific data format, the All Encompassing Hierarchical Object Oriented Format, at the National Center for Supercomputing Applications, the NCSA in the US. And there is the clue for its use: supercomputers, high performance computing and parallel computing. The format is in fact just a part of a whole package that consists of data model, libraries, technology toolsets and others. It is maintained by a non profit organisation of the same name, HDF and has an open source licence. 

There is a lot more that could be said about both Parquet and HDF, but that would be beyond the scope of this episode. I provided some links in the episode notes if you’d like to follow up on it. 

And now it’s time for my little chat with Nick Radcliffe. 

PETER Hello, Nick. Nice to meet you again, 

NICK Hello, 

PETER Maybe you can quickly introduce yourself. 

NICK Sure. So I run a company called Stochastic Solutions, which is, I guess we call it a data science company these days. So we we write data analysis software and do data analysis for people. I’m also a professor of maths at Edinburgh and an organiser of PyData to Edinburgh and various other things like that. I’ve been doing this for a long time now. 

PETER Indeed. And you’re just the man to talk about data when it comes to how to deal with data. You’ve been working with machine learning for a long while. So what are the kind of typical data formats? Let’s start with that. What are the typical data formats that you’ve come across? 

NICK there are many of course and varied. But the the sad truth is that the lowest common denominator remains CSV files, what we used to call comma separated data files and Microsoft renaming it to character separated data files. You kind of have to be able to deal with that because that’s the only thing that absolutely everyone understands these days. is a lot more common, particularly for Web related things. Obviously, databases are an important source of data. and you can get things in a much more structured format that way, which helps with metadata and formats. There are also modern sort of based or document based databases, things like Mongo, which kind of use something more like , their storage format and deal with somewhat more loosely structured data. And then of course there’s all the proprietary stuff which can be anything at all. So there’s binary files, image files, video files, audio files and just dumps of arrays. Pixlr, A’s, whatever. And I guess you face all of those depending on what you’re doing. 

PETER Yeah. CSV has been with us for quite some time. So where does it actually come from? I encountered it first back in my spotty youth when I dealt with Excel. 

NICK I don’t know who first used it or where it first came from, but it’s it’s kind of a very natural for anyone to write data to desk or through a pipe. so the advantage of it is that pretty much anything can read it. And these days, even if that’s Unicode data, maybe an UTF-8 encoding or something, that’s still easy. The disadvantages are, of course, that it’s very ambiguous and it doesn’t carry any animated data by default doesn’t carry any metadata with it. And so, you know, that’s the question of how accurate is what of what a valid row had was. How do you know the types? What happens when the apparent type changes halfway down the file? It’s kind of the format that everyone can use, but everyone hates because because so many problems come from, data being read incorrectly 

PETER But JSON should make it a little bit easier these days, shouldn’t it? Because I think it’s more structured, and I think you can encode things like the type a little bit more clearly than you do with CSV. 

NICK so has a couple of advantages, you know. I think some people would say it’s the best thing that came out of out of JavaScript, actually. But it has a few simple types and it only has one kind of number. So it doesn’t distinguish between floating point values and integers, but because it’s text based, you can put the decimal point in on all and that gives the strong indication. It understands booleans, true and false. It understands strings and it understands lists and dictionaries as long as the dictionary. So it’s a mapping from in the case of , it has to be a string, a quoted string to some other value, which can then be any other type one of its basic types or another dictionary or a list. it’s pretty flexible and it’s, it has very unforgiving pauses. So if there’s anything wrong in a file, almost all parsers will refuse it, which in my view is a very good thing. It’s the opposite of HTML. The fact that it’s unforgiving, as in if there’s anything wrong with it at all, and missing, quote, a missing comma or an extra comma, whatever, it will refuse it. And so that makes it slightly more difficult to write by hand, but it means that there is almost no bad in the world, the format specified as UTF-8 as well. So it’s very good from that point of view. one of its limitations are it doesn’t have any kind of understanding of dates, dates or times or date stamps. So they have to be encoded as text and then you have to have some way of deciding when you’re going to interpret a field. The other thing that’s very interesting about is it allows hierarchy and therefore it’s good for not so much unstructured data as, more variable data, tree type data and things like that, which the other downside of it, which is a fairly small downside, is that I suppose, by default, it’s quite verbose because if you have each data item has a label and a dictionary, typically, then the labels can be repeated very many times. That’s not a very big problem because you can just compress them. But it does mean that a raw dump will typically be very large compared with a CSV file. The other thing that can be slightly challenging about JSON is that typically if you have a large number of records in a file, you have an opening separator at the start of the start of a list or a sort of a dictionary or something on a close at the end, which makes it just a bit more difficult to process bit by bit. Anyway, people sometimes work around that is to use what’s sometimes called L format, where each line is a JSON dictionary, often common keys, but there doesn’t have to be an opening for the whole collection at the start and a closing delimiter at the end. But other than that, a useful format. 

PETER It’s interesting you haven’t mentioned YAML, because YAML is used in computing as well, But is it actually used in angst when it comes to data processing and for machine learning? 

NICK Yes. Yeah. So suppose it’s not a coincidence that I haven’t mentioned YAML. I hate YAML. A lot of people love it. I mean a lot of people have very great fondness for yaml. so yaml has very similar capabilities to JSON, but it’s easier for humans to type. And it’s a very forgiving language. It’s quite hard to type stuff that isn’t valid YAML and in the same way that I was saying a good thing about JSON is that it’s policies tend to be relatively unforgiving. it’s not exactly that YAML is forgiving, it’s that most things are valid yaml and it will find a way to interpret it. And very small differences like the presence of spacing or whatever can have a very big effect on a YAML file and change its meaning very dramatically. Some people love that. you think about the two classic things that people know that are forgiving an unforgiving HTML has always been incredibly forgiving. You can put almost 

NICK any old rubbish on an HTML file and browsers will bend over backwards to interpret them in some useful way. And Somalis is totally unforgiving. Again, if there’s a single stray character, it’s mandated. Actually an example of an example processor will reject the data and YAML is much more like HTML in practices is very forgiving. Easy to type therefore, on JSON is unforgiving. There’s 

NICK there’s a law. Actually I think it’s Purcell’s law, which is that things should be forgiving in what they read and strict in what they write. And that’s not a bad principle. But my view is that in the end, strict formats mean that there’s less bad data around you know, when you’ve read something correctly that that ends up being more beneficial. there are many opinions. 

PETER I know. But I think when it comes to machine learning or A.I., then cleaning up the data very often is a very laborious task. From that point of view, having a mechanism that enforces a strictness to a tearing to the rules of how you write a file is probably welcome, and it makes the life of data scientists and data processes easier. 

NICK Yes. And of course, you can have with any format, you can extend it with more matter less. So you can have a description of your CSV file saying what the separate saw, what quoting rules are and so forth. that can make everything much more reliable. So something I always try to do. 

PETER The next question is around Python and how Python can help digesting data. What kind of libraries I use, particularly when it comes to large data sets, because I think in the AI and machine learning could deal very often with large datasets. So what does Python bring to the table? 

NICK Python brings an enormous amount to the table which is the reason why it’s it’s one of the two major and probably the pre-eminent in fact system in use and data science that an R is is really the other. Python is a surprising choice for data processing or data analysis at some level because it’s an incredibly slow interpreted language. And if you write all of your processing code in native Python without anything to make it fast, it will be, potentially hundreds or even even thousands of times slower than a compiler language like Rust or C or something. But in practice, the reason it’s good is that what people should do and do do in almost all cases is use very highly optimised. C libraries or C or Rust or whatever libraries to do the processing. So the granddaddy of those in Python is a library called Numpy from Numfocus. There is some dispute about how it should be pronounced. I think it’s known by some people say numpy or NimPI, but I think it’s it’s numpy for numerical python and it has support fundamentally for arrays of data of various basic types and very fast processing operations manipulating those arrays. And those can be one dimensional arrays or multi dimensional arrays. And it’s very powerful libraries used. It’s been used forever for as long as Python has been important in data science. It was probably the thing that made Python important in DICE. So hat’s a very good thing to use. And important to realise that when you’re using numpy, you only really get the benefit if you use numpy functions for doing things like, joining to call adding two columns together summing a column or anything like that. If you iterate over the elements yourself in Python, it will be potentially slower than just using the Python list actually, So numpy is incredibly important. More recently, pandas has become very important 

PETER I would like to talk about pandas, because I think lot of people are using that now. 

NICK yeah, so pandas pandas library I don’t use and don’t like but it’s very popular with data scientists because it it started off being very closely linked to numpy it basically provided a data frame which is you can think of it like a database table essentially. So a set of named columns, least in the simple case, is a set of named columns with, for a number of rows and potentially for a very large number of rows, as with numpy and operations for processing that data frame and for displaying that data frame, so forth. That’s become very important as well and is used incredibly widely in Python. And then there’s a library called SciPy, or least I call it SciPy scientific, scientific python. I had a colleague who could call it Skippy, but 

PETER That’s for another day. 

NICK SciPy provides great number of scientific and statistical functions and it’s compatible with both pandas and particularly numpy. So that’s very useful for distributions and sampling conversation and all sorts of stuff like that. And then there’s a library called Scikit-Learn, which is .Again incredibly widely used, and that has a lot of machine learning paradigms that are efficient and which will operate typically over numpy columns or pandas arrays or pandas data frames So those are probably the most important things historically and all of those are written in C or some other compiled language can be incredibly efficient, just as good, in fact, probably better than you typically write your own. More recently, a library that’s becoming very popular and has some of the same sorts of capabilities as pandas is a library called Pola.RS. Pola.RS is written in rust. It’s incredibly fast, very efficient and very safe. But it also has Python bindings. So again, you can use it directly from Python and 

PETER How’s it spelled? 

NICK it’s spelled P-O-L-A-R-S, and sometimes people put a dot before the RS. The rs comes from rust. 

I think, and it has adopted many of the lessons from databases. it has a query optimisation stage and it also supports parallelism and parallelism. You know, we’ve talked about, you know, what do we mean by big or large in the context of large data. And there are various different scales that are interesting. Obviously, you know, Excel started with a limit of 65,000 rows and then had this important moment about ten years ago where it expanded to cope with a million rows of data. But those are tiny datasets by the standards of of modern data science. And in terms of the sizes that we really think about, one question is, is will all the data fit in memory on whatever computer you’re running on? 

PETER Indeed. 

NICK And really, when you say it fits in memory, it probably needs to fit in a few times in memory because you typically have multiple copies of things lying around as you as you process it. If it will fit in memory, then your life is relatively easy. If it won’t fit in memory, then everything becomes harder. And at that point you need either to write fancier code to bring it in in pieces, or to use a database that will just do that for you. And there are some very high performance databases, and that’s another way of doing all of this stuff. all you need a library like Pola.RS, which will kind of handle all that for you. so one dimension of the scale of data is is just storing it an access. And the other question is how you process it. And obviously the machines that we’re sitting in front of. I’m sitting in front of a mac studio that has 20 processors, even my iPhone has about four processors these days. So we all have these shared memory parallel machines and again, if you just write code in Python without taking very special care, you will only use one of those cores what you need in order to, process really large amounts of data is to use multiple processors or multiple cores or whatever, either in the sense of, machine like we all have or a much bigger cluster somewhere where potentially the data won’t even necessarily be stored in one contiguous bit of memory. It might be distributed across various machines or whatever. And of course the other libraries that we should probably talk about, there are the various you don’t have libraries that are important both because those are a bit specialist, but also they are key to a lot of the image processing and speech processing and video processing. And these are obviously neural networks or deep learning models, if you prefer. And again, there are a number of libraries like TensorFlow which are parallel and are available with bindings in Python. There are quite a lot to choose from these days actually. Those will also potentially use GPUs for computation. So the graphics processing units turn out to be pretty good for kinds of matrix backed multiplications that are the basis of neural networks, 

PETER Yeah, That’s quite a lot, actually. probably people won’t start with all of them, but I think numpy is probably a good place to start. 

NICK I think most people these days would start with pandas. 

PETER All right. 

NICK I would personally think that pandas gives you quite a lot more, though there are a number things that I think not fantastic about pandas. It’s no handling has always been very strange and it has has a very complicated indexing scheme that can can be. But nevertheless, either pandas or numpy are definitely right places to start. And as I say, the key thing is to make sure that you don’t just use them to store your data, but you use their functions and methods to process the data because that’s where you get the performance from. 

PETER thank you very much for your time today, Nick. That was very insightful. And I wish you all the best for the future. 

NICK Okay. Nice talking to you again, Peter. 

So you heard it from Nick: if you use libraries like Numpy and Pandas, then make sure you use the libraries for the data processing part and not just for reading in and storing data. These libraries have a highly optimsed code written in C or other languages like Rust as in Pola.RS, that sit beneath the veneer of Python you the end user will be dealing with. 

Podcasts we love

Check out these other fine podcasts recommended by us, not an algorithm.

Code for Thought Artwork

Code for Thought

Peter Schmidt