First steps in data exploration
Ideas on how to start with a new data set
import requests
import os
data_url = 'http://www.berlin.de/sen/finanzen/dokumentendownload/haushalt/haushaltsplan-/haushaltsplan-2018-2019/20180125_dhh18_19.csv'
file = 'berlin_budget.csv'
if not os.path.exists(file):
with open(file, 'wb') as file:
resp = requests.get(data_url)
file.write(resp.content)
For sharing more complex data analyses on fastPages, a step like this might not be very
interesting to the ready and might quite likely be hidden by starting the cell with #hide
.
Now even with a .csv
file locally present the are a few annoying errors that that might prevent somebody from simply importing. What we would all like to to do, is to simply call pd.read_csv('berlin_budget.csv')
Encoding
This is probably one of the most annoying data aspects that data scientists sometimes have to deal with. Unfortunately when encoding errors occur one has to deal with them, and manually inspecting the data files in a text editor or excel, generally does not help loading them into python. In practical terms most files today are encoded in what is called UTF-8
, and that is the standard that python applications assume. The good thing is most functions that deal with files, like open
or pd.read_csv
let you specify the encoding. Unfortunately it can be overwhelming what to pick when confronted with this issue for the first time. Legacy windows applications, which have been and probably still are present in many public institutions and agencies, used different encodings. Most commonly (for western files) an encoding called CP1252
was used. Therefore we will also specify this encoding when reading in the data.
Separator
The next issue with this file is the separator, which is actually not ,
. In the best cases one can still read in the files and then realize this later on. Sometimes this can fail, however. As is the case with our budget data set. In such a case the file needs to be either inspected to find the right separator, or common separators can simply be tried (for instance [',',';','\t']
). In this case ;
lets us load the data, which is a typical separator for .csv
files in Germany.
Number formatting
While being annoying as well number formatting usually does not prevent one from loading data, but simply making manipulating it more difficult. Here we again set the typical German values decimal = ','
and thousands = '.'
.
import pandas as pd
berlin_budget_raw = pd.read_csv('berlin_budget.csv', sep=';', decimal=',', thousands='.', encoding='CP1252')
First steps with raw data
Now that we have imported the raw data we'll go through some typical starting points. These are
- checking the shape
- looking at example rows
- looking at the data types
- checking for missing data
- check different values for discrete variables
Shape
A first helpful thing is usually, to look at the shape of the data with berlin_budget_raw.shape
. For this data set it shows that we have 20828 rows and 25 columns. This means that there are more rows, than we can look at manually and there are more columns than we can conveniently look it. This is often the case and the data was chosen as an example of how to deal with this. The strategy to deal with the rows, is to use function that help us analyze the entire content to get an impression. The strategy for the columns is to use additional information we find to exclude as many columns as possible for initial analyses, to make the data easier to handle.
berlin_budget_raw.head()
The first 5 rows show us many numeric and string columns and show that repeating string columns seem to coincide with repeating numeric columns.
berlin_budget_raw.dtypes
This shows again that many columns with similar names come as a float column together with another object (string) column. We can also see that there are two columns containing €, which are likely to contain the actual budget data. This already suggest the following strategy for reducing the columns. For each pair of columns that seem to belong together, we might be able to drop one of the two for initial analyses. For the budget column, it is probably a good idea to start with only one of them.
berlin_budget_raw.isna().sum()
This shows that there is exactly one missing value for each row, which is not too bad. Additionally it suggests, that there is probably an empty row, instead of randomly missing data. And true enough, if we were to check berlin_budget_raw.tail()
we would see that every value in the last row is missing. Such cases are easy to deal with, because we can just drop that row.
berlin_budget_raw.nunique().sort_values()
There are many interesting information here. First of all Titelart
only has two different values. For budget data this means that it is a good candidate to indicate whether something is an expense or a funding source. Furthermore many of the categories with similar names have very similar distinct value counts, supporting our earlier strategy idea of only using one of the in first analyses. We also see that some categories have only a small number of different values. These are more suited for initial analyses, because they keep it simple and allow an easy inspection of aggregated data, as will be illustrated below.
pd.concat
can help to summarize them.
#collapse-output
pd.concat(
[
berlin_budget_raw.dtypes.rename('dtypes'),
berlin_budget_raw.isna().sum().rename('missing'),
berlin_budget_raw.nunique().rename('distinct_values')
]
,axis=1
).sort_values('distinct_values')
Simplified Data
With the analyses above we create the following simplified data as follows.
- We only keep categorical columns with a small number of values (<=13)
- We only keep the string version of the categorical category
- We only keep the 2018 budget column
- We drop the missing data
We store results in berlin_budget
and have a look at some samples.
berlin_budget.sample(5)
This already simplified the data a lot and brought it down from 25 to 6 columns!. At this point one can more easily get started with more concrete analyses. In This post we restrict ourselves to two simple examples. A consistency check regarding the budget's spending and funding. And a simple overview plot about different spending.
berlin_budget.groupby('Titelart')['Ansatz 2018 in €'].sum()
This works out. With a little bit of domain knowledge about Berlin itself we realized that the column Bereichsbezeichung
contains mostly the different districts of Berlin. So a reasonable followup check is to do the same analysis on a district level. This also works out.
berlin_budget.groupby(['Bereichsbezeichnung','Titelart'])['Ansatz 2018 in €'].sum().unstack()
First bar plots
A quantities dependence on a few categories can be nicely visualized using bar plots. This is another advantage of starting with columns containing few different elements when analyzing a new data set. As an illustration we pick Hauptfunktionsbezeichnung
and plot spending against it. often two simple tricks help with the visualization in these cases.
- The use of horizontal bar charts, because it makes category names easier to read.
- Sorting the values before plotting them, which makes everything easier to compare.
In our example we get1:
(
berlin_budget
.query('Titelart == "Ausgabetitel"')
.groupby('Hauptfunktionsbezeichnung')['Ansatz 2018 in €'].sum()
.sort_values()
.plot.barh()
)
Conclusion
This article gave a concrete example how to start tackling a real life data set. We primarily showed how to analyze and most importantly reduce the available data to get a manageable starting point. This often boils down to identifying a meaningful subset of columns that are suited for initial analysis and plots. This is particularly helpful in the absence of documentation or good domain knowledge and we hope it will hep the readers to get started with data that interests them.
1. We also used a matplotlib style for prettier output inside the post.↩