Reading data from files into pandas dataframes is largely straight-forward and easy to do. Choose the right input function depending on the file type, pass in the filepath, and add a few parameters to configure how to read the data from the file.
For a csv file, that might look something like this:
import pandas as pddf = pd.read_csv("filename.csv")
If this were a tab delimited file, we could override the default delimiter param (","):
df = pd.read_csv("filename.csv", delimiter="\t")
If there were some non-data values in the footer of the file (say the last 3 rows), we could add a param to handle that as well:
Depending on the number of different configuration needs, the list of optional parameters can get quite long which can be seen by the length of the read_csvfunction signature.
When this falls apart
I work with a lot of file-based tabular data for my day-to-day work. Most of it is research files from various state departments of education as well as NCES. Individually parsing each file this way would result in massive amounts of scripts each customizing the function parameters for various pandas input functions. Instead, one way to mitigate this burden is to create configuration files for each file that specifies how it should be parsed and pass those into a single script that handles interpreting the configuration.
Getting an example file
To start, let’s download the 2021-2022 national directory of schools from the NCES Common Core of Data (ccd) website to use as a demonstration of how this might work. You can download the file directly here.
Once downloaded, we can open it in pandas:
import pandas as pddf = pd.read_csv("ccd_sch_029_2122_w_1a_071722.csv")df.head()
/var/folders/q8/dgbdr1_566nf3r38kwhrw0zh0000gn/T/ipykernel_25748/4258698622.py:3: DtypeWarning: Columns (14,15,21,22,39,41,42) have mixed types. Specify dtype option on import or set low_memory=False.
df = pd.read_csv(csv_data)
SCHOOL_YEAR
FIPST
STATENAME
ST
SCH_NAME
LEA_NAME
STATE_AGENCY_NO
UNION
ST_LEAID
LEAID
...
G_10_OFFERED
G_11_OFFERED
G_12_OFFERED
G_13_OFFERED
G_UG_OFFERED
G_AE_OFFERED
GSLO
GSHI
LEVEL
IGOFFERED
0
2021-2022
1
ALABAMA
AL
Albertville Middle School
Albertville City
1
NaN
AL-101
100005
...
No
No
No
No
No
No
07
08
Middle
As reported
1
2021-2022
1
ALABAMA
AL
Albertville High School
Albertville City
1
NaN
AL-101
100005
...
Yes
Yes
Yes
No
No
No
09
12
High
As reported
2
2021-2022
1
ALABAMA
AL
Albertville Intermediate School
Albertville City
1
NaN
AL-101
100005
...
No
No
No
No
No
No
05
06
Middle
As reported
3
2021-2022
1
ALABAMA
AL
Albertville Elementary School
Albertville City
1
NaN
AL-101
100005
...
No
No
No
No
No
No
03
04
Elementary
As reported
4
2021-2022
1
ALABAMA
AL
Albertville Kindergarten and PreK
Albertville City
1
NaN
AL-101
100005
...
No
No
No
No
No
No
PK
KG
Elementary
As reported
5 rows × 65 columns
Two things are apparent on first inspection:
a few of the columns have mixed data types we will need to handle
there are many columns (its a very wide file) with obscure naming conventions we will need to interpret
To get our head around the structure of this data, luckily there is a companion documentation file available from NCES as well.
We can override the data type inference that pandas is doing and specify either column specific data types or simply a default across all columns. In this case, let’s treat everything as a string. As this is a directory file, much of the data is descriptive and even values like numeric IDs that could be treated as integers will likely benefit from being handled as a string instead to preserve leading zeros that may be important for joining to other data later.
df = pd.read_csv(csv_data, dtype=str)df.head()
SCHOOL_YEAR
FIPST
STATENAME
ST
SCH_NAME
LEA_NAME
STATE_AGENCY_NO
UNION
ST_LEAID
LEAID
...
G_10_OFFERED
G_11_OFFERED
G_12_OFFERED
G_13_OFFERED
G_UG_OFFERED
G_AE_OFFERED
GSLO
GSHI
LEVEL
IGOFFERED
0
2021-2022
01
ALABAMA
AL
Albertville Middle School
Albertville City
01
NaN
AL-101
0100005
...
No
No
No
No
No
No
07
08
Middle
As reported
1
2021-2022
01
ALABAMA
AL
Albertville High School
Albertville City
01
NaN
AL-101
0100005
...
Yes
Yes
Yes
No
No
No
09
12
High
As reported
2
2021-2022
01
ALABAMA
AL
Albertville Intermediate School
Albertville City
01
NaN
AL-101
0100005
...
No
No
No
No
No
No
05
06
Middle
As reported
3
2021-2022
01
ALABAMA
AL
Albertville Elementary School
Albertville City
01
NaN
AL-101
0100005
...
No
No
No
No
No
No
03
04
Elementary
As reported
4
2021-2022
01
ALABAMA
AL
Albertville Kindergarten and PreK
Albertville City
01
NaN
AL-101
0100005
...
No
No
No
No
No
No
PK
KG
Elementary
As reported
5 rows × 65 columns
That has gotten rid of the Dtype warning, but we still have these column names to deal with. Let’s create a dictionary mapping the original names to what we want them to be instead.
Well, in this singular case it’s not necessary at all. If this was the one file we needed for analysis, then yes we could stop here. But, imagine that instead you had hundreds if not thousands of similar files with different file types (csv, excel, txt, etc) each with its own unique set of challenges such as footers or headers to skip, tabs to combine, and columns to rename. At that point, writing a separate python script for each file and running them as needed would become unwieldy rather quickly.
Introducing Yaml: Json’s cuter cousin
Our column_map dictionary looks a lot like a json object and likely we could specify our configurations as a series of json files or even python files with dictionaries. But this syntax is full of unnecessary punctuation.
Compare this list in python/json with a similar one in yaml:
Admittedly, not a HUGE difference but one is clearly a bit more human readable. The more complex the configuration gets the more this becomes obvious. The trade off is that a native python dictionary or even a json string could be parsed using only the python standard lib. Currently, yaml support is not baked into python. Thankfully, there is a simple package that can be used: pyyaml.
To start, we’ll need to load the data from this yaml file into something Python can work with:
from yaml import safe_loadwithopen("nces_school_directory_2022.yml") as f: config = safe_load(f)
You can see that this has converted it to a native python dictionary. Less for us to write, easier to read, but still as functional as writing it as a dictionary to begin with.
To use this dictionary as the parameters for the pandas read_csv function, we can take advantage of the unpacking operator ** to convert our dictionary into a set of keyword arguments (often referred to as kwargs).
TypeError: read_csv() got an unexpected keyword argument 'name'
But wait, there’s more
Wait! What’s this? One of the values in our yaml config isn’t an actual parameter/keyword for the read_csv function. While this adds some additional overhead to our usage of yaml, it provides some useful functionality. We can specify additional variables in the config that can drive behavior in our script beyond just the function parameters.
For instance, the file_type variable could be used to swap between use of read_csv when file_type: csv is specified or read_excel when file_type: excel is provided instead. Or the name variable could be used to determine the table name in our database where this data will be loaded.
In order to distinguish these extra variables from our function parameters, we need to be aware of which keywords are specific to the function. We could specify these explicitly OR we could infer them from the function signature.
Now we can compare the key/value pairs in our yaml config with any that match the function signature of pd.read_csv.
def get_kwargs(config): params = inspect.signature(pd.read_csv).parameters.keys()return {k: v for k, v in config.items() if k in params}
This will create a new dictionary with just the key/value pairs (k: v) that match the function signature. At this point, we can now pass those into our input function.
df = pd.read_csv(**get_kwargs(config))df.head()
SCHOOL_YEAR
FIPST
STATENAME
ST
SCH_NAME
LEA_NAME
STATE_AGENCY_NO
UNION
ST_LEAID
LEAID
...
G_10_OFFERED
G_11_OFFERED
G_12_OFFERED
G_13_OFFERED
G_UG_OFFERED
G_AE_OFFERED
GSLO
GSHI
LEVEL
IGOFFERED
0
2021-2022
01
ALABAMA
AL
Albertville Middle School
Albertville City
01
NaN
AL-101
0100005
...
No
No
No
No
No
No
07
08
Middle
As reported
1
2021-2022
01
ALABAMA
AL
Albertville High School
Albertville City
01
NaN
AL-101
0100005
...
Yes
Yes
Yes
No
No
No
09
12
High
As reported
2
2021-2022
01
ALABAMA
AL
Albertville Intermediate School
Albertville City
01
NaN
AL-101
0100005
...
No
No
No
No
No
No
05
06
Middle
As reported
3
2021-2022
01
ALABAMA
AL
Albertville Elementary School
Albertville City
01
NaN
AL-101
0100005
...
No
No
No
No
No
No
03
04
Elementary
As reported
4
2021-2022
01
ALABAMA
AL
Albertville Kindergarten and PreK
Albertville City
01
NaN
AL-101
0100005
...
No
No
No
No
No
No
PK
KG
Elementary
As reported
5 rows × 65 columns
Very close. This hasn’t handled the column renaming. That’s because in pd.read_csv the column name values is set with the names param. We could just call our columns list names in the config or we could pass it explicitly instead (if, say, we needed to name it columns for some other functionality in our script).
I like to simplify all of this by creating a custom configuration class to handle this. That enables us to use class attributes for accessing the key/values and can wrap in the yaml reading as well.
class Config:def__init__(self, yaml_file_path):""" Extract configuration from yaml file and set the values as class attributes or kwargs based on the specified function signature. """withopen(yaml_file_path) as f:self.yaml = safe_load(f)self.set_attrs()def set_attrs(self):""" Create attribute called kwargs that returns dictionary of all key/value pairs that match the given function signature params. Set any non-matching key as a class attribute. """ params =self.get_signature()for k, v inself.yaml.items():if k notin params:setattr(self, k, v)self.kwargs = {k: v for k, v inself.yaml.items() if k in params}def get_signature(self):""" Select the corresponding function signature based on the specified file_type of either csv or excel. Note: this could be expanded to handle more file_types, such as: - SPSS - JSON - Parquet - XML """ signatures = {"csv": inspect.signature(pd.read_csv).parameters.keys(),"excel": inspect.signature(pd.read_csv).parameters.keys(), }return signatures.get(self.yaml["file_type"])
With this class we can now create a Config object from a yaml file and pass that to our pandas functions. It also now handles for whether the file_type expects to use read_csv or read_excel.
There is a lot more parsing functionality that could be extended from this basic example using more configuration variables and/or expanding to different file types.
We’ve used a variation of this approach to great effect to parse hundreds of research files from most of the states in the U.S. as well as a variety of other public research data like NCES. This has even been used to parse Google Spreadsheet data directly in memory.
With this data now transformed into a pandas dataframe, it can be easily loaded into a data warehouse or analyzed in memory.