Configuring Pandas with Yaml Files

tutorial
Published

August 19, 2023

Reading data from files into pandas dataframes is largely straight-forward and easy to do. Choose the right input function depending on the file type, pass in the filepath, and add a few parameters to configure how to read the data from the file.

For a csv file, that might look something like this:

import pandas as pd

df = pd.read_csv("filename.csv")

If this were a tab delimited file, we could override the default delimiter param (","):

df = pd.read_csv("filename.csv", delimiter="\t")

If there were some non-data values in the footer of the file (say the last 3 rows), we could add a param to handle that as well:

df = pd.read_csv("filename.csv", delimiter="\t", skipfooter=3)

Depending on the number of different configuration needs, the list of optional parameters can get quite long which can be seen by the length of the read_csv function signature.

When this falls apart

I work with a lot of file-based tabular data for my day-to-day work. Most of it is research files from various state departments of education as well as NCES. Individually parsing each file this way would result in massive amounts of scripts each customizing the function parameters for various pandas input functions. Instead, one way to mitigate this burden is to create configuration files for each file that specifies how it should be parsed and pass those into a single script that handles interpreting the configuration.

Getting an example file

To start, let’s download the 2021-2022 national directory of schools from the NCES Common Core of Data (ccd) website to use as a demonstration of how this might work. You can download the file directly here.

Once downloaded, we can open it in pandas:

import pandas as pd

df = pd.read_csv("ccd_sch_029_2122_w_1a_071722.csv")
df.head()
/var/folders/q8/dgbdr1_566nf3r38kwhrw0zh0000gn/T/ipykernel_25748/4258698622.py:3: DtypeWarning: Columns (14,15,21,22,39,41,42) have mixed types. Specify dtype option on import or set low_memory=False.
  df = pd.read_csv(csv_data)
SCHOOL_YEAR FIPST STATENAME ST SCH_NAME LEA_NAME STATE_AGENCY_NO UNION ST_LEAID LEAID ... G_10_OFFERED G_11_OFFERED G_12_OFFERED G_13_OFFERED G_UG_OFFERED G_AE_OFFERED GSLO GSHI LEVEL IGOFFERED
0 2021-2022 1 ALABAMA AL Albertville Middle School Albertville City 1 NaN AL-101 100005 ... No No No No No No 07 08 Middle As reported
1 2021-2022 1 ALABAMA AL Albertville High School Albertville City 1 NaN AL-101 100005 ... Yes Yes Yes No No No 09 12 High As reported
2 2021-2022 1 ALABAMA AL Albertville Intermediate School Albertville City 1 NaN AL-101 100005 ... No No No No No No 05 06 Middle As reported
3 2021-2022 1 ALABAMA AL Albertville Elementary School Albertville City 1 NaN AL-101 100005 ... No No No No No No 03 04 Elementary As reported
4 2021-2022 1 ALABAMA AL Albertville Kindergarten and PreK Albertville City 1 NaN AL-101 100005 ... No No No No No No PK KG Elementary As reported

5 rows × 65 columns

Two things are apparent on first inspection:

  1. a few of the columns have mixed data types we will need to handle
  2. there are many columns (its a very wide file) with obscure naming conventions we will need to interpret

To get our head around the structure of this data, luckily there is a companion documentation file available from NCES as well.

The full list of column names is:

df.columns
Index(['SCHOOL_YEAR', 'FIPST', 'STATENAME', 'ST', 'SCH_NAME', 'LEA_NAME',
       'STATE_AGENCY_NO', 'UNION', 'ST_LEAID', 'LEAID', 'ST_SCHID', 'NCESSCH',
       'SCHID', 'MSTREET1', 'MSTREET2', 'MSTREET3', 'MCITY', 'MSTATE', 'MZIP',
       'MZIP4', 'LSTREET1', 'LSTREET2', 'LSTREET3', 'LCITY', 'LSTATE', 'LZIP',
       'LZIP4', 'PHONE', 'WEBSITE', 'SY_STATUS', 'SY_STATUS_TEXT',
       'UPDATED_STATUS', 'UPDATED_STATUS_TEXT', 'EFFECTIVE_DATE',
       'SCH_TYPE_TEXT', 'SCH_TYPE', 'RECON_STATUS', 'OUT_OF_STATE_FLAG',
       'CHARTER_TEXT', 'CHARTAUTH1', 'CHARTAUTHN1', 'CHARTAUTH2',
       'CHARTAUTHN2', 'NOGRADES', 'G_PK_OFFERED', 'G_KG_OFFERED',
       'G_1_OFFERED', 'G_2_OFFERED', 'G_3_OFFERED', 'G_4_OFFERED',
       'G_5_OFFERED', 'G_6_OFFERED', 'G_7_OFFERED', 'G_8_OFFERED',
       'G_9_OFFERED', 'G_10_OFFERED', 'G_11_OFFERED', 'G_12_OFFERED',
       'G_13_OFFERED', 'G_UG_OFFERED', 'G_AE_OFFERED', 'GSLO', 'GSHI', 'LEVEL',
       'IGOFFERED'],
      dtype='object')

We can override the data type inference that pandas is doing and specify either column specific data types or simply a default across all columns. In this case, let’s treat everything as a string. As this is a directory file, much of the data is descriptive and even values like numeric IDs that could be treated as integers will likely benefit from being handled as a string instead to preserve leading zeros that may be important for joining to other data later.

df = pd.read_csv(csv_data, dtype=str)
df.head()
SCHOOL_YEAR FIPST STATENAME ST SCH_NAME LEA_NAME STATE_AGENCY_NO UNION ST_LEAID LEAID ... G_10_OFFERED G_11_OFFERED G_12_OFFERED G_13_OFFERED G_UG_OFFERED G_AE_OFFERED GSLO GSHI LEVEL IGOFFERED
0 2021-2022 01 ALABAMA AL Albertville Middle School Albertville City 01 NaN AL-101 0100005 ... No No No No No No 07 08 Middle As reported
1 2021-2022 01 ALABAMA AL Albertville High School Albertville City 01 NaN AL-101 0100005 ... Yes Yes Yes No No No 09 12 High As reported
2 2021-2022 01 ALABAMA AL Albertville Intermediate School Albertville City 01 NaN AL-101 0100005 ... No No No No No No 05 06 Middle As reported
3 2021-2022 01 ALABAMA AL Albertville Elementary School Albertville City 01 NaN AL-101 0100005 ... No No No No No No 03 04 Elementary As reported
4 2021-2022 01 ALABAMA AL Albertville Kindergarten and PreK Albertville City 01 NaN AL-101 0100005 ... No No No No No No PK KG Elementary As reported

5 rows × 65 columns

That has gotten rid of the Dtype warning, but we still have these column names to deal with. Let’s create a dictionary mapping the original names to what we want them to be instead.

column_renames = {
    "SCHOOL_YEAR": "school_year", 
    "FIPST": "state_fips_code", 
    "STATENAME": "state_name", 
    "ST": "state_abbrev", 
    "SCH_NAME": "school_name", 
    "LEA_NAME": "lea_name",
    "STATE_AGENCY_NO": "state_agency_id", 
    "UNION": "supervisory_union_id", 
    "ST_LEAID": "state_lea_id", 
    "LEAID": "nces_lea_id", 
    "ST_SCHID": "state_school_id", 
    "NCESSCH": "nces_school_id",
    "SCHID": "school_id", 
    "MSTREET1": "street1_mailing_address", 
    "MSTREET2": "street2_mailing_address", 
    "MSTREET3": "street3_mailing_address", 
    "MCITY": "city_mailing_address", 
    "MSTATE": "state_mailing_address", 
    "MZIP": "zipcode_mailing_address",
    "MZIP4": "secondary_zipcode_mailing_address", 
    "LSTREET1": "street1_location", 
    "LSTREET2": "street2_location", 
    "LSTREET3": "street3_location", 
    "LCITY": "city_location", 
    "LSTATE": "state_location", 
    "LZIP": "zipcode_location",
    "LZIP4": "secondary_zipcode_location", 
    "PHONE": "phone_number", 
    "WEBSITE": "website_url", 
    "SY_STATUS": "start_of_year_status_code", 
    "SY_STATUS_TEXT": "start_of_year_status_description",
    "UPDATED_STATUS": "updated_status_code", 
    "UPDATED_STATUS_TEXT": "updated_status_description", 
    "EFFECTIVE_DATE": "updated_status_effective_date",
    "SCH_TYPE_TEXT": "school_type_description", 
    "SCH_TYPE": "school_type_code",
    "RECON_STATUS": "is_reconstituted", 
    "OUT_OF_STATE_FLAG": "is_out_of_state_location",
    "CHARTER_TEXT": "is_charter_school", 
    "CHARTAUTH1": "charter_authorizer_id1", 
    "CHARTAUTHN1": "charter_authorizer_name1", 
    "CHARTAUTH2": "charter_authorizer_id2",
    "CHARTAUTHN2": "charter_authorizer_name2", 
    "NOGRADES": "no_grades_offered", 
    "G_PK_OFFERED": "grade_pk_offered", 
    "G_KG_OFFERED": "grade_k_offered",
    "G_1_OFFERED": "grade_1_offered", 
    "G_2_OFFERED": "grade_2_offered", 
    "G_3_OFFERED": "grade_3_offered", 
    "G_4_OFFERED": "grade_4_offered",
    "G_5_OFFERED": "grade_5_offered", 
    "G_6_OFFERED": "grade_6_offered", 
    "G_7_OFFERED": "grade_7_offered", 
    "G_8_OFFERED": "grade_8_offered",
    "G_9_OFFERED": "grade_9_offered", 
    "G_10_OFFERED": "grade_10_offered", 
    "G_11_OFFERED": "grade_11_offered", 
    "G_12_OFFERED": "grade_12_offered",
    "G_13_OFFERED": "grade_13_offered", 
    "G_UG_OFFERED": "ungraded_offered", 
    "G_AE_OFFERED": "adult_education_offered", 
    "GSLO": "lowest_grade_offered", 
    "GSHI": "highest_grade_offered", 
    "LEVEL": "school_level",
    "IGOFFERED": "any_grades_offered_field_adjusted",
}

A bit tedious, but at this point we could rename our columns and be done.

df.rename(columns=column_renames, inplace=True)
df.head()
school_year state_fips_code state_name state_abbrev school_name lea_name state_agency_id supervisory_union_id state_lea_id nces_lea_id ... grade_10_offered grade_11_offered grade_12_offered grade_13_offered ungraded_offered adult_education_offered lowest_grade_offered highest_grade_offered school_level any_grades_offered_field_adjusted
0 2021-2022 01 ALABAMA AL Albertville Middle School Albertville City 01 NaN AL-101 0100005 ... No No No No No No 07 08 Middle As reported
1 2021-2022 01 ALABAMA AL Albertville High School Albertville City 01 NaN AL-101 0100005 ... Yes Yes Yes No No No 09 12 High As reported
2 2021-2022 01 ALABAMA AL Albertville Intermediate School Albertville City 01 NaN AL-101 0100005 ... No No No No No No 05 06 Middle As reported
3 2021-2022 01 ALABAMA AL Albertville Elementary School Albertville City 01 NaN AL-101 0100005 ... No No No No No No 03 04 Elementary As reported
4 2021-2022 01 ALABAMA AL Albertville Kindergarten and PreK Albertville City 01 NaN AL-101 0100005 ... No No No No No No PK KG Elementary As reported

5 rows × 65 columns

So why the need for a special yaml config file?

Well, in this singular case it’s not necessary at all. If this was the one file we needed for analysis, then yes we could stop here. But, imagine that instead you had hundreds if not thousands of similar files with different file types (csv, excel, txt, etc) each with its own unique set of challenges such as footers or headers to skip, tabs to combine, and columns to rename. At that point, writing a separate python script for each file and running them as needed would become unwieldy rather quickly.

Introducing Yaml: Json’s cuter cousin

Our column_map dictionary looks a lot like a json object and likely we could specify our configurations as a series of json files or even python files with dictionaries. But this syntax is full of unnecessary punctuation.

Compare this list in python/json with a similar one in yaml:

Python or Json:

columns = [
    "school_year", 
    "state_fips_code", 
    "state_name", 
    "state_abbrev", 
    "school_name", 
    "lea_name",
    "state_agency_id", 
    "supervisory_union_id", 
    "state_lea_id", 
    "nces_lea_id", 
    "state_school_id", 
    "nces_school_id",
    "school_id", 
]

Yaml:

columns:
  - school_year 
  - state_fips_code 
  - state_name 
  - state_abbrev 
  - school_name 
  - lea_name
  - state_agency_id 
  - supervisory_union_id 
  - state_lea_id 
  - nces_lea_id 
  - state_school_id 
  - nces_school_id
  - school_id

Admittedly, not a HUGE difference but one is clearly a bit more human readable. The more complex the configuration gets the more this becomes obvious. The trade off is that a native python dictionary or even a json string could be parsed using only the python standard lib. Currently, yaml support is not baked into python. Thankfully, there is a simple package that can be used: pyyaml.

We’ll install that with:

pip install pyyaml

or if you’re like me and prefer pipenv:

pipenv install pyyaml

Then let’s create a yaml config file for our nces directory file.

name: nces_school_directory_2022
filepath_or_buffer: ccd_sch_029_2122_w_1a_071722.csv
file_type: csv
dtype: str
columns:
  - school_year 
  - state_fips_code 
  - state_name 
  - state_abbrev 
  - school_name 
  - lea_name
  - state_agency_id 
  - supervisory_union_id 
  - state_lea_id 
  - nces_lea_id 
  - state_school_id 
  - nces_school_id
  - school_id 
  - street1_mailing_address 
  - street2_mailing_address 
  - street3_mailing_address 
  - city_mailing_address 
  - state_mailing_address 
  - zipcode_mailing_address
  - secondary_zipcode_mailing_address 
  - street1_location 
  - street2_location 
  - street3_location 
  - city_location 
  - state_location 
  - zipcode_location
  - secondary_zipcode_location 
  - phone_number 
  - website_url 
  - start_of_year_status_code 
  - start_of_year_status_description
  - updated_status_code 
  - updated_status_description 
  - updated_status_effective_date
  - school_type_description 
  - school_type_code
  - is_reconstituted 
  - is_out_of_state_location
  - is_charter_school 
  - charter_authorizer_id1 
  - charter_authorizer_name1 
  - charter_authorizer_id2
  - charter_authorizer_name2 
  - no_grades_offered 
  - grade_pk_offered 
  - grade_k_offered
  - grade_1_offered 
  - grade_2_offered 
  - grade_3_offered 
  - grade_4_offered
  - grade_5_offered 
  - grade_6_offered 
  - grade_7_offered 
  - grade_8_offered
  - grade_9_offered 
  - grade_10_offered 
  - grade_11_offered 
  - grade_12_offered
  - grade_13_offered 
  - ungraded_offered 
  - adult_education_offered 
  - lowest_grade_offered 
  - highest_grade_offered 
  - school_level
  - any_grades_offered_field_adjusted

Reading Yaml

To start, we’ll need to load the data from this yaml file into something Python can work with:

from yaml import safe_load

with open("nces_school_directory_2022.yml") as f:
    config = safe_load(f)

You can see that this has converted it to a native python dictionary. Less for us to write, easier to read, but still as functional as writing it as a dictionary to begin with.

config
{'name': 'nces_school_directory_2022',
 'filepath_or_buffer': 'ccd_sch_029_2122_w_1a_071722.csv',
 'file_type': 'csv',
 'dtype': 'str',
 'columns': ['school_year',
  'state_fips_code',
  'state_name',
  'state_abbrev',
  'school_name',
  'lea_name',
  'state_agency_id',
  'supervisory_union_id',
  'state_lea_id',
  'nces_lea_id',
  'state_school_id',
  'nces_school_id',
  'school_id',
  'street1_mailing_address',
  'street2_mailing_address',
  'street3_mailing_address',
  'city_mailing_address',
  'state_mailing_address',
  'zipcode_mailing_address',
  'secondary_zipcode_mailing_address',
  'street1_location',
  'street2_location',
  'street3_location',
  'city_location',
  'state_location',
  'zipcode_location',
  'secondary_zipcode_location',
  'phone_number',
  'website_url',
  'start_of_year_status_code',
  'start_of_year_status_description',
  'updated_status_code',
  'updated_status_description',
  'updated_status_effective_date',
  'school_type_description',
  'school_type_code',
  'is_reconstituted',
  'is_out_of_state_location',
  'is_charter_school',
  'charter_authorizer_id1',
  'charter_authorizer_name1',
  'charter_authorizer_id2',
  'charter_authorizer_name2',
  'no_grades_offered',
  'grade_pk_offered',
  'grade_k_offered',
  'grade_1_offered',
  'grade_2_offered',
  'grade_3_offered',
  'grade_4_offered',
  'grade_5_offered',
  'grade_6_offered',
  'grade_7_offered',
  'grade_8_offered',
  'grade_9_offered',
  'grade_10_offered',
  'grade_11_offered',
  'grade_12_offered',
  'grade_13_offered',
  'ungraded_offered',
  'adult_education_offered',
  'lowest_grade_offered',
  'highest_grade_offered',
  'school_level',
  'any_grades_offered_field_adjusted']}

To use this dictionary as the parameters for the pandas read_csv function, we can take advantage of the unpacking operator ** to convert our dictionary into a set of keyword arguments (often referred to as kwargs).

You can learn more about args and kwargs from this helpful Real Python article.

df = pd.read_csv(**config)
df.head()
TypeError: read_csv() got an unexpected keyword argument 'name'

But wait, there’s more

Wait! What’s this? One of the values in our yaml config isn’t an actual parameter/keyword for the read_csv function. While this adds some additional overhead to our usage of yaml, it provides some useful functionality. We can specify additional variables in the config that can drive behavior in our script beyond just the function parameters.

For instance, the file_type variable could be used to swap between use of read_csv when file_type: csv is specified or read_excel when file_type: excel is provided instead. Or the name variable could be used to determine the table name in our database where this data will be loaded.

In order to distinguish these extra variables from our function parameters, we need to be aware of which keywords are specific to the function. We could specify these explicitly OR we could infer them from the function signature.

import inspect

params = inspect.signature(pd.read_csv).parameters.keys()
params
odict_keys(['filepath_or_buffer', 'sep', 'delimiter', 'header', 'names', 'index_col', 'usecols', 'dtype', 'engine', 'converters', 'true_values', 'false_values', 'skipinitialspace', 'skiprows', 'skipfooter', 'nrows', 'na_values', 'keep_default_na', 'na_filter', 'verbose', 'skip_blank_lines', 'parse_dates', 'infer_datetime_format', 'keep_date_col', 'date_parser', 'date_format', 'dayfirst', 'cache_dates', 'iterator', 'chunksize', 'compression', 'thousands', 'decimal', 'lineterminator', 'quotechar', 'quoting', 'doublequote', 'escapechar', 'comment', 'encoding', 'encoding_errors', 'dialect', 'on_bad_lines', 'delim_whitespace', 'low_memory', 'memory_map', 'float_precision', 'storage_options', 'dtype_backend'])

Now we can compare the key/value pairs in our yaml config with any that match the function signature of pd.read_csv.

def get_kwargs(config):
    params = inspect.signature(pd.read_csv).parameters.keys()
    return {k: v for k, v in config.items() if k in params}

This will create a new dictionary with just the key/value pairs (k: v) that match the function signature. At this point, we can now pass those into our input function.

df = pd.read_csv(**get_kwargs(config))
df.head()
SCHOOL_YEAR FIPST STATENAME ST SCH_NAME LEA_NAME STATE_AGENCY_NO UNION ST_LEAID LEAID ... G_10_OFFERED G_11_OFFERED G_12_OFFERED G_13_OFFERED G_UG_OFFERED G_AE_OFFERED GSLO GSHI LEVEL IGOFFERED
0 2021-2022 01 ALABAMA AL Albertville Middle School Albertville City 01 NaN AL-101 0100005 ... No No No No No No 07 08 Middle As reported
1 2021-2022 01 ALABAMA AL Albertville High School Albertville City 01 NaN AL-101 0100005 ... Yes Yes Yes No No No 09 12 High As reported
2 2021-2022 01 ALABAMA AL Albertville Intermediate School Albertville City 01 NaN AL-101 0100005 ... No No No No No No 05 06 Middle As reported
3 2021-2022 01 ALABAMA AL Albertville Elementary School Albertville City 01 NaN AL-101 0100005 ... No No No No No No 03 04 Elementary As reported
4 2021-2022 01 ALABAMA AL Albertville Kindergarten and PreK Albertville City 01 NaN AL-101 0100005 ... No No No No No No PK KG Elementary As reported

5 rows × 65 columns

Very close. This hasn’t handled the column renaming. That’s because in pd.read_csv the column name values is set with the names param. We could just call our columns list names in the config or we could pass it explicitly instead (if, say, we needed to name it columns for some other functionality in our script).

df = pd.read_csv(**get_kwargs(config), names=config["columns"])
df.head()
school_year state_fips_code state_name state_abbrev school_name lea_name state_agency_id supervisory_union_id state_lea_id nces_lea_id ... grade_10_offered grade_11_offered grade_12_offered grade_13_offered ungraded_offered adult_education_offered lowest_grade_offered highest_grade_offered school_level any_grades_offered_field_adjusted
0 SCHOOL_YEAR FIPST STATENAME ST SCH_NAME LEA_NAME STATE_AGENCY_NO UNION ST_LEAID LEAID ... G_10_OFFERED G_11_OFFERED G_12_OFFERED G_13_OFFERED G_UG_OFFERED G_AE_OFFERED GSLO GSHI LEVEL IGOFFERED
1 2021-2022 01 ALABAMA AL Albertville Middle School Albertville City 01 NaN AL-101 0100005 ... No No No No No No 07 08 Middle As reported
2 2021-2022 01 ALABAMA AL Albertville High School Albertville City 01 NaN AL-101 0100005 ... Yes Yes Yes No No No 09 12 High As reported
3 2021-2022 01 ALABAMA AL Albertville Intermediate School Albertville City 01 NaN AL-101 0100005 ... No No No No No No 05 06 Middle As reported
4 2021-2022 01 ALABAMA AL Albertville Elementary School Albertville City 01 NaN AL-101 0100005 ... No No No No No No 03 04 Elementary As reported

5 rows × 65 columns

Icing on the cake

I like to simplify all of this by creating a custom configuration class to handle this. That enables us to use class attributes for accessing the key/values and can wrap in the yaml reading as well.

class Config:
    def __init__(self, yaml_file_path):
        """
        Extract configuration from yaml file and set the values as 
        class attributes or kwargs based on the specified function
        signature.
        """
        with open(yaml_file_path) as f:
            self.yaml = safe_load(f)
        self.set_attrs()
    
    def set_attrs(self):
        """
        Create attribute called kwargs that returns dictionary of all key/value
        pairs that match the given function signature params.

        Set any non-matching key as a class attribute.
        """
        params = self.get_signature()
        for k, v in self.yaml.items():
            if k not in params:
                setattr(self, k, v)
        self.kwargs = {k: v for k, v in self.yaml.items() if k in params}

    def get_signature(self):
        """
        Select the corresponding function signature based on the specified
        file_type of either csv or excel.

        Note: this could be expanded to handle more file_types, such as:
          - SPSS
          - JSON
          - Parquet
          - XML
        """
        signatures = {
            "csv": inspect.signature(pd.read_csv).parameters.keys(),
            "excel": inspect.signature(pd.read_csv).parameters.keys(),
        }
        return signatures.get(self.yaml["file_type"])

With this class we can now create a Config object from a yaml file and pass that to our pandas functions. It also now handles for whether the file_type expects to use read_csv or read_excel.

config = Config("nces_school_directory_2022.yml")
df = pd.read_csv(**config.kwargs, names=config.columns)
df.head()
school_year state_fips_code state_name state_abbrev school_name lea_name state_agency_id supervisory_union_id state_lea_id nces_lea_id ... grade_10_offered grade_11_offered grade_12_offered grade_13_offered ungraded_offered adult_education_offered lowest_grade_offered highest_grade_offered school_level any_grades_offered_field_adjusted
0 SCHOOL_YEAR FIPST STATENAME ST SCH_NAME LEA_NAME STATE_AGENCY_NO UNION ST_LEAID LEAID ... G_10_OFFERED G_11_OFFERED G_12_OFFERED G_13_OFFERED G_UG_OFFERED G_AE_OFFERED GSLO GSHI LEVEL IGOFFERED
1 2021-2022 01 ALABAMA AL Albertville Middle School Albertville City 01 NaN AL-101 0100005 ... No No No No No No 07 08 Middle As reported
2 2021-2022 01 ALABAMA AL Albertville High School Albertville City 01 NaN AL-101 0100005 ... Yes Yes Yes No No No 09 12 High As reported
3 2021-2022 01 ALABAMA AL Albertville Intermediate School Albertville City 01 NaN AL-101 0100005 ... No No No No No No 05 06 Middle As reported
4 2021-2022 01 ALABAMA AL Albertville Elementary School Albertville City 01 NaN AL-101 0100005 ... No No No No No No 03 04 Elementary As reported

5 rows × 65 columns

Conclusion

There is a lot more parsing functionality that could be extended from this basic example using more configuration variables and/or expanding to different file types.

We’ve used a variation of this approach to great effect to parse hundreds of research files from most of the states in the U.S. as well as a variety of other public research data like NCES. This has even been used to parse Google Spreadsheet data directly in memory.

With this data now transformed into a pandas dataframe, it can be easily loaded into a data warehouse or analyzed in memory.