Data Transformation of the Texas STAAR Aggregate Data for 2020-2021

Data files available here on the TEA website.

The STAAR data files contain results for 60+ student demographic groups and each variable is repeated in a separate column for both demographic and test subject. As you will see below, this results in data files that are several thousand columns wide and are a different schema for every grade.

These files are designed for use in SAS and SPSS, and the website itself indicates that the number of variables in these files is too great to import into Microsoft Access or some versions of Microsoft Excel without significant truncation.

However, my goal was to pivot this data into a standardized format and load into BigQuery for cross analysis with other state assessment data. What follows is my approach in Python, but I am sure there are other potentially simpler methods than mine, I hope this demonstration helps other approach this dataset for analysis and possibly encourages others to improve upon this method.

The original files are linked below for easy access.

Campus level data for Grades 3-8

Grade 3: test data | variable list
Grade 4: test data | variable list
Grade 5: test data | variable list
Grade 6: test data | variable list
Grade 7: test data | variable list
Grade 8: test data | variable list

Getting the Lay of the Land

Before determining a solution and final schema, I wanted to understand the structure of these files. This can be accomplished by looping over the linked dat files and getting their shapes.

import pandas as pd
from pathlib import Path
import xlrd
import os

# Display options for pandas data frames
pd.set_option('display.max_columns', None)
pd.set_option('max_colwidth', None)
pd.set_option('max_seq_item', None)

base_url = "https://tea.texas.gov/sites/default/files/"
dat_files = ["cfy21e3.dat", "cfy21e4.dat", "cfy21e5.dat", "cfy21e6.dat", "cfy21e7.dat", "cfy21e8.dat"]

for dat_file in dat_files:
    url = f"{base_url}{dat_file}"
    content = Path(url).stem
    df = pd.read_csv(url)
    print(content, df.shape)

cfy21e3 (4643, 2093)
cfy21e4 (4622, 3075)
cfy21e5 (4415, 3197)
cfy21e6 (2700, 2093)
cfy21e7 (2337, 3075)
cfy21e8 (2370, 4301)

Each file is named with this convention (as far as I can discern): - c: Campus level data - fy21: Fiscal year 2021 - e#: English grade # (English, because grades 3-5 also include Spanish results)

The numbers in the accompanying parenthesis represent the count of rows and columns, respectively.

Inconsistent Data Shape

We can see from this quick file inspection that combining these datasets into a single unified structure will require more than just concatenation. The number of columns is jagged/inconsistent. Reviewing the variable files shows that each grade has differences in the variables reported. Some grades have Reading and Mathematics, while other include a Writing test as well. There are additional differences, but that is one of the most common.

They are so wide, because each variable such as # Tested is repeated for each subject and each student group (60+ in total). We can also see that the naming of these columns is encoded/abbreviated, such as r_all_d for # Tested -- Reading -- All Students or r_eth2_d for # Tested -- Reading -- Two or More Races Students.

Finding a Pattern

There is something of a pattern to this naming convention you may have noticed: - r: reading - all: all students - d: # tested

This becomes more obvious with subsequent examples such as r_all_unsatgl_nm for # Did Not Meet Grade Level Performance -- Reading -- All Students.

A First Attempt

My initial thought was to split these variable names using the the _ as a delimiter, but this was met with limited success, again due to inconsistent usage.

Note: You will see the following warning in places WARNING *** OLE2 inconsistency: SSCS size is 0 but SSAT size is non-zero. This is due to an encoding issue with the TEA provided files and can be ignored as it does not impact our ability to parse the files.

grade3_vars = 'https://tea.texas.gov/sites/default/files/fy21_varlist_g03.xls'

df = pd.read_excel(grade3_vars)
df.head(20)

WARNING *** OLE2 inconsistency: SSCS size is 0 but SSAT size is non-zero

	Variable	Format_Type	Description
0	campus	Character	Campus Number
1	year	Character	Test Administration Year
2	region	Character	Education Service Center (Region) Number
3	district	Character	District Number
4	dname	Character	District Name
5	cname	Character	Campus Name
6	grade	Character	Tested Grade (Usually the Enrolled Grade Unless Student Tested Above Grade)
7	r_all_docs_n	Numeric	# Answer Documents Submitted -- Reading -- All Students
8	r_all_abs_n	Numeric	# Absent - Not Tested -- Reading -- All Students
9	r_all_oth_n	Numeric	# Other - Not Tested -- Reading -- All Students
10	m_all_docs_n	Numeric	# Answer Documents Submitted -- Mathematics -- All Students
11	m_all_abs_n	Numeric	# Absent - Not Tested -- Mathematics -- All Students
12	m_all_oth_n	Numeric	# Other - Not Tested -- Mathematics -- All Students
13	r_all_d	Numeric	# Tested -- Reading -- All Students
14	r_sexm_d	Numeric	# Tested -- Reading -- Male Students
15	r_sexf_d	Numeric	# Tested -- Reading -- Female Students
16	r_sexv_d	Numeric	# Tested -- Reading -- No Sex Info Students
17	r_ethh_d	Numeric	# Tested -- Reading -- Hispanic/Latino Students
18	r_ethi_d	Numeric	# Tested -- Reading -- American Indian or Alaska Native Students
19	r_etha_d	Numeric	# Tested -- Reading -- Asian Students

It appears at first glance, that the pattern is subject, student demographic group, variable type. However, we can see this doesn’t remain consistent.

df.tail(15)

	Variable	Format_Type	Description
2078	m_spen_pct_cat4	Numeric	% Avg Items Correct--Reporting Category 4 -- Mathematics -- Not Special Ed Students
2079	m_spev_avg_cat4	Numeric	# Avg Items Correct--Reporting Category 4 -- Mathematics -- No Info Special Ed Students
2080	m_spev_pct_cat4	Numeric	% Avg Items Correct--Reporting Category 4 -- Mathematics -- No Info Special Ed Students
2081	m_gify_avg_cat4	Numeric	# Avg Items Correct--Reporting Category 4 -- Mathematics -- Gifted/Talented Students
2082	m_gify_pct_cat4	Numeric	% Avg Items Correct--Reporting Category 4 -- Mathematics -- Gifted/Talented Students
2083	m_gifn_avg_cat4	Numeric	# Avg Items Correct--Reporting Category 4 -- Mathematics -- Not Gifted/Talented Students
2084	m_gifn_pct_cat4	Numeric	% Avg Items Correct--Reporting Category 4 -- Mathematics -- Not Gifted/Talented Students
2085	m_gifv_avg_cat4	Numeric	# Avg Items Correct--Reporting Category 4 -- Mathematics -- No Info Gifted/Talented Students
2086	m_gifv_pct_cat4	Numeric	% Avg Items Correct--Reporting Category 4 -- Mathematics -- No Info Gifted/Talented Students
2087	m_atry_avg_cat4	Numeric	# Avg Items Correct--Reporting Category 4 -- Mathematics -- At-Risk Students
2088	m_atry_pct_cat4	Numeric	% Avg Items Correct--Reporting Category 4 -- Mathematics -- At-Risk Students
2089	m_atrn_avg_cat4	Numeric	# Avg Items Correct--Reporting Category 4 -- Mathematics -- Not At-Risk Students
2090	m_atrn_pct_cat4	Numeric	% Avg Items Correct--Reporting Category 4 -- Mathematics -- Not At-Risk Students
2091	m_atrv_avg_cat4	Numeric	# Avg Items Correct--Reporting Category 4 -- Mathematics -- No Info At-Risk Students
2092	m_atrv_pct_cat4	Numeric	% Avg Items Correct--Reporting Category 4 -- Mathematics -- No Info At-Risk Students

In these latter cases, there is a 4th variable reporting category that comes before the subject and student group.

df[195:215]

	Variable	Format_Type	Description
195	r_atrv_unsatgl_nm	Numeric	# Did Not Meet Grade Level Performance -- Reading -- No Info At-Risk Students
196	r_all_approgl_nm	Numeric	# Approaches Grade Level Performance -- Reading -- All Students
197	r_sexm_approgl_nm	Numeric	# Approaches Grade Level Performance -- Reading -- Male Students
198	r_sexf_approgl_nm	Numeric	# Approaches Grade Level Performance -- Reading -- Female Students
199	r_sexv_approgl_nm	Numeric	# Approaches Grade Level Performance -- Reading -- No Sex Info Students
200	r_ethh_approgl_nm	Numeric	# Approaches Grade Level Performance -- Reading -- Hispanic/Latino Students
201	r_ethi_approgl_nm	Numeric	# Approaches Grade Level Performance -- Reading -- American Indian or Alaska Native Students
202	r_etha_approgl_nm	Numeric	# Approaches Grade Level Performance -- Reading -- Asian Students
203	r_ethb_approgl_nm	Numeric	# Approaches Grade Level Performance -- Reading -- Black or African American Students
204	r_ethp_approgl_nm	Numeric	# Approaches Grade Level Performance -- Reading -- Native Hawaiian or Other Pacific Islander Students
205	r_ethw_approgl_nm	Numeric	# Approaches Grade Level Performance -- Reading -- White Students
206	r_eth2_approgl_nm	Numeric	# Approaches Grade Level Performance -- Reading -- Two or More Races Students
207	r_ethv_approgl_nm	Numeric	# Approaches Grade Level Performance -- Reading -- No Ethnicity Info Students
208	r_ecoy_approgl_nm	Numeric	# Approaches Grade Level Performance -- Reading -- Econ Disadv Students Codes: 1 2 9
209	r_econ_approgl_nm	Numeric	# Approaches Grade Level Performance -- Reading -- Not Econ Disadv Students
210	r_eco1_approgl_nm	Numeric	# Approaches Grade Level Performance -- Reading -- Free Meals Students
211	r_eco2_approgl_nm	Numeric	# Approaches Grade Level Performance -- Reading -- Reduced-Price Meals Students
212	r_eco9_approgl_nm	Numeric	# Approaches Grade Level Performance -- Reading -- Other Econ Disadvantaged Students
213	r_ecov_approgl_nm	Numeric	# Approaches Grade Level Performance -- Reading -- No Info Econ Students
214	r_ti1y_approgl_nm	Numeric	# Approaches Grade Level Performance -- Reading -- Title-I Participant Students Codes: 6 7 9

There is another pattern with 4 parts for the proficiency levels that is subject, student group, and then a two part string for the variable type.

These differences would make it challenging to simply split on the _ character. Even distinguishing between the three part and four part variables isn’t sufficient, because there are two types of the four part variables.

A Different Solution

However, we can simply ignore the variables. We have a description column that also has it’s own delimiter: --.

descriptions = df.Description.values.tolist()

for description in descriptions[:20]:
    parts = description.split('--')
    print(parts)

print('\n...\n')

for description in descriptions[-5:]:
    parts = description.split('--')
    print(parts)

['Campus Number']
['Test Administration Year']
['Education Service Center (Region) Number']
['District Number']
['District Name']
['Campus Name']
['Tested Grade (Usually the Enrolled Grade Unless Student Tested Above Grade)']
['# Answer Documents Submitted ', ' Reading ', ' All Students']
['# Absent - Not Tested ', ' Reading ', ' All Students']
['# Other - Not Tested ', ' Reading ', ' All Students']
['# Answer Documents Submitted ', ' Mathematics ', ' All Students']
['# Absent - Not Tested ', ' Mathematics ', ' All Students']
['# Other - Not Tested ', ' Mathematics ', ' All Students']
['# Tested ', ' Reading ', ' All Students']
['# Tested ', ' Reading ', ' Male Students']
['# Tested ', ' Reading ', ' Female Students']
['# Tested ', ' Reading ', ' No Sex Info Students']
['# Tested ', ' Reading ', ' Hispanic/Latino Students']
['# Tested ', ' Reading ', ' American Indian or Alaska Native Students']
['# Tested ', ' Reading ', ' Asian Students']

...

['% Avg Items Correct', 'Reporting Category 4 ', ' Mathematics ', ' At-Risk Students']
['# Avg Items Correct', 'Reporting Category 4 ', ' Mathematics ', ' Not At-Risk Students']
['% Avg Items Correct', 'Reporting Category 4 ', ' Mathematics ', ' Not At-Risk Students']
['# Avg Items Correct', 'Reporting Category 4 ', ' Mathematics ', ' No Info At-Risk Students']
['% Avg Items Correct', 'Reporting Category 4 ', ' Mathematics ', ' No Info At-Risk Students']

This mostly works. We have three different types of descriptions/variables: - 1 part variables like ['Campus Number'] - 3 part variables like ['# Tested ', ' Reading ', ' All Students'] - 4 part variables like ['% Avg Items Correct', 'Reporting Category 4 ', ' Mathematics ', ' At-Risk Students']

That means we can use the list length to determine the order of the variable values: variable name, test subject, and student group. The fourth part reporting category is basically a modifier for the variable name so we can combine those.

By examining a single variable name, we can see that # Tested, for example, isn’t actually 122 different variables but rather a single variable with 122 permutations of test subject and student group: 61 student groups * 2 test subjects

df[df.Description.str.contains('# Tested')]

	Variable	Format_Type	Description
13	r_all_d	Numeric	# Tested -- Reading -- All Students
14	r_sexm_d	Numeric	# Tested -- Reading -- Male Students
15	r_sexf_d	Numeric	# Tested -- Reading -- Female Students
16	r_sexv_d	Numeric	# Tested -- Reading -- No Sex Info Students
17	r_ethh_d	Numeric	# Tested -- Reading -- Hispanic/Latino Students
...	...	...	...
130	m_gifn_d	Numeric	# Tested -- Mathematics -- Not Gifted/Talented Students
131	m_gifv_d	Numeric	# Tested -- Mathematics -- No Info Gifted/Talented Students
132	m_atry_d	Numeric	# Tested -- Mathematics -- At-Risk Students
133	m_atrn_d	Numeric	# Tested -- Mathematics -- Not At-Risk Students
134	m_atrv_d	Numeric	# Tested -- Mathematics -- No Info At-Risk Students

122 rows × 3 columns

Extracting the Variable Descriptors

In addition to pulling out the test subject and student group from the variable description, it will be useful to rename the columns into a common naming convention that is more human readable than the current encodings.

Example:

If we take the initial variable example r_all_d for # Tested -- Reading -- All Students and remove the subject and student group we would be left with simply d as the variable/column name. But a better description would be # Tested or even better (to eliminate special characters that would be problematic in the database) n_tested.

Splitting the Description

To start let’s add a new column to the dataframe that has the array of descriptors. We can also remove any leading or trailing whitespace from both the variable and description to prevent mismatch issues later.

df["Variable"] = df["Variable"].str.strip()
df["Description"] = df["Description"].str.strip()
df["desc_list"] = df["Description"].apply(lambda x: x.split("--"))
df

	Variable	Format_Type	Description	desc_list
0	campus	Character	Campus Number	[Campus Number]
1	year	Character	Test Administration Year	[Test Administration Year]
2	region	Character	Education Service Center (Region) Number	[Education Service Center (Region) Number]
3	district	Character	District Number	[District Number]
4	dname	Character	District Name	[District Name]
...	...	...	...	...
2088	m_atry_pct_cat4	Numeric	% Avg Items Correct--Reporting Category 4 -- Mathematics -- At-Risk Students	[% Avg Items Correct, Reporting Category 4 , Mathematics , At-Risk Students]
2089	m_atrn_avg_cat4	Numeric	# Avg Items Correct--Reporting Category 4 -- Mathematics -- Not At-Risk Students	[# Avg Items Correct, Reporting Category 4 , Mathematics , Not At-Risk Students]
2090	m_atrn_pct_cat4	Numeric	% Avg Items Correct--Reporting Category 4 -- Mathematics -- Not At-Risk Students	[% Avg Items Correct, Reporting Category 4 , Mathematics , Not At-Risk Students]
2091	m_atrv_avg_cat4	Numeric	# Avg Items Correct--Reporting Category 4 -- Mathematics -- No Info At-Risk Students	[# Avg Items Correct, Reporting Category 4 , Mathematics , No Info At-Risk Students]
2092	m_atrv_pct_cat4	Numeric	% Avg Items Correct--Reporting Category 4 -- Mathematics -- No Info At-Risk Students	[% Avg Items Correct, Reporting Category 4 , Mathematics , No Info At-Risk Students]

2093 rows × 4 columns

Extracting Test Subject

In most cases the test subject is the 2nd part of the variable description with two exceptions: if the variable is one of the campus/grade variables that only has a single part or for the four part variables that include reporting category.

def get_subject(desc_list):
    """
    Return the 2nd part of 3 part descriptors, the 3rd part of 4 part descriptors, or None.

    Note: Else None is not necessary here, since functions return None by default in Python
    """
    length = len(desc_list)
    if length == 4:
        return desc_list[2]
    elif length > 1:
        return desc_list[1]

df["subject"] = df["desc_list"].apply(lambda x: get_subject(x))
df.head(20)

	Variable	Format_Type	Description	desc_list	subject
0	campus	Character	Campus Number	[Campus Number]	None
1	year	Character	Test Administration Year	[Test Administration Year]	None
2	region	Character	Education Service Center (Region) Number	[Education Service Center (Region) Number]	None
3	district	Character	District Number	[District Number]	None
4	dname	Character	District Name	[District Name]	None
5	cname	Character	Campus Name	[Campus Name]	None
6	grade	Character	Tested Grade (Usually the Enrolled Grade Unless Student Tested Above Grade)	[Tested Grade (Usually the Enrolled Grade Unless Student Tested Above Grade)]	None
7	r_all_docs_n	Numeric	# Answer Documents Submitted -- Reading -- All Students	[# Answer Documents Submitted , Reading , All Students]	Reading
8	r_all_abs_n	Numeric	# Absent - Not Tested -- Reading -- All Students	[# Absent - Not Tested , Reading , All Students]	Reading
9	r_all_oth_n	Numeric	# Other - Not Tested -- Reading -- All Students	[# Other - Not Tested , Reading , All Students]	Reading
10	m_all_docs_n	Numeric	# Answer Documents Submitted -- Mathematics -- All Students	[# Answer Documents Submitted , Mathematics , All Students]	Mathematics
11	m_all_abs_n	Numeric	# Absent - Not Tested -- Mathematics -- All Students	[# Absent - Not Tested , Mathematics , All Students]	Mathematics
12	m_all_oth_n	Numeric	# Other - Not Tested -- Mathematics -- All Students	[# Other - Not Tested , Mathematics , All Students]	Mathematics
13	r_all_d	Numeric	# Tested -- Reading -- All Students	[# Tested , Reading , All Students]	Reading
14	r_sexm_d	Numeric	# Tested -- Reading -- Male Students	[# Tested , Reading , Male Students]	Reading
15	r_sexf_d	Numeric	# Tested -- Reading -- Female Students	[# Tested , Reading , Female Students]	Reading
16	r_sexv_d	Numeric	# Tested -- Reading -- No Sex Info Students	[# Tested , Reading , No Sex Info Students]	Reading
17	r_ethh_d	Numeric	# Tested -- Reading -- Hispanic/Latino Students	[# Tested , Reading , Hispanic/Latino Students]	Reading
18	r_ethi_d	Numeric	# Tested -- Reading -- American Indian or Alaska Native Students	[# Tested , Reading , American Indian or Alaska Native Students]	Reading
19	r_etha_d	Numeric	# Tested -- Reading -- Asian Students	[# Tested , Reading , Asian Students]	Reading

Extracting Student Group

In all cases (except the single descriptor variables) the student group is included at the end of the description.

def get_student_group(desc_list):
    """
    Unless the description has only one part, return the final part.
    """
    if len(desc_list) > 1:
        return desc_list[-1]

df["student_group"] = df["desc_list"].apply(lambda x: get_student_group(x))
df.head(20)

	Variable	Format_Type	Description	desc_list	subject	student_group
0	campus	Character	Campus Number	[Campus Number]	None	None
1	year	Character	Test Administration Year	[Test Administration Year]	None	None
2	region	Character	Education Service Center (Region) Number	[Education Service Center (Region) Number]	None	None
3	district	Character	District Number	[District Number]	None	None
4	dname	Character	District Name	[District Name]	None	None
5	cname	Character	Campus Name	[Campus Name]	None	None
6	grade	Character	Tested Grade (Usually the Enrolled Grade Unless Student Tested Above Grade)	[Tested Grade (Usually the Enrolled Grade Unless Student Tested Above Grade)]	None	None
7	r_all_docs_n	Numeric	# Answer Documents Submitted -- Reading -- All Students	[# Answer Documents Submitted , Reading , All Students]	Reading	All Students
8	r_all_abs_n	Numeric	# Absent - Not Tested -- Reading -- All Students	[# Absent - Not Tested , Reading , All Students]	Reading	All Students
9	r_all_oth_n	Numeric	# Other - Not Tested -- Reading -- All Students	[# Other - Not Tested , Reading , All Students]	Reading	All Students
10	m_all_docs_n	Numeric	# Answer Documents Submitted -- Mathematics -- All Students	[# Answer Documents Submitted , Mathematics , All Students]	Mathematics	All Students
11	m_all_abs_n	Numeric	# Absent - Not Tested -- Mathematics -- All Students	[# Absent - Not Tested , Mathematics , All Students]	Mathematics	All Students
12	m_all_oth_n	Numeric	# Other - Not Tested -- Mathematics -- All Students	[# Other - Not Tested , Mathematics , All Students]	Mathematics	All Students
13	r_all_d	Numeric	# Tested -- Reading -- All Students	[# Tested , Reading , All Students]	Reading	All Students
14	r_sexm_d	Numeric	# Tested -- Reading -- Male Students	[# Tested , Reading , Male Students]	Reading	Male Students
15	r_sexf_d	Numeric	# Tested -- Reading -- Female Students	[# Tested , Reading , Female Students]	Reading	Female Students
16	r_sexv_d	Numeric	# Tested -- Reading -- No Sex Info Students	[# Tested , Reading , No Sex Info Students]	Reading	No Sex Info Students
17	r_ethh_d	Numeric	# Tested -- Reading -- Hispanic/Latino Students	[# Tested , Reading , Hispanic/Latino Students]	Reading	Hispanic/Latino Students
18	r_ethi_d	Numeric	# Tested -- Reading -- American Indian or Alaska Native Students	[# Tested , Reading , American Indian or Alaska Native Students]	Reading	American Indian or Alaska Native Students
19	r_etha_d	Numeric	# Tested -- Reading -- Asian Students	[# Tested , Reading , Asian Students]	Reading	Asian Students

Creating a New Variable Name

There are a few problematic characters like # and % in the names of the variables that should be replaced as well as replacing spaces and abbreviating some longer words in the names. This also uses some regex parsing to remove any text contained in parenthesis.

import re

def rename_variable(desc_list):
    """
    Replace special characters and append reporting category when present.
    """
    variable_name = desc_list[0].lower().strip()
    if len(desc_list) == 4:
        variable_name += desc_list[1].lower().strip()
    replacements = {
        '-': '',
        '#': 'n',
        '%': 'pct',
        'average': 'avg',
        'reporting category': 'rpt cat',
        ' ': '_',
        '__': '_',
    }
    for old, new in replacements.items():
        variable_name = re.sub(r"\([^()]*\)", "", variable_name).strip()
        variable_name = variable_name.replace(old, new)
    return variable_name

df["name"] = df["desc_list"].apply(lambda x: rename_variable(x))
df.head(20)

	Variable	Format_Type	Description	desc_list	subject	student_group	name
0	campus	Character	Campus Number	[Campus Number]	None	None	campus_number
1	year	Character	Test Administration Year	[Test Administration Year]	None	None	test_administration_year
2	region	Character	Education Service Center (Region) Number	[Education Service Center (Region) Number]	None	None	education_service_center_number
3	district	Character	District Number	[District Number]	None	None	district_number
4	dname	Character	District Name	[District Name]	None	None	district_name
5	cname	Character	Campus Name	[Campus Name]	None	None	campus_name
6	grade	Character	Tested Grade (Usually the Enrolled Grade Unless Student Tested Above Grade)	[Tested Grade (Usually the Enrolled Grade Unless Student Tested Above Grade)]	None	None	tested_grade
7	r_all_docs_n	Numeric	# Answer Documents Submitted -- Reading -- All Students	[# Answer Documents Submitted , Reading , All Students]	Reading	All Students	n_answer_documents_submitted
8	r_all_abs_n	Numeric	# Absent - Not Tested -- Reading -- All Students	[# Absent - Not Tested , Reading , All Students]	Reading	All Students	n_absent_not_tested
9	r_all_oth_n	Numeric	# Other - Not Tested -- Reading -- All Students	[# Other - Not Tested , Reading , All Students]	Reading	All Students	n_other_not_tested
10	m_all_docs_n	Numeric	# Answer Documents Submitted -- Mathematics -- All Students	[# Answer Documents Submitted , Mathematics , All Students]	Mathematics	All Students	n_answer_documents_submitted
11	m_all_abs_n	Numeric	# Absent - Not Tested -- Mathematics -- All Students	[# Absent - Not Tested , Mathematics , All Students]	Mathematics	All Students	n_absent_not_tested
12	m_all_oth_n	Numeric	# Other - Not Tested -- Mathematics -- All Students	[# Other - Not Tested , Mathematics , All Students]	Mathematics	All Students	n_other_not_tested
13	r_all_d	Numeric	# Tested -- Reading -- All Students	[# Tested , Reading , All Students]	Reading	All Students	n_tested
14	r_sexm_d	Numeric	# Tested -- Reading -- Male Students	[# Tested , Reading , Male Students]	Reading	Male Students	n_tested
15	r_sexf_d	Numeric	# Tested -- Reading -- Female Students	[# Tested , Reading , Female Students]	Reading	Female Students	n_tested
16	r_sexv_d	Numeric	# Tested -- Reading -- No Sex Info Students	[# Tested , Reading , No Sex Info Students]	Reading	No Sex Info Students	n_tested
17	r_ethh_d	Numeric	# Tested -- Reading -- Hispanic/Latino Students	[# Tested , Reading , Hispanic/Latino Students]	Reading	Hispanic/Latino Students	n_tested
18	r_ethi_d	Numeric	# Tested -- Reading -- American Indian or Alaska Native Students	[# Tested , Reading , American Indian or Alaska Native Students]	Reading	American Indian or Alaska Native Students	n_tested
19	r_etha_d	Numeric	# Tested -- Reading -- Asian Students	[# Tested , Reading , Asian Students]	Reading	Asian Students	n_tested

Bringing it All Together

Because we’re going to iterate over each variable list (there are six files: one for each grade between 3 and 8), we can simplify this by running all these change operations at once. We also want to remove unneeded columns and rename the original columns to standardize the naming conventions and make for an easy join condition later.

def add_descriptors(df):
    df["Variable"] = df["Variable"].str.strip()
    df["Description"] = df["Description"].str.strip()
    df["desc_list"] = df["Description"].apply(lambda x: x.split('--'))
    df["name"] = df["desc_list"].apply(lambda x: rename_variable(x))
    df["subject"] = df["desc_list"].apply(lambda x: get_subject(x))
    df["student_group"] = df["desc_list"].apply(lambda x: get_student_group(x))
    df.drop(columns=["Description", "desc_list"], inplace=True)
    df.rename(columns={'Variable': 'variable', 'Format_Type': 'format_type'}, inplace=True)
    return df

A Path Forward

That means it would be possible to pivot this data long for each distinct variable after extracting the subject and student group from the variable description. This information, though, is split across two different files: the test data in the .dat files and the variable descriptions in the .xls files.

My plan then is to: 1. Match the corresponding dat file to its variable list 2. Pivot the test data long so that each variable and value is a single row 3. Extract the test subject and student group data points from the variable description 4. Join the two datasets together using the variable name as a unique identifier 5. Once more pivot the combined data such that each distinct variable becomes its own column 6. Combine those standardized data structures for each grade into a single dataset 7. Write that new dataset out to a compressed csv file for importing into BigQuery

base_url = "https://tea.texas.gov/sites/default/files/"
dat_files = ["cfy21e3.dat", "cfy21e4.dat", "cfy21e5.dat", "cfy21e6.dat", "cfy21e7.dat", "cfy21e8.dat"]
var_files = ["fy21_varlist_g03.xls", "fy21_varlist_g04.xls", "fy21_varlist_g05.xls", "fy21_varlist_g06.xls", "fy21_varlist_g07.xls", "fy21_varlist_g08.xls"]

# 1. Match the dat file with its corresponding var file
all_files = zip(dat_files, var_files)

# Create a list to concatenate the resulting grade level data frames later
all_grades = []

for dat_file, var_file in all_files:
    test_data = pd.read_csv(f"{base_url}{dat_file}", dtype="string")
    variables = pd.read_excel(f"{base_url}{var_file}", dtype="string")

    print("Parsing:", dat_file, "rows/cols:", test_data.shape)


    # 2. melt the dataframe (pivot wide to long) but keep all the site and grade identifiers out of the pivot
    identifiers = variables[variables.Format_Type == 'Character'].Variable.tolist()
    id_vars = [identifier.upper() for identifier in identifiers]
    test_data = pd.melt(test_data, id_vars=id_vars)
    print("Pivoted:", dat_file, "rows/cols:", test_data.shape)

    # 3. extract the subject and student groups
    variables = add_descriptors(variables)

    # 4. join the two dataset together using the variable name as the unique identifier
    combined = test_data.merge(variables, how='left', on='variable')
    print("Joined:", dat_file, "+", var_file, "rows/cols:", combined.shape)

    # 5. pivot back wide but with only the distinct variables
    cols = combined.columns.tolist()
    combined[cols] = combined[cols].apply(lambda x: x.str.strip())
    combined.columns = combined.columns.str.lower()
    columns = ['campus','year','region','district','dname','cname','grade','format_type','subject','student_group','name']
    combined = combined.set_index(columns)['value'].unstack().reset_index()
    print("Combined:", dat_file, "+", var_file, "rows/cols:", combined.shape)

    # 6. append data for each grade into a single data structure and concatenate
    all_grades.append(combined)

df = pd.concat(all_grades)
print("Final rows/cols:", df.shape)

# 7. write to compressed csv
df.to_csv("TX_STAAR_3_8_2021.csv.gz", index=False, compression="gzip")

WARNING *** OLE2 inconsistency: SSCS size is 0 but SSAT size is non-zero
Parsing: cfy21e3.dat rows/cols: (4643, 2093)
Pivoted: cfy21e3.dat rows/cols: (9685298, 9)
Joined: cfy21e3.dat + fy21_varlist_g03.xls rows/cols: (9685298, 13)
Combined: cfy21e3.dat + fy21_varlist_g03.xls rows/cols: (571089, 35)
WARNING *** OLE2 inconsistency: SSCS size is 0 but SSAT size is non-zero
Parsing: cfy21e4.dat rows/cols: (4622, 3075)
Pivoted: cfy21e4.dat rows/cols: (14180296, 9)
Joined: cfy21e4.dat + fy21_varlist_g04.xls rows/cols: (14180296, 13)
Combined: cfy21e4.dat + fy21_varlist_g04.xls rows/cols: (850448, 35)
WARNING *** OLE2 inconsistency: SSCS size is 0 but SSAT size is non-zero
Parsing: cfy21e5.dat rows/cols: (4415, 3197)
Pivoted: cfy21e5.dat rows/cols: (14083850, 9)
Joined: cfy21e5.dat + fy21_varlist_g05.xls rows/cols: (14083850, 13)
Combined: cfy21e5.dat + fy21_varlist_g05.xls rows/cols: (812360, 35)
WARNING *** OLE2 inconsistency: SSCS size is 0 but SSAT size is non-zero
Parsing: cfy21e6.dat rows/cols: (2700, 2093)
Pivoted: cfy21e6.dat rows/cols: (5632200, 9)
Joined: cfy21e6.dat + fy21_varlist_g06.xls rows/cols: (5632200, 13)
Combined: cfy21e6.dat + fy21_varlist_g06.xls rows/cols: (332100, 35)
WARNING *** OLE2 inconsistency: SSCS size is 0 but SSAT size is non-zero
Parsing: cfy21e7.dat rows/cols: (2337, 3075)
Pivoted: cfy21e7.dat rows/cols: (7169916, 9)
Joined: cfy21e7.dat + fy21_varlist_g07.xls rows/cols: (7169916, 13)
Combined: cfy21e7.dat + fy21_varlist_g07.xls rows/cols: (430008, 35)
WARNING *** OLE2 inconsistency: SSCS size is 0 but SSAT size is non-zero
Parsing: cfy21e8.dat rows/cols: (2370, 4301)
Pivoted: cfy21e8.dat rows/cols: (10176780, 9)
Joined: cfy21e8.dat + fy21_varlist_g08.xls rows/cols: (10176780, 13)
Combined: cfy21e8.dat + fy21_varlist_g08.xls rows/cols: (580650, 35)
Final rows/cols: (3576655, 35)

df

name	campus	year	region	district	dname	cname	grade	format_type	subject	student_group	NaN	avg_scale_score	n_absent_not_tested	n_answer_documents_submitted	n_approaches_grade_level_performance	n_avg_items_correctrpt_cat_1	n_avg_items_correctrpt_cat_2	n_avg_items_correctrpt_cat_3	n_avg_items_correctrpt_cat_4	n_did_not_meet_grade_level_performance	n_masters_grade_level_performance	n_meets_grade_level_performance	n_other_not_tested	n_tested	pct_absent_not_tested	pct_answer_documents_submitted	pct_approaches_grade_level_performance	pct_avg_items_correctrpt_cat_1	pct_avg_items_correctrpt_cat_2	pct_avg_items_correctrpt_cat_3	pct_avg_items_correctrpt_cat_4	pct_did_not_meet_grade_level_performance	pct_masters_grade_level_performance	pct_meets_grade_level_performance	pct_other_not_tested
0	001902103	21	07	001902	CAYUGA ISD	CAYUGA ELEM.	03	<NA>	NaN	NaN		NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	001902103	21	07	001902	CAYUGA ISD	CAYUGA ELEM.	03	Numeric	Mathematics	All Students	NaN	1607	0	26	22	6.3	10.7	5	3.3	4	16	19	0	26	0	100	85	79	83	71	84	15	62	73	0
2	001902103	21	07	001902	CAYUGA ISD	CAYUGA ELEM.	03	Numeric	Mathematics	American Indian or Alaska Native Students	NaN	<NA>	NaN	NaN	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	NaN	0	NaN	NaN	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	NaN
3	001902103	21	07	001902	CAYUGA ISD	CAYUGA ELEM.	03	Numeric	Mathematics	Asian Students	NaN	<NA>	NaN	NaN	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	NaN	0	NaN	NaN	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	NaN
4	001902103	21	07	001902	CAYUGA ISD	CAYUGA ELEM.	03	Numeric	Mathematics	At-Risk Students	NaN	1465	NaN	NaN	5	5.1	9.3	3.3	2.7	2	2	3	NaN	7	NaN	NaN	71	64	71	47	68	29	29	43	NaN
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
580645	254902001	21	20	254902	LA PRYOR ISD	LA PRYOR H.S.	08	Numeric	Social Studies	Title-I Participant Students Codes: 6 7 9	NaN	3425	NaN	NaN	12	6.9	5.4	4.2	3.2	23	0	4	NaN	35	NaN	NaN	34	41	54	42	46	66	0	11	NaN
580646	254902001	21	20	254902	LA PRYOR ISD	LA PRYOR H.S.	08	Numeric	Social Studies	Transitional Bilingual/Early Exit Students	NaN	<NA>	NaN	NaN	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	NaN	0	NaN	NaN	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	NaN
580647	254902001	21	20	254902	LA PRYOR ISD	LA PRYOR H.S.	08	Numeric	Social Studies	Transitional Bilingual/Late Exit Students	NaN	<NA>	NaN	NaN	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	NaN	0	NaN	NaN	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	NaN
580648	254902001	21	20	254902	LA PRYOR ISD	LA PRYOR H.S.	08	Numeric	Social Studies	Two or More Races Students	NaN	<NA>	NaN	NaN	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	NaN	0	NaN	NaN	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	NaN
580649	254902001	21	20	254902	LA PRYOR ISD	LA PRYOR H.S.	08	Numeric	Social Studies	White Students	NaN	<NA>	NaN	NaN	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	NaN	1	NaN	NaN	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	NaN

3576655 rows × 35 columns

Conclusion

The resulting dataset/file ends up being ~3.5 million rows in length and a consistent 35 columns wide. You can download the reshaped dataset here: TX STAAR 2021 - Grades 3-8

We can now run simpler queries on the reshaped data such as:

SELECT *
FROM TX_STAAR_2021
WHERE subject = 'Reading'
AND student_group = 'All Students'
AND grade IN ('6', '7', '8')

A similar query using the original datasets would’ve required unioning three different datasets and a much more complicated set of column matching to filter down to the right subject and student group.

In our production environment, the approach is similar but we initially load the raw dat and var files into Google Cloud Storage and as well as the resulting csv. That is then converted to avro for better query performance and an external table is generated in BigQuery using their API.