Parsing Texas Assessment Data

Posted on Sat 29 January 2022 in Tutorial

Data Transformation of the Texas STAAR Aggregate Data for 2020-2021

Data files available here on the TEA website.

The STAAR data files contain results for 60+ student demographic groups and each variable is repeated in a separate column for both demographic and test subject. As you will see below, this results in data files that are several thousand columns wide and are a different schema for every grade.

These files are designed for use in SAS and SPSS, and the website itself indicates that the number of variables in these files is too great to import into Microsoft Access or some versions of Microsoft Excel without significant truncation.

However, my goal was to pivot this data into a standardized format and load into BigQuery for cross analysis with other state assessment data. What follows is my approach in Python, but I am sure there are other potentially simpler methods than mine, I hope this demonstration helps other approach this dataset for analysis and possibly encourages others to improve upon this method.

The original files are linked below for easy access.

Campus level data for Grades 3-8

Getting the Lay of the Land

Before determining a solution and final schema, I wanted to understand the structure of these files. This can be accomplished by looping over the linked dat files and getting their shapes.

In [7]:
import pandas as pd
from pathlib import Path
import xlrd
import os

# Display options for pandas data frames
pd.set_option('display.max_columns', None)
pd.set_option('max_colwidth', None)
pd.set_option('max_seq_item', None)
In [47]:
base_url = "https://tea.texas.gov/sites/default/files/"
dat_files = ["cfy21e3.dat", "cfy21e4.dat", "cfy21e5.dat", "cfy21e6.dat", "cfy21e7.dat", "cfy21e8.dat"]

for dat_file in dat_files:
    url = f"{base_url}{dat_file}"
    content = Path(url).stem
    df = pd.read_csv(url)
    print(content, df.shape)
cfy21e3 (4643, 2093)
cfy21e4 (4622, 3075)
cfy21e5 (4415, 3197)
cfy21e6 (2700, 2093)
cfy21e7 (2337, 3075)
cfy21e8 (2370, 4301)

Each file is named with this convention (as far as I can discern):

  • c: Campus level data
  • fy21: Fiscal year 2021
  • e#: English grade # (English, because grades 3-5 also include Spanish results)

The numbers in the accompanying parenthesis represent the count of rows and columns, respectively.

Inconsistent Data Shape

We can see from this quick file inspection that combining these datasets into a single unified structure will require more than just concatenation. The number of columns is jagged/inconsistent. Reviewing the variable files shows that each grade has differences in the variables reported. Some grades have Reading and Mathematics, while other include a Writing test as well. There are additional differences, but that is one of the most common.

They are so wide, because each variable such as # Tested is repeated for each subject and each student group (60+ in total). We can also see that the naming of these columns is encoded/abbreviated, such as r_all_d for # Tested -- Reading -- All Students or r_eth2_d for # Tested -- Reading -- Two or More Races Students.

Finding a Pattern

There is something of a pattern to this naming convention you may have noticed:

  • r: reading
  • all: all students
  • d: # tested

This becomes more obvious with subsequent examples such as r_all_unsatgl_nm for # Did Not Meet Grade Level Performance -- Reading -- All Students.

A First Attempt

My initial thought was to split these variable names using the the _ as a delimiter, but this was met with limited success, again due to inconsistent usage.

Note: You will see the following warning in places WARNING *** OLE2 inconsistency: SSCS size is 0 but SSAT size is non-zero. This is due to an encoding issue with the TEA provided files and can be ignored as it does not impact our ability to parse the files.

In [11]:
grade3_vars = 'https://tea.texas.gov/sites/default/files/fy21_varlist_g03.xls'

df = pd.read_excel(grade3_vars)
df.head(20)
WARNING *** OLE2 inconsistency: SSCS size is 0 but SSAT size is non-zero
Out[11]:
Variable Format_Type Description
0 campus Character Campus Number
1 year Character Test Administration Year
2 region Character Education Service Center (Region) Number
3 district Character District Number
4 dname Character District Name
5 cname Character Campus Name
6 grade Character Tested Grade (Usually the Enrolled Grade Unless Student Tested Above Grade)
7 r_all_docs_n Numeric # Answer Documents Submitted -- Reading -- All Students
8 r_all_abs_n Numeric # Absent - Not Tested -- Reading -- All Students
9 r_all_oth_n Numeric # Other - Not Tested -- Reading -- All Students
10 m_all_docs_n Numeric # Answer Documents Submitted -- Mathematics -- All Students
11 m_all_abs_n Numeric # Absent - Not Tested -- Mathematics -- All Students
12 m_all_oth_n Numeric # Other - Not Tested -- Mathematics -- All Students
13 r_all_d Numeric # Tested -- Reading -- All Students
14 r_sexm_d Numeric # Tested -- Reading -- Male Students
15 r_sexf_d Numeric # Tested -- Reading -- Female Students
16 r_sexv_d Numeric # Tested -- Reading -- No Sex Info Students
17 r_ethh_d Numeric # Tested -- Reading -- Hispanic/Latino Students
18 r_ethi_d Numeric # Tested -- Reading -- American Indian or Alaska Native Students
19 r_etha_d Numeric # Tested -- Reading -- Asian Students

It appears at first glance, that the pattern is subject, student demographic group, variable type. However, we can see this doesn't remain consistent.

In [12]:
df.tail(15)
Out[12]:
Variable Format_Type Description
2078 m_spen_pct_cat4 Numeric % Avg Items Correct--Reporting Category 4 -- Mathematics -- Not Special Ed Students
2079 m_spev_avg_cat4 Numeric # Avg Items Correct--Reporting Category 4 -- Mathematics -- No Info Special Ed Students
2080 m_spev_pct_cat4 Numeric % Avg Items Correct--Reporting Category 4 -- Mathematics -- No Info Special Ed Students
2081 m_gify_avg_cat4 Numeric # Avg Items Correct--Reporting Category 4 -- Mathematics -- Gifted/Talented Students
2082 m_gify_pct_cat4 Numeric % Avg Items Correct--Reporting Category 4 -- Mathematics -- Gifted/Talented Students
2083 m_gifn_avg_cat4 Numeric # Avg Items Correct--Reporting Category 4 -- Mathematics -- Not Gifted/Talented Students
2084 m_gifn_pct_cat4 Numeric % Avg Items Correct--Reporting Category 4 -- Mathematics -- Not Gifted/Talented Students
2085 m_gifv_avg_cat4 Numeric # Avg Items Correct--Reporting Category 4 -- Mathematics -- No Info Gifted/Talented Students
2086 m_gifv_pct_cat4 Numeric % Avg Items Correct--Reporting Category 4 -- Mathematics -- No Info Gifted/Talented Students
2087 m_atry_avg_cat4 Numeric # Avg Items Correct--Reporting Category 4 -- Mathematics -- At-Risk Students
2088 m_atry_pct_cat4 Numeric % Avg Items Correct--Reporting Category 4 -- Mathematics -- At-Risk Students
2089 m_atrn_avg_cat4 Numeric # Avg Items Correct--Reporting Category 4 -- Mathematics -- Not At-Risk Students
2090 m_atrn_pct_cat4 Numeric % Avg Items Correct--Reporting Category 4 -- Mathematics -- Not At-Risk Students
2091 m_atrv_avg_cat4 Numeric # Avg Items Correct--Reporting Category 4 -- Mathematics -- No Info At-Risk Students
2092 m_atrv_pct_cat4 Numeric % Avg Items Correct--Reporting Category 4 -- Mathematics -- No Info At-Risk Students

In these latter cases, there is a 4th variable reporting category that comes before the subject and student group.

In [13]:
df[195:215]
Out[13]:
Variable Format_Type Description
195 r_atrv_unsatgl_nm Numeric # Did Not Meet Grade Level Performance -- Reading -- No Info At-Risk Students
196 r_all_approgl_nm Numeric # Approaches Grade Level Performance -- Reading -- All Students
197 r_sexm_approgl_nm Numeric # Approaches Grade Level Performance -- Reading -- Male Students
198 r_sexf_approgl_nm Numeric # Approaches Grade Level Performance -- Reading -- Female Students
199 r_sexv_approgl_nm Numeric # Approaches Grade Level Performance -- Reading -- No Sex Info Students
200 r_ethh_approgl_nm Numeric # Approaches Grade Level Performance -- Reading -- Hispanic/Latino Students
201 r_ethi_approgl_nm Numeric # Approaches Grade Level Performance -- Reading -- American Indian or Alaska Native Students
202 r_etha_approgl_nm Numeric # Approaches Grade Level Performance -- Reading -- Asian Students
203 r_ethb_approgl_nm Numeric # Approaches Grade Level Performance -- Reading -- Black or African American Students
204 r_ethp_approgl_nm Numeric # Approaches Grade Level Performance -- Reading -- Native Hawaiian or Other Pacific Islander Students
205 r_ethw_approgl_nm Numeric # Approaches Grade Level Performance -- Reading -- White Students
206 r_eth2_approgl_nm Numeric # Approaches Grade Level Performance -- Reading -- Two or More Races Students
207 r_ethv_approgl_nm Numeric # Approaches Grade Level Performance -- Reading -- No Ethnicity Info Students
208 r_ecoy_approgl_nm Numeric # Approaches Grade Level Performance -- Reading -- Econ Disadv Students Codes: 1 2 9
209 r_econ_approgl_nm Numeric # Approaches Grade Level Performance -- Reading -- Not Econ Disadv Students
210 r_eco1_approgl_nm Numeric # Approaches Grade Level Performance -- Reading -- Free Meals Students
211 r_eco2_approgl_nm Numeric # Approaches Grade Level Performance -- Reading -- Reduced-Price Meals Students
212 r_eco9_approgl_nm Numeric # Approaches Grade Level Performance -- Reading -- Other Econ Disadvantaged Students
213 r_ecov_approgl_nm Numeric # Approaches Grade Level Performance -- Reading -- No Info Econ Students
214 r_ti1y_approgl_nm Numeric # Approaches Grade Level Performance -- Reading -- Title-I Participant Students Codes: 6 7 9

There is another pattern with 4 parts for the proficiency levels that is subject, student group, and then a two part string for the variable type.

These differences would make it challenging to simply split on the _ character. Even distinguishing between the three part and four part variables isn't sufficient, because there are two types of the four part variables.

A Different Solution

However, we can simply ignore the variables. We have a description column that also has it's own delimiter: --.

In [52]:
descriptions = df.Description.values.tolist()

for description in descriptions[:20]:
    parts = description.split('--')
    print(parts)

print('\n...\n')

for description in descriptions[-5:]:
    parts = description.split('--')
    print(parts)
['Campus Number']
['Test Administration Year']
['Education Service Center (Region) Number']
['District Number']
['District Name']
['Campus Name']
['Tested Grade (Usually the Enrolled Grade Unless Student Tested Above Grade)']
['# Answer Documents Submitted ', ' Reading ', ' All Students']
['# Absent - Not Tested ', ' Reading ', ' All Students']
['# Other - Not Tested ', ' Reading ', ' All Students']
['# Answer Documents Submitted ', ' Mathematics ', ' All Students']
['# Absent - Not Tested ', ' Mathematics ', ' All Students']
['# Other - Not Tested ', ' Mathematics ', ' All Students']
['# Tested ', ' Reading ', ' All Students']
['# Tested ', ' Reading ', ' Male Students']
['# Tested ', ' Reading ', ' Female Students']
['# Tested ', ' Reading ', ' No Sex Info Students']
['# Tested ', ' Reading ', ' Hispanic/Latino Students']
['# Tested ', ' Reading ', ' American Indian or Alaska Native Students']
['# Tested ', ' Reading ', ' Asian Students']

...

['% Avg Items Correct', 'Reporting Category 4 ', ' Mathematics ', ' At-Risk Students']
['# Avg Items Correct', 'Reporting Category 4 ', ' Mathematics ', ' Not At-Risk Students']
['% Avg Items Correct', 'Reporting Category 4 ', ' Mathematics ', ' Not At-Risk Students']
['# Avg Items Correct', 'Reporting Category 4 ', ' Mathematics ', ' No Info At-Risk Students']
['% Avg Items Correct', 'Reporting Category 4 ', ' Mathematics ', ' No Info At-Risk Students']

This mostly works. We have three different types of descriptions/variables:

  • 1 part variables like ['Campus Number']
  • 3 part variables like ['# Tested ', ' Reading ', ' All Students']
  • 4 part variables like ['% Avg Items Correct', 'Reporting Category 4 ', ' Mathematics ', ' At-Risk Students']

That means we can use the list length to determine the order of the variable values: variable name, test subject, and student group. The fourth part reporting category is basically a modifier for the variable name so we can combine those.

By examining a single variable name, we can see that # Tested, for example, isn't actually 122 different variables but rather a single variable with 122 permutations of test subject and student group: 61 student groups * 2 test subjects

In [14]:
df[df.Description.str.contains('# Tested')]
Out[14]:
Variable Format_Type Description
13 r_all_d Numeric # Tested -- Reading -- All Students
14 r_sexm_d Numeric # Tested -- Reading -- Male Students
15 r_sexf_d Numeric # Tested -- Reading -- Female Students
16 r_sexv_d Numeric # Tested -- Reading -- No Sex Info Students
17 r_ethh_d Numeric # Tested -- Reading -- Hispanic/Latino Students
... ... ... ...
130 m_gifn_d Numeric # Tested -- Mathematics -- Not Gifted/Talented Students
131 m_gifv_d Numeric # Tested -- Mathematics -- No Info Gifted/Talented Students
132 m_atry_d Numeric # Tested -- Mathematics -- At-Risk Students
133 m_atrn_d Numeric # Tested -- Mathematics -- Not At-Risk Students
134 m_atrv_d Numeric # Tested -- Mathematics -- No Info At-Risk Students

122 rows × 3 columns

Extracting the Variable Descriptors

In addition to pulling out the test subject and student group from the variable description, it will be useful to rename the columns into a common naming convention that is more human readable than the current encodings.

Example:

If we take the initial variable example r_all_d for # Tested -- Reading -- All Students and remove the subject and student group we would be left with simply d as the variable/column name. But a better description would be # Tested or even better (to eliminate special characters that would be problematic in the database) n_tested.

Splitting the Description

To start let's add a new column to the dataframe that has the array of descriptors. We can also remove any leading or trailing whitespace from both the variable and description to prevent mismatch issues later.

In [15]:
df["Variable"] = df["Variable"].str.strip()
df["Description"] = df["Description"].str.strip()
df["desc_list"] = df["Description"].apply(lambda x: x.split("--"))
df
Out[15]:
Variable Format_Type Description desc_list
0 campus Character Campus Number [Campus Number]
1 year Character Test Administration Year [Test Administration Year]
2 region Character Education Service Center (Region) Number [Education Service Center (Region) Number]
3 district Character District Number [District Number]
4 dname Character District Name [District Name]
... ... ... ... ...
2088 m_atry_pct_cat4 Numeric % Avg Items Correct--Reporting Category 4 -- Mathematics -- At-Risk Students [% Avg Items Correct, Reporting Category 4 , Mathematics , At-Risk Students]
2089 m_atrn_avg_cat4 Numeric # Avg Items Correct--Reporting Category 4 -- Mathematics -- Not At-Risk Students [# Avg Items Correct, Reporting Category 4 , Mathematics , Not At-Risk Students]
2090 m_atrn_pct_cat4 Numeric % Avg Items Correct--Reporting Category 4 -- Mathematics -- Not At-Risk Students [% Avg Items Correct, Reporting Category 4 , Mathematics , Not At-Risk Students]
2091 m_atrv_avg_cat4 Numeric # Avg Items Correct--Reporting Category 4 -- Mathematics -- No Info At-Risk Students [# Avg Items Correct, Reporting Category 4 , Mathematics , No Info At-Risk Students]
2092 m_atrv_pct_cat4 Numeric % Avg Items Correct--Reporting Category 4 -- Mathematics -- No Info At-Risk Students [% Avg Items Correct, Reporting Category 4 , Mathematics , No Info At-Risk Students]

2093 rows × 4 columns

Extracting Test Subject

In most cases the test subject is the 2nd part of the variable description with two exceptions: if the variable is one of the campus/grade variables that only has a single part or for the four part variables that include reporting category.

In [16]:
def get_subject(desc_list):
    """
    Return the 2nd part of 3 part descriptors, the 3rd part of 4 part descriptors, or None.

    Note: Else None is not necessary here, since functions return None by default in Python
    """
    length = len(desc_list)
    if length == 4:
        return desc_list[2]
    elif length > 1:
        return desc_list[1]
In [18]:
df["subject"] = df["desc_list"].apply(lambda x: get_subject(x))
df.head(20)
Out[18]:
Variable Format_Type Description desc_list subject
0 campus Character Campus Number [Campus Number] None
1 year Character Test Administration Year [Test Administration Year] None
2 region Character Education Service Center (Region) Number [Education Service Center (Region) Number] None
3 district Character District Number [District Number] None
4 dname Character District Name [District Name] None
5 cname Character Campus Name [Campus Name] None
6 grade Character Tested Grade (Usually the Enrolled Grade Unless Student Tested Above Grade) [Tested Grade (Usually the Enrolled Grade Unless Student Tested Above Grade)] None
7 r_all_docs_n Numeric # Answer Documents Submitted -- Reading -- All Students [# Answer Documents Submitted , Reading , All Students] Reading
8 r_all_abs_n Numeric # Absent - Not Tested -- Reading -- All Students [# Absent - Not Tested , Reading , All Students] Reading
9 r_all_oth_n Numeric # Other - Not Tested -- Reading -- All Students [# Other - Not Tested , Reading , All Students] Reading
10 m_all_docs_n Numeric # Answer Documents Submitted -- Mathematics -- All Students [# Answer Documents Submitted , Mathematics , All Students] Mathematics
11 m_all_abs_n Numeric # Absent - Not Tested -- Mathematics -- All Students [# Absent - Not Tested , Mathematics , All Students] Mathematics
12 m_all_oth_n Numeric # Other - Not Tested -- Mathematics -- All Students [# Other - Not Tested , Mathematics , All Students] Mathematics
13 r_all_d Numeric # Tested -- Reading -- All Students [# Tested , Reading , All Students] Reading
14 r_sexm_d Numeric # Tested -- Reading -- Male Students [# Tested , Reading , Male Students] Reading
15 r_sexf_d Numeric # Tested -- Reading -- Female Students [# Tested , Reading , Female Students] Reading
16 r_sexv_d Numeric # Tested -- Reading -- No Sex Info Students [# Tested , Reading , No Sex Info Students] Reading
17 r_ethh_d Numeric # Tested -- Reading -- Hispanic/Latino Students [# Tested , Reading , Hispanic/Latino Students] Reading
18 r_ethi_d Numeric # Tested -- Reading -- American Indian or Alaska Native Students [# Tested , Reading , American Indian or Alaska Native Students] Reading
19 r_etha_d Numeric # Tested -- Reading -- Asian Students [# Tested , Reading , Asian Students] Reading

Extracting Student Group

In all cases (except the single descriptor variables) the student group is included at the end of the description.

In [19]:
def get_student_group(desc_list):
    """
    Unless the description has only one part, return the final part.
    """
    if len(desc_list) > 1:
        return desc_list[-1]
In [20]:
df["student_group"] = df["desc_list"].apply(lambda x: get_student_group(x))
df.head(20)
Out[20]:
Variable Format_Type Description desc_list subject student_group
0 campus Character Campus Number [Campus Number] None None
1 year Character Test Administration Year [Test Administration Year] None None
2 region Character Education Service Center (Region) Number [Education Service Center (Region) Number] None None
3 district Character District Number [District Number] None None
4 dname Character District Name [District Name] None None
5 cname Character Campus Name [Campus Name] None None
6 grade Character Tested Grade (Usually the Enrolled Grade Unless Student Tested Above Grade) [Tested Grade (Usually the Enrolled Grade Unless Student Tested Above Grade)] None None
7 r_all_docs_n Numeric # Answer Documents Submitted -- Reading -- All Students [# Answer Documents Submitted , Reading , All Students] Reading All Students
8 r_all_abs_n Numeric # Absent - Not Tested -- Reading -- All Students [# Absent - Not Tested , Reading , All Students] Reading All Students
9 r_all_oth_n Numeric # Other - Not Tested -- Reading -- All Students [# Other - Not Tested , Reading , All Students] Reading All Students
10 m_all_docs_n Numeric # Answer Documents Submitted -- Mathematics -- All Students [# Answer Documents Submitted , Mathematics , All Students] Mathematics All Students
11 m_all_abs_n Numeric # Absent - Not Tested -- Mathematics -- All Students [# Absent - Not Tested , Mathematics , All Students] Mathematics All Students
12 m_all_oth_n Numeric # Other - Not Tested -- Mathematics -- All Students [# Other - Not Tested , Mathematics , All Students] Mathematics All Students
13 r_all_d Numeric # Tested -- Reading -- All Students [# Tested , Reading , All Students] Reading All Students
14 r_sexm_d Numeric # Tested -- Reading -- Male Students [# Tested , Reading , Male Students] Reading Male Students
15 r_sexf_d Numeric # Tested -- Reading -- Female Students [# Tested , Reading , Female Students] Reading Female Students
16 r_sexv_d Numeric # Tested -- Reading -- No Sex Info Students [# Tested , Reading , No Sex Info Students] Reading No Sex Info Students
17 r_ethh_d Numeric # Tested -- Reading -- Hispanic/Latino Students [# Tested , Reading , Hispanic/Latino Students] Reading Hispanic/Latino Students
18 r_ethi_d Numeric # Tested -- Reading -- American Indian or Alaska Native Students [# Tested , Reading , American Indian or Alaska Native Students] Reading American Indian or Alaska Native Students
19 r_etha_d Numeric # Tested -- Reading -- Asian Students [# Tested , Reading , Asian Students] Reading Asian Students

Creating a New Variable Name

There are a few problematic characters like # and % in the names of the variables that should be replaced as well as replacing spaces and abbreviating some longer words in the names. This also uses some regex parsing to remove any text contained in parenthesis.

In [29]:
import re

def rename_variable(desc_list):
    """
    Replace special characters and append reporting category when present.
    """
    variable_name = desc_list[0].lower().strip()
    if len(desc_list) == 4:
        variable_name += desc_list[1].lower().strip()
    replacements = {
        '-': '',
        '#': 'n',
        '%': 'pct',
        'average': 'avg',
        'reporting category': 'rpt cat',
        ' ': '_',
        '__': '_',
    }
    for old, new in replacements.items():
        variable_name = re.sub(r"\([^()]*\)", "", variable_name).strip()
        variable_name = variable_name.replace(old, new)
    return variable_name
In [30]:
df["name"] = df["desc_list"].apply(lambda x: rename_variable(x))
df.head(20)
Out[30]:
Variable Format_Type Description desc_list subject student_group name
0 campus Character Campus Number [Campus Number] None None campus_number
1 year Character Test Administration Year [Test Administration Year] None None test_administration_year
2 region Character Education Service Center (Region) Number [Education Service Center (Region) Number] None None education_service_center_number
3 district Character District Number [District Number] None None district_number
4 dname Character District Name [District Name] None None district_name
5 cname Character Campus Name [Campus Name] None None campus_name
6 grade Character Tested Grade (Usually the Enrolled Grade Unless Student Tested Above Grade) [Tested Grade (Usually the Enrolled Grade Unless Student Tested Above Grade)] None None tested_grade
7 r_all_docs_n Numeric # Answer Documents Submitted -- Reading -- All Students [# Answer Documents Submitted , Reading , All Students] Reading All Students n_answer_documents_submitted
8 r_all_abs_n Numeric # Absent - Not Tested -- Reading -- All Students [# Absent - Not Tested , Reading , All Students] Reading All Students n_absent_not_tested
9 r_all_oth_n Numeric # Other - Not Tested -- Reading -- All Students [# Other - Not Tested , Reading , All Students] Reading All Students n_other_not_tested
10 m_all_docs_n Numeric # Answer Documents Submitted -- Mathematics -- All Students [# Answer Documents Submitted , Mathematics , All Students] Mathematics All Students n_answer_documents_submitted
11 m_all_abs_n Numeric # Absent - Not Tested -- Mathematics -- All Students [# Absent - Not Tested , Mathematics , All Students] Mathematics All Students n_absent_not_tested
12 m_all_oth_n Numeric # Other - Not Tested -- Mathematics -- All Students [# Other - Not Tested , Mathematics , All Students] Mathematics All Students n_other_not_tested
13 r_all_d Numeric # Tested -- Reading -- All Students [# Tested , Reading , All Students] Reading All Students n_tested
14 r_sexm_d Numeric # Tested -- Reading -- Male Students [# Tested , Reading , Male Students] Reading Male Students n_tested
15 r_sexf_d Numeric # Tested -- Reading -- Female Students [# Tested , Reading , Female Students] Reading Female Students n_tested
16 r_sexv_d Numeric # Tested -- Reading -- No Sex Info Students [# Tested , Reading , No Sex Info Students] Reading No Sex Info Students n_tested
17 r_ethh_d Numeric # Tested -- Reading -- Hispanic/Latino Students [# Tested , Reading , Hispanic/Latino Students] Reading Hispanic/Latino Students n_tested
18 r_ethi_d Numeric # Tested -- Reading -- American Indian or Alaska Native Students [# Tested , Reading , American Indian or Alaska Native Students] Reading American Indian or Alaska Native Students n_tested
19 r_etha_d Numeric # Tested -- Reading -- Asian Students [# Tested , Reading , Asian Students] Reading Asian Students n_tested

Bringing it All Together

Because we're going to iterate over each variable list (there are six files: one for each grade between 3 and 8), we can simplify this by running all these change operations at once. We also want to remove unneeded columns and rename the original columns to standardize the naming conventions and make for an easy join condition later.

In [31]:
def add_descriptors(df):
    df["Variable"] = df["Variable"].str.strip()
    df["Description"] = df["Description"].str.strip()
    df["desc_list"] = df["Description"].apply(lambda x: x.split('--'))
    df["name"] = df["desc_list"].apply(lambda x: rename_variable(x))
    df["subject"] = df["desc_list"].apply(lambda x: get_subject(x))
    df["student_group"] = df["desc_list"].apply(lambda x: get_student_group(x))
    df.drop(columns=["Description", "desc_list"], inplace=True)
    df.rename(columns={'Variable': 'variable', 'Format_Type': 'format_type'}, inplace=True)
    return df

A Path Forward

That means it would be possible to pivot this data long for each distinct variable after extracting the subject and student group from the variable description. This information, though, is split across two different files: the test data in the .dat files and the variable descriptions in the .xls files.

My plan then is to:

  1. Match the corresponding dat file to its variable list
  2. Pivot the test data long so that each variable and value is a single row
  3. Extract the test subject and student group data points from the variable description
  4. Join the two datasets together using the variable name as a unique identifier
  5. Once more pivot the combined data such that each distinct variable becomes its own column
  6. Combine those standardized data structures for each grade into a single dataset
  7. Write that new dataset out to a compressed csv file for importing into BigQuery
In [32]:
base_url = "https://tea.texas.gov/sites/default/files/"
dat_files = ["cfy21e3.dat", "cfy21e4.dat", "cfy21e5.dat", "cfy21e6.dat", "cfy21e7.dat", "cfy21e8.dat"]
var_files = ["fy21_varlist_g03.xls", "fy21_varlist_g04.xls", "fy21_varlist_g05.xls", "fy21_varlist_g06.xls", "fy21_varlist_g07.xls", "fy21_varlist_g08.xls"]

# 1. Match the dat file with its corresponding var file
all_files = zip(dat_files, var_files)

# Create a list to concatenate the resulting grade level data frames later
all_grades = []

for dat_file, var_file in all_files:
    test_data = pd.read_csv(f"{base_url}{dat_file}", dtype="string")
    variables = pd.read_excel(f"{base_url}{var_file}", dtype="string")

    print("Parsing:", dat_file, "rows/cols:", test_data.shape)


    # 2. melt the dataframe (pivot wide to long) but keep all the site and grade identifiers out of the pivot
    identifiers = variables[variables.Format_Type == 'Character'].Variable.tolist()
    id_vars = [identifier.upper() for identifier in identifiers]
    test_data = pd.melt(test_data, id_vars=id_vars)
    print("Pivoted:", dat_file, "rows/cols:", test_data.shape)

    # 3. extract the subject and student groups
    variables = add_descriptors(variables)

    # 4. join the two dataset together using the variable name as the unique identifier
    combined = test_data.merge(variables, how='left', on='variable')
    print("Joined:", dat_file, "+", var_file, "rows/cols:", combined.shape)

    # 5. pivot back wide but with only the distinct variables
    cols = combined.columns.tolist()
    combined[cols] = combined[cols].apply(lambda x: x.str.strip())
    combined.columns = combined.columns.str.lower()
    columns = ['campus','year','region','district','dname','cname','grade','format_type','subject','student_group','name']
    combined = combined.set_index(columns)['value'].unstack().reset_index()
    print("Combined:", dat_file, "+", var_file, "rows/cols:", combined.shape)

    # 6. append data for each grade into a single data structure and concatenate
    all_grades.append(combined)

df = pd.concat(all_grades)
print("Final rows/cols:", df.shape)

# 7. write to compressed csv
df.to_csv("TX_STAAR_3_8_2021.csv.gz", index=False, compression="gzip")
WARNING *** OLE2 inconsistency: SSCS size is 0 but SSAT size is non-zero
Parsing: cfy21e3.dat rows/cols: (4643, 2093)
Pivoted: cfy21e3.dat rows/cols: (9685298, 9)
Joined: cfy21e3.dat + fy21_varlist_g03.xls rows/cols: (9685298, 13)
Combined: cfy21e3.dat + fy21_varlist_g03.xls rows/cols: (571089, 35)
WARNING *** OLE2 inconsistency: SSCS size is 0 but SSAT size is non-zero
Parsing: cfy21e4.dat rows/cols: (4622, 3075)
Pivoted: cfy21e4.dat rows/cols: (14180296, 9)
Joined: cfy21e4.dat + fy21_varlist_g04.xls rows/cols: (14180296, 13)
Combined: cfy21e4.dat + fy21_varlist_g04.xls rows/cols: (850448, 35)
WARNING *** OLE2 inconsistency: SSCS size is 0 but SSAT size is non-zero
Parsing: cfy21e5.dat rows/cols: (4415, 3197)
Pivoted: cfy21e5.dat rows/cols: (14083850, 9)
Joined: cfy21e5.dat + fy21_varlist_g05.xls rows/cols: (14083850, 13)
Combined: cfy21e5.dat + fy21_varlist_g05.xls rows/cols: (812360, 35)
WARNING *** OLE2 inconsistency: SSCS size is 0 but SSAT size is non-zero
Parsing: cfy21e6.dat rows/cols: (2700, 2093)
Pivoted: cfy21e6.dat rows/cols: (5632200, 9)
Joined: cfy21e6.dat + fy21_varlist_g06.xls rows/cols: (5632200, 13)
Combined: cfy21e6.dat + fy21_varlist_g06.xls rows/cols: (332100, 35)
WARNING *** OLE2 inconsistency: SSCS size is 0 but SSAT size is non-zero
Parsing: cfy21e7.dat rows/cols: (2337, 3075)
Pivoted: cfy21e7.dat rows/cols: (7169916, 9)
Joined: cfy21e7.dat + fy21_varlist_g07.xls rows/cols: (7169916, 13)
Combined: cfy21e7.dat + fy21_varlist_g07.xls rows/cols: (430008, 35)
WARNING *** OLE2 inconsistency: SSCS size is 0 but SSAT size is non-zero
Parsing: cfy21e8.dat rows/cols: (2370, 4301)
Pivoted: cfy21e8.dat rows/cols: (10176780, 9)
Joined: cfy21e8.dat + fy21_varlist_g08.xls rows/cols: (10176780, 13)
Combined: cfy21e8.dat + fy21_varlist_g08.xls rows/cols: (580650, 35)
Final rows/cols: (3576655, 35)
In [33]:
df
Out[33]:
name campus year region district dname cname grade format_type subject student_group NaN avg_scale_score n_absent_not_tested n_answer_documents_submitted n_approaches_grade_level_performance n_avg_items_correctrpt_cat_1 n_avg_items_correctrpt_cat_2 n_avg_items_correctrpt_cat_3 n_avg_items_correctrpt_cat_4 n_did_not_meet_grade_level_performance n_masters_grade_level_performance n_meets_grade_level_performance n_other_not_tested n_tested pct_absent_not_tested pct_answer_documents_submitted pct_approaches_grade_level_performance pct_avg_items_correctrpt_cat_1 pct_avg_items_correctrpt_cat_2 pct_avg_items_correctrpt_cat_3 pct_avg_items_correctrpt_cat_4 pct_did_not_meet_grade_level_performance pct_masters_grade_level_performance pct_meets_grade_level_performance pct_other_not_tested
0 001902103 21 07 001902 CAYUGA ISD CAYUGA ELEM. 03 <NA> NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 001902103 21 07 001902 CAYUGA ISD CAYUGA ELEM. 03 Numeric Mathematics All Students NaN 1607 0 26 22 6.3 10.7 5 3.3 4 16 19 0 26 0 100 85 79 83 71 84 15 62 73 0
2 001902103 21 07 001902 CAYUGA ISD CAYUGA ELEM. 03 Numeric Mathematics American Indian or Alaska Native Students NaN <NA> NaN NaN <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> NaN 0 NaN NaN <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> NaN
3 001902103 21 07 001902 CAYUGA ISD CAYUGA ELEM. 03 Numeric Mathematics Asian Students NaN <NA> NaN NaN <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> NaN 0 NaN NaN <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> NaN
4 001902103 21 07 001902 CAYUGA ISD CAYUGA ELEM. 03 Numeric Mathematics At-Risk Students NaN 1465 NaN NaN 5 5.1 9.3 3.3 2.7 2 2 3 NaN 7 NaN NaN 71 64 71 47 68 29 29 43 NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
580645 254902001 21 20 254902 LA PRYOR ISD LA PRYOR H.S. 08 Numeric Social Studies Title-I Participant Students Codes: 6 7 9 NaN 3425 NaN NaN 12 6.9 5.4 4.2 3.2 23 0 4 NaN 35 NaN NaN 34 41 54 42 46 66 0 11 NaN
580646 254902001 21 20 254902 LA PRYOR ISD LA PRYOR H.S. 08 Numeric Social Studies Transitional Bilingual/Early Exit Students NaN <NA> NaN NaN <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> NaN 0 NaN NaN <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> NaN
580647 254902001 21 20 254902 LA PRYOR ISD LA PRYOR H.S. 08 Numeric Social Studies Transitional Bilingual/Late Exit Students NaN <NA> NaN NaN <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> NaN 0 NaN NaN <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> NaN
580648 254902001 21 20 254902 LA PRYOR ISD LA PRYOR H.S. 08 Numeric Social Studies Two or More Races Students NaN <NA> NaN NaN <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> NaN 0 NaN NaN <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> NaN
580649 254902001 21 20 254902 LA PRYOR ISD LA PRYOR H.S. 08 Numeric Social Studies White Students NaN <NA> NaN NaN <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> NaN 1 NaN NaN <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> NaN

3576655 rows × 35 columns

Conclusion

The resulting dataset/file ends up being ~3.5 million rows in length and a consistent 35 columns wide. You can download the reshaped dataset here: TX STAAR 2021 - Grades 3-8

We can now run simpler queries on the reshaped data such as:

SELECT *
FROM TX_STAAR_2021
WHERE subject = 'Reading'
AND student_group = 'All Students'
AND grade IN ('6', '7', '8')

A similar query using the original datasets would've required unioning three different datasets and a much more complicated set of column matching to filter down to the right subject and student group.

In our production environment, the approach is similar but we initially load the raw dat and var files into Google Cloud Storage and as well as the resulting csv. That is then converted to avro for better query performance and an external table is generated in BigQuery using their API.