import pandas as pd
from pathlib import Path
import xlrd
import os
# Display options for pandas data frames
'display.max_columns', None)
pd.set_option('max_colwidth', None)
pd.set_option('max_seq_item', None) pd.set_option(
Data Transformation of the Texas STAAR Aggregate Data for 2020-2021
Data files available here on the TEA website.
The STAAR data files contain results for 60+ student demographic groups and each variable is repeated in a separate column for both demographic and test subject. As you will see below, this results in data files that are several thousand columns wide and are a different schema for every grade.
These files are designed for use in SAS and SPSS, and the website itself indicates that the number of variables in these files is too great to import into Microsoft Access or some versions of Microsoft Excel without significant truncation.
However, my goal was to pivot this data into a standardized format and load into BigQuery for cross analysis with other state assessment data. What follows is my approach in Python, but I am sure there are other potentially simpler methods than mine, I hope this demonstration helps other approach this dataset for analysis and possibly encourages others to improve upon this method.
The original files are linked below for easy access.
Campus level data for Grades 3-8
- Grade 3: test data | variable list
- Grade 4: test data | variable list
- Grade 5: test data | variable list
- Grade 6: test data | variable list
- Grade 7: test data | variable list
- Grade 8: test data | variable list
Getting the Lay of the Land
Before determining a solution and final schema, I wanted to understand the structure of these files. This can be accomplished by looping over the linked dat files and getting their shapes.
= "https://tea.texas.gov/sites/default/files/"
base_url = ["cfy21e3.dat", "cfy21e4.dat", "cfy21e5.dat", "cfy21e6.dat", "cfy21e7.dat", "cfy21e8.dat"]
dat_files
for dat_file in dat_files:
= f"{base_url}{dat_file}"
url = Path(url).stem
content = pd.read_csv(url)
df print(content, df.shape)
cfy21e3 (4643, 2093)
cfy21e4 (4622, 3075)
cfy21e5 (4415, 3197)
cfy21e6 (2700, 2093)
cfy21e7 (2337, 3075)
cfy21e8 (2370, 4301)
Each file is named with this convention (as far as I can discern): - c: Campus level data - fy21: Fiscal year 2021 - e#: English grade # (English, because grades 3-5 also include Spanish results)
The numbers in the accompanying parenthesis represent the count of rows and columns, respectively.
Inconsistent Data Shape
We can see from this quick file inspection that combining these datasets into a single unified structure will require more than just concatenation. The number of columns is jagged/inconsistent. Reviewing the variable files shows that each grade has differences in the variables reported. Some grades have Reading and Mathematics, while other include a Writing test as well. There are additional differences, but that is one of the most common.
They are so wide, because each variable such as # Tested
is repeated for each subject and each student group (60+ in total). We can also see that the naming of these columns is encoded/abbreviated, such as r_all_d
for # Tested -- Reading -- All Students
or r_eth2_d
for # Tested -- Reading -- Two or More Races Students
.
Finding a Pattern
There is something of a pattern to this naming convention you may have noticed: - r: reading - all: all students - d: # tested
This becomes more obvious with subsequent examples such as r_all_unsatgl_nm
for # Did Not Meet Grade Level Performance -- Reading -- All Students
.
A First Attempt
My initial thought was to split these variable names using the the _
as a delimiter, but this was met with limited success, again due to inconsistent usage.
Note: You will see the following warning in places WARNING *** OLE2 inconsistency: SSCS size is 0 but SSAT size is non-zero
. This is due to an encoding issue with the TEA provided files and can be ignored as it does not impact our ability to parse the files.
= 'https://tea.texas.gov/sites/default/files/fy21_varlist_g03.xls'
grade3_vars
= pd.read_excel(grade3_vars)
df 20) df.head(
WARNING *** OLE2 inconsistency: SSCS size is 0 but SSAT size is non-zero
Variable | Format_Type | Description | |
---|---|---|---|
0 | campus | Character | Campus Number |
1 | year | Character | Test Administration Year |
2 | region | Character | Education Service Center (Region) Number |
3 | district | Character | District Number |
4 | dname | Character | District Name |
5 | cname | Character | Campus Name |
6 | grade | Character | Tested Grade (Usually the Enrolled Grade Unless Student Tested Above Grade) |
7 | r_all_docs_n | Numeric | # Answer Documents Submitted -- Reading -- All Students |
8 | r_all_abs_n | Numeric | # Absent - Not Tested -- Reading -- All Students |
9 | r_all_oth_n | Numeric | # Other - Not Tested -- Reading -- All Students |
10 | m_all_docs_n | Numeric | # Answer Documents Submitted -- Mathematics -- All Students |
11 | m_all_abs_n | Numeric | # Absent - Not Tested -- Mathematics -- All Students |
12 | m_all_oth_n | Numeric | # Other - Not Tested -- Mathematics -- All Students |
13 | r_all_d | Numeric | # Tested -- Reading -- All Students |
14 | r_sexm_d | Numeric | # Tested -- Reading -- Male Students |
15 | r_sexf_d | Numeric | # Tested -- Reading -- Female Students |
16 | r_sexv_d | Numeric | # Tested -- Reading -- No Sex Info Students |
17 | r_ethh_d | Numeric | # Tested -- Reading -- Hispanic/Latino Students |
18 | r_ethi_d | Numeric | # Tested -- Reading -- American Indian or Alaska Native Students |
19 | r_etha_d | Numeric | # Tested -- Reading -- Asian Students |
It appears at first glance, that the pattern is subject, student demographic group, variable type. However, we can see this doesn’t remain consistent.
15) df.tail(
Variable | Format_Type | Description | |
---|---|---|---|
2078 | m_spen_pct_cat4 | Numeric | % Avg Items Correct--Reporting Category 4 -- Mathematics -- Not Special Ed Students |
2079 | m_spev_avg_cat4 | Numeric | # Avg Items Correct--Reporting Category 4 -- Mathematics -- No Info Special Ed Students |
2080 | m_spev_pct_cat4 | Numeric | % Avg Items Correct--Reporting Category 4 -- Mathematics -- No Info Special Ed Students |
2081 | m_gify_avg_cat4 | Numeric | # Avg Items Correct--Reporting Category 4 -- Mathematics -- Gifted/Talented Students |
2082 | m_gify_pct_cat4 | Numeric | % Avg Items Correct--Reporting Category 4 -- Mathematics -- Gifted/Talented Students |
2083 | m_gifn_avg_cat4 | Numeric | # Avg Items Correct--Reporting Category 4 -- Mathematics -- Not Gifted/Talented Students |
2084 | m_gifn_pct_cat4 | Numeric | % Avg Items Correct--Reporting Category 4 -- Mathematics -- Not Gifted/Talented Students |
2085 | m_gifv_avg_cat4 | Numeric | # Avg Items Correct--Reporting Category 4 -- Mathematics -- No Info Gifted/Talented Students |
2086 | m_gifv_pct_cat4 | Numeric | % Avg Items Correct--Reporting Category 4 -- Mathematics -- No Info Gifted/Talented Students |
2087 | m_atry_avg_cat4 | Numeric | # Avg Items Correct--Reporting Category 4 -- Mathematics -- At-Risk Students |
2088 | m_atry_pct_cat4 | Numeric | % Avg Items Correct--Reporting Category 4 -- Mathematics -- At-Risk Students |
2089 | m_atrn_avg_cat4 | Numeric | # Avg Items Correct--Reporting Category 4 -- Mathematics -- Not At-Risk Students |
2090 | m_atrn_pct_cat4 | Numeric | % Avg Items Correct--Reporting Category 4 -- Mathematics -- Not At-Risk Students |
2091 | m_atrv_avg_cat4 | Numeric | # Avg Items Correct--Reporting Category 4 -- Mathematics -- No Info At-Risk Students |
2092 | m_atrv_pct_cat4 | Numeric | % Avg Items Correct--Reporting Category 4 -- Mathematics -- No Info At-Risk Students |
In these latter cases, there is a 4th variable reporting category that comes before the subject and student group.
195:215] df[
Variable | Format_Type | Description | |
---|---|---|---|
195 | r_atrv_unsatgl_nm | Numeric | # Did Not Meet Grade Level Performance -- Reading -- No Info At-Risk Students |
196 | r_all_approgl_nm | Numeric | # Approaches Grade Level Performance -- Reading -- All Students |
197 | r_sexm_approgl_nm | Numeric | # Approaches Grade Level Performance -- Reading -- Male Students |
198 | r_sexf_approgl_nm | Numeric | # Approaches Grade Level Performance -- Reading -- Female Students |
199 | r_sexv_approgl_nm | Numeric | # Approaches Grade Level Performance -- Reading -- No Sex Info Students |
200 | r_ethh_approgl_nm | Numeric | # Approaches Grade Level Performance -- Reading -- Hispanic/Latino Students |
201 | r_ethi_approgl_nm | Numeric | # Approaches Grade Level Performance -- Reading -- American Indian or Alaska Native Students |
202 | r_etha_approgl_nm | Numeric | # Approaches Grade Level Performance -- Reading -- Asian Students |
203 | r_ethb_approgl_nm | Numeric | # Approaches Grade Level Performance -- Reading -- Black or African American Students |
204 | r_ethp_approgl_nm | Numeric | # Approaches Grade Level Performance -- Reading -- Native Hawaiian or Other Pacific Islander Students |
205 | r_ethw_approgl_nm | Numeric | # Approaches Grade Level Performance -- Reading -- White Students |
206 | r_eth2_approgl_nm | Numeric | # Approaches Grade Level Performance -- Reading -- Two or More Races Students |
207 | r_ethv_approgl_nm | Numeric | # Approaches Grade Level Performance -- Reading -- No Ethnicity Info Students |
208 | r_ecoy_approgl_nm | Numeric | # Approaches Grade Level Performance -- Reading -- Econ Disadv Students Codes: 1 2 9 |
209 | r_econ_approgl_nm | Numeric | # Approaches Grade Level Performance -- Reading -- Not Econ Disadv Students |
210 | r_eco1_approgl_nm | Numeric | # Approaches Grade Level Performance -- Reading -- Free Meals Students |
211 | r_eco2_approgl_nm | Numeric | # Approaches Grade Level Performance -- Reading -- Reduced-Price Meals Students |
212 | r_eco9_approgl_nm | Numeric | # Approaches Grade Level Performance -- Reading -- Other Econ Disadvantaged Students |
213 | r_ecov_approgl_nm | Numeric | # Approaches Grade Level Performance -- Reading -- No Info Econ Students |
214 | r_ti1y_approgl_nm | Numeric | # Approaches Grade Level Performance -- Reading -- Title-I Participant Students Codes: 6 7 9 |
There is another pattern with 4 parts for the proficiency levels that is subject, student group, and then a two part string for the variable type.
These differences would make it challenging to simply split on the _
character. Even distinguishing between the three part and four part variables isn’t sufficient, because there are two types of the four part variables.
A Different Solution
However, we can simply ignore the variables. We have a description column that also has it’s own delimiter: --
.
= df.Description.values.tolist()
descriptions
for description in descriptions[:20]:
= description.split('--')
parts print(parts)
print('\n...\n')
for description in descriptions[-5:]:
= description.split('--')
parts print(parts)
['Campus Number']
['Test Administration Year']
['Education Service Center (Region) Number']
['District Number']
['District Name']
['Campus Name']
['Tested Grade (Usually the Enrolled Grade Unless Student Tested Above Grade)']
['# Answer Documents Submitted ', ' Reading ', ' All Students']
['# Absent - Not Tested ', ' Reading ', ' All Students']
['# Other - Not Tested ', ' Reading ', ' All Students']
['# Answer Documents Submitted ', ' Mathematics ', ' All Students']
['# Absent - Not Tested ', ' Mathematics ', ' All Students']
['# Other - Not Tested ', ' Mathematics ', ' All Students']
['# Tested ', ' Reading ', ' All Students']
['# Tested ', ' Reading ', ' Male Students']
['# Tested ', ' Reading ', ' Female Students']
['# Tested ', ' Reading ', ' No Sex Info Students']
['# Tested ', ' Reading ', ' Hispanic/Latino Students']
['# Tested ', ' Reading ', ' American Indian or Alaska Native Students']
['# Tested ', ' Reading ', ' Asian Students']
...
['% Avg Items Correct', 'Reporting Category 4 ', ' Mathematics ', ' At-Risk Students']
['# Avg Items Correct', 'Reporting Category 4 ', ' Mathematics ', ' Not At-Risk Students']
['% Avg Items Correct', 'Reporting Category 4 ', ' Mathematics ', ' Not At-Risk Students']
['# Avg Items Correct', 'Reporting Category 4 ', ' Mathematics ', ' No Info At-Risk Students']
['% Avg Items Correct', 'Reporting Category 4 ', ' Mathematics ', ' No Info At-Risk Students']
This mostly works. We have three different types of descriptions/variables: - 1 part variables like ['Campus Number']
- 3 part variables like ['# Tested ', ' Reading ', ' All Students']
- 4 part variables like ['% Avg Items Correct', 'Reporting Category 4 ', ' Mathematics ', ' At-Risk Students']
That means we can use the list length to determine the order of the variable values: variable name, test subject, and student group. The fourth part reporting category is basically a modifier for the variable name so we can combine those.
By examining a single variable name, we can see that # Tested
, for example, isn’t actually 122 different variables but rather a single variable with 122 permutations of test subject and student group: 61 student groups * 2 test subjects
str.contains('# Tested')] df[df.Description.
Variable | Format_Type | Description | |
---|---|---|---|
13 | r_all_d | Numeric | # Tested -- Reading -- All Students |
14 | r_sexm_d | Numeric | # Tested -- Reading -- Male Students |
15 | r_sexf_d | Numeric | # Tested -- Reading -- Female Students |
16 | r_sexv_d | Numeric | # Tested -- Reading -- No Sex Info Students |
17 | r_ethh_d | Numeric | # Tested -- Reading -- Hispanic/Latino Students |
... | ... | ... | ... |
130 | m_gifn_d | Numeric | # Tested -- Mathematics -- Not Gifted/Talented Students |
131 | m_gifv_d | Numeric | # Tested -- Mathematics -- No Info Gifted/Talented Students |
132 | m_atry_d | Numeric | # Tested -- Mathematics -- At-Risk Students |
133 | m_atrn_d | Numeric | # Tested -- Mathematics -- Not At-Risk Students |
134 | m_atrv_d | Numeric | # Tested -- Mathematics -- No Info At-Risk Students |
122 rows × 3 columns
Extracting the Variable Descriptors
In addition to pulling out the test subject and student group from the variable description, it will be useful to rename the columns into a common naming convention that is more human readable than the current encodings.
Example:
If we take the initial variable example r_all_d
for # Tested -- Reading -- All Students
and remove the subject and student group we would be left with simply d
as the variable/column name. But a better description would be # Tested
or even better (to eliminate special characters that would be problematic in the database) n_tested
.
Splitting the Description
To start let’s add a new column to the dataframe that has the array of descriptors. We can also remove any leading or trailing whitespace from both the variable and description to prevent mismatch issues later.
"Variable"] = df["Variable"].str.strip()
df["Description"] = df["Description"].str.strip()
df["desc_list"] = df["Description"].apply(lambda x: x.split("--"))
df[ df
Variable | Format_Type | Description | desc_list | |
---|---|---|---|---|
0 | campus | Character | Campus Number | [Campus Number] |
1 | year | Character | Test Administration Year | [Test Administration Year] |
2 | region | Character | Education Service Center (Region) Number | [Education Service Center (Region) Number] |
3 | district | Character | District Number | [District Number] |
4 | dname | Character | District Name | [District Name] |
... | ... | ... | ... | ... |
2088 | m_atry_pct_cat4 | Numeric | % Avg Items Correct--Reporting Category 4 -- Mathematics -- At-Risk Students | [% Avg Items Correct, Reporting Category 4 , Mathematics , At-Risk Students] |
2089 | m_atrn_avg_cat4 | Numeric | # Avg Items Correct--Reporting Category 4 -- Mathematics -- Not At-Risk Students | [# Avg Items Correct, Reporting Category 4 , Mathematics , Not At-Risk Students] |
2090 | m_atrn_pct_cat4 | Numeric | % Avg Items Correct--Reporting Category 4 -- Mathematics -- Not At-Risk Students | [% Avg Items Correct, Reporting Category 4 , Mathematics , Not At-Risk Students] |
2091 | m_atrv_avg_cat4 | Numeric | # Avg Items Correct--Reporting Category 4 -- Mathematics -- No Info At-Risk Students | [# Avg Items Correct, Reporting Category 4 , Mathematics , No Info At-Risk Students] |
2092 | m_atrv_pct_cat4 | Numeric | % Avg Items Correct--Reporting Category 4 -- Mathematics -- No Info At-Risk Students | [% Avg Items Correct, Reporting Category 4 , Mathematics , No Info At-Risk Students] |
2093 rows × 4 columns
Extracting Test Subject
In most cases the test subject is the 2nd part of the variable description with two exceptions: if the variable is one of the campus/grade variables that only has a single part or for the four part variables that include reporting category.
def get_subject(desc_list):
"""
Return the 2nd part of 3 part descriptors, the 3rd part of 4 part descriptors, or None.
Note: Else None is not necessary here, since functions return None by default in Python
"""
= len(desc_list)
length if length == 4:
return desc_list[2]
elif length > 1:
return desc_list[1]
"subject"] = df["desc_list"].apply(lambda x: get_subject(x))
df[20) df.head(
Variable | Format_Type | Description | desc_list | subject | |
---|---|---|---|---|---|
0 | campus | Character | Campus Number | [Campus Number] | None |
1 | year | Character | Test Administration Year | [Test Administration Year] | None |
2 | region | Character | Education Service Center (Region) Number | [Education Service Center (Region) Number] | None |
3 | district | Character | District Number | [District Number] | None |
4 | dname | Character | District Name | [District Name] | None |
5 | cname | Character | Campus Name | [Campus Name] | None |
6 | grade | Character | Tested Grade (Usually the Enrolled Grade Unless Student Tested Above Grade) | [Tested Grade (Usually the Enrolled Grade Unless Student Tested Above Grade)] | None |
7 | r_all_docs_n | Numeric | # Answer Documents Submitted -- Reading -- All Students | [# Answer Documents Submitted , Reading , All Students] | Reading |
8 | r_all_abs_n | Numeric | # Absent - Not Tested -- Reading -- All Students | [# Absent - Not Tested , Reading , All Students] | Reading |
9 | r_all_oth_n | Numeric | # Other - Not Tested -- Reading -- All Students | [# Other - Not Tested , Reading , All Students] | Reading |
10 | m_all_docs_n | Numeric | # Answer Documents Submitted -- Mathematics -- All Students | [# Answer Documents Submitted , Mathematics , All Students] | Mathematics |
11 | m_all_abs_n | Numeric | # Absent - Not Tested -- Mathematics -- All Students | [# Absent - Not Tested , Mathematics , All Students] | Mathematics |
12 | m_all_oth_n | Numeric | # Other - Not Tested -- Mathematics -- All Students | [# Other - Not Tested , Mathematics , All Students] | Mathematics |
13 | r_all_d | Numeric | # Tested -- Reading -- All Students | [# Tested , Reading , All Students] | Reading |
14 | r_sexm_d | Numeric | # Tested -- Reading -- Male Students | [# Tested , Reading , Male Students] | Reading |
15 | r_sexf_d | Numeric | # Tested -- Reading -- Female Students | [# Tested , Reading , Female Students] | Reading |
16 | r_sexv_d | Numeric | # Tested -- Reading -- No Sex Info Students | [# Tested , Reading , No Sex Info Students] | Reading |
17 | r_ethh_d | Numeric | # Tested -- Reading -- Hispanic/Latino Students | [# Tested , Reading , Hispanic/Latino Students] | Reading |
18 | r_ethi_d | Numeric | # Tested -- Reading -- American Indian or Alaska Native Students | [# Tested , Reading , American Indian or Alaska Native Students] | Reading |
19 | r_etha_d | Numeric | # Tested -- Reading -- Asian Students | [# Tested , Reading , Asian Students] | Reading |
Extracting Student Group
In all cases (except the single descriptor variables) the student group is included at the end of the description.
def get_student_group(desc_list):
"""
Unless the description has only one part, return the final part.
"""
if len(desc_list) > 1:
return desc_list[-1]
"student_group"] = df["desc_list"].apply(lambda x: get_student_group(x))
df[20) df.head(
Variable | Format_Type | Description | desc_list | subject | student_group | |
---|---|---|---|---|---|---|
0 | campus | Character | Campus Number | [Campus Number] | None | None |
1 | year | Character | Test Administration Year | [Test Administration Year] | None | None |
2 | region | Character | Education Service Center (Region) Number | [Education Service Center (Region) Number] | None | None |
3 | district | Character | District Number | [District Number] | None | None |
4 | dname | Character | District Name | [District Name] | None | None |
5 | cname | Character | Campus Name | [Campus Name] | None | None |
6 | grade | Character | Tested Grade (Usually the Enrolled Grade Unless Student Tested Above Grade) | [Tested Grade (Usually the Enrolled Grade Unless Student Tested Above Grade)] | None | None |
7 | r_all_docs_n | Numeric | # Answer Documents Submitted -- Reading -- All Students | [# Answer Documents Submitted , Reading , All Students] | Reading | All Students |
8 | r_all_abs_n | Numeric | # Absent - Not Tested -- Reading -- All Students | [# Absent - Not Tested , Reading , All Students] | Reading | All Students |
9 | r_all_oth_n | Numeric | # Other - Not Tested -- Reading -- All Students | [# Other - Not Tested , Reading , All Students] | Reading | All Students |
10 | m_all_docs_n | Numeric | # Answer Documents Submitted -- Mathematics -- All Students | [# Answer Documents Submitted , Mathematics , All Students] | Mathematics | All Students |
11 | m_all_abs_n | Numeric | # Absent - Not Tested -- Mathematics -- All Students | [# Absent - Not Tested , Mathematics , All Students] | Mathematics | All Students |
12 | m_all_oth_n | Numeric | # Other - Not Tested -- Mathematics -- All Students | [# Other - Not Tested , Mathematics , All Students] | Mathematics | All Students |
13 | r_all_d | Numeric | # Tested -- Reading -- All Students | [# Tested , Reading , All Students] | Reading | All Students |
14 | r_sexm_d | Numeric | # Tested -- Reading -- Male Students | [# Tested , Reading , Male Students] | Reading | Male Students |
15 | r_sexf_d | Numeric | # Tested -- Reading -- Female Students | [# Tested , Reading , Female Students] | Reading | Female Students |
16 | r_sexv_d | Numeric | # Tested -- Reading -- No Sex Info Students | [# Tested , Reading , No Sex Info Students] | Reading | No Sex Info Students |
17 | r_ethh_d | Numeric | # Tested -- Reading -- Hispanic/Latino Students | [# Tested , Reading , Hispanic/Latino Students] | Reading | Hispanic/Latino Students |
18 | r_ethi_d | Numeric | # Tested -- Reading -- American Indian or Alaska Native Students | [# Tested , Reading , American Indian or Alaska Native Students] | Reading | American Indian or Alaska Native Students |
19 | r_etha_d | Numeric | # Tested -- Reading -- Asian Students | [# Tested , Reading , Asian Students] | Reading | Asian Students |
Creating a New Variable Name
There are a few problematic characters like #
and %
in the names of the variables that should be replaced as well as replacing spaces and abbreviating some longer words in the names. This also uses some regex parsing to remove any text contained in parenthesis.
import re
def rename_variable(desc_list):
"""
Replace special characters and append reporting category when present.
"""
= desc_list[0].lower().strip()
variable_name if len(desc_list) == 4:
+= desc_list[1].lower().strip()
variable_name = {
replacements '-': '',
'#': 'n',
'%': 'pct',
'average': 'avg',
'reporting category': 'rpt cat',
' ': '_',
'__': '_',
}for old, new in replacements.items():
= re.sub(r"\([^()]*\)", "", variable_name).strip()
variable_name = variable_name.replace(old, new)
variable_name return variable_name
"name"] = df["desc_list"].apply(lambda x: rename_variable(x))
df[20) df.head(
Variable | Format_Type | Description | desc_list | subject | student_group | name | |
---|---|---|---|---|---|---|---|
0 | campus | Character | Campus Number | [Campus Number] | None | None | campus_number |
1 | year | Character | Test Administration Year | [Test Administration Year] | None | None | test_administration_year |
2 | region | Character | Education Service Center (Region) Number | [Education Service Center (Region) Number] | None | None | education_service_center_number |
3 | district | Character | District Number | [District Number] | None | None | district_number |
4 | dname | Character | District Name | [District Name] | None | None | district_name |
5 | cname | Character | Campus Name | [Campus Name] | None | None | campus_name |
6 | grade | Character | Tested Grade (Usually the Enrolled Grade Unless Student Tested Above Grade) | [Tested Grade (Usually the Enrolled Grade Unless Student Tested Above Grade)] | None | None | tested_grade |
7 | r_all_docs_n | Numeric | # Answer Documents Submitted -- Reading -- All Students | [# Answer Documents Submitted , Reading , All Students] | Reading | All Students | n_answer_documents_submitted |
8 | r_all_abs_n | Numeric | # Absent - Not Tested -- Reading -- All Students | [# Absent - Not Tested , Reading , All Students] | Reading | All Students | n_absent_not_tested |
9 | r_all_oth_n | Numeric | # Other - Not Tested -- Reading -- All Students | [# Other - Not Tested , Reading , All Students] | Reading | All Students | n_other_not_tested |
10 | m_all_docs_n | Numeric | # Answer Documents Submitted -- Mathematics -- All Students | [# Answer Documents Submitted , Mathematics , All Students] | Mathematics | All Students | n_answer_documents_submitted |
11 | m_all_abs_n | Numeric | # Absent - Not Tested -- Mathematics -- All Students | [# Absent - Not Tested , Mathematics , All Students] | Mathematics | All Students | n_absent_not_tested |
12 | m_all_oth_n | Numeric | # Other - Not Tested -- Mathematics -- All Students | [# Other - Not Tested , Mathematics , All Students] | Mathematics | All Students | n_other_not_tested |
13 | r_all_d | Numeric | # Tested -- Reading -- All Students | [# Tested , Reading , All Students] | Reading | All Students | n_tested |
14 | r_sexm_d | Numeric | # Tested -- Reading -- Male Students | [# Tested , Reading , Male Students] | Reading | Male Students | n_tested |
15 | r_sexf_d | Numeric | # Tested -- Reading -- Female Students | [# Tested , Reading , Female Students] | Reading | Female Students | n_tested |
16 | r_sexv_d | Numeric | # Tested -- Reading -- No Sex Info Students | [# Tested , Reading , No Sex Info Students] | Reading | No Sex Info Students | n_tested |
17 | r_ethh_d | Numeric | # Tested -- Reading -- Hispanic/Latino Students | [# Tested , Reading , Hispanic/Latino Students] | Reading | Hispanic/Latino Students | n_tested |
18 | r_ethi_d | Numeric | # Tested -- Reading -- American Indian or Alaska Native Students | [# Tested , Reading , American Indian or Alaska Native Students] | Reading | American Indian or Alaska Native Students | n_tested |
19 | r_etha_d | Numeric | # Tested -- Reading -- Asian Students | [# Tested , Reading , Asian Students] | Reading | Asian Students | n_tested |
Bringing it All Together
Because we’re going to iterate over each variable list (there are six files: one for each grade between 3 and 8), we can simplify this by running all these change operations at once. We also want to remove unneeded columns and rename the original columns to standardize the naming conventions and make for an easy join condition later.
def add_descriptors(df):
"Variable"] = df["Variable"].str.strip()
df["Description"] = df["Description"].str.strip()
df["desc_list"] = df["Description"].apply(lambda x: x.split('--'))
df["name"] = df["desc_list"].apply(lambda x: rename_variable(x))
df["subject"] = df["desc_list"].apply(lambda x: get_subject(x))
df["student_group"] = df["desc_list"].apply(lambda x: get_student_group(x))
df[=["Description", "desc_list"], inplace=True)
df.drop(columns={'Variable': 'variable', 'Format_Type': 'format_type'}, inplace=True)
df.rename(columnsreturn df
A Path Forward
That means it would be possible to pivot this data long for each distinct variable after extracting the subject and student group from the variable description. This information, though, is split across two different files: the test data in the .dat
files and the variable descriptions in the .xls
files.
My plan then is to: 1. Match the corresponding dat file to its variable list 2. Pivot the test data long so that each variable and value is a single row 3. Extract the test subject and student group data points from the variable description 4. Join the two datasets together using the variable name as a unique identifier 5. Once more pivot the combined data such that each distinct variable becomes its own column 6. Combine those standardized data structures for each grade into a single dataset 7. Write that new dataset out to a compressed csv file for importing into BigQuery
= "https://tea.texas.gov/sites/default/files/"
base_url = ["cfy21e3.dat", "cfy21e4.dat", "cfy21e5.dat", "cfy21e6.dat", "cfy21e7.dat", "cfy21e8.dat"]
dat_files = ["fy21_varlist_g03.xls", "fy21_varlist_g04.xls", "fy21_varlist_g05.xls", "fy21_varlist_g06.xls", "fy21_varlist_g07.xls", "fy21_varlist_g08.xls"]
var_files
# 1. Match the dat file with its corresponding var file
= zip(dat_files, var_files)
all_files
# Create a list to concatenate the resulting grade level data frames later
= []
all_grades
for dat_file, var_file in all_files:
= pd.read_csv(f"{base_url}{dat_file}", dtype="string")
test_data = pd.read_excel(f"{base_url}{var_file}", dtype="string")
variables
print("Parsing:", dat_file, "rows/cols:", test_data.shape)
# 2. melt the dataframe (pivot wide to long) but keep all the site and grade identifiers out of the pivot
= variables[variables.Format_Type == 'Character'].Variable.tolist()
identifiers = [identifier.upper() for identifier in identifiers]
id_vars = pd.melt(test_data, id_vars=id_vars)
test_data print("Pivoted:", dat_file, "rows/cols:", test_data.shape)
# 3. extract the subject and student groups
= add_descriptors(variables)
variables
# 4. join the two dataset together using the variable name as the unique identifier
= test_data.merge(variables, how='left', on='variable')
combined print("Joined:", dat_file, "+", var_file, "rows/cols:", combined.shape)
# 5. pivot back wide but with only the distinct variables
= combined.columns.tolist()
cols = combined[cols].apply(lambda x: x.str.strip())
combined[cols] = combined.columns.str.lower()
combined.columns = ['campus','year','region','district','dname','cname','grade','format_type','subject','student_group','name']
columns = combined.set_index(columns)['value'].unstack().reset_index()
combined print("Combined:", dat_file, "+", var_file, "rows/cols:", combined.shape)
# 6. append data for each grade into a single data structure and concatenate
all_grades.append(combined)
= pd.concat(all_grades)
df print("Final rows/cols:", df.shape)
# 7. write to compressed csv
"TX_STAAR_3_8_2021.csv.gz", index=False, compression="gzip") df.to_csv(
WARNING *** OLE2 inconsistency: SSCS size is 0 but SSAT size is non-zero
Parsing: cfy21e3.dat rows/cols: (4643, 2093)
Pivoted: cfy21e3.dat rows/cols: (9685298, 9)
Joined: cfy21e3.dat + fy21_varlist_g03.xls rows/cols: (9685298, 13)
Combined: cfy21e3.dat + fy21_varlist_g03.xls rows/cols: (571089, 35)
WARNING *** OLE2 inconsistency: SSCS size is 0 but SSAT size is non-zero
Parsing: cfy21e4.dat rows/cols: (4622, 3075)
Pivoted: cfy21e4.dat rows/cols: (14180296, 9)
Joined: cfy21e4.dat + fy21_varlist_g04.xls rows/cols: (14180296, 13)
Combined: cfy21e4.dat + fy21_varlist_g04.xls rows/cols: (850448, 35)
WARNING *** OLE2 inconsistency: SSCS size is 0 but SSAT size is non-zero
Parsing: cfy21e5.dat rows/cols: (4415, 3197)
Pivoted: cfy21e5.dat rows/cols: (14083850, 9)
Joined: cfy21e5.dat + fy21_varlist_g05.xls rows/cols: (14083850, 13)
Combined: cfy21e5.dat + fy21_varlist_g05.xls rows/cols: (812360, 35)
WARNING *** OLE2 inconsistency: SSCS size is 0 but SSAT size is non-zero
Parsing: cfy21e6.dat rows/cols: (2700, 2093)
Pivoted: cfy21e6.dat rows/cols: (5632200, 9)
Joined: cfy21e6.dat + fy21_varlist_g06.xls rows/cols: (5632200, 13)
Combined: cfy21e6.dat + fy21_varlist_g06.xls rows/cols: (332100, 35)
WARNING *** OLE2 inconsistency: SSCS size is 0 but SSAT size is non-zero
Parsing: cfy21e7.dat rows/cols: (2337, 3075)
Pivoted: cfy21e7.dat rows/cols: (7169916, 9)
Joined: cfy21e7.dat + fy21_varlist_g07.xls rows/cols: (7169916, 13)
Combined: cfy21e7.dat + fy21_varlist_g07.xls rows/cols: (430008, 35)
WARNING *** OLE2 inconsistency: SSCS size is 0 but SSAT size is non-zero
Parsing: cfy21e8.dat rows/cols: (2370, 4301)
Pivoted: cfy21e8.dat rows/cols: (10176780, 9)
Joined: cfy21e8.dat + fy21_varlist_g08.xls rows/cols: (10176780, 13)
Combined: cfy21e8.dat + fy21_varlist_g08.xls rows/cols: (580650, 35)
Final rows/cols: (3576655, 35)
df
name | campus | year | region | district | dname | cname | grade | format_type | subject | student_group | NaN | avg_scale_score | n_absent_not_tested | n_answer_documents_submitted | n_approaches_grade_level_performance | n_avg_items_correctrpt_cat_1 | n_avg_items_correctrpt_cat_2 | n_avg_items_correctrpt_cat_3 | n_avg_items_correctrpt_cat_4 | n_did_not_meet_grade_level_performance | n_masters_grade_level_performance | n_meets_grade_level_performance | n_other_not_tested | n_tested | pct_absent_not_tested | pct_answer_documents_submitted | pct_approaches_grade_level_performance | pct_avg_items_correctrpt_cat_1 | pct_avg_items_correctrpt_cat_2 | pct_avg_items_correctrpt_cat_3 | pct_avg_items_correctrpt_cat_4 | pct_did_not_meet_grade_level_performance | pct_masters_grade_level_performance | pct_meets_grade_level_performance | pct_other_not_tested |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 001902103 | 21 | 07 | 001902 | CAYUGA ISD | CAYUGA ELEM. | 03 | <NA> | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | |
1 | 001902103 | 21 | 07 | 001902 | CAYUGA ISD | CAYUGA ELEM. | 03 | Numeric | Mathematics | All Students | NaN | 1607 | 0 | 26 | 22 | 6.3 | 10.7 | 5 | 3.3 | 4 | 16 | 19 | 0 | 26 | 0 | 100 | 85 | 79 | 83 | 71 | 84 | 15 | 62 | 73 | 0 |
2 | 001902103 | 21 | 07 | 001902 | CAYUGA ISD | CAYUGA ELEM. | 03 | Numeric | Mathematics | American Indian or Alaska Native Students | NaN | <NA> | NaN | NaN | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | NaN | 0 | NaN | NaN | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | NaN |
3 | 001902103 | 21 | 07 | 001902 | CAYUGA ISD | CAYUGA ELEM. | 03 | Numeric | Mathematics | Asian Students | NaN | <NA> | NaN | NaN | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | NaN | 0 | NaN | NaN | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | NaN |
4 | 001902103 | 21 | 07 | 001902 | CAYUGA ISD | CAYUGA ELEM. | 03 | Numeric | Mathematics | At-Risk Students | NaN | 1465 | NaN | NaN | 5 | 5.1 | 9.3 | 3.3 | 2.7 | 2 | 2 | 3 | NaN | 7 | NaN | NaN | 71 | 64 | 71 | 47 | 68 | 29 | 29 | 43 | NaN |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
580645 | 254902001 | 21 | 20 | 254902 | LA PRYOR ISD | LA PRYOR H.S. | 08 | Numeric | Social Studies | Title-I Participant Students Codes: 6 7 9 | NaN | 3425 | NaN | NaN | 12 | 6.9 | 5.4 | 4.2 | 3.2 | 23 | 0 | 4 | NaN | 35 | NaN | NaN | 34 | 41 | 54 | 42 | 46 | 66 | 0 | 11 | NaN |
580646 | 254902001 | 21 | 20 | 254902 | LA PRYOR ISD | LA PRYOR H.S. | 08 | Numeric | Social Studies | Transitional Bilingual/Early Exit Students | NaN | <NA> | NaN | NaN | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | NaN | 0 | NaN | NaN | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | NaN |
580647 | 254902001 | 21 | 20 | 254902 | LA PRYOR ISD | LA PRYOR H.S. | 08 | Numeric | Social Studies | Transitional Bilingual/Late Exit Students | NaN | <NA> | NaN | NaN | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | NaN | 0 | NaN | NaN | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | NaN |
580648 | 254902001 | 21 | 20 | 254902 | LA PRYOR ISD | LA PRYOR H.S. | 08 | Numeric | Social Studies | Two or More Races Students | NaN | <NA> | NaN | NaN | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | NaN | 0 | NaN | NaN | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | NaN |
580649 | 254902001 | 21 | 20 | 254902 | LA PRYOR ISD | LA PRYOR H.S. | 08 | Numeric | Social Studies | White Students | NaN | <NA> | NaN | NaN | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | NaN | 1 | NaN | NaN | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | NaN |
3576655 rows × 35 columns
Conclusion
The resulting dataset/file ends up being ~3.5 million rows in length and a consistent 35 columns wide. You can download the reshaped dataset here: TX STAAR 2021 - Grades 3-8
We can now run simpler queries on the reshaped data such as:
SELECT *
FROM TX_STAAR_2021
WHERE subject = 'Reading'
AND student_group = 'All Students'
AND grade IN ('6', '7', '8')
A similar query using the original datasets would’ve required unioning three different datasets and a much more complicated set of column matching to filter down to the right subject and student group.
In our production environment, the approach is similar but we initially load the raw dat and var files into Google Cloud Storage and as well as the resulting csv. That is then converted to avro for better query performance and an external table is generated in BigQuery using their API.