I’ve been wanting to put together a simple resource for learning the basics of querying data with SQL for some time now. However, one of the roadblocks has been the overhead of initial setup instructions for creating a database to query from. Even cloud options have hurdles that prevent just jumping in and writing SQL.
DuckDB: Simple and Effective
One of my favorite new tools is DuckDB. Especially when combined with pandas dataframes. It enables me to use pandas to quickly read a data source into a dataframe and then shift away from the often confusing array of pandas querying syntax and shift back to something I am most comfortable with in my data analysis toolbelt: SQL.
import pandas as pddirectory = pd.read_csv("ccd_sch_029_2122_w_1a_071722.csv", dtype=str)# convert column names to lowercase, # because I'm not a fan of my column names screaming at me in ALL CAPSdirectory.columns = [c.lower() for c in directory.columns] directory.head()
school_year
fipst
statename
st
sch_name
lea_name
state_agency_no
union
st_leaid
leaid
...
g_10_offered
g_11_offered
g_12_offered
g_13_offered
g_ug_offered
g_ae_offered
gslo
gshi
level
igoffered
0
2021-2022
01
ALABAMA
AL
Albertville Middle School
Albertville City
01
NaN
AL-101
0100005
...
No
No
No
No
No
No
07
08
Middle
As reported
1
2021-2022
01
ALABAMA
AL
Albertville High School
Albertville City
01
NaN
AL-101
0100005
...
Yes
Yes
Yes
No
No
No
09
12
High
As reported
2
2021-2022
01
ALABAMA
AL
Albertville Intermediate School
Albertville City
01
NaN
AL-101
0100005
...
No
No
No
No
No
No
05
06
Middle
As reported
3
2021-2022
01
ALABAMA
AL
Albertville Elementary School
Albertville City
01
NaN
AL-101
0100005
...
No
No
No
No
No
No
03
04
Elementary
As reported
4
2021-2022
01
ALABAMA
AL
Albertville Kindergarten and PreK
Albertville City
01
NaN
AL-101
0100005
...
No
No
No
No
No
No
PK
KG
Elementary
As reported
5 rows × 65 columns
A quick pause for vocabulary
Before we dive in to writing SQL queries, let’s pause to define a few common terms that will come up.
Table: A table is a fundamental component of a database. It represents a collection of related data organized in rows and columns. Each row in a table is called a record, and each column represents a field.
Field: A field is a single piece of data within a table. It corresponds to the columns in a table and defines the type of data it can hold, such as text, numbers, dates, or binary data.
Record: A record, also known as a row, is a complete set of related data in a table. It contains values for each field, representing a single entity or data entry within the database.
Schema: A schema is a blueprint or structure that defines the organization of a database. It outlines the tables, fields, data types, relationships, and constraints that form the database’s structure.
Query: A query is a request made to the database to retrieve or manipulate data. It uses a structured query language (SQL) to interact with the database.
Understanding table schema
Understanding the schema of a table is essential before writing a SQL query. The schema defines the structure of the table, including the names of columns, data types, and constraints. When exploring unfamiliar datasets, understanding the schema provides insights into the available data and helps you identify relevant tables and columns for analysis. It also ensures that you refer to the correct column names and use the appropriate data types in your SQL query, leading to accurate and valid results. Knowing the data types of each column helps you format your query correctly.
For example, when dealing with dates or numerical values, understanding the data types ensures you use the right functions and operators for calculations. When writing queries, you often need to filter or sort data based on certain criteria. Understanding the schema allows you to apply the correct conditions and sorting instructions to retrieve the desired results effectively.
Now, let’s orient ourselves to the data in this dataset and it’s schema. To do so, let’s get a list of columns and get a sense of the data volume.
their data types (recall we have set all of them to strings, which pandas lists as object)
the number of non-null values in each column
the number of columns: 65
the number of rows: 102,130
it even shows the amount of memory used to hold this data: 50.6 MB
A note about null
In SQL and databases, null represents the absence of a value or unknown data for a specific field in a table. It’s not the same as zero or an empty string and requires special handling in SQL queries. The null value can be used in any data type, whether it’s a string, numeric, date, or any other data type.
Dealing with null values in SQL queries can be tricky, and can lead to unexpected or incorrect results.
Getting Started
Now that we have a basic lay-of-the-land for our data, we can begin querying it. At this point, I will shift away from pandas syntax and use SQL queries.
To start, let’s install duckdb.
pip install duckdb
import duckdbquery ="select * from directory limit 5"duckdb.sql(query).df()
school_year
fipst
statename
st
sch_name
lea_name
state_agency_no
union
st_leaid
leaid
...
g_10_offered
g_11_offered
g_12_offered
g_13_offered
g_ug_offered
g_ae_offered
gslo
gshi
level
igoffered
0
2021-2022
01
ALABAMA
AL
Albertville Middle School
Albertville City
01
NaN
AL-101
0100005
...
No
No
No
No
No
No
07
08
Middle
As reported
1
2021-2022
01
ALABAMA
AL
Albertville High School
Albertville City
01
NaN
AL-101
0100005
...
Yes
Yes
Yes
No
No
No
09
12
High
As reported
2
2021-2022
01
ALABAMA
AL
Albertville Intermediate School
Albertville City
01
NaN
AL-101
0100005
...
No
No
No
No
No
No
05
06
Middle
As reported
3
2021-2022
01
ALABAMA
AL
Albertville Elementary School
Albertville City
01
NaN
AL-101
0100005
...
No
No
No
No
No
No
03
04
Elementary
As reported
4
2021-2022
01
ALABAMA
AL
Albertville Kindergarten and PreK
Albertville City
01
NaN
AL-101
0100005
...
No
No
No
No
No
No
PK
KG
Elementary
As reported
5 rows × 65 columns
This basic query is equivalent to directory.head().
select*fromdirectorylimit5
What is each keyword doing here?
select: The select keyword is used to specify which columns or expressions to include in the query result. It allows you to choose the data you want to retrieve from the tables in the database. In SQL, the asterisk (*) is used in a select statement to represent all columns in a table.
from: The from keyword is used to specify the table from which the data will be retrieved. It tells the query where to find the data to be selected. With duckdb, we can specify a dataframe as the table to be queried.
limit: The limit keyword is used to restrict the number of rows returned in the query result. It is particularly useful when you only need to see a small subset of the data.
Asking Questions
SQL Queries are fundamentally a way of interrogating your data. You usually start with a question you want to answer and then form your query (question) is such a way that the database can understand it and return the relevant data.
One of the basic ways to interrogate the data is to constrain it in some way. Filtering the data to show only a subset allows you to begin to identify possible patterns. The where clause is particularly valuable for this purpose.
Let’s translate this question into a SQL query: Which schools in Delaware offer 12th grade enrollment?
query =""" select sch_name from directory where statename = 'DELAWARE' and g_12_offered = 'Yes' order by sch_name"""duckdb.sql(query).df()
sch_name
0
Appoquinimink High School
1
Brandywine High School
2
Brandywine SITE
3
Brennen School (The)
4
Caesar Rodney High School
5
Calloway (Cab) School of the Arts
6
Cape Henlopen High School
7
Charter School of Wilmington
8
Christiana High School
9
Concord High School
10
Conrad Schools of Science
11
Delaware Military Academy
12
Delaware School for the Deaf
13
Delcastle Technical High School
14
Delmar High School
15
Dickinson (John) School
16
Douglass School
17
Dover High School
18
Early College High School at Del State
19
Ennis (Howard T.) School
20
First State Military Academy
21
Freire Charter School
22
Glasgow High School
23
Great Oaks Charter School
24
Hodgson (Paul M.) Vocational Technical High School
25
Howard High School of Technology
26
Indian River High School
27
Indian River Intensive Learning Center
28
John S. Charlton School
29
Kent County Community School
30
Kent County Secondary ILC
31
Lake Forest High School
32
Laurel Senior High School
33
Leach (John G.) School
34
MOT Charter School
35
McKean (Thomas) High School
36
Meadowood Program
37
Middletown High School
38
Milford Senior High School
39
Mount Pleasant High School
40
Newark Charter School
41
Newark High School
42
Odyssey Charter School
43
POLYTECH High School
44
Penn (William) High School
45
Positive Outcomes Charter School
46
Seaford Senior High School
47
Smyrna High School
48
St. Georges Technical High School
49
Sussex Academy
50
Sussex Central High School
51
Sussex Consortium
52
Sussex Orthopedic Program
53
Sussex Technical High School
54
The Wallace Wallin School
55
Woodbridge High School
56
duPont (Alexis I.) High School
NOTE: You may have noticed the use of """ in the code above. This enables a string to be broken out across multiple lines which I find easier to read for SQL queries.
What is happening here? Let’s break down this query.
select sch_name -- return only the school name columnfromdirectory-- from the data in our dataframewhere statename ='DELAWARE'-- where the state name is DELAWARE (state names are all caps in the dataset)and g_12_offered ='Yes'-- and where grade 12 is offeredorderby sch_name -- sort the resulting data alphabetically by the school name
Next, let’s refine this query to identify which of those schools are charter schools or not.
query =""" select sch_name, charter_text, from directory where statename = 'DELAWARE' and g_12_offered = 'Yes' order by sch_name"""duckdb.sql(query).df()
sch_name
charter_text
0
Appoquinimink High School
No
1
Brandywine High School
No
2
Brandywine SITE
No
3
Brennen School (The)
No
4
Caesar Rodney High School
No
5
Calloway (Cab) School of the Arts
No
6
Cape Henlopen High School
No
7
Charter School of Wilmington
Yes
8
Christiana High School
No
9
Concord High School
No
10
Conrad Schools of Science
No
11
Delaware Military Academy
Yes
12
Delaware School for the Deaf
No
13
Delcastle Technical High School
No
14
Delmar High School
No
15
Dickinson (John) School
No
16
Douglass School
No
17
Dover High School
No
18
Early College High School at Del State
Yes
19
Ennis (Howard T.) School
No
20
First State Military Academy
Yes
21
Freire Charter School
Yes
22
Glasgow High School
No
23
Great Oaks Charter School
Yes
24
Hodgson (Paul M.) Vocational Technical High School
No
25
Howard High School of Technology
No
26
Indian River High School
No
27
Indian River Intensive Learning Center
No
28
John S. Charlton School
No
29
Kent County Community School
No
30
Kent County Secondary ILC
No
31
Lake Forest High School
No
32
Laurel Senior High School
No
33
Leach (John G.) School
No
34
MOT Charter School
Yes
35
McKean (Thomas) High School
No
36
Meadowood Program
No
37
Middletown High School
No
38
Milford Senior High School
No
39
Mount Pleasant High School
No
40
Newark Charter School
Yes
41
Newark High School
No
42
Odyssey Charter School
Yes
43
POLYTECH High School
No
44
Penn (William) High School
No
45
Positive Outcomes Charter School
Yes
46
Seaford Senior High School
No
47
Smyrna High School
No
48
St. Georges Technical High School
No
49
Sussex Academy
Yes
50
Sussex Central High School
No
51
Sussex Consortium
No
52
Sussex Orthopedic Program
No
53
Sussex Technical High School
No
54
The Wallace Wallin School
No
55
Woodbridge High School
No
56
duPont (Alexis I.) High School
No
And if we want to limit our list to just the charter schools, we can add another filter to the where clause.
query =""" select sch_name, charter_text, from directory where statename = 'DELAWARE' and g_12_offered = 'Yes' and charter_text = 'Yes' order by sch_name"""duckdb.sql(query).df()
sch_name
charter_text
0
Charter School of Wilmington
Yes
1
Delaware Military Academy
Yes
2
Early College High School at Del State
Yes
3
First State Military Academy
Yes
4
Freire Charter School
Yes
5
Great Oaks Charter School
Yes
6
MOT Charter School
Yes
7
Newark Charter School
Yes
8
Odyssey Charter School
Yes
9
Positive Outcomes Charter School
Yes
10
Sussex Academy
Yes
Counting: A common type of question to ask about data
After this initial investigation, my curiosity has been piqued. I am wondering how many schools are charters vs. traditional public schools in Delaware (regardless of grades offered). Let’s translate that question into a sql query using some basic aggregation.
query =""" select statename, charter_text, count(sch_name) as num_schools, from directory where statename = 'DELAWARE' group by statename, charter_text"""duckdb.sql(query).df()
statename
charter_text
num_schools
0
DELAWARE
Yes
23
1
DELAWARE
No
208
How does this work?
The count() function, predictably, will count the number of values in a given column (or all records if the * is passed instead). You can further expand this to count only unique school names by using count(distinct sch_name).
We are also aliasing the count using the as keyword to give the resulting column a name: num_schools.
The group by clause is necessary for this aggregation because it defines the context in which the count() function operates. Without the group by clause, the count() function would treat the entire result set as a single group.
Consider this example instead (without grouping):
query =""" select count(sch_name) as num_schools, from directory where statename = 'DELAWARE'"""duckdb.sql(query).df()
num_schools
0
231
In order to get a count for each value of the Yes/No flag in the charter_text column, we need to group our counts by each of those results. In addition, in order to include the statename in our result set, we need to include it in the grouping as well.
If we wanted to determine the same counts for all states in the dataset, we could simply remove the where clause limiting the data to Delaware.
query =""" select statename, charter_text, count(sch_name) as num_schools, from directory group by statename, charter_text"""duckdb.sql(query).df()
statename
charter_text
num_schools
0
ALABAMA
No
1553
1
ALABAMA
Yes
13
2
ALASKA
No
478
3
ALASKA
Yes
31
4
ARIZONA
No
1909
...
...
...
...
100
GUAM
Yes
3
101
NORTHERN MARIANAS
Not applicable
35
102
PUERTO RICO
No
846
103
PUERTO RICO
Yes
7
104
U.S. VIRGIN ISLANDS
Not applicable
23
105 rows × 3 columns
This gives us some sense of the breakdown, but it actually might be useful to rephrase our question as: What number and percent of schools in each state are charter schools?
With that additional clarity, we can refine our query a bit using some conditional logic.
query =""" select statename, count( if(charter_text = 'Yes', sch_name, null) ) as num_charter_schools, count(sch_name) as num_schools, from directory group by statename order by statename"""duckdb.sql(query).df()
statename
num_charter_schools
num_schools
0
ALABAMA
13
1566
1
ALASKA
31
509
2
AMERICAN SAMOA
0
29
3
ARIZONA
606
2515
4
ARKANSAS
93
1099
5
BUREAU OF INDIAN EDUCATION
0
174
6
CALIFORNIA
1337
10456
7
COLORADO
268
1960
8
CONNECTICUT
21
1009
9
DELAWARE
23
231
10
DISTRICT OF COLUMBIA
125
243
11
FLORIDA
767
4306
12
GEORGIA
95
2327
13
GUAM
3
44
14
HAWAII
37
296
15
IDAHO
81
801
16
ILLINOIS
138
4409
17
INDIANA
121
1933
18
IOWA
2
1333
19
KANSAS
9
1360
20
KENTUCKY
0
1548
21
LOUISIANA
150
1383
22
MAINE
13
602
23
MARYLAND
49
1424
24
MASSACHUSETTS
78
1859
25
MICHIGAN
386
3570
26
MINNESOTA
306
2773
27
MISSISSIPPI
7
1054
28
MISSOURI
81
2469
29
MONTANA
0
832
30
NEBRASKA
0
1098
31
NEVADA
99
769
32
NEW HAMPSHIRE
39
498
33
NEW JERSEY
87
2582
34
NEW MEXICO
102
894
35
NEW YORK
331
4842
36
NORTH CAROLINA
223
2765
37
NORTH DAKOTA
0
536
38
NORTHERN MARIANAS
0
35
39
OHIO
325
3689
40
OKLAHOMA
72
1803
41
OREGON
132
1290
42
PENNSYLVANIA
184
2963
43
PUERTO RICO
7
853
44
RHODE ISLAND
41
319
45
SOUTH CAROLINA
93
1288
46
SOUTH DAKOTA
0
732
47
TENNESSEE
121
1931
48
TEXAS
1056
9652
49
U.S. VIRGIN ISLANDS
0
23
50
UTAH
141
1123
51
VERMONT
0
305
52
VIRGINIA
7
2136
53
WASHINGTON
17
2561
54
WEST VIRGINIA
3
700
55
WISCONSIN
243
2263
56
WYOMING
5
366
That gets us the counts of charter schools in each state as well as the total number of schools in each state. Let’s breakdown what this is doing:
count(if(charter_text ='Yes', sch_name, null)) as num_charter_schools,
The if() function checks whether a statement evaluates to true or not and then returns one value for true and another for false. Here, we are leveraging that functionality to count the school name if the value of the charter_text column is Yes but returning null if not. That works because the count() function ignores null values when aggregating.
A similar, but slightly different approach could be done using the sum() function instead:
sum(if(charter_text ='Yes', 1, 0)) as num_charter_schools,
This would treat each charter school as being equivalent to the value 1 and each non-charter schools as equivalent to the value 0 and then would add them up. The result is the same in this case. There can be subtle differences in the approach depending on the data being aggregated. It can be useful, at times, to compare the resulting differences (if any).
Calculating the percentage of schools
We have our counts now, but we don’t have the percentage of schools and that is ultimately the question we are asking. To determine that, we need to divide the number of charters in each state by the total number of schools.
We could do that like this:
query =""" select statename, count(if(charter_text = 'Yes', sch_name, null)) as num_charter_schools, count(sch_name) as num_schools, count(if(charter_text = 'Yes', sch_name, null))/count(sch_name) as percent_of_schools, from directory group by statename order by statename"""duckdb.sql(query).df()
statename
num_charter_schools
num_schools
percent_of_schools
0
ALABAMA
13
1566
0.008301
1
ALASKA
31
509
0.060904
2
AMERICAN SAMOA
0
29
0.000000
3
ARIZONA
606
2515
0.240954
4
ARKANSAS
93
1099
0.084622
5
BUREAU OF INDIAN EDUCATION
0
174
0.000000
6
CALIFORNIA
1337
10456
0.127869
7
COLORADO
268
1960
0.136735
8
CONNECTICUT
21
1009
0.020813
9
DELAWARE
23
231
0.099567
10
DISTRICT OF COLUMBIA
125
243
0.514403
11
FLORIDA
767
4306
0.178124
12
GEORGIA
95
2327
0.040825
13
GUAM
3
44
0.068182
14
HAWAII
37
296
0.125000
15
IDAHO
81
801
0.101124
16
ILLINOIS
138
4409
0.031300
17
INDIANA
121
1933
0.062597
18
IOWA
2
1333
0.001500
19
KANSAS
9
1360
0.006618
20
KENTUCKY
0
1548
0.000000
21
LOUISIANA
150
1383
0.108460
22
MAINE
13
602
0.021595
23
MARYLAND
49
1424
0.034410
24
MASSACHUSETTS
78
1859
0.041958
25
MICHIGAN
386
3570
0.108123
26
MINNESOTA
306
2773
0.110350
27
MISSISSIPPI
7
1054
0.006641
28
MISSOURI
81
2469
0.032807
29
MONTANA
0
832
0.000000
30
NEBRASKA
0
1098
0.000000
31
NEVADA
99
769
0.128739
32
NEW HAMPSHIRE
39
498
0.078313
33
NEW JERSEY
87
2582
0.033695
34
NEW MEXICO
102
894
0.114094
35
NEW YORK
331
4842
0.068360
36
NORTH CAROLINA
223
2765
0.080651
37
NORTH DAKOTA
0
536
0.000000
38
NORTHERN MARIANAS
0
35
0.000000
39
OHIO
325
3689
0.088100
40
OKLAHOMA
72
1803
0.039933
41
OREGON
132
1290
0.102326
42
PENNSYLVANIA
184
2963
0.062099
43
PUERTO RICO
7
853
0.008206
44
RHODE ISLAND
41
319
0.128527
45
SOUTH CAROLINA
93
1288
0.072205
46
SOUTH DAKOTA
0
732
0.000000
47
TENNESSEE
121
1931
0.062662
48
TEXAS
1056
9652
0.109407
49
U.S. VIRGIN ISLANDS
0
23
0.000000
50
UTAH
141
1123
0.125557
51
VERMONT
0
305
0.000000
52
VIRGINIA
7
2136
0.003277
53
WASHINGTON
17
2561
0.006638
54
WEST VIRGINIA
3
700
0.004286
55
WISCONSIN
243
2263
0.107380
56
WYOMING
5
366
0.013661
Not bad, but I find that decimal representation of the percentage a bit harder to read. Let’s transform that to look like 12.5% by:
multiplying the decimal by 100 to get a whole number
rounding to a single decimal digit
converting it to a string and adding the % sign
query =""" select statename, count(if(charter_text = 'Yes', sch_name, null)) as num_charter_schools, count(sch_name) as num_schools, concat(cast(round((count(if(charter_text = 'Yes', sch_name, null))/count(sch_name)) * 100, 1) as string), '%') as percent_of_schools, from directory group by statename order by statename"""duckdb.sql(query).df()
statename
num_charter_schools
num_schools
percent_of_schools
0
ALABAMA
13
1566
0.8%
1
ALASKA
31
509
6.1%
2
AMERICAN SAMOA
0
29
0.0%
3
ARIZONA
606
2515
24.1%
4
ARKANSAS
93
1099
8.5%
5
BUREAU OF INDIAN EDUCATION
0
174
0.0%
6
CALIFORNIA
1337
10456
12.8%
7
COLORADO
268
1960
13.7%
8
CONNECTICUT
21
1009
2.1%
9
DELAWARE
23
231
10.0%
10
DISTRICT OF COLUMBIA
125
243
51.4%
11
FLORIDA
767
4306
17.8%
12
GEORGIA
95
2327
4.1%
13
GUAM
3
44
6.8%
14
HAWAII
37
296
12.5%
15
IDAHO
81
801
10.1%
16
ILLINOIS
138
4409
3.1%
17
INDIANA
121
1933
6.3%
18
IOWA
2
1333
0.2%
19
KANSAS
9
1360
0.7%
20
KENTUCKY
0
1548
0.0%
21
LOUISIANA
150
1383
10.8%
22
MAINE
13
602
2.2%
23
MARYLAND
49
1424
3.4%
24
MASSACHUSETTS
78
1859
4.2%
25
MICHIGAN
386
3570
10.8%
26
MINNESOTA
306
2773
11.0%
27
MISSISSIPPI
7
1054
0.7%
28
MISSOURI
81
2469
3.3%
29
MONTANA
0
832
0.0%
30
NEBRASKA
0
1098
0.0%
31
NEVADA
99
769
12.9%
32
NEW HAMPSHIRE
39
498
7.8%
33
NEW JERSEY
87
2582
3.4%
34
NEW MEXICO
102
894
11.4%
35
NEW YORK
331
4842
6.8%
36
NORTH CAROLINA
223
2765
8.1%
37
NORTH DAKOTA
0
536
0.0%
38
NORTHERN MARIANAS
0
35
0.0%
39
OHIO
325
3689
8.8%
40
OKLAHOMA
72
1803
4.0%
41
OREGON
132
1290
10.2%
42
PENNSYLVANIA
184
2963
6.2%
43
PUERTO RICO
7
853
0.8%
44
RHODE ISLAND
41
319
12.9%
45
SOUTH CAROLINA
93
1288
7.2%
46
SOUTH DAKOTA
0
732
0.0%
47
TENNESSEE
121
1931
6.3%
48
TEXAS
1056
9652
10.9%
49
U.S. VIRGIN ISLANDS
0
23
0.0%
50
UTAH
141
1123
12.6%
51
VERMONT
0
305
0.0%
52
VIRGINIA
7
2136
0.3%
53
WASHINGTON
17
2561
0.7%
54
WEST VIRGINIA
3
700
0.4%
55
WISCONSIN
243
2263
10.7%
56
WYOMING
5
366
1.4%
Cleaning this up for readability
At this point we have a working query, but the syntax is getting a bit dense and hard to read. I’d like to break this up a bit to make it more clear what is happening.
Our code has three main steps:
calculate the counts of charter schools and all schools for each state
divide the counts to calculate a percentage value as decimal
transform the decimal value to a string representation of the percentage
We can use nested subqueries to do this, but I consider that an anti-pattern in SQL and prefer the use of CTEs (common table expressions). They allow us to designate a query to be an intermediate result that can be further queried.
New Vocabulary
Common Table Expression (CTE) is a temporary result set that is defined within the scope of a single SQL statement. It allows you to create a named query that can be referenced multiple times within the main query. CTEs are primarily used to improve the readability, modularity, and reusability of SQL queries
Note: Subqueries as an anti-pattern
Using nested subqueries in SQL can become an anti-pattern because it can lead to complex, hard-to-maintain queries and potential performance issues. Nested subqueries involve placing one query inside another, often making the overall query difficult to read, understand, and optimize. While nested subqueries can solve specific problems, using them excessively can lead to unreadable, inefficient, and hard-to-maintain SQL queries. It’s generally better to explore alternative techniques like joins, CTEs, and window functions for clearer, more optimized code.
query =""" with school_counts as ( select statename, count(if(charter_text = 'Yes', sch_name, null)) as num_charter_schools, count(sch_name) as num_schools, from directory group by statename ), percents as ( select statename, num_charter_schools, num_schools, round((num_charter_schools/num_schools) * 100, 1) as percent_of_schools from school_counts ) select statename, num_charter_schools, num_schools, concat(cast(percent_of_schools as string),'%') as percent_of_schools from percents"""charters = duckdb.sql(query).df()
Another way to do this that is unique to duckdb would be to store each query into a variable and execute queries against those.
query =""" select statename, count(if(charter_text = 'Yes', sch_name, null)) as num_charter_schools, count(sch_name) as num_schools, from directory group by statename"""school_counts = duckdb.sql(query).df()query =""" select statename, num_charter_schools, num_schools, round((num_charter_schools/num_schools) * 100, 1) as percent_of_schools from school_counts"""percents = duckdb.sql(query).df()query =""" select statename, num_charter_schools, num_schools, concat(cast(percent_of_schools as string),'%') as percent_of_schools from percents"""charters = duckdb.sql(query).df()
charters
statename
num_charter_schools
num_schools
percent_of_schools
0
ALABAMA
13
1566
0.8%
1
ALASKA
31
509
6.1%
2
ARIZONA
606
2515
24.1%
3
ARKANSAS
93
1099
8.5%
4
CALIFORNIA
1337
10456
12.8%
5
COLORADO
268
1960
13.7%
6
CONNECTICUT
21
1009
2.1%
7
DELAWARE
23
231
10.0%
8
DISTRICT OF COLUMBIA
125
243
51.4%
9
FLORIDA
767
4306
17.8%
10
GEORGIA
95
2327
4.1%
11
HAWAII
37
296
12.5%
12
IDAHO
81
801
10.1%
13
ILLINOIS
138
4409
3.1%
14
INDIANA
121
1933
6.3%
15
IOWA
2
1333
0.2%
16
KANSAS
9
1360
0.7%
17
KENTUCKY
0
1548
0.0%
18
LOUISIANA
150
1383
10.8%
19
MAINE
13
602
2.2%
20
MARYLAND
49
1424
3.4%
21
MASSACHUSETTS
78
1859
4.2%
22
MICHIGAN
386
3570
10.8%
23
MINNESOTA
306
2773
11.0%
24
MISSISSIPPI
7
1054
0.7%
25
MISSOURI
81
2469
3.3%
26
MONTANA
0
832
0.0%
27
NEBRASKA
0
1098
0.0%
28
NEVADA
99
769
12.9%
29
NEW HAMPSHIRE
39
498
7.8%
30
NEW JERSEY
87
2582
3.4%
31
NEW MEXICO
102
894
11.4%
32
NEW YORK
331
4842
6.8%
33
NORTH CAROLINA
223
2765
8.1%
34
NORTH DAKOTA
0
536
0.0%
35
OHIO
325
3689
8.8%
36
OKLAHOMA
72
1803
4.0%
37
OREGON
132
1290
10.2%
38
PENNSYLVANIA
184
2963
6.2%
39
RHODE ISLAND
41
319
12.9%
40
SOUTH CAROLINA
93
1288
7.2%
41
SOUTH DAKOTA
0
732
0.0%
42
TENNESSEE
121
1931
6.3%
43
TEXAS
1056
9652
10.9%
44
UTAH
141
1123
12.6%
45
VERMONT
0
305
0.0%
46
VIRGINIA
7
2136
0.3%
47
WASHINGTON
17
2561
0.7%
48
WEST VIRGINIA
3
700
0.4%
49
WISCONSIN
243
2263
10.7%
50
WYOMING
5
366
1.4%
51
BUREAU OF INDIAN EDUCATION
0
174
0.0%
52
AMERICAN SAMOA
0
29
0.0%
53
GUAM
3
44
6.8%
54
NORTHERN MARIANAS
0
35
0.0%
55
PUERTO RICO
7
853
0.8%
56
U.S. VIRGIN ISLANDS
0
23
0.0%
Either way, we can now see the answer to our initial question. However, at this point a new question has come to mind: Which states have the highest percentage of charter schools?
To answer that, we can simply sort the results:
query =""" select * from charters order by percent_of_schools desc"""duckdb.sql(query).df()
statename
num_charter_schools
num_schools
percent_of_schools
0
OHIO
325
3689
8.8%
1
ARKANSAS
93
1099
8.5%
2
NORTH CAROLINA
223
2765
8.1%
3
NEW HAMPSHIRE
39
498
7.8%
4
SOUTH CAROLINA
93
1288
7.2%
5
NEW YORK
331
4842
6.8%
6
GUAM
3
44
6.8%
7
INDIANA
121
1933
6.3%
8
TENNESSEE
121
1931
6.3%
9
PENNSYLVANIA
184
2963
6.2%
10
ALASKA
31
509
6.1%
11
DISTRICT OF COLUMBIA
125
243
51.4%
12
MASSACHUSETTS
78
1859
4.2%
13
GEORGIA
95
2327
4.1%
14
OKLAHOMA
72
1803
4.0%
15
MARYLAND
49
1424
3.4%
16
NEW JERSEY
87
2582
3.4%
17
MISSOURI
81
2469
3.3%
18
ILLINOIS
138
4409
3.1%
19
ARIZONA
606
2515
24.1%
20
MAINE
13
602
2.2%
21
CONNECTICUT
21
1009
2.1%
22
FLORIDA
767
4306
17.8%
23
COLORADO
268
1960
13.7%
24
NEVADA
99
769
12.9%
25
RHODE ISLAND
41
319
12.9%
26
CALIFORNIA
1337
10456
12.8%
27
UTAH
141
1123
12.6%
28
HAWAII
37
296
12.5%
29
NEW MEXICO
102
894
11.4%
30
MINNESOTA
306
2773
11.0%
31
TEXAS
1056
9652
10.9%
32
LOUISIANA
150
1383
10.8%
33
MICHIGAN
386
3570
10.8%
34
WISCONSIN
243
2263
10.7%
35
OREGON
132
1290
10.2%
36
IDAHO
81
801
10.1%
37
DELAWARE
23
231
10.0%
38
WYOMING
5
366
1.4%
39
ALABAMA
13
1566
0.8%
40
PUERTO RICO
7
853
0.8%
41
KANSAS
9
1360
0.7%
42
WASHINGTON
17
2561
0.7%
43
MISSISSIPPI
7
1054
0.7%
44
WEST VIRGINIA
3
700
0.4%
45
VIRGINIA
7
2136
0.3%
46
IOWA
2
1333
0.2%
47
NORTH DAKOTA
0
536
0.0%
48
SOUTH DAKOTA
0
732
0.0%
49
NEBRASKA
0
1098
0.0%
50
MONTANA
0
832
0.0%
51
VERMONT
0
305
0.0%
52
KENTUCKY
0
1548
0.0%
53
BUREAU OF INDIAN EDUCATION
0
174
0.0%
54
AMERICAN SAMOA
0
29
0.0%
55
NORTHERN MARIANAS
0
35
0.0%
56
U.S. VIRGIN ISLANDS
0
23
0.0%
Oh no!!!
Our conversion of the percentage into a string has ruined our ability to sort the values. This is why it is so vital to understand data types. Strings are sorted alphabetically but numbers are sorted numerically. What that means in this context is that when sorting descending (reverse), the string "8" is larger than the string "50" because 8 comes after 5. If the value were still numbers that would not be the case.
To resolve this we can preserve the original decimal column for sorting rather than replace it entirely.
query =""" with school_counts as ( select statename, count(if(charter_text = 'Yes', sch_name, null)) as num_charter_schools, count(sch_name) as num_schools, from directory group by statename ), percents as ( select statename, num_charter_schools, num_schools, round((num_charter_schools/num_schools) * 100, 1) as percent_of_schools_numeric from school_counts ) select statename, num_charter_schools, num_schools, concat(cast(percent_of_schools_numeric as string),'%') as percent_of_schools from percents order by percent_of_schools_numeric desc"""duckdb.sql(query).df()
While DC might have the largest percent of charter schools it is out of a small number of schools to begin with given the size of Washington DC. If we sort by count we can see that California and Texas have the largest total number of charter schools, but because of the size of the population in those states, the total represents a smaller percentage.
Washington DC is not even in the top 10.
query =""" select * from charters order by num_charter_schools desc limit 10"""duckdb.sql(query).df()
statename
num_charter_schools
num_schools
percent_of_schools
0
CALIFORNIA
1337
10456
12.8%
1
TEXAS
1056
9652
10.9%
2
FLORIDA
767
4306
17.8%
3
ARIZONA
606
2515
24.1%
4
MICHIGAN
386
3570
10.8%
5
NEW YORK
331
4842
6.8%
6
OHIO
325
3689
8.8%
7
MINNESOTA
306
2773
11.0%
8
COLORADO
268
1960
13.7%
9
WISCONSIN
243
2263
10.7%
This begs the question: Which states have the most students in charter schools?
Not all school sizes are equal and there could be a large amount of schools with smaller enrollments or fewer schools with larger enrollments. Unfortunately, our initial dataset doesn’t have enrollment information. To determine this, we need new data. Going back to NCES, we can pull the school membership data as well. This is a much larger dataset (198 MB compressed) than the directory info but doesn’t have as many of the descriptors as our original dataset (such as the charter flag).
To answer this final query, we’ll need to combine the data in these two datasets somehow.
This is where SQL really shines. The sql join is one of the fundamental features of the language that makes it so powerful for data analysis. SQL joins are like puzzle pieces that help you combine information from different tables in a database. They’re useful when you want to see related data together, such as matching customers with their orders or students with their courses. Instead of keeping all the information in one big table, you can split it into smaller ones and then use joins to bring the pieces together, making it easier to understand and work with your data.
enrollment = pd.read_csv("ccd_SCH_052_2122_l_1a_071722.csv", dtype=str)enrollment.columns = [c.lower() for c in enrollment.columns]enrollment.head()
school_year
fipst
statename
st
sch_name
state_agency_no
union
st_leaid
leaid
st_schid
ncessch
schid
grade
race_ethnicity
sex
student_count
total_indicator
dms_flag
0
2021-2022
01
ALABAMA
AL
Albertville Middle School
01
NaN
AL-101
0100005
AL-101-0010
010000500870
0100870
Grade 7
American Indian or Alaska Native
Female
1
Category Set A - By Race/Ethnicity; Sex; Grade
Reported
1
2021-2022
01
ALABAMA
AL
Albertville Middle School
01
NaN
AL-101
0100005
AL-101-0010
010000500870
0100870
Grade 7
American Indian or Alaska Native
Male
3
Category Set A - By Race/Ethnicity; Sex; Grade
Reported
2
2021-2022
01
ALABAMA
AL
Albertville Middle School
01
NaN
AL-101
0100005
AL-101-0010
010000500870
0100870
Grade 7
Asian
Female
2
Category Set A - By Race/Ethnicity; Sex; Grade
Reported
3
2021-2022
01
ALABAMA
AL
Albertville Middle School
01
NaN
AL-101
0100005
AL-101-0010
010000500870
0100870
Grade 7
Asian
Male
2
Category Set A - By Race/Ethnicity; Sex; Grade
Reported
4
2021-2022
01
ALABAMA
AL
Albertville Middle School
01
NaN
AL-101
0100005
AL-101-0010
010000500870
0100870
Grade 7
Black or African American
Female
7
Category Set A - By Race/Ethnicity; Sex; Grade
Reported
Exploring Enrollment
My goal here is to simply pull out the school-wide total enrollment for each school, but a quick scan of that dataframe preview indicates this data is pretty granular. It is broken out into grade levels, each student group for race/ethnicity, and even sex. But I do see there is a column called total_indicator that might be useful. I’d like to get a sense of what the values of that column are so I can determine if it could be used to filter the data somehow. To do that I will query the distinct values in that column.
pd.set_option('max_colwidth', 0) # this is to make sure pandas doesn't truncate the text in the columnquery ="""select distinct total_indicatorfrom enrollment"""duckdb.sql(query).df()
total_indicator
0
Category Set A - By Race/Ethnicity; Sex; Grade
1
Derived - Education Unit Total minus Adult Education Count
2
Derived - Subtotal by Race/Ethnicity and Sex minus Adult Education Count
3
Education Unit Total
4
Subtotal 4 - By Grade
We can see that there are two values that might work here: - Education Unit Total - Derived - Education Unit Total minus Adult Education Count
We can use the first as it seems more straight-forward and includes the full number of students (including adult ed).
Before we do that, let’s pull out the relevant data from the dataset that we want to join to our directory data.
query =""" select school_year, statename, st, sch_name, st_schid, ncessch, student_count, from enrollment where total_indicator = 'Education Unit Total'"""total_enrollment = duckdb.sql(query).df()
total_enrollment
school_year
statename
st
sch_name
st_schid
ncessch
student_count
0
2021-2022
ALABAMA
AL
Albertville Middle School
AL-101-0010
010000500870
920
1
2021-2022
ALABAMA
AL
Albertville High School
AL-101-0020
010000500871
1665
2
2021-2022
ALABAMA
AL
Albertville Intermediate School
AL-101-0110
010000500879
924
3
2021-2022
ALABAMA
AL
Albertville Elementary School
AL-101-0200
010000500889
891
4
2021-2022
ALABAMA
AL
Albertville Kindergarten and PreK
AL-101-0035
010000501616
579
...
...
...
...
...
...
...
...
100024
2021-2022
U.S. VIRGIN ISLANDS
VI
LOCKHART ELEMENTARY SCHOOL
VI-001-15
780003000024
284
100025
2021-2022
U.S. VIRGIN ISLANDS
VI
ULLA F MULLER ELEMENTARY SCHOOL
VI-001-17
780003000026
437
100026
2021-2022
U.S. VIRGIN ISLANDS
VI
YVONNE BOWSKY ELEMENTARY SCHOOL
VI-001-23
780003000027
381
100027
2021-2022
U.S. VIRGIN ISLANDS
VI
CANCRYN JUNIOR HIGH SCHOOL
VI-001-25
780003000033
744
100028
2021-2022
U.S. VIRGIN ISLANDS
VI
BERTHA BOSCHULTE JUNIOR HIGH
VI-001-9
780003000034
515
100029 rows × 7 columns
I’m going to include a few different fields that may be necessary for joining such as the school_year, statename, and a couple of different identifiers since we’re not 100% sure which one is unique enough to join yet.
We can do the same to the directory data and just keep the relevant fields for our analysis.
One thing to note is that the directory dataset has 102,130 rows and the enrollment dataset has 100,029 rows. So there is a difference of 2,101 rows between them. When we join them we should expect our returned data to be within those ranges if we join accurately and depending on the type of join used.
query =""" select e.school_year, e.statename, e.st, e.sch_name, e.st_schid, e.ncessch, cast(e.student_count as int) as student_count, d.charter_text from directory_info d inner join total_enrollment e on d.school_year = e.school_year and d.st = e.st and d.ncessch = e.ncessch"""enrollments_with_info = duckdb.sql(query).df()
enrollments_with_info
school_year
statename
st
sch_name
st_schid
ncessch
student_count
charter_text
0
2021-2022
ALABAMA
AL
Albertville High School
AL-101-0020
010000500871
1665.0
No
1
2021-2022
ALABAMA
AL
Albertville Intermediate School
AL-101-0110
010000500879
924.0
No
2
2021-2022
ALABAMA
AL
Albertville Elementary School
AL-101-0200
010000500889
891.0
No
3
2021-2022
ALABAMA
AL
Albertville Kindergarten and PreK
AL-101-0035
010000501616
579.0
No
4
2021-2022
ALABAMA
AL
Albertville Primary School
AL-101-0005
010000502150
977.0
No
...
...
...
...
...
...
...
...
...
100024
2021-2022
PUERTO RICO
PR
ALEJANDRO JR CRUZ (GALATEO PARCELAS)
PR-01-71449
720003002078
129.0
No
100025
2021-2022
PUERTO RICO
PR
BERWIND SUPERIOR
PR-01-67942
720003001333
228.0
No
100026
2021-2022
PUERTO RICO
PR
DR. ANTONIO S. PEDREIRA (ESPECIALIZADA)
PR-01-61333
720003001370
299.0
No
100027
2021-2022
PUERTO RICO
PR
TRINA PADILLA DE SANZ
PR-01-61440
720003001379
425.0
No
100028
2021-2022
PUERTO RICO
PR
JUDITH A VIVAS
PR-01-16220
720003001971
247.0
No
100029 rows × 8 columns
The inner join is something like the intersection of a venn diagram. It will return records (based on a matching condition or set of conditions) that appear in both datasets. Whereas a left join would return all the records in the first dataset along with null values for any records that don’t appear in the set that is being joined. This can be very useful when trying to determine which records are missing. For our purposes, we only only care about schools with actual enrollment (since we can’t calculate non-existing numbers of students).
Notice the use of table aliases (d and e resprectively) to help distinguish between columns in either dataset. Another way to write that join without the aliases would be:
For smaller table names this can be fine, but it does get tedious over time and sensible aliases can make the syntax of queries easier to read.
Table aliases and explicit table references are technically only required when the columns share names across both datasets (since the SQL interpreter can’t distinguish which one you mean) but it is generally considered a best practice when using joins to always specify the table source (preferably with an alias) in both the join conditions as well as in the select statement.
Because both of our original dataset are constrained by year, we could probably remove some of the join conditions.
And if we are certain the NCES IDs are unique across states (they are), we could further simplify as:
from directory_info dinnerjoin total_enrollment eon d.ncessch = e.ncessch
Calculating % of students in charters
Now that we’ve combined our datasets, we can write our aggregations similar to what we did for schools earlier. We could do this using CTEs or since we’ve stored the joined data in an output variable (as a dataframe) we can query that directly.
query =""" select school_year, statename, count(if(charter_text = 'Yes', sch_name, null)) as charter_schools, count(sch_name) as total_schools, round((count(if(charter_text = 'Yes', sch_name, null))/count(sch_name)) * 100, 1) as percent_charter_schools, cast(sum(if(charter_text = 'Yes', student_count, 0)) as int) as charter_students, cast(sum(student_count) as int) as total_students, round((sum(if(charter_text = 'Yes', student_count, 0))/sum(student_count)) * 100, 1) as percent_charter_students, from enrollments_with_info group by school_year, statename having sum(student_count) is not null and sum(if(charter_text = 'Yes', student_count, 0)) > 0 order by percent_charter_students desc"""charter_enrollments = duckdb.sql(query).df()
charter_enrollments
school_year
statename
charter_schools
total_schools
percent_charter_schools
charter_students
total_students
percent_charter_students
0
2021-2022
DISTRICT OF COLUMBIA
123
240
51.2
39476
88543
44.6
1
2021-2022
ARIZONA
581
2418
24.0
231195
1131888
20.4
2
2021-2022
COLORADO
265
1941
13.7
130279
880582
14.8
3
2021-2022
NEVADA
93
746
12.5
63944
488251
13.1
4
2021-2022
FLORIDA
707
4191
16.9
361634
2832516
12.8
5
2021-2022
LOUISIANA
146
1366
10.7
86301
680793
12.7
6
2021-2022
DELAWARE
23
229
10.0
17201
139935
12.3
7
2021-2022
CALIFORNIA
1291
10167
12.7
678056
5874619
11.5
8
2021-2022
UTAH
137
1105
12.4
77733
687107
11.3
9
2021-2022
MICHIGAN
374
3538
10.6
150327
1396598
10.8
10
2021-2022
PENNSYLVANIA
179
2941
6.1
163372
1671899
9.8
11
2021-2022
NEW MEXICO
99
890
11.1
30160
316785
9.5
12
2021-2022
IDAHO
77
784
9.8
28051
313909
8.9
13
2021-2022
ARKANSAS
93
1084
8.6
42341
489565
8.6
14
2021-2022
NORTH CAROLINA
204
2719
7.5
131624
1525223
8.6
15
2021-2022
OKLAHOMA
70
1792
3.9
59753
696411
8.6
16
2021-2022
RHODE ISLAND
40
315
12.7
11387
136864
8.3
17
2021-2022
TEXAS
1010
9105
11.1
443548
5428609
8.2
18
2021-2022
OREGON
131
1285
10.2
42668
540687
7.9
19
2021-2022
MINNESOTA
278
2661
10.4
66595
868742
7.7
20
2021-2022
HAWAII
37
294
12.6
12114
173178
7.0
21
2021-2022
NEW YORK
330
4802
6.9
173341
2526204
6.9
22
2021-2022
OHIO
324
3659
8.9
115021
1682397
6.8
23
2021-2022
SOUTH CAROLINA
88
1267
6.9
49344
780403
6.3
24
2021-2022
GUAM
3
44
6.8
1783
28402
6.3
25
2021-2022
WISCONSIN
236
2243
10.5
49715
827393
6.0
26
2021-2022
ALASKA
30
500
6.0
7621
129944
5.9
27
2021-2022
MASSACHUSETTS
78
1842
4.2
48399
911529
5.3
28
2021-2022
INDIANA
119
1915
6.2
50073
1033897
4.8
29
2021-2022
NEW JERSEY
87
2558
3.4
58780
1339937
4.4
30
2021-2022
TENNESSEE
116
1906
6.1
43908
995549
4.4
31
2021-2022
GEORGIA
94
2314
4.1
69242
1740875
4.0
32
2021-2022
ILLINOIS
136
4386
3.1
60496
1867412
3.2
33
2021-2022
NEW HAMPSHIRE
36
494
7.3
4934
168797
2.9
34
2021-2022
MARYLAND
48
1417
3.4
24104
881461
2.7
35
2021-2022
MISSOURI
78
2453
3.2
24146
887077
2.7
36
2021-2022
CONNECTICUT
21
1000
2.1
11047
496795
2.2
37
2021-2022
MAINE
13
571
2.3
2732
168236
1.6
38
2021-2022
MISSISSIPPI
7
1040
0.7
2921
442000
0.7
39
2021-2022
WYOMING
5
359
1.4
642
92848
0.7
40
2021-2022
KANSAS
9
1355
0.7
2877
480069
0.6
41
2021-2022
PUERTO RICO
7
843
0.8
1515
259535
0.6
42
2021-2022
ALABAMA
11
1494
0.7
3566
747846
0.5
43
2021-2022
WASHINGTON
17
2546
0.7
4571
1080143
0.4
44
2021-2022
VIRGINIA
7
1948
0.4
1278
1244624
0.1
45
2021-2022
IOWA
2
1325
0.2
149
503735
0.0
You’ll notice some similar syntax as our previous aggregate query, but I have omitted the step converting the percent to a string here to aid in future analysis.
Wrapping up with some rankings
Now that we have the data we need for our investigation we can sort and re-sort the data and find answers to our inquiries, however one fundamental issue is coming up: the answer to our question is it sort of depends …
Which is very common in data analysis. Depending on how you phrase your question you might get different answers. Wouldn’t it be nice to have a definitive answer about which state, broadly, ranks highest for charter presence?
To do this, let’s explore a little more advanced of a topic in SQL queries: window functions.
I won’t delve into them too deeply, but a basic understanding is helpful here:
Imagine you’re looking at a row of data in a table, like a line of people. A window function in SQL helps you calculate things for each person in the line, like finding their position compared to others, without changing the order. It’s like having a magic window that lets you peek at everyone’s info while keeping them in the same order they were in. This helps you figure out things like who’s first, who’s last, and who’s in the middle without messing up the original line.
In this case, we’d like to understand the ranking of each state for each calculation we created earlier. I.E. who is the #1 state for most charter schools and who is the #1 state for the most students in charter schools, etc. For thank we will use the rank() window function.
All window functions require an accompanying over() clause. They can get complicated but for our purposes we are just using them to specify the order to rank.
query =""" select *, rank() over(order by percent_charter_schools desc) as percent_schools_rank, rank() over(order by charter_schools desc) as count_schools_rank, rank() over(order by percent_charter_students desc) as percent_students_rank, rank() over(order by charter_students desc) as count_students_rank, from charter_enrollments order by statename"""rankings = duckdb.sql(query).df()
rankings
school_year
statename
charter_schools
total_schools
percent_charter_schools
charter_students
total_students
percent_charter_students
percent_schools_rank
count_schools_rank
percent_students_rank
count_students_rank
0
2021-2022
ALABAMA
11
1494
0.7
3566
747846
0.5
41
39
43
38
1
2021-2022
ALASKA
30
500
6.0
7621
129944
5.9
29
34
27
35
2
2021-2022
ARIZONA
581
2418
24.0
231195
1131888
20.4
2
4
2
4
3
2021-2022
ARKANSAS
93
1084
8.6
42341
489565
8.6
20
22
14
25
4
2021-2022
CALIFORNIA
1291
10167
12.7
678056
5874619
11.5
5
1
8
1
5
2021-2022
COLORADO
265
1941
13.7
130279
880582
14.8
4
9
3
9
6
2021-2022
CONNECTICUT
21
1000
2.1
11047
496795
2.2
38
36
37
34
7
2021-2022
DELAWARE
23
229
10.0
17201
139935
12.3
17
35
7
31
8
2021-2022
DISTRICT OF COLUMBIA
123
240
51.2
39476
88543
44.6
1
17
1
26
9
2021-2022
FLORIDA
707
4191
16.9
361634
2832516
12.8
3
3
5
3
10
2021-2022
GEORGIA
94
2314
4.1
69242
1740875
4.0
31
21
32
13
11
2021-2022
GUAM
3
44
6.8
1783
28402
6.3
25
45
24
42
12
2021-2022
HAWAII
37
294
12.6
12114
173178
7.0
7
32
21
32
13
2021-2022
IDAHO
77
784
9.8
28051
313909
8.9
18
28
13
28
14
2021-2022
ILLINOIS
136
4386
3.1
60496
1867412
3.2
36
15
33
16
15
2021-2022
INDIANA
119
1915
6.2
50073
1033897
4.8
26
18
29
19
16
2021-2022
IOWA
2
1325
0.2
149
503735
0.0
46
46
46
46
17
2021-2022
KANSAS
9
1355
0.7
2877
480069
0.6
41
40
41
40
18
2021-2022
LOUISIANA
146
1366
10.7
86301
680793
12.7
12
13
6
11
19
2021-2022
MAINE
13
571
2.3
2732
168236
1.6
37
38
38
41
20
2021-2022
MARYLAND
48
1417
3.4
24104
881461
2.7
33
30
35
30
21
2021-2022
MASSACHUSETTS
78
1842
4.2
48399
911529
5.3
30
26
28
22
22
2021-2022
MICHIGAN
374
3538
10.6
150327
1396598
10.8
13
5
10
7
23
2021-2022
MINNESOTA
278
2661
10.4
66595
868742
7.7
15
8
20
14
24
2021-2022
MISSISSIPPI
7
1040
0.7
2921
442000
0.7
41
41
39
39
25
2021-2022
MISSOURI
78
2453
3.2
24146
887077
2.7
35
26
35
29
26
2021-2022
NEVADA
93
746
12.5
63944
488251
13.1
8
22
4
15
27
2021-2022
NEW HAMPSHIRE
36
494
7.3
4934
168797
2.9
22
33
34
36
28
2021-2022
NEW JERSEY
87
2558
3.4
58780
1339937
4.4
33
25
30
18
29
2021-2022
NEW MEXICO
99
890
11.1
30160
316785
9.5
10
20
12
27
30
2021-2022
NEW YORK
330
4802
6.9
173341
2526204
6.9
23
6
22
5
31
2021-2022
NORTH CAROLINA
204
2719
7.5
131624
1525223
8.6
21
11
14
8
32
2021-2022
OHIO
324
3659
8.9
115021
1682397
6.8
19
7
23
10
33
2021-2022
OKLAHOMA
70
1792
3.9
59753
696411
8.6
32
29
14
17
34
2021-2022
OREGON
131
1285
10.2
42668
540687
7.9
16
16
19
24
35
2021-2022
PENNSYLVANIA
179
2941
6.1
163372
1671899
9.8
27
12
11
6
36
2021-2022
PUERTO RICO
7
843
0.8
1515
259535
0.6
40
41
41
43
37
2021-2022
RHODE ISLAND
40
315
12.7
11387
136864
8.3
5
31
17
33
38
2021-2022
SOUTH CAROLINA
88
1267
6.9
49344
780403
6.3
23
24
24
21
39
2021-2022
TENNESSEE
116
1906
6.1
43908
995549
4.4
27
19
30
23
40
2021-2022
TEXAS
1010
9105
11.1
443548
5428609
8.2
10
2
18
2
41
2021-2022
UTAH
137
1105
12.4
77733
687107
11.3
9
14
9
12
42
2021-2022
VIRGINIA
7
1948
0.4
1278
1244624
0.1
45
41
45
44
43
2021-2022
WASHINGTON
17
2546
0.7
4571
1080143
0.4
41
37
44
37
44
2021-2022
WISCONSIN
236
2243
10.5
49715
827393
6.0
14
10
26
20
45
2021-2022
WYOMING
5
359
1.4
642
92848
0.7
39
44
39
45
Combining Rankings
These rankings are useful and can certainly make it easier to sort, but they haven’t provided us with any more definitive of an answer. To do that, let’s combine the rankings across metrics and find which state has the lowest value (ie. closest to #1 across all combined metrics).
We’ll once again use the rank function but across the sum of the other ranks.
query =""" select *, rank() over(order by (percent_schools_rank + count_schools_rank + percent_students_rank + count_students_rank)) as combined_ranking from rankings order by combined_ranking"""duckdb.sql(query).df()
school_year
statename
charter_schools
total_schools
percent_charter_schools
charter_students
total_students
percent_charter_students
percent_schools_rank
count_schools_rank
percent_students_rank
count_students_rank
combined_ranking
0
2021-2022
ARIZONA
581
2418
24.0
231195
1131888
20.4
2
4
2
4
1
1
2021-2022
FLORIDA
707
4191
16.9
361634
2832516
12.8
3
3
5
3
2
2
2021-2022
CALIFORNIA
1291
10167
12.7
678056
5874619
11.5
5
1
8
1
3
3
2021-2022
COLORADO
265
1941
13.7
130279
880582
14.8
4
9
3
9
4
4
2021-2022
TEXAS
1010
9105
11.1
443548
5428609
8.2
10
2
18
2
5
5
2021-2022
MICHIGAN
374
3538
10.6
150327
1396598
10.8
13
5
10
7
6
6
2021-2022
LOUISIANA
146
1366
10.7
86301
680793
12.7
12
13
6
11
7
7
2021-2022
UTAH
137
1105
12.4
77733
687107
11.3
9
14
9
12
8
8
2021-2022
DISTRICT OF COLUMBIA
123
240
51.2
39476
88543
44.6
1
17
1
26
9
9
2021-2022
NEVADA
93
746
12.5
63944
488251
13.1
8
22
4
15
10
10
2021-2022
NORTH CAROLINA
204
2719
7.5
131624
1525223
8.6
21
11
14
8
11
11
2021-2022
NEW YORK
330
4802
6.9
173341
2526204
6.9
23
6
22
5
12
12
2021-2022
PENNSYLVANIA
179
2941
6.1
163372
1671899
9.8
27
12
11
6
12
13
2021-2022
MINNESOTA
278
2661
10.4
66595
868742
7.7
15
8
20
14
14
14
2021-2022
OHIO
324
3659
8.9
115021
1682397
6.8
19
7
23
10
15
15
2021-2022
NEW MEXICO
99
890
11.1
30160
316785
9.5
10
20
12
27
16
16
2021-2022
WISCONSIN
236
2243
10.5
49715
827393
6.0
14
10
26
20
17
17
2021-2022
OREGON
131
1285
10.2
42668
540687
7.9
16
16
19
24
18
18
2021-2022
ARKANSAS
93
1084
8.6
42341
489565
8.6
20
22
14
25
19
19
2021-2022
RHODE ISLAND
40
315
12.7
11387
136864
8.3
5
31
17
33
20
20
2021-2022
IDAHO
77
784
9.8
28051
313909
8.9
18
28
13
28
21
21
2021-2022
DELAWARE
23
229
10.0
17201
139935
12.3
17
35
7
31
22
22
2021-2022
HAWAII
37
294
12.6
12114
173178
7.0
7
32
21
32
23
23
2021-2022
INDIANA
119
1915
6.2
50073
1033897
4.8
26
18
29
19
23
24
2021-2022
OKLAHOMA
70
1792
3.9
59753
696411
8.6
32
29
14
17
23
25
2021-2022
SOUTH CAROLINA
88
1267
6.9
49344
780403
6.3
23
24
24
21
23
26
2021-2022
GEORGIA
94
2314
4.1
69242
1740875
4.0
31
21
32
13
27
27
2021-2022
TENNESSEE
116
1906
6.1
43908
995549
4.4
27
19
30
23
28
28
2021-2022
ILLINOIS
136
4386
3.1
60496
1867412
3.2
36
15
33
16
29
29
2021-2022
MASSACHUSETTS
78
1842
4.2
48399
911529
5.3
30
26
28
22
30
30
2021-2022
NEW JERSEY
87
2558
3.4
58780
1339937
4.4
33
25
30
18
30
31
2021-2022
ALASKA
30
500
6.0
7621
129944
5.9
29
34
27
35
32
32
2021-2022
MISSOURI
78
2453
3.2
24146
887077
2.7
35
26
35
29
32
33
2021-2022
NEW HAMPSHIRE
36
494
7.3
4934
168797
2.9
22
33
34
36
32
34
2021-2022
MARYLAND
48
1417
3.4
24104
881461
2.7
33
30
35
30
35
35
2021-2022
GUAM
3
44
6.8
1783
28402
6.3
25
45
24
42
36
36
2021-2022
CONNECTICUT
21
1000
2.1
11047
496795
2.2
38
36
37
34
37
37
2021-2022
MAINE
13
571
2.3
2732
168236
1.6
37
38
38
41
38
38
2021-2022
WASHINGTON
17
2546
0.7
4571
1080143
0.4
41
37
44
37
39
39
2021-2022
MISSISSIPPI
7
1040
0.7
2921
442000
0.7
41
41
39
39
40
40
2021-2022
ALABAMA
11
1494
0.7
3566
747846
0.5
41
39
43
38
41
41
2021-2022
KANSAS
9
1355
0.7
2877
480069
0.6
41
40
41
40
42
42
2021-2022
PUERTO RICO
7
843
0.8
1515
259535
0.6
40
41
41
43
43
43
2021-2022
WYOMING
5
359
1.4
642
92848
0.7
39
44
39
45
44
44
2021-2022
VIRGINIA
7
1948
0.4
1278
1244624
0.1
45
41
45
44
45
45
2021-2022
IOWA
2
1325
0.2
149
503735
0.0
46
46
46
46
46
Conclusion
We can see that Arizona ranks highest for charter presence, followed by Florida. A quite different story than we were getting previously with our one-off questions. Fascinating!
There is much more you can do as you expand your knowledge of sql; including taking this data and visualizing it for presentation to an audience. That’s a topic for another day but let’s store our results in a new dataset.
query =""" select *, rank() over(order by (percent_schools_rank + count_schools_rank + percent_students_rank + count_students_rank)) as combined_ranking from rankings order by combined_ranking"""charter_presence_rankings = duckdb.sql(query).df()charter_presence_rankings.to_csv("charter_presence_rankings.csv.gz", compression="gzip", index=False)
In this tutorial we learned: 1. some basic SQL vocabulary 2. how to read data from files into dataframes for querying 3. how to explore the structure of a new dataset 4. how to select data from a dataframe using duckdb and sql syntax 5. how to filter and limit data 6. how to translate questions into queries 7. how to do some basic aggregations like count and sum 8. how to better organize complex queries with CTEs 9. a valuable lesson about data types 10. how to join data from multiple datasets 11. how to leverage a basic window function rank to come to a more definitive answer 12. how to store the output of our analysis for future use
I hope this was helpful and gave you a taste of how to iterate on a set of questions using SQL to find an answer in a structured way (even without access to a database).