SQL Basics using DuckDB

tutorial
Published

August 20, 2023

I’ve been wanting to put together a simple resource for learning the basics of querying data with SQL for some time now. However, one of the roadblocks has been the overhead of initial setup instructions for creating a database to query from. Even cloud options have hurdles that prevent just jumping in and writing SQL.

DuckDB: Simple and Effective

One of my favorite new tools is DuckDB. Especially when combined with pandas dataframes. It enables me to use pandas to quickly read a data source into a dataframe and then shift away from the often confusing array of pandas querying syntax and shift back to something I am most comfortable with in my data analysis toolbelt: SQL.

To start, we’ll need some data to work with. Similar to the last tutorial, let’s download the 2021-2022 NCES national school directory to work with.

import pandas as pd


directory = pd.read_csv("ccd_sch_029_2122_w_1a_071722.csv", dtype=str)


# convert column names to lowercase, 
# because I'm not a fan of my column names screaming at me in ALL CAPS
directory.columns = [c.lower() for c in directory.columns] 

directory.head()
school_year fipst statename st sch_name lea_name state_agency_no union st_leaid leaid ... g_10_offered g_11_offered g_12_offered g_13_offered g_ug_offered g_ae_offered gslo gshi level igoffered
0 2021-2022 01 ALABAMA AL Albertville Middle School Albertville City 01 NaN AL-101 0100005 ... No No No No No No 07 08 Middle As reported
1 2021-2022 01 ALABAMA AL Albertville High School Albertville City 01 NaN AL-101 0100005 ... Yes Yes Yes No No No 09 12 High As reported
2 2021-2022 01 ALABAMA AL Albertville Intermediate School Albertville City 01 NaN AL-101 0100005 ... No No No No No No 05 06 Middle As reported
3 2021-2022 01 ALABAMA AL Albertville Elementary School Albertville City 01 NaN AL-101 0100005 ... No No No No No No 03 04 Elementary As reported
4 2021-2022 01 ALABAMA AL Albertville Kindergarten and PreK Albertville City 01 NaN AL-101 0100005 ... No No No No No No PK KG Elementary As reported

5 rows × 65 columns

A quick pause for vocabulary

Before we dive in to writing SQL queries, let’s pause to define a few common terms that will come up.

  • Table: A table is a fundamental component of a database. It represents a collection of related data organized in rows and columns. Each row in a table is called a record, and each column represents a field.
  • Field: A field is a single piece of data within a table. It corresponds to the columns in a table and defines the type of data it can hold, such as text, numbers, dates, or binary data.
  • Record: A record, also known as a row, is a complete set of related data in a table. It contains values for each field, representing a single entity or data entry within the database.
  • Schema: A schema is a blueprint or structure that defines the organization of a database. It outlines the tables, fields, data types, relationships, and constraints that form the database’s structure.
  • Query: A query is a request made to the database to retrieve or manipulate data. It uses a structured query language (SQL) to interact with the database.

Understanding table schema

Understanding the schema of a table is essential before writing a SQL query. The schema defines the structure of the table, including the names of columns, data types, and constraints. When exploring unfamiliar datasets, understanding the schema provides insights into the available data and helps you identify relevant tables and columns for analysis. It also ensures that you refer to the correct column names and use the appropriate data types in your SQL query, leading to accurate and valid results. Knowing the data types of each column helps you format your query correctly.

For example, when dealing with dates or numerical values, understanding the data types ensures you use the right functions and operators for calculations. When writing queries, you often need to filter or sort data based on certain criteria. Understanding the schema allows you to apply the correct conditions and sorting instructions to retrieve the desired results effectively.

Now, let’s orient ourselves to the data in this dataset and it’s schema. To do so, let’s get a list of columns and get a sense of the data volume.

directory.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 102130 entries, 0 to 102129
Data columns (total 65 columns):
 #   Column               Non-Null Count   Dtype 
---  ------               --------------   ----- 
 0   school_year          102130 non-null  object
 1   fipst                102130 non-null  object
 2   statename            102130 non-null  object
 3   st                   102130 non-null  object
 4   sch_name             102130 non-null  object
 5   lea_name             102130 non-null  object
 6   state_agency_no      102130 non-null  object
 7   union                2440 non-null    object
 8   st_leaid             102130 non-null  object
 9   leaid                102130 non-null  object
 10  st_schid             102130 non-null  object
 11  ncessch              102130 non-null  object
 12  schid                102130 non-null  object
 13  mstreet1             102130 non-null  object
 14  mstreet2             1672 non-null    object
 15  mstreet3             2 non-null       object
 16  mcity                102130 non-null  object
 17  mstate               102130 non-null  object
 18  mzip                 102130 non-null  object
 19  mzip4                60951 non-null   object
 20  lstreet1             102127 non-null  object
 21  lstreet2             579 non-null     object
 22  lstreet3             4 non-null       object
 23  lcity                102130 non-null  object
 24  lstate               102130 non-null  object
 25  lzip                 102130 non-null  object
 26  lzip4                59351 non-null   object
 27  phone                102130 non-null  object
 28  website              66812 non-null   object
 29  sy_status            102130 non-null  object
 30  sy_status_text       102130 non-null  object
 31  updated_status       102130 non-null  object
 32  updated_status_text  102130 non-null  object
 33  effective_date       102130 non-null  object
 34  sch_type_text        102130 non-null  object
 35  sch_type             102130 non-null  object
 36  recon_status         102130 non-null  object
 37  out_of_state_flag    102130 non-null  object
 38  charter_text         102130 non-null  object
 39  chartauth1           6549 non-null    object
 40  chartauthn1          6553 non-null    object
 41  chartauth2           97 non-null      object
 42  chartauthn2          97 non-null      object
 43  nogrades             102130 non-null  object
 44  g_pk_offered         102130 non-null  object
 45  g_kg_offered         102130 non-null  object
 46  g_1_offered          102130 non-null  object
 47  g_2_offered          102130 non-null  object
 48  g_3_offered          102130 non-null  object
 49  g_4_offered          102130 non-null  object
 50  g_5_offered          102130 non-null  object
 51  g_6_offered          102130 non-null  object
 52  g_7_offered          102130 non-null  object
 53  g_8_offered          102130 non-null  object
 54  g_9_offered          102130 non-null  object
 55  g_10_offered         102130 non-null  object
 56  g_11_offered         102130 non-null  object
 57  g_12_offered         102130 non-null  object
 58  g_13_offered         102130 non-null  object
 59  g_ug_offered         102130 non-null  object
 60  g_ae_offered         102130 non-null  object
 61  gslo                 102130 non-null  object
 62  gshi                 102130 non-null  object
 63  level                102130 non-null  object
 64  igoffered            102130 non-null  object
dtypes: object(65)
memory usage: 50.6+ MB

This shows us a number of things:

  • all the column names
  • their data types (recall we have set all of them to strings, which pandas lists as object)
  • the number of non-null values in each column
  • the number of columns: 65
  • the number of rows: 102,130
  • it even shows the amount of memory used to hold this data: 50.6 MB

A note about null

In SQL and databases, null represents the absence of a value or unknown data for a specific field in a table. It’s not the same as zero or an empty string and requires special handling in SQL queries. The null value can be used in any data type, whether it’s a string, numeric, date, or any other data type.

Dealing with null values in SQL queries can be tricky, and can lead to unexpected or incorrect results.

Getting Started

Now that we have a basic lay-of-the-land for our data, we can begin querying it. At this point, I will shift away from pandas syntax and use SQL queries.

To start, let’s install duckdb.

pip install duckdb
import duckdb

query = "select * from directory limit 5"

duckdb.sql(query).df()
school_year fipst statename st sch_name lea_name state_agency_no union st_leaid leaid ... g_10_offered g_11_offered g_12_offered g_13_offered g_ug_offered g_ae_offered gslo gshi level igoffered
0 2021-2022 01 ALABAMA AL Albertville Middle School Albertville City 01 NaN AL-101 0100005 ... No No No No No No 07 08 Middle As reported
1 2021-2022 01 ALABAMA AL Albertville High School Albertville City 01 NaN AL-101 0100005 ... Yes Yes Yes No No No 09 12 High As reported
2 2021-2022 01 ALABAMA AL Albertville Intermediate School Albertville City 01 NaN AL-101 0100005 ... No No No No No No 05 06 Middle As reported
3 2021-2022 01 ALABAMA AL Albertville Elementary School Albertville City 01 NaN AL-101 0100005 ... No No No No No No 03 04 Elementary As reported
4 2021-2022 01 ALABAMA AL Albertville Kindergarten and PreK Albertville City 01 NaN AL-101 0100005 ... No No No No No No PK KG Elementary As reported

5 rows × 65 columns

This basic query is equivalent to directory.head().

select * 
from directory
limit 5

What is each keyword doing here?

select: The select keyword is used to specify which columns or expressions to include in the query result. It allows you to choose the data you want to retrieve from the tables in the database. In SQL, the asterisk (*) is used in a select statement to represent all columns in a table.

from: The from keyword is used to specify the table from which the data will be retrieved. It tells the query where to find the data to be selected. With duckdb, we can specify a dataframe as the table to be queried.

limit: The limit keyword is used to restrict the number of rows returned in the query result. It is particularly useful when you only need to see a small subset of the data.

Asking Questions

SQL Queries are fundamentally a way of interrogating your data. You usually start with a question you want to answer and then form your query (question) is such a way that the database can understand it and return the relevant data.

One of the basic ways to interrogate the data is to constrain it in some way. Filtering the data to show only a subset allows you to begin to identify possible patterns. The where clause is particularly valuable for this purpose.

Let’s translate this question into a SQL query: Which schools in Delaware offer 12th grade enrollment?

query = """
  select sch_name
  from directory
  where statename = 'DELAWARE'
    and g_12_offered = 'Yes'
  order by sch_name
"""

duckdb.sql(query).df()
sch_name
0 Appoquinimink High School
1 Brandywine High School
2 Brandywine SITE
3 Brennen School (The)
4 Caesar Rodney High School
5 Calloway (Cab) School of the Arts
6 Cape Henlopen High School
7 Charter School of Wilmington
8 Christiana High School
9 Concord High School
10 Conrad Schools of Science
11 Delaware Military Academy
12 Delaware School for the Deaf
13 Delcastle Technical High School
14 Delmar High School
15 Dickinson (John) School
16 Douglass School
17 Dover High School
18 Early College High School at Del State
19 Ennis (Howard T.) School
20 First State Military Academy
21 Freire Charter School
22 Glasgow High School
23 Great Oaks Charter School
24 Hodgson (Paul M.) Vocational Technical High School
25 Howard High School of Technology
26 Indian River High School
27 Indian River Intensive Learning Center
28 John S. Charlton School
29 Kent County Community School
30 Kent County Secondary ILC
31 Lake Forest High School
32 Laurel Senior High School
33 Leach (John G.) School
34 MOT Charter School
35 McKean (Thomas) High School
36 Meadowood Program
37 Middletown High School
38 Milford Senior High School
39 Mount Pleasant High School
40 Newark Charter School
41 Newark High School
42 Odyssey Charter School
43 POLYTECH High School
44 Penn (William) High School
45 Positive Outcomes Charter School
46 Seaford Senior High School
47 Smyrna High School
48 St. Georges Technical High School
49 Sussex Academy
50 Sussex Central High School
51 Sussex Consortium
52 Sussex Orthopedic Program
53 Sussex Technical High School
54 The Wallace Wallin School
55 Woodbridge High School
56 duPont (Alexis I.) High School

NOTE: You may have noticed the use of """ in the code above. This enables a string to be broken out across multiple lines which I find easier to read for SQL queries.

What is happening here? Let’s break down this query.

select sch_name -- return only the school name column
from directory -- from the data in our dataframe
where statename = 'DELAWARE' -- where the state name is DELAWARE (state names are all caps in the dataset)
  and g_12_offered = 'Yes' -- and where grade 12 is offered
order by sch_name -- sort the resulting data alphabetically by the school name

Next, let’s refine this query to identify which of those schools are charter schools or not.

query = """
  select 
    sch_name,
    charter_text,
  from directory
  where statename = 'DELAWARE'
    and g_12_offered = 'Yes'
  order by sch_name
"""

duckdb.sql(query).df()
sch_name charter_text
0 Appoquinimink High School No
1 Brandywine High School No
2 Brandywine SITE No
3 Brennen School (The) No
4 Caesar Rodney High School No
5 Calloway (Cab) School of the Arts No
6 Cape Henlopen High School No
7 Charter School of Wilmington Yes
8 Christiana High School No
9 Concord High School No
10 Conrad Schools of Science No
11 Delaware Military Academy Yes
12 Delaware School for the Deaf No
13 Delcastle Technical High School No
14 Delmar High School No
15 Dickinson (John) School No
16 Douglass School No
17 Dover High School No
18 Early College High School at Del State Yes
19 Ennis (Howard T.) School No
20 First State Military Academy Yes
21 Freire Charter School Yes
22 Glasgow High School No
23 Great Oaks Charter School Yes
24 Hodgson (Paul M.) Vocational Technical High School No
25 Howard High School of Technology No
26 Indian River High School No
27 Indian River Intensive Learning Center No
28 John S. Charlton School No
29 Kent County Community School No
30 Kent County Secondary ILC No
31 Lake Forest High School No
32 Laurel Senior High School No
33 Leach (John G.) School No
34 MOT Charter School Yes
35 McKean (Thomas) High School No
36 Meadowood Program No
37 Middletown High School No
38 Milford Senior High School No
39 Mount Pleasant High School No
40 Newark Charter School Yes
41 Newark High School No
42 Odyssey Charter School Yes
43 POLYTECH High School No
44 Penn (William) High School No
45 Positive Outcomes Charter School Yes
46 Seaford Senior High School No
47 Smyrna High School No
48 St. Georges Technical High School No
49 Sussex Academy Yes
50 Sussex Central High School No
51 Sussex Consortium No
52 Sussex Orthopedic Program No
53 Sussex Technical High School No
54 The Wallace Wallin School No
55 Woodbridge High School No
56 duPont (Alexis I.) High School No

And if we want to limit our list to just the charter schools, we can add another filter to the where clause.

query = """
  select 
    sch_name,
    charter_text,
  from directory
  where statename = 'DELAWARE'
    and g_12_offered = 'Yes'
    and charter_text = 'Yes'
  order by sch_name
"""

duckdb.sql(query).df()
sch_name charter_text
0 Charter School of Wilmington Yes
1 Delaware Military Academy Yes
2 Early College High School at Del State Yes
3 First State Military Academy Yes
4 Freire Charter School Yes
5 Great Oaks Charter School Yes
6 MOT Charter School Yes
7 Newark Charter School Yes
8 Odyssey Charter School Yes
9 Positive Outcomes Charter School Yes
10 Sussex Academy Yes

Counting: A common type of question to ask about data

After this initial investigation, my curiosity has been piqued. I am wondering how many schools are charters vs. traditional public schools in Delaware (regardless of grades offered). Let’s translate that question into a sql query using some basic aggregation.

query = """
  select 
    statename,
    charter_text,
    count(sch_name) as num_schools,
  from directory
  where statename = 'DELAWARE'
  group by 
    statename, 
    charter_text
"""

duckdb.sql(query).df()
statename charter_text num_schools
0 DELAWARE Yes 23
1 DELAWARE No 208

How does this work?

The count() function, predictably, will count the number of values in a given column (or all records if the * is passed instead). You can further expand this to count only unique school names by using count(distinct sch_name).

We are also aliasing the count using the as keyword to give the resulting column a name: num_schools.

The group by clause is necessary for this aggregation because it defines the context in which the count() function operates. Without the group by clause, the count() function would treat the entire result set as a single group.

Consider this example instead (without grouping):

query = """
  select 
    count(sch_name) as num_schools,
  from directory
  where statename = 'DELAWARE'
"""

duckdb.sql(query).df()
num_schools
0 231

In order to get a count for each value of the Yes/No flag in the charter_text column, we need to group our counts by each of those results. In addition, in order to include the statename in our result set, we need to include it in the grouping as well.

If we wanted to determine the same counts for all states in the dataset, we could simply remove the where clause limiting the data to Delaware.

query = """
  select 
    statename,
    charter_text,
    count(sch_name) as num_schools,
  from directory
  group by 
    statename, 
    charter_text
"""

duckdb.sql(query).df()
statename charter_text num_schools
0 ALABAMA No 1553
1 ALABAMA Yes 13
2 ALASKA No 478
3 ALASKA Yes 31
4 ARIZONA No 1909
... ... ... ...
100 GUAM Yes 3
101 NORTHERN MARIANAS Not applicable 35
102 PUERTO RICO No 846
103 PUERTO RICO Yes 7
104 U.S. VIRGIN ISLANDS Not applicable 23

105 rows × 3 columns

This gives us some sense of the breakdown, but it actually might be useful to rephrase our question as: What number and percent of schools in each state are charter schools?

With that additional clarity, we can refine our query a bit using some conditional logic.

query = """
  select 
    statename,
    count(
        if(charter_text = 'Yes', sch_name, null)
    ) as num_charter_schools,
    count(sch_name) as num_schools,
  from directory
  group by statename
  order by statename
"""

duckdb.sql(query).df()
statename num_charter_schools num_schools
0 ALABAMA 13 1566
1 ALASKA 31 509
2 AMERICAN SAMOA 0 29
3 ARIZONA 606 2515
4 ARKANSAS 93 1099
5 BUREAU OF INDIAN EDUCATION 0 174
6 CALIFORNIA 1337 10456
7 COLORADO 268 1960
8 CONNECTICUT 21 1009
9 DELAWARE 23 231
10 DISTRICT OF COLUMBIA 125 243
11 FLORIDA 767 4306
12 GEORGIA 95 2327
13 GUAM 3 44
14 HAWAII 37 296
15 IDAHO 81 801
16 ILLINOIS 138 4409
17 INDIANA 121 1933
18 IOWA 2 1333
19 KANSAS 9 1360
20 KENTUCKY 0 1548
21 LOUISIANA 150 1383
22 MAINE 13 602
23 MARYLAND 49 1424
24 MASSACHUSETTS 78 1859
25 MICHIGAN 386 3570
26 MINNESOTA 306 2773
27 MISSISSIPPI 7 1054
28 MISSOURI 81 2469
29 MONTANA 0 832
30 NEBRASKA 0 1098
31 NEVADA 99 769
32 NEW HAMPSHIRE 39 498
33 NEW JERSEY 87 2582
34 NEW MEXICO 102 894
35 NEW YORK 331 4842
36 NORTH CAROLINA 223 2765
37 NORTH DAKOTA 0 536
38 NORTHERN MARIANAS 0 35
39 OHIO 325 3689
40 OKLAHOMA 72 1803
41 OREGON 132 1290
42 PENNSYLVANIA 184 2963
43 PUERTO RICO 7 853
44 RHODE ISLAND 41 319
45 SOUTH CAROLINA 93 1288
46 SOUTH DAKOTA 0 732
47 TENNESSEE 121 1931
48 TEXAS 1056 9652
49 U.S. VIRGIN ISLANDS 0 23
50 UTAH 141 1123
51 VERMONT 0 305
52 VIRGINIA 7 2136
53 WASHINGTON 17 2561
54 WEST VIRGINIA 3 700
55 WISCONSIN 243 2263
56 WYOMING 5 366

That gets us the counts of charter schools in each state as well as the total number of schools in each state. Let’s breakdown what this is doing:

count(
  if(charter_text = 'Yes', sch_name, null)
) as num_charter_schools,

The if() function checks whether a statement evaluates to true or not and then returns one value for true and another for false. Here, we are leveraging that functionality to count the school name if the value of the charter_text column is Yes but returning null if not. That works because the count() function ignores null values when aggregating.

A similar, but slightly different approach could be done using the sum() function instead:

sum(
  if(charter_text = 'Yes', 1, 0)
) as num_charter_schools,

This would treat each charter school as being equivalent to the value 1 and each non-charter schools as equivalent to the value 0 and then would add them up. The result is the same in this case. There can be subtle differences in the approach depending on the data being aggregated. It can be useful, at times, to compare the resulting differences (if any).

Calculating the percentage of schools

We have our counts now, but we don’t have the percentage of schools and that is ultimately the question we are asking. To determine that, we need to divide the number of charters in each state by the total number of schools.

We could do that like this:

query = """
  select 
    statename,
    count(if(charter_text = 'Yes', sch_name, null)) as num_charter_schools,
    count(sch_name) as num_schools,
    count(if(charter_text = 'Yes', sch_name, null))/count(sch_name) as percent_of_schools,
  from directory
  group by statename
  order by statename
"""

duckdb.sql(query).df()
statename num_charter_schools num_schools percent_of_schools
0 ALABAMA 13 1566 0.008301
1 ALASKA 31 509 0.060904
2 AMERICAN SAMOA 0 29 0.000000
3 ARIZONA 606 2515 0.240954
4 ARKANSAS 93 1099 0.084622
5 BUREAU OF INDIAN EDUCATION 0 174 0.000000
6 CALIFORNIA 1337 10456 0.127869
7 COLORADO 268 1960 0.136735
8 CONNECTICUT 21 1009 0.020813
9 DELAWARE 23 231 0.099567
10 DISTRICT OF COLUMBIA 125 243 0.514403
11 FLORIDA 767 4306 0.178124
12 GEORGIA 95 2327 0.040825
13 GUAM 3 44 0.068182
14 HAWAII 37 296 0.125000
15 IDAHO 81 801 0.101124
16 ILLINOIS 138 4409 0.031300
17 INDIANA 121 1933 0.062597
18 IOWA 2 1333 0.001500
19 KANSAS 9 1360 0.006618
20 KENTUCKY 0 1548 0.000000
21 LOUISIANA 150 1383 0.108460
22 MAINE 13 602 0.021595
23 MARYLAND 49 1424 0.034410
24 MASSACHUSETTS 78 1859 0.041958
25 MICHIGAN 386 3570 0.108123
26 MINNESOTA 306 2773 0.110350
27 MISSISSIPPI 7 1054 0.006641
28 MISSOURI 81 2469 0.032807
29 MONTANA 0 832 0.000000
30 NEBRASKA 0 1098 0.000000
31 NEVADA 99 769 0.128739
32 NEW HAMPSHIRE 39 498 0.078313
33 NEW JERSEY 87 2582 0.033695
34 NEW MEXICO 102 894 0.114094
35 NEW YORK 331 4842 0.068360
36 NORTH CAROLINA 223 2765 0.080651
37 NORTH DAKOTA 0 536 0.000000
38 NORTHERN MARIANAS 0 35 0.000000
39 OHIO 325 3689 0.088100
40 OKLAHOMA 72 1803 0.039933
41 OREGON 132 1290 0.102326
42 PENNSYLVANIA 184 2963 0.062099
43 PUERTO RICO 7 853 0.008206
44 RHODE ISLAND 41 319 0.128527
45 SOUTH CAROLINA 93 1288 0.072205
46 SOUTH DAKOTA 0 732 0.000000
47 TENNESSEE 121 1931 0.062662
48 TEXAS 1056 9652 0.109407
49 U.S. VIRGIN ISLANDS 0 23 0.000000
50 UTAH 141 1123 0.125557
51 VERMONT 0 305 0.000000
52 VIRGINIA 7 2136 0.003277
53 WASHINGTON 17 2561 0.006638
54 WEST VIRGINIA 3 700 0.004286
55 WISCONSIN 243 2263 0.107380
56 WYOMING 5 366 0.013661

Not bad, but I find that decimal representation of the percentage a bit harder to read. Let’s transform that to look like 12.5% by:

  • multiplying the decimal by 100 to get a whole number
  • rounding to a single decimal digit
  • converting it to a string and adding the % sign
query = """
  select 
    statename,
    count(if(charter_text = 'Yes', sch_name, null)) as num_charter_schools,
    count(sch_name) as num_schools,
    concat(cast(round((count(if(charter_text = 'Yes', sch_name, null))/count(sch_name)) * 100, 1) as string), '%') as percent_of_schools,
  from directory
  group by statename
  order by statename
"""

duckdb.sql(query).df()
statename num_charter_schools num_schools percent_of_schools
0 ALABAMA 13 1566 0.8%
1 ALASKA 31 509 6.1%
2 AMERICAN SAMOA 0 29 0.0%
3 ARIZONA 606 2515 24.1%
4 ARKANSAS 93 1099 8.5%
5 BUREAU OF INDIAN EDUCATION 0 174 0.0%
6 CALIFORNIA 1337 10456 12.8%
7 COLORADO 268 1960 13.7%
8 CONNECTICUT 21 1009 2.1%
9 DELAWARE 23 231 10.0%
10 DISTRICT OF COLUMBIA 125 243 51.4%
11 FLORIDA 767 4306 17.8%
12 GEORGIA 95 2327 4.1%
13 GUAM 3 44 6.8%
14 HAWAII 37 296 12.5%
15 IDAHO 81 801 10.1%
16 ILLINOIS 138 4409 3.1%
17 INDIANA 121 1933 6.3%
18 IOWA 2 1333 0.2%
19 KANSAS 9 1360 0.7%
20 KENTUCKY 0 1548 0.0%
21 LOUISIANA 150 1383 10.8%
22 MAINE 13 602 2.2%
23 MARYLAND 49 1424 3.4%
24 MASSACHUSETTS 78 1859 4.2%
25 MICHIGAN 386 3570 10.8%
26 MINNESOTA 306 2773 11.0%
27 MISSISSIPPI 7 1054 0.7%
28 MISSOURI 81 2469 3.3%
29 MONTANA 0 832 0.0%
30 NEBRASKA 0 1098 0.0%
31 NEVADA 99 769 12.9%
32 NEW HAMPSHIRE 39 498 7.8%
33 NEW JERSEY 87 2582 3.4%
34 NEW MEXICO 102 894 11.4%
35 NEW YORK 331 4842 6.8%
36 NORTH CAROLINA 223 2765 8.1%
37 NORTH DAKOTA 0 536 0.0%
38 NORTHERN MARIANAS 0 35 0.0%
39 OHIO 325 3689 8.8%
40 OKLAHOMA 72 1803 4.0%
41 OREGON 132 1290 10.2%
42 PENNSYLVANIA 184 2963 6.2%
43 PUERTO RICO 7 853 0.8%
44 RHODE ISLAND 41 319 12.9%
45 SOUTH CAROLINA 93 1288 7.2%
46 SOUTH DAKOTA 0 732 0.0%
47 TENNESSEE 121 1931 6.3%
48 TEXAS 1056 9652 10.9%
49 U.S. VIRGIN ISLANDS 0 23 0.0%
50 UTAH 141 1123 12.6%
51 VERMONT 0 305 0.0%
52 VIRGINIA 7 2136 0.3%
53 WASHINGTON 17 2561 0.7%
54 WEST VIRGINIA 3 700 0.4%
55 WISCONSIN 243 2263 10.7%
56 WYOMING 5 366 1.4%

Cleaning this up for readability

At this point we have a working query, but the syntax is getting a bit dense and hard to read. I’d like to break this up a bit to make it more clear what is happening.

Our code has three main steps:

  1. calculate the counts of charter schools and all schools for each state
  2. divide the counts to calculate a percentage value as decimal
  3. transform the decimal value to a string representation of the percentage

We can use nested subqueries to do this, but I consider that an anti-pattern in SQL and prefer the use of CTEs (common table expressions). They allow us to designate a query to be an intermediate result that can be further queried.

New Vocabulary

  • Common Table Expression (CTE) is a temporary result set that is defined within the scope of a single SQL statement. It allows you to create a named query that can be referenced multiple times within the main query. CTEs are primarily used to improve the readability, modularity, and reusability of SQL queries

Note: Subqueries as an anti-pattern

Using nested subqueries in SQL can become an anti-pattern because it can lead to complex, hard-to-maintain queries and potential performance issues. Nested subqueries involve placing one query inside another, often making the overall query difficult to read, understand, and optimize. While nested subqueries can solve specific problems, using them excessively can lead to unreadable, inefficient, and hard-to-maintain SQL queries. It’s generally better to explore alternative techniques like joins, CTEs, and window functions for clearer, more optimized code.

query = """
  with school_counts as (
    select 
      statename,
      count(if(charter_text = 'Yes', sch_name, null)) as num_charter_schools,
      count(sch_name) as num_schools,
    from directory
    group by statename
  ),

  percents as (
      select
        statename,
        num_charter_schools,
        num_schools,
        round((num_charter_schools/num_schools) * 100, 1) as percent_of_schools
      from school_counts
  )

  select
    statename,
    num_charter_schools,
    num_schools,
    concat(cast(percent_of_schools as string),'%') as percent_of_schools
  from percents
"""

charters = duckdb.sql(query).df()

Another way to do this that is unique to duckdb would be to store each query into a variable and execute queries against those.

query = """
  select 
      statename,
      count(if(charter_text = 'Yes', sch_name, null)) as num_charter_schools,
      count(sch_name) as num_schools,
  from directory
  group by statename
"""
school_counts = duckdb.sql(query).df()

query = """
  select
    statename,
    num_charter_schools,
    num_schools,
    round((num_charter_schools/num_schools) * 100, 1) as percent_of_schools
  from school_counts
"""
percents = duckdb.sql(query).df()

query = """
  select
    statename,
    num_charter_schools,
    num_schools,
    concat(cast(percent_of_schools as string),'%') as percent_of_schools
  from percents
"""
charters = duckdb.sql(query).df()
charters
statename num_charter_schools num_schools percent_of_schools
0 ALABAMA 13 1566 0.8%
1 ALASKA 31 509 6.1%
2 ARIZONA 606 2515 24.1%
3 ARKANSAS 93 1099 8.5%
4 CALIFORNIA 1337 10456 12.8%
5 COLORADO 268 1960 13.7%
6 CONNECTICUT 21 1009 2.1%
7 DELAWARE 23 231 10.0%
8 DISTRICT OF COLUMBIA 125 243 51.4%
9 FLORIDA 767 4306 17.8%
10 GEORGIA 95 2327 4.1%
11 HAWAII 37 296 12.5%
12 IDAHO 81 801 10.1%
13 ILLINOIS 138 4409 3.1%
14 INDIANA 121 1933 6.3%
15 IOWA 2 1333 0.2%
16 KANSAS 9 1360 0.7%
17 KENTUCKY 0 1548 0.0%
18 LOUISIANA 150 1383 10.8%
19 MAINE 13 602 2.2%
20 MARYLAND 49 1424 3.4%
21 MASSACHUSETTS 78 1859 4.2%
22 MICHIGAN 386 3570 10.8%
23 MINNESOTA 306 2773 11.0%
24 MISSISSIPPI 7 1054 0.7%
25 MISSOURI 81 2469 3.3%
26 MONTANA 0 832 0.0%
27 NEBRASKA 0 1098 0.0%
28 NEVADA 99 769 12.9%
29 NEW HAMPSHIRE 39 498 7.8%
30 NEW JERSEY 87 2582 3.4%
31 NEW MEXICO 102 894 11.4%
32 NEW YORK 331 4842 6.8%
33 NORTH CAROLINA 223 2765 8.1%
34 NORTH DAKOTA 0 536 0.0%
35 OHIO 325 3689 8.8%
36 OKLAHOMA 72 1803 4.0%
37 OREGON 132 1290 10.2%
38 PENNSYLVANIA 184 2963 6.2%
39 RHODE ISLAND 41 319 12.9%
40 SOUTH CAROLINA 93 1288 7.2%
41 SOUTH DAKOTA 0 732 0.0%
42 TENNESSEE 121 1931 6.3%
43 TEXAS 1056 9652 10.9%
44 UTAH 141 1123 12.6%
45 VERMONT 0 305 0.0%
46 VIRGINIA 7 2136 0.3%
47 WASHINGTON 17 2561 0.7%
48 WEST VIRGINIA 3 700 0.4%
49 WISCONSIN 243 2263 10.7%
50 WYOMING 5 366 1.4%
51 BUREAU OF INDIAN EDUCATION 0 174 0.0%
52 AMERICAN SAMOA 0 29 0.0%
53 GUAM 3 44 6.8%
54 NORTHERN MARIANAS 0 35 0.0%
55 PUERTO RICO 7 853 0.8%
56 U.S. VIRGIN ISLANDS 0 23 0.0%

Either way, we can now see the answer to our initial question. However, at this point a new question has come to mind: Which states have the highest percentage of charter schools?

To answer that, we can simply sort the results:

query = """
  select * 
  from charters 
  order by percent_of_schools desc
"""
duckdb.sql(query).df()
statename num_charter_schools num_schools percent_of_schools
0 OHIO 325 3689 8.8%
1 ARKANSAS 93 1099 8.5%
2 NORTH CAROLINA 223 2765 8.1%
3 NEW HAMPSHIRE 39 498 7.8%
4 SOUTH CAROLINA 93 1288 7.2%
5 NEW YORK 331 4842 6.8%
6 GUAM 3 44 6.8%
7 INDIANA 121 1933 6.3%
8 TENNESSEE 121 1931 6.3%
9 PENNSYLVANIA 184 2963 6.2%
10 ALASKA 31 509 6.1%
11 DISTRICT OF COLUMBIA 125 243 51.4%
12 MASSACHUSETTS 78 1859 4.2%
13 GEORGIA 95 2327 4.1%
14 OKLAHOMA 72 1803 4.0%
15 MARYLAND 49 1424 3.4%
16 NEW JERSEY 87 2582 3.4%
17 MISSOURI 81 2469 3.3%
18 ILLINOIS 138 4409 3.1%
19 ARIZONA 606 2515 24.1%
20 MAINE 13 602 2.2%
21 CONNECTICUT 21 1009 2.1%
22 FLORIDA 767 4306 17.8%
23 COLORADO 268 1960 13.7%
24 NEVADA 99 769 12.9%
25 RHODE ISLAND 41 319 12.9%
26 CALIFORNIA 1337 10456 12.8%
27 UTAH 141 1123 12.6%
28 HAWAII 37 296 12.5%
29 NEW MEXICO 102 894 11.4%
30 MINNESOTA 306 2773 11.0%
31 TEXAS 1056 9652 10.9%
32 LOUISIANA 150 1383 10.8%
33 MICHIGAN 386 3570 10.8%
34 WISCONSIN 243 2263 10.7%
35 OREGON 132 1290 10.2%
36 IDAHO 81 801 10.1%
37 DELAWARE 23 231 10.0%
38 WYOMING 5 366 1.4%
39 ALABAMA 13 1566 0.8%
40 PUERTO RICO 7 853 0.8%
41 KANSAS 9 1360 0.7%
42 WASHINGTON 17 2561 0.7%
43 MISSISSIPPI 7 1054 0.7%
44 WEST VIRGINIA 3 700 0.4%
45 VIRGINIA 7 2136 0.3%
46 IOWA 2 1333 0.2%
47 NORTH DAKOTA 0 536 0.0%
48 SOUTH DAKOTA 0 732 0.0%
49 NEBRASKA 0 1098 0.0%
50 MONTANA 0 832 0.0%
51 VERMONT 0 305 0.0%
52 KENTUCKY 0 1548 0.0%
53 BUREAU OF INDIAN EDUCATION 0 174 0.0%
54 AMERICAN SAMOA 0 29 0.0%
55 NORTHERN MARIANAS 0 35 0.0%
56 U.S. VIRGIN ISLANDS 0 23 0.0%

Oh no!!!

Our conversion of the percentage into a string has ruined our ability to sort the values. This is why it is so vital to understand data types. Strings are sorted alphabetically but numbers are sorted numerically. What that means in this context is that when sorting descending (reverse), the string "8" is larger than the string "50" because 8 comes after 5. If the value were still numbers that would not be the case.

To resolve this we can preserve the original decimal column for sorting rather than replace it entirely.

query = """
  with school_counts as (
    select 
      statename,
      count(if(charter_text = 'Yes', sch_name, null)) as num_charter_schools,
      count(sch_name) as num_schools,
    from directory
    group by statename
  ),

  percents as (
      select
        statename,
        num_charter_schools,
        num_schools,
        round((num_charter_schools/num_schools) * 100, 1) as percent_of_schools_numeric
      from school_counts
  )

  select
    statename,
    num_charter_schools,
    num_schools,
    concat(cast(percent_of_schools_numeric as string),'%') as percent_of_schools
  from percents
  order by percent_of_schools_numeric desc
"""

duckdb.sql(query).df()
statename num_charter_schools num_schools percent_of_schools
0 DISTRICT OF COLUMBIA 125 243 51.4%
1 ARIZONA 606 2515 24.1%
2 FLORIDA 767 4306 17.8%
3 COLORADO 268 1960 13.7%
4 NEVADA 99 769 12.9%
5 RHODE ISLAND 41 319 12.9%
6 CALIFORNIA 1337 10456 12.8%
7 UTAH 141 1123 12.6%
8 HAWAII 37 296 12.5%
9 NEW MEXICO 102 894 11.4%
10 MINNESOTA 306 2773 11.0%
11 TEXAS 1056 9652 10.9%
12 LOUISIANA 150 1383 10.8%
13 MICHIGAN 386 3570 10.8%
14 WISCONSIN 243 2263 10.7%
15 OREGON 132 1290 10.2%
16 IDAHO 81 801 10.1%
17 DELAWARE 23 231 10.0%
18 OHIO 325 3689 8.8%
19 ARKANSAS 93 1099 8.5%
20 NORTH CAROLINA 223 2765 8.1%
21 NEW HAMPSHIRE 39 498 7.8%
22 SOUTH CAROLINA 93 1288 7.2%
23 NEW YORK 331 4842 6.8%
24 GUAM 3 44 6.8%
25 INDIANA 121 1933 6.3%
26 TENNESSEE 121 1931 6.3%
27 PENNSYLVANIA 184 2963 6.2%
28 ALASKA 31 509 6.1%
29 MASSACHUSETTS 78 1859 4.2%
30 GEORGIA 95 2327 4.1%
31 OKLAHOMA 72 1803 4.0%
32 MARYLAND 49 1424 3.4%
33 NEW JERSEY 87 2582 3.4%
34 MISSOURI 81 2469 3.3%
35 ILLINOIS 138 4409 3.1%
36 MAINE 13 602 2.2%
37 CONNECTICUT 21 1009 2.1%
38 WYOMING 5 366 1.4%
39 ALABAMA 13 1566 0.8%
40 PUERTO RICO 7 853 0.8%
41 KANSAS 9 1360 0.7%
42 MISSISSIPPI 7 1054 0.7%
43 WASHINGTON 17 2561 0.7%
44 WEST VIRGINIA 3 700 0.4%
45 VIRGINIA 7 2136 0.3%
46 IOWA 2 1333 0.2%
47 KENTUCKY 0 1548 0.0%
48 MONTANA 0 832 0.0%
49 NEBRASKA 0 1098 0.0%
50 NORTH DAKOTA 0 536 0.0%
51 SOUTH DAKOTA 0 732 0.0%
52 VERMONT 0 305 0.0%
53 BUREAU OF INDIAN EDUCATION 0 174 0.0%
54 AMERICAN SAMOA 0 29 0.0%
55 NORTHERN MARIANAS 0 35 0.0%
56 U.S. VIRGIN ISLANDS 0 23 0.0%

Now we can see the Washington DC has the highest percent of charter schools by a large margin. That is the legacy of Michelle Rhee’s impact on DC schools after passage of the D.C. Public Education Reform Amendment Act in 2007.

The Devil is in the details

While DC might have the largest percent of charter schools it is out of a small number of schools to begin with given the size of Washington DC. If we sort by count we can see that California and Texas have the largest total number of charter schools, but because of the size of the population in those states, the total represents a smaller percentage.

Washington DC is not even in the top 10.

query = """
  select * 
  from charters 
  order by num_charter_schools desc 
  limit 10
"""
duckdb.sql(query).df()
statename num_charter_schools num_schools percent_of_schools
0 CALIFORNIA 1337 10456 12.8%
1 TEXAS 1056 9652 10.9%
2 FLORIDA 767 4306 17.8%
3 ARIZONA 606 2515 24.1%
4 MICHIGAN 386 3570 10.8%
5 NEW YORK 331 4842 6.8%
6 OHIO 325 3689 8.8%
7 MINNESOTA 306 2773 11.0%
8 COLORADO 268 1960 13.7%
9 WISCONSIN 243 2263 10.7%

This begs the question: Which states have the most students in charter schools?

Not all school sizes are equal and there could be a large amount of schools with smaller enrollments or fewer schools with larger enrollments. Unfortunately, our initial dataset doesn’t have enrollment information. To determine this, we need new data. Going back to NCES, we can pull the school membership data as well. This is a much larger dataset (198 MB compressed) than the directory info but doesn’t have as many of the descriptors as our original dataset (such as the charter flag).

To answer this final query, we’ll need to combine the data in these two datasets somehow.

This is where SQL really shines. The sql join is one of the fundamental features of the language that makes it so powerful for data analysis. SQL joins are like puzzle pieces that help you combine information from different tables in a database. They’re useful when you want to see related data together, such as matching customers with their orders or students with their courses. Instead of keeping all the information in one big table, you can split it into smaller ones and then use joins to bring the pieces together, making it easier to understand and work with your data.

enrollment = pd.read_csv("ccd_SCH_052_2122_l_1a_071722.csv", dtype=str)
enrollment.columns = [c.lower() for c in enrollment.columns]
enrollment.head()
school_year fipst statename st sch_name state_agency_no union st_leaid leaid st_schid ncessch schid grade race_ethnicity sex student_count total_indicator dms_flag
0 2021-2022 01 ALABAMA AL Albertville Middle School 01 NaN AL-101 0100005 AL-101-0010 010000500870 0100870 Grade 7 American Indian or Alaska Native Female 1 Category Set A - By Race/Ethnicity; Sex; Grade Reported
1 2021-2022 01 ALABAMA AL Albertville Middle School 01 NaN AL-101 0100005 AL-101-0010 010000500870 0100870 Grade 7 American Indian or Alaska Native Male 3 Category Set A - By Race/Ethnicity; Sex; Grade Reported
2 2021-2022 01 ALABAMA AL Albertville Middle School 01 NaN AL-101 0100005 AL-101-0010 010000500870 0100870 Grade 7 Asian Female 2 Category Set A - By Race/Ethnicity; Sex; Grade Reported
3 2021-2022 01 ALABAMA AL Albertville Middle School 01 NaN AL-101 0100005 AL-101-0010 010000500870 0100870 Grade 7 Asian Male 2 Category Set A - By Race/Ethnicity; Sex; Grade Reported
4 2021-2022 01 ALABAMA AL Albertville Middle School 01 NaN AL-101 0100005 AL-101-0010 010000500870 0100870 Grade 7 Black or African American Female 7 Category Set A - By Race/Ethnicity; Sex; Grade Reported

Exploring Enrollment

My goal here is to simply pull out the school-wide total enrollment for each school, but a quick scan of that dataframe preview indicates this data is pretty granular. It is broken out into grade levels, each student group for race/ethnicity, and even sex. But I do see there is a column called total_indicator that might be useful. I’d like to get a sense of what the values of that column are so I can determine if it could be used to filter the data somehow. To do that I will query the distinct values in that column.

pd.set_option('max_colwidth', 0) # this is to make sure pandas doesn't truncate the text in the column

query = """
select distinct
  total_indicator
from enrollment
"""

duckdb.sql(query).df()
total_indicator
0 Category Set A - By Race/Ethnicity; Sex; Grade
1 Derived - Education Unit Total minus Adult Education Count
2 Derived - Subtotal by Race/Ethnicity and Sex minus Adult Education Count
3 Education Unit Total
4 Subtotal 4 - By Grade

We can see that there are two values that might work here: - Education Unit Total - Derived - Education Unit Total minus Adult Education Count

We can use the first as it seems more straight-forward and includes the full number of students (including adult ed).

Before we do that, let’s pull out the relevant data from the dataset that we want to join to our directory data.

query = """
  select 
    school_year,
    statename,
    st,
    sch_name,
    st_schid,
    ncessch,
    student_count,
  from enrollment
  where total_indicator = 'Education Unit Total'
"""

total_enrollment = duckdb.sql(query).df()
total_enrollment
school_year statename st sch_name st_schid ncessch student_count
0 2021-2022 ALABAMA AL Albertville Middle School AL-101-0010 010000500870 920
1 2021-2022 ALABAMA AL Albertville High School AL-101-0020 010000500871 1665
2 2021-2022 ALABAMA AL Albertville Intermediate School AL-101-0110 010000500879 924
3 2021-2022 ALABAMA AL Albertville Elementary School AL-101-0200 010000500889 891
4 2021-2022 ALABAMA AL Albertville Kindergarten and PreK AL-101-0035 010000501616 579
... ... ... ... ... ... ... ...
100024 2021-2022 U.S. VIRGIN ISLANDS VI LOCKHART ELEMENTARY SCHOOL VI-001-15 780003000024 284
100025 2021-2022 U.S. VIRGIN ISLANDS VI ULLA F MULLER ELEMENTARY SCHOOL VI-001-17 780003000026 437
100026 2021-2022 U.S. VIRGIN ISLANDS VI YVONNE BOWSKY ELEMENTARY SCHOOL VI-001-23 780003000027 381
100027 2021-2022 U.S. VIRGIN ISLANDS VI CANCRYN JUNIOR HIGH SCHOOL VI-001-25 780003000033 744
100028 2021-2022 U.S. VIRGIN ISLANDS VI BERTHA BOSCHULTE JUNIOR HIGH VI-001-9 780003000034 515

100029 rows × 7 columns

I’m going to include a few different fields that may be necessary for joining such as the school_year, statename, and a couple of different identifiers since we’re not 100% sure which one is unique enough to join yet.

We can do the same to the directory data and just keep the relevant fields for our analysis.

query = """
  select
    school_year,
    statename,
    st,
    sch_name,
    st_schid,
    ncessch,
    charter_text
  from directory
"""
directory_info = duckdb.sql(query).df()
directory_info
school_year statename st sch_name st_schid ncessch charter_text
0 2021-2022 ALABAMA AL Albertville Middle School AL-101-0010 010000500870 No
1 2021-2022 ALABAMA AL Albertville High School AL-101-0020 010000500871 No
2 2021-2022 ALABAMA AL Albertville Intermediate School AL-101-0110 010000500879 No
3 2021-2022 ALABAMA AL Albertville Elementary School AL-101-0200 010000500889 No
4 2021-2022 ALABAMA AL Albertville Kindergarten and PreK AL-101-0035 010000501616 No
... ... ... ... ... ... ... ...
102125 2021-2022 U.S. VIRGIN ISLANDS VI LOCKHART ELEMENTARY SCHOOL VI-001-15 780003000024 Not applicable
102126 2021-2022 U.S. VIRGIN ISLANDS VI ULLA F MULLER ELEMENTARY SCHOOL VI-001-17 780003000026 Not applicable
102127 2021-2022 U.S. VIRGIN ISLANDS VI YVONNE BOWSKY ELEMENTARY SCHOOL VI-001-23 780003000027 Not applicable
102128 2021-2022 U.S. VIRGIN ISLANDS VI CANCRYN JUNIOR HIGH SCHOOL VI-001-25 780003000033 Not applicable
102129 2021-2022 U.S. VIRGIN ISLANDS VI BERTHA BOSCHULTE JUNIOR HIGH VI-001-9 780003000034 Not applicable

102130 rows × 7 columns

One thing to note is that the directory dataset has 102,130 rows and the enrollment dataset has 100,029 rows. So there is a difference of 2,101 rows between them. When we join them we should expect our returned data to be within those ranges if we join accurately and depending on the type of join used.

For more info on the types of joins check out this great tutorial from dataschool.

query = """
  select
    e.school_year,
    e.statename,
    e.st,
    e.sch_name,
    e.st_schid,
    e.ncessch,
    cast(e.student_count as int) as student_count,
    d.charter_text
  from directory_info d
  inner join total_enrollment e
    on d.school_year = e.school_year
    and d.st = e.st
    and d.ncessch = e.ncessch
"""
enrollments_with_info = duckdb.sql(query).df()
enrollments_with_info
school_year statename st sch_name st_schid ncessch student_count charter_text
0 2021-2022 ALABAMA AL Albertville High School AL-101-0020 010000500871 1665.0 No
1 2021-2022 ALABAMA AL Albertville Intermediate School AL-101-0110 010000500879 924.0 No
2 2021-2022 ALABAMA AL Albertville Elementary School AL-101-0200 010000500889 891.0 No
3 2021-2022 ALABAMA AL Albertville Kindergarten and PreK AL-101-0035 010000501616 579.0 No
4 2021-2022 ALABAMA AL Albertville Primary School AL-101-0005 010000502150 977.0 No
... ... ... ... ... ... ... ... ...
100024 2021-2022 PUERTO RICO PR ALEJANDRO JR CRUZ (GALATEO PARCELAS) PR-01-71449 720003002078 129.0 No
100025 2021-2022 PUERTO RICO PR BERWIND SUPERIOR PR-01-67942 720003001333 228.0 No
100026 2021-2022 PUERTO RICO PR DR. ANTONIO S. PEDREIRA (ESPECIALIZADA) PR-01-61333 720003001370 299.0 No
100027 2021-2022 PUERTO RICO PR TRINA PADILLA DE SANZ PR-01-61440 720003001379 425.0 No
100028 2021-2022 PUERTO RICO PR JUDITH A VIVAS PR-01-16220 720003001971 247.0 No

100029 rows × 8 columns

The inner join is something like the intersection of a venn diagram. It will return records (based on a matching condition or set of conditions) that appear in both datasets. Whereas a left join would return all the records in the first dataset along with null values for any records that don’t appear in the set that is being joined. This can be very useful when trying to determine which records are missing. For our purposes, we only only care about schools with actual enrollment (since we can’t calculate non-existing numbers of students).

  from directory_info d
  inner join total_enrollment e
    on d.school_year = e.school_year
    and d.st = e.st
    and d.ncessch = e.ncessch

Notice the use of table aliases (d and e resprectively) to help distinguish between columns in either dataset. Another way to write that join without the aliases would be:

  from directory_info
  inner join total_enrollment
    on directory_info.school_year = total_enrollment.school_year
    and directory_info.st = total_enrollment.st
    and directory_info.ncessch = total_enrollment.ncessch

For smaller table names this can be fine, but it does get tedious over time and sensible aliases can make the syntax of queries easier to read.

Table aliases and explicit table references are technically only required when the columns share names across both datasets (since the SQL interpreter can’t distinguish which one you mean) but it is generally considered a best practice when using joins to always specify the table source (preferably with an alias) in both the join conditions as well as in the select statement.

Because both of our original dataset are constrained by year, we could probably remove some of the join conditions.

  from directory_info d
  inner join total_enrollment e
    on d.st = e.st
    and d.ncessch = e.ncessch

And if we are certain the NCES IDs are unique across states (they are), we could further simplify as:

  from directory_info d
  inner join total_enrollment e
    on d.ncessch = e.ncessch

Calculating % of students in charters

Now that we’ve combined our datasets, we can write our aggregations similar to what we did for schools earlier. We could do this using CTEs or since we’ve stored the joined data in an output variable (as a dataframe) we can query that directly.

query = """
  select
    school_year,
    statename,
    count(if(charter_text = 'Yes', sch_name, null)) as charter_schools,
    count(sch_name) as total_schools,
    round((count(if(charter_text = 'Yes', sch_name, null))/count(sch_name)) * 100, 1) as percent_charter_schools,
    cast(sum(if(charter_text = 'Yes', student_count, 0)) as int) as charter_students,
    cast(sum(student_count) as int) as total_students,
    round((sum(if(charter_text = 'Yes', student_count, 0))/sum(student_count)) * 100, 1) as percent_charter_students,
  from enrollments_with_info 
  group by 
    school_year,
    statename
  having sum(student_count) is not null
    and sum(if(charter_text = 'Yes', student_count, 0)) > 0
  order by percent_charter_students desc
"""
charter_enrollments = duckdb.sql(query).df()
charter_enrollments
school_year statename charter_schools total_schools percent_charter_schools charter_students total_students percent_charter_students
0 2021-2022 DISTRICT OF COLUMBIA 123 240 51.2 39476 88543 44.6
1 2021-2022 ARIZONA 581 2418 24.0 231195 1131888 20.4
2 2021-2022 COLORADO 265 1941 13.7 130279 880582 14.8
3 2021-2022 NEVADA 93 746 12.5 63944 488251 13.1
4 2021-2022 FLORIDA 707 4191 16.9 361634 2832516 12.8
5 2021-2022 LOUISIANA 146 1366 10.7 86301 680793 12.7
6 2021-2022 DELAWARE 23 229 10.0 17201 139935 12.3
7 2021-2022 CALIFORNIA 1291 10167 12.7 678056 5874619 11.5
8 2021-2022 UTAH 137 1105 12.4 77733 687107 11.3
9 2021-2022 MICHIGAN 374 3538 10.6 150327 1396598 10.8
10 2021-2022 PENNSYLVANIA 179 2941 6.1 163372 1671899 9.8
11 2021-2022 NEW MEXICO 99 890 11.1 30160 316785 9.5
12 2021-2022 IDAHO 77 784 9.8 28051 313909 8.9
13 2021-2022 ARKANSAS 93 1084 8.6 42341 489565 8.6
14 2021-2022 NORTH CAROLINA 204 2719 7.5 131624 1525223 8.6
15 2021-2022 OKLAHOMA 70 1792 3.9 59753 696411 8.6
16 2021-2022 RHODE ISLAND 40 315 12.7 11387 136864 8.3
17 2021-2022 TEXAS 1010 9105 11.1 443548 5428609 8.2
18 2021-2022 OREGON 131 1285 10.2 42668 540687 7.9
19 2021-2022 MINNESOTA 278 2661 10.4 66595 868742 7.7
20 2021-2022 HAWAII 37 294 12.6 12114 173178 7.0
21 2021-2022 NEW YORK 330 4802 6.9 173341 2526204 6.9
22 2021-2022 OHIO 324 3659 8.9 115021 1682397 6.8
23 2021-2022 SOUTH CAROLINA 88 1267 6.9 49344 780403 6.3
24 2021-2022 GUAM 3 44 6.8 1783 28402 6.3
25 2021-2022 WISCONSIN 236 2243 10.5 49715 827393 6.0
26 2021-2022 ALASKA 30 500 6.0 7621 129944 5.9
27 2021-2022 MASSACHUSETTS 78 1842 4.2 48399 911529 5.3
28 2021-2022 INDIANA 119 1915 6.2 50073 1033897 4.8
29 2021-2022 NEW JERSEY 87 2558 3.4 58780 1339937 4.4
30 2021-2022 TENNESSEE 116 1906 6.1 43908 995549 4.4
31 2021-2022 GEORGIA 94 2314 4.1 69242 1740875 4.0
32 2021-2022 ILLINOIS 136 4386 3.1 60496 1867412 3.2
33 2021-2022 NEW HAMPSHIRE 36 494 7.3 4934 168797 2.9
34 2021-2022 MARYLAND 48 1417 3.4 24104 881461 2.7
35 2021-2022 MISSOURI 78 2453 3.2 24146 887077 2.7
36 2021-2022 CONNECTICUT 21 1000 2.1 11047 496795 2.2
37 2021-2022 MAINE 13 571 2.3 2732 168236 1.6
38 2021-2022 MISSISSIPPI 7 1040 0.7 2921 442000 0.7
39 2021-2022 WYOMING 5 359 1.4 642 92848 0.7
40 2021-2022 KANSAS 9 1355 0.7 2877 480069 0.6
41 2021-2022 PUERTO RICO 7 843 0.8 1515 259535 0.6
42 2021-2022 ALABAMA 11 1494 0.7 3566 747846 0.5
43 2021-2022 WASHINGTON 17 2546 0.7 4571 1080143 0.4
44 2021-2022 VIRGINIA 7 1948 0.4 1278 1244624 0.1
45 2021-2022 IOWA 2 1325 0.2 149 503735 0.0

You’ll notice some similar syntax as our previous aggregate query, but I have omitted the step converting the percent to a string here to aid in future analysis.

Wrapping up with some rankings

Now that we have the data we need for our investigation we can sort and re-sort the data and find answers to our inquiries, however one fundamental issue is coming up: the answer to our question is it sort of depends …

Which is very common in data analysis. Depending on how you phrase your question you might get different answers. Wouldn’t it be nice to have a definitive answer about which state, broadly, ranks highest for charter presence?

To do this, let’s explore a little more advanced of a topic in SQL queries: window functions.

I won’t delve into them too deeply, but a basic understanding is helpful here:

Imagine you’re looking at a row of data in a table, like a line of people. A window function in SQL helps you calculate things for each person in the line, like finding their position compared to others, without changing the order. It’s like having a magic window that lets you peek at everyone’s info while keeping them in the same order they were in. This helps you figure out things like who’s first, who’s last, and who’s in the middle without messing up the original line.

In this case, we’d like to understand the ranking of each state for each calculation we created earlier. I.E. who is the #1 state for most charter schools and who is the #1 state for the most students in charter schools, etc. For thank we will use the rank() window function.

All window functions require an accompanying over() clause. They can get complicated but for our purposes we are just using them to specify the order to rank.

query = """
  select *,
    rank() over(order by percent_charter_schools desc) as percent_schools_rank,
    rank() over(order by charter_schools desc) as count_schools_rank,
    rank() over(order by percent_charter_students desc) as percent_students_rank,
    rank() over(order by charter_students desc) as count_students_rank,
  from charter_enrollments
  order by statename
"""

rankings = duckdb.sql(query).df()
rankings
school_year statename charter_schools total_schools percent_charter_schools charter_students total_students percent_charter_students percent_schools_rank count_schools_rank percent_students_rank count_students_rank
0 2021-2022 ALABAMA 11 1494 0.7 3566 747846 0.5 41 39 43 38
1 2021-2022 ALASKA 30 500 6.0 7621 129944 5.9 29 34 27 35
2 2021-2022 ARIZONA 581 2418 24.0 231195 1131888 20.4 2 4 2 4
3 2021-2022 ARKANSAS 93 1084 8.6 42341 489565 8.6 20 22 14 25
4 2021-2022 CALIFORNIA 1291 10167 12.7 678056 5874619 11.5 5 1 8 1
5 2021-2022 COLORADO 265 1941 13.7 130279 880582 14.8 4 9 3 9
6 2021-2022 CONNECTICUT 21 1000 2.1 11047 496795 2.2 38 36 37 34
7 2021-2022 DELAWARE 23 229 10.0 17201 139935 12.3 17 35 7 31
8 2021-2022 DISTRICT OF COLUMBIA 123 240 51.2 39476 88543 44.6 1 17 1 26
9 2021-2022 FLORIDA 707 4191 16.9 361634 2832516 12.8 3 3 5 3
10 2021-2022 GEORGIA 94 2314 4.1 69242 1740875 4.0 31 21 32 13
11 2021-2022 GUAM 3 44 6.8 1783 28402 6.3 25 45 24 42
12 2021-2022 HAWAII 37 294 12.6 12114 173178 7.0 7 32 21 32
13 2021-2022 IDAHO 77 784 9.8 28051 313909 8.9 18 28 13 28
14 2021-2022 ILLINOIS 136 4386 3.1 60496 1867412 3.2 36 15 33 16
15 2021-2022 INDIANA 119 1915 6.2 50073 1033897 4.8 26 18 29 19
16 2021-2022 IOWA 2 1325 0.2 149 503735 0.0 46 46 46 46
17 2021-2022 KANSAS 9 1355 0.7 2877 480069 0.6 41 40 41 40
18 2021-2022 LOUISIANA 146 1366 10.7 86301 680793 12.7 12 13 6 11
19 2021-2022 MAINE 13 571 2.3 2732 168236 1.6 37 38 38 41
20 2021-2022 MARYLAND 48 1417 3.4 24104 881461 2.7 33 30 35 30
21 2021-2022 MASSACHUSETTS 78 1842 4.2 48399 911529 5.3 30 26 28 22
22 2021-2022 MICHIGAN 374 3538 10.6 150327 1396598 10.8 13 5 10 7
23 2021-2022 MINNESOTA 278 2661 10.4 66595 868742 7.7 15 8 20 14
24 2021-2022 MISSISSIPPI 7 1040 0.7 2921 442000 0.7 41 41 39 39
25 2021-2022 MISSOURI 78 2453 3.2 24146 887077 2.7 35 26 35 29
26 2021-2022 NEVADA 93 746 12.5 63944 488251 13.1 8 22 4 15
27 2021-2022 NEW HAMPSHIRE 36 494 7.3 4934 168797 2.9 22 33 34 36
28 2021-2022 NEW JERSEY 87 2558 3.4 58780 1339937 4.4 33 25 30 18
29 2021-2022 NEW MEXICO 99 890 11.1 30160 316785 9.5 10 20 12 27
30 2021-2022 NEW YORK 330 4802 6.9 173341 2526204 6.9 23 6 22 5
31 2021-2022 NORTH CAROLINA 204 2719 7.5 131624 1525223 8.6 21 11 14 8
32 2021-2022 OHIO 324 3659 8.9 115021 1682397 6.8 19 7 23 10
33 2021-2022 OKLAHOMA 70 1792 3.9 59753 696411 8.6 32 29 14 17
34 2021-2022 OREGON 131 1285 10.2 42668 540687 7.9 16 16 19 24
35 2021-2022 PENNSYLVANIA 179 2941 6.1 163372 1671899 9.8 27 12 11 6
36 2021-2022 PUERTO RICO 7 843 0.8 1515 259535 0.6 40 41 41 43
37 2021-2022 RHODE ISLAND 40 315 12.7 11387 136864 8.3 5 31 17 33
38 2021-2022 SOUTH CAROLINA 88 1267 6.9 49344 780403 6.3 23 24 24 21
39 2021-2022 TENNESSEE 116 1906 6.1 43908 995549 4.4 27 19 30 23
40 2021-2022 TEXAS 1010 9105 11.1 443548 5428609 8.2 10 2 18 2
41 2021-2022 UTAH 137 1105 12.4 77733 687107 11.3 9 14 9 12
42 2021-2022 VIRGINIA 7 1948 0.4 1278 1244624 0.1 45 41 45 44
43 2021-2022 WASHINGTON 17 2546 0.7 4571 1080143 0.4 41 37 44 37
44 2021-2022 WISCONSIN 236 2243 10.5 49715 827393 6.0 14 10 26 20
45 2021-2022 WYOMING 5 359 1.4 642 92848 0.7 39 44 39 45

Combining Rankings

These rankings are useful and can certainly make it easier to sort, but they haven’t provided us with any more definitive of an answer. To do that, let’s combine the rankings across metrics and find which state has the lowest value (ie. closest to #1 across all combined metrics).

We’ll once again use the rank function but across the sum of the other ranks.

query = """
  select *,
    rank() over(order by (percent_schools_rank + count_schools_rank + percent_students_rank + count_students_rank)) as combined_ranking
  from rankings
  order by combined_ranking
"""
duckdb.sql(query).df()
school_year statename charter_schools total_schools percent_charter_schools charter_students total_students percent_charter_students percent_schools_rank count_schools_rank percent_students_rank count_students_rank combined_ranking
0 2021-2022 ARIZONA 581 2418 24.0 231195 1131888 20.4 2 4 2 4 1
1 2021-2022 FLORIDA 707 4191 16.9 361634 2832516 12.8 3 3 5 3 2
2 2021-2022 CALIFORNIA 1291 10167 12.7 678056 5874619 11.5 5 1 8 1 3
3 2021-2022 COLORADO 265 1941 13.7 130279 880582 14.8 4 9 3 9 4
4 2021-2022 TEXAS 1010 9105 11.1 443548 5428609 8.2 10 2 18 2 5
5 2021-2022 MICHIGAN 374 3538 10.6 150327 1396598 10.8 13 5 10 7 6
6 2021-2022 LOUISIANA 146 1366 10.7 86301 680793 12.7 12 13 6 11 7
7 2021-2022 UTAH 137 1105 12.4 77733 687107 11.3 9 14 9 12 8
8 2021-2022 DISTRICT OF COLUMBIA 123 240 51.2 39476 88543 44.6 1 17 1 26 9
9 2021-2022 NEVADA 93 746 12.5 63944 488251 13.1 8 22 4 15 10
10 2021-2022 NORTH CAROLINA 204 2719 7.5 131624 1525223 8.6 21 11 14 8 11
11 2021-2022 NEW YORK 330 4802 6.9 173341 2526204 6.9 23 6 22 5 12
12 2021-2022 PENNSYLVANIA 179 2941 6.1 163372 1671899 9.8 27 12 11 6 12
13 2021-2022 MINNESOTA 278 2661 10.4 66595 868742 7.7 15 8 20 14 14
14 2021-2022 OHIO 324 3659 8.9 115021 1682397 6.8 19 7 23 10 15
15 2021-2022 NEW MEXICO 99 890 11.1 30160 316785 9.5 10 20 12 27 16
16 2021-2022 WISCONSIN 236 2243 10.5 49715 827393 6.0 14 10 26 20 17
17 2021-2022 OREGON 131 1285 10.2 42668 540687 7.9 16 16 19 24 18
18 2021-2022 ARKANSAS 93 1084 8.6 42341 489565 8.6 20 22 14 25 19
19 2021-2022 RHODE ISLAND 40 315 12.7 11387 136864 8.3 5 31 17 33 20
20 2021-2022 IDAHO 77 784 9.8 28051 313909 8.9 18 28 13 28 21
21 2021-2022 DELAWARE 23 229 10.0 17201 139935 12.3 17 35 7 31 22
22 2021-2022 HAWAII 37 294 12.6 12114 173178 7.0 7 32 21 32 23
23 2021-2022 INDIANA 119 1915 6.2 50073 1033897 4.8 26 18 29 19 23
24 2021-2022 OKLAHOMA 70 1792 3.9 59753 696411 8.6 32 29 14 17 23
25 2021-2022 SOUTH CAROLINA 88 1267 6.9 49344 780403 6.3 23 24 24 21 23
26 2021-2022 GEORGIA 94 2314 4.1 69242 1740875 4.0 31 21 32 13 27
27 2021-2022 TENNESSEE 116 1906 6.1 43908 995549 4.4 27 19 30 23 28
28 2021-2022 ILLINOIS 136 4386 3.1 60496 1867412 3.2 36 15 33 16 29
29 2021-2022 MASSACHUSETTS 78 1842 4.2 48399 911529 5.3 30 26 28 22 30
30 2021-2022 NEW JERSEY 87 2558 3.4 58780 1339937 4.4 33 25 30 18 30
31 2021-2022 ALASKA 30 500 6.0 7621 129944 5.9 29 34 27 35 32
32 2021-2022 MISSOURI 78 2453 3.2 24146 887077 2.7 35 26 35 29 32
33 2021-2022 NEW HAMPSHIRE 36 494 7.3 4934 168797 2.9 22 33 34 36 32
34 2021-2022 MARYLAND 48 1417 3.4 24104 881461 2.7 33 30 35 30 35
35 2021-2022 GUAM 3 44 6.8 1783 28402 6.3 25 45 24 42 36
36 2021-2022 CONNECTICUT 21 1000 2.1 11047 496795 2.2 38 36 37 34 37
37 2021-2022 MAINE 13 571 2.3 2732 168236 1.6 37 38 38 41 38
38 2021-2022 WASHINGTON 17 2546 0.7 4571 1080143 0.4 41 37 44 37 39
39 2021-2022 MISSISSIPPI 7 1040 0.7 2921 442000 0.7 41 41 39 39 40
40 2021-2022 ALABAMA 11 1494 0.7 3566 747846 0.5 41 39 43 38 41
41 2021-2022 KANSAS 9 1355 0.7 2877 480069 0.6 41 40 41 40 42
42 2021-2022 PUERTO RICO 7 843 0.8 1515 259535 0.6 40 41 41 43 43
43 2021-2022 WYOMING 5 359 1.4 642 92848 0.7 39 44 39 45 44
44 2021-2022 VIRGINIA 7 1948 0.4 1278 1244624 0.1 45 41 45 44 45
45 2021-2022 IOWA 2 1325 0.2 149 503735 0.0 46 46 46 46 46

Conclusion

We can see that Arizona ranks highest for charter presence, followed by Florida. A quite different story than we were getting previously with our one-off questions. Fascinating!

There is much more you can do as you expand your knowledge of sql; including taking this data and visualizing it for presentation to an audience. That’s a topic for another day but let’s store our results in a new dataset.

query = """
  select *,
    rank() over(order by (percent_schools_rank + count_schools_rank + percent_students_rank + count_students_rank)) as combined_ranking
  from rankings
  order by combined_ranking
"""
charter_presence_rankings = duckdb.sql(query).df()
charter_presence_rankings.to_csv("charter_presence_rankings.csv.gz", compression="gzip", index=False)

You can download this data here if you like.

Summary

In this tutorial we learned: 1. some basic SQL vocabulary 2. how to read data from files into dataframes for querying 3. how to explore the structure of a new dataset 4. how to select data from a dataframe using duckdb and sql syntax 5. how to filter and limit data 6. how to translate questions into queries 7. how to do some basic aggregations like count and sum 8. how to better organize complex queries with CTEs 9. a valuable lesson about data types 10. how to join data from multiple datasets 11. how to leverage a basic window function rank to come to a more definitive answer 12. how to store the output of our analysis for future use

I hope this was helpful and gave you a taste of how to iterate on a set of questions using SQL to find an answer in a structured way (even without access to a database).