Normalize JSON with Pandas
Posted on Tue 12 January 2021 in Tutorial
When processing nested JSON
data into a flat structure for importing into a relational database, it can be tricky to structure the data into the right shape. Pandas has a great tool for doing this called pandas.json_normalize()
but the documentation doesn't make it obvious how to leverage its capabilities for handling nested data structures.
I thought I could provide a brief example using some randomly generated survey response data (using the Faker library) to illustrate it's advantages.
Setup¶
To start, I'm going to be using pandas and Faker so we'll import those. I'm also going to need to easily display the parsed json as well as the returned dataframes, so I'm importing the json module from the standard lib as well as some IPython notebook helpers for displaying dataframes as HTML tables.
import json
from IPython.display import display, HTML
from faker import Faker
from faker.providers import BaseProvider, date_time, internet
import pandas as pd
To make it easier to call the IPython display helper, we'll include this simple function as syntactic sugar.
def print_df(df):
display(HTML(df.to_html()))
Fake Data¶
Faker doesn't have a built in provider for survey questions, so let's go ahead and add a simple one that creates non-sensical questions with a simple hack to the sentence provider.
class MyProvider(BaseProvider):
def question(self):
stems = ('Does', 'How does', 'Which', 'Why does')
stem = Faker().random_choices(elements=stems, length=1)[0]
sentence = Faker().sentence()
sentence = sentence[0].lower() + sentence[1:]
question = sentence.replace(".", "?")
question = f"{stem} {question}"
return question
Using this new question provider, we'll construct a few records of fake survey response data with some respondent level data like a respondent_id, survey_date, and respondent email. Within that we'll nest a list of responses which will in turn have it's own dictionary of data at the question level: id, question text, and choices. The choices list will be singular here, but assume it has that structure because the API this comes from has to also account for multi-select options and we'll need to parse it as a list regardless. Depending on our analysis needs, this might also be a place where we'd want to keep these in a comma separated string, but for our purposes here we'll ignore that use case.
fake = Faker()
fake.add_provider(MyProvider)
choices = ('Strongly Agree', 'Agree', 'Neutral', 'Disagree', 'Strongly Disagree')
sample_data = [
{
"respondent_id": fake.bothify(text="#?##??###?#"),
"survey_date": fake.date(),
"email": fake.email(domain="example.com"),
"responses": [
{
"question_id": fake.bothify(text="#??#??###?#"),
"question_text": fake.question(),
"choices": [
{
"choice": fake.random_choices(elements=choices, length=1)[0],
"number": fake.random_digit(),
}
]
} for _ in range(5)
]
} for _ in range(2)
]
Let's print out a single record to see the resulting data structure that has been generated randomly.
print(json.dumps(sample_data[0], indent=2))
Reading with Pandas¶
As you can see below, simply reading this directly into a dataframe only parses the top level respondent data, but then keeps the responses data as a json array. Which isn't great for simple analysis. Could you load that "as is" into a jsonb field in PostgreSQL? Sure. If you like parsing json with SQL. Yuck!
df = pd.DataFrame(sample_data)
print_df(df)
A messy custom parser¶
We could attempt to reshape this by writing some custom functions to handle extracting the responses and merging that data with the top-level meta data about the respondent with some dictionary unpacking, but this gets messy and would fall apart quickly as the structure changed. While this approach works, it's not ideal.
Sure this code could be further refactored to simplify the logic, but it's not worth it since the pandas.json_normalize()
can do this for us easily.
def record_format(responses):
data = []
for response in responses:
record = {
"question_id": response.get("question_id"),
"question_text": response.get("question_text"),
"choice": response.get("choices")[0].get("choice"),
"number": response.get("choices")[0].get("number"),
}
data.append(record)
return data
def parse_json(records):
data = []
for record in sample_data:
meta = {
"respondent_id": record.get("respondent_id"),
"survey_date": record.get("survey_date"),
"email": record.get("email"),
}
responses = record.get("responses")
formatted_responses = record_format(responses)
for response in formatted_responses:
combined = {**meta, **response}
data.append(combined)
return data
clean_data = parse_json(sample_data)
df = pd.DataFrame(clean_data)
print_df(df)
JSON Normalize¶
Thankfully there is the json_normalize()
function, but it requires a little understanding to get it to satisfactorily parse flat. Simply passing it the sample data without any parameters results in a very familiar result that gets us no further than we started in the first attempt.
df = pd.json_normalize(sample_data)
print_df(df)
A few optional parameters can be used here to parse the first nested array called responses
. We can direct the pandas json parser to a specific key as the source of records. The record_path
parameter takes either a string or list of strings to construct that path. The name of this parameter is a hint about how to think of this when passed as a list as we'll see later.
df = pd.json_normalize(
sample_data,
record_path="responses",
)
print_df(df)
But when we direct the parser to just unpack the reponses
array, we lose our data from the level above. Pandas can be instructed to keep this by giving it a list of metadata to repeat for each record it unpacks from the level above. We use the meta
parameter and pass it a list of the fields to include.
df = pd.json_normalize(
sample_data,
record_path="responses",
meta=[
"respondent_id",
"survey_date",
"email",
],
)
print_df(df)
We must go deeper!¶
That works for the most part, but we still have that annoying choices json array that would be nice to split out into columns.
df5 = pd.json_normalize(
sample_data,
record_path="choices",
meta=[
"respondent_id",
"survey_date",
"email",
],
)
Simply passing the choices field to the record_path
param results in a KeyError
though. This is because the choices field is actually nested in the responses field. So pandas need us to construct a path to reach it. We can get to it by passing each key as a record in the list to construct a path. Here that looks like ["responses", "choices"]
.
df = pd.json_normalize(
sample_data,
record_path=["responses", "choices"],
meta=[
"respondent_id",
"survey_date",
"email",
],
)
print_df(df)
But when we do that, we lose our question_id and question_text fields. That's because we need to add them in the meta list and pass their paths like the record path param. See below.
df = pd.json_normalize(
sample_data,
record_path=["responses", "choices"],
meta=[
"respondent_id",
"survey_date",
"email",
["responses", "question_id"],
["responses", "question_text"],
],
)
print_df(df)
One last tweak: some databases (like MS SQL) don't like naming columns with that period in the name. As a work around you can give the json_normalize
function a custom separator such as an underscore instead.
df = pd.json_normalize(
sample_data,
sep="_",
record_path=["responses", "choices"],
meta=[
"respondent_id",
"survey_date",
"email",
["responses", "question_id"],
["responses", "question_text"]
],
)
print_df(df)
Finally, we'll order the columns from the least nested level on the left all the way to the most nested on the right for easier readability.
column_order = ["respondent_id", "survey_date", "email", "responses_question_id", "responses_question_text", "choice", "number"]
df = df[column_order]
print_df(df)
Parsed and ready to import¶
At this point our data is in a simple tabular format and ready to import into a database table with something like pandas.to_sql() function, but we'll save that for another post.