import json
from IPython.display import display, HTML
from faker import Faker
from faker.providers import BaseProvider, date_time, internet
import pandas as pd
When processing nested JSON
data into a flat structure for importing into a relational database, it can be tricky to structure the data into the right shape. Pandas has a great tool for doing this called pandas.json_normalize()
but the documentation doesn’t make it obvious how to leverage its capabilities for handling nested data structures.
I thought I could provide a brief example using some randomly generated survey response data (using the Faker library) to illustrate it’s advantages.
Setup
To start, I’m going to be using pandas and Faker so we’ll import those. I’m also going to need to easily display the parsed json as well as the returned dataframes, so I’m importing the json module from the standard lib as well as some IPython notebook helpers for displaying dataframes as HTML tables.
Fake Data
Faker doesn’t have a built in provider for survey questions, so let’s go ahead and add a simple one that creates non-sensical questions with a simple hack to the sentence provider.
class MyProvider(BaseProvider):
def question(self):
= ('Does', 'How does', 'Which', 'Why does')
stems = Faker().random_choices(elements=stems, length=1)[0]
stem = Faker().sentence()
sentence = sentence[0].lower() + sentence[1:]
sentence = sentence.replace(".", "?")
question = f"{stem} {question}"
question return question
Using this new question provider, we’ll construct a few records of fake survey response data with some respondent level data like a respondent_id, survey_date, and respondent email. Within that we’ll nest a list of responses which will in turn have it’s own dictionary of data at the question level: id, question text, and choices. The choices list will be singular here, but assume it has that structure because the API this comes from has to also account for multi-select options and we’ll need to parse it as a list regardless. Depending on our analysis needs, this might also be a place where we’d want to keep these in a comma separated string, but for our purposes here we’ll ignore that use case.
= Faker()
fake
fake.add_provider(MyProvider)= ('Strongly Agree', 'Agree', 'Neutral', 'Disagree', 'Strongly Disagree')
choices
= [
sample_data
{"respondent_id": fake.bothify(text="#?##??###?#"),
"survey_date": fake.date(),
"email": fake.email(domain="example.com"),
"responses": [
{"question_id": fake.bothify(text="#??#??###?#"),
"question_text": fake.question(),
"choices": [
{"choice": fake.random_choices(elements=choices, length=1)[0],
"number": fake.random_digit(),
}
]for _ in range(5)
}
]for _ in range(2)
} ]
Let’s print out a single record to see the resulting data structure that has been generated randomly.
print(json.dumps(sample_data[0], indent=2))
{
"respondent_id": "1W38Sn628N9",
"survey_date": "1985-09-30",
"email": "ofrench@example.com",
"responses": [
{
"question_id": "9lQ8OH810H2",
"question_text": "How does mean action onto south usually prepare that?",
"choices": [
{
"choice": "Strongly Disagree",
"number": 9
}
]
},
{
"question_id": "8YZ3Ri208p1",
"question_text": "Does million mean tax foot statement?",
"choices": [
{
"choice": "Strongly Disagree",
"number": 4
}
]
},
{
"question_id": "2Ys2yn819r0",
"question_text": "How does decade what air scientist defense allow entire?",
"choices": [
{
"choice": "Disagree",
"number": 2
}
]
},
{
"question_id": "3ii1tu719B3",
"question_text": "Does focus statement peace forward do relate?",
"choices": [
{
"choice": "Neutral",
"number": 6
}
]
},
{
"question_id": "9xv2JX456C7",
"question_text": "Does explain ability plant?",
"choices": [
{
"choice": "Strongly Disagree",
"number": 5
}
]
}
]
}
Reading with Pandas
As you can see below, simply reading this directly into a dataframe only parses the top level respondent data, but then keeps the responses data as a json array. Which isn’t great for simple analysis. Could you load that “as is” into a jsonb field in PostgreSQL? Sure. If you like parsing json with SQL. Yuck!
= pd.DataFrame(sample_data)
df df
respondent_id | survey_date | responses | ||
---|---|---|---|---|
0 | 1W38Sn628N9 | 1985-09-30 | ofrench@example.com | [{'question_id': '9lQ8OH810H2', 'question_text... |
1 | 8J14yp561A1 | 1970-07-08 | halvarado@example.com | [{'question_id': '3sG5am762C6', 'question_text... |
A messy custom parser
We could attempt to reshape this by writing some custom functions to handle extracting the responses and merging that data with the top-level meta data about the respondent with some dictionary unpacking, but this gets messy and would fall apart quickly as the structure changed. While this approach works, it’s not ideal.
Sure this code could be further refactored to simplify the logic, but it’s not worth it since the pandas.json_normalize()
can do this for us easily.
def record_format(responses):
= []
data for response in responses:
= {
record "question_id": response.get("question_id"),
"question_text": response.get("question_text"),
"choice": response.get("choices")[0].get("choice"),
"number": response.get("choices")[0].get("number"),
}
data.append(record)return data
def parse_json(records):
= []
data for record in sample_data:
= {
meta "respondent_id": record.get("respondent_id"),
"survey_date": record.get("survey_date"),
"email": record.get("email"),
}= record.get("responses")
responses = record_format(responses)
formatted_responses for response in formatted_responses:
= {**meta, **response}
combined
data.append(combined)return data
= parse_json(sample_data)
clean_data = pd.DataFrame(clean_data)
df df
respondent_id | survey_date | question_id | question_text | choice | number | ||
---|---|---|---|---|---|---|---|
0 | 1W38Sn628N9 | 1985-09-30 | ofrench@example.com | 9lQ8OH810H2 | How does mean action onto south usually prepar... | Strongly Disagree | 9 |
1 | 1W38Sn628N9 | 1985-09-30 | ofrench@example.com | 8YZ3Ri208p1 | Does million mean tax foot statement? | Strongly Disagree | 4 |
2 | 1W38Sn628N9 | 1985-09-30 | ofrench@example.com | 2Ys2yn819r0 | How does decade what air scientist defense all... | Disagree | 2 |
3 | 1W38Sn628N9 | 1985-09-30 | ofrench@example.com | 3ii1tu719B3 | Does focus statement peace forward do relate? | Neutral | 6 |
4 | 1W38Sn628N9 | 1985-09-30 | ofrench@example.com | 9xv2JX456C7 | Does explain ability plant? | Strongly Disagree | 5 |
5 | 8J14yp561A1 | 1970-07-08 | halvarado@example.com | 3sG5am762C6 | Why does know writer ball bad whole? | Agree | 3 |
6 | 8J14yp561A1 | 1970-07-08 | halvarado@example.com | 5uM6LP013p7 | Why does break chance boy enjoy call paper yet? | Strongly Disagree | 9 |
7 | 8J14yp561A1 | 1970-07-08 | halvarado@example.com | 3NE9pu846z1 | Does note spring newspaper and that thing? | Neutral | 7 |
8 | 8J14yp561A1 | 1970-07-08 | halvarado@example.com | 6ge7Zt058d3 | Why does message next eat the stay? | Neutral | 0 |
9 | 8J14yp561A1 | 1970-07-08 | halvarado@example.com | 2EB3vo658l2 | How does no later then inside fill discover? | Agree | 4 |
JSON Normalize
Thankfully there is the json_normalize()
function, but it requires a little understanding to get it to satisfactorily parse flat. Simply passing it the sample data without any parameters results in a very familiar result that gets us no further than we started in the first attempt.
= pd.json_normalize(sample_data)
df df
respondent_id | survey_date | responses | ||
---|---|---|---|---|
0 | 1W38Sn628N9 | 1985-09-30 | ofrench@example.com | [{'question_id': '9lQ8OH810H2', 'question_text... |
1 | 8J14yp561A1 | 1970-07-08 | halvarado@example.com | [{'question_id': '3sG5am762C6', 'question_text... |
A few optional parameters can be used here to parse the first nested array called responses
. We can direct the pandas json parser to a specific key as the source of records. The record_path
parameter takes either a string or list of strings to construct that path. The name of this parameter is a hint about how to think of this when passed as a list as we’ll see later.
= pd.json_normalize(
df
sample_data, ="responses",
record_path
) df
question_id | question_text | choices | |
---|---|---|---|
0 | 9lQ8OH810H2 | How does mean action onto south usually prepar... | [{'choice': 'Strongly Disagree', 'number': 9}] |
1 | 8YZ3Ri208p1 | Does million mean tax foot statement? | [{'choice': 'Strongly Disagree', 'number': 4}] |
2 | 2Ys2yn819r0 | How does decade what air scientist defense all... | [{'choice': 'Disagree', 'number': 2}] |
3 | 3ii1tu719B3 | Does focus statement peace forward do relate? | [{'choice': 'Neutral', 'number': 6}] |
4 | 9xv2JX456C7 | Does explain ability plant? | [{'choice': 'Strongly Disagree', 'number': 5}] |
5 | 3sG5am762C6 | Why does know writer ball bad whole? | [{'choice': 'Agree', 'number': 3}] |
6 | 5uM6LP013p7 | Why does break chance boy enjoy call paper yet? | [{'choice': 'Strongly Disagree', 'number': 9}] |
7 | 3NE9pu846z1 | Does note spring newspaper and that thing? | [{'choice': 'Neutral', 'number': 7}] |
8 | 6ge7Zt058d3 | Why does message next eat the stay? | [{'choice': 'Neutral', 'number': 0}] |
9 | 2EB3vo658l2 | How does no later then inside fill discover? | [{'choice': 'Agree', 'number': 4}] |
But when we direct the parser to just unpack the reponses
array, we lose our data from the level above. Pandas can be instructed to keep this by giving it a list of metadata to repeat for each record it unpacks from the level above. We use the meta
parameter and pass it a list of the fields to include.
= pd.json_normalize(
df
sample_data, ="responses",
record_path=[
meta"respondent_id",
"survey_date",
"email",
],
)
df
question_id | question_text | choices | respondent_id | survey_date | ||
---|---|---|---|---|---|---|
0 | 9lQ8OH810H2 | How does mean action onto south usually prepar... | [{'choice': 'Strongly Disagree', 'number': 9}] | 1W38Sn628N9 | 1985-09-30 | ofrench@example.com |
1 | 8YZ3Ri208p1 | Does million mean tax foot statement? | [{'choice': 'Strongly Disagree', 'number': 4}] | 1W38Sn628N9 | 1985-09-30 | ofrench@example.com |
2 | 2Ys2yn819r0 | How does decade what air scientist defense all... | [{'choice': 'Disagree', 'number': 2}] | 1W38Sn628N9 | 1985-09-30 | ofrench@example.com |
3 | 3ii1tu719B3 | Does focus statement peace forward do relate? | [{'choice': 'Neutral', 'number': 6}] | 1W38Sn628N9 | 1985-09-30 | ofrench@example.com |
4 | 9xv2JX456C7 | Does explain ability plant? | [{'choice': 'Strongly Disagree', 'number': 5}] | 1W38Sn628N9 | 1985-09-30 | ofrench@example.com |
5 | 3sG5am762C6 | Why does know writer ball bad whole? | [{'choice': 'Agree', 'number': 3}] | 8J14yp561A1 | 1970-07-08 | halvarado@example.com |
6 | 5uM6LP013p7 | Why does break chance boy enjoy call paper yet? | [{'choice': 'Strongly Disagree', 'number': 9}] | 8J14yp561A1 | 1970-07-08 | halvarado@example.com |
7 | 3NE9pu846z1 | Does note spring newspaper and that thing? | [{'choice': 'Neutral', 'number': 7}] | 8J14yp561A1 | 1970-07-08 | halvarado@example.com |
8 | 6ge7Zt058d3 | Why does message next eat the stay? | [{'choice': 'Neutral', 'number': 0}] | 8J14yp561A1 | 1970-07-08 | halvarado@example.com |
9 | 2EB3vo658l2 | How does no later then inside fill discover? | [{'choice': 'Agree', 'number': 4}] | 8J14yp561A1 | 1970-07-08 | halvarado@example.com |
We must go deeper!
That works for the most part, but we still have that annoying choices json array that would be nice to split out into columns.
= pd.json_normalize(
df5
sample_data, ="choices",
record_path=[
meta"respondent_id",
"survey_date",
"email",
], )
Simply passing the choices field to the record_path
param results in a KeyError
though. This is because the choices field is actually nested in the responses field. So pandas need us to construct a path to reach it. We can get to it by passing each key as a record in the list to construct a path. Here that looks like ["responses", "choices"]
.
= pd.json_normalize(
df
sample_data, =["responses", "choices"],
record_path=[
meta"respondent_id",
"survey_date",
"email",
],
) df
choice | number | respondent_id | survey_date | ||
---|---|---|---|---|---|
0 | Strongly Disagree | 9 | 1W38Sn628N9 | 1985-09-30 | ofrench@example.com |
1 | Strongly Disagree | 4 | 1W38Sn628N9 | 1985-09-30 | ofrench@example.com |
2 | Disagree | 2 | 1W38Sn628N9 | 1985-09-30 | ofrench@example.com |
3 | Neutral | 6 | 1W38Sn628N9 | 1985-09-30 | ofrench@example.com |
4 | Strongly Disagree | 5 | 1W38Sn628N9 | 1985-09-30 | ofrench@example.com |
5 | Agree | 3 | 8J14yp561A1 | 1970-07-08 | halvarado@example.com |
6 | Strongly Disagree | 9 | 8J14yp561A1 | 1970-07-08 | halvarado@example.com |
7 | Neutral | 7 | 8J14yp561A1 | 1970-07-08 | halvarado@example.com |
8 | Neutral | 0 | 8J14yp561A1 | 1970-07-08 | halvarado@example.com |
9 | Agree | 4 | 8J14yp561A1 | 1970-07-08 | halvarado@example.com |
But when we do that, we lose our question_id and question_text fields. That’s because we need to add them in the meta list and pass their paths like the record path param. See below.
= pd.json_normalize(
df
sample_data, =["responses", "choices"],
record_path=[
meta"respondent_id",
"survey_date",
"email",
"responses", "question_id"],
["responses", "question_text"],
[
],
) df
choice | number | respondent_id | survey_date | responses.question_id | responses.question_text | ||
---|---|---|---|---|---|---|---|
0 | Strongly Disagree | 9 | 1W38Sn628N9 | 1985-09-30 | ofrench@example.com | 9lQ8OH810H2 | How does mean action onto south usually prepar... |
1 | Strongly Disagree | 4 | 1W38Sn628N9 | 1985-09-30 | ofrench@example.com | 8YZ3Ri208p1 | Does million mean tax foot statement? |
2 | Disagree | 2 | 1W38Sn628N9 | 1985-09-30 | ofrench@example.com | 2Ys2yn819r0 | How does decade what air scientist defense all... |
3 | Neutral | 6 | 1W38Sn628N9 | 1985-09-30 | ofrench@example.com | 3ii1tu719B3 | Does focus statement peace forward do relate? |
4 | Strongly Disagree | 5 | 1W38Sn628N9 | 1985-09-30 | ofrench@example.com | 9xv2JX456C7 | Does explain ability plant? |
5 | Agree | 3 | 8J14yp561A1 | 1970-07-08 | halvarado@example.com | 3sG5am762C6 | Why does know writer ball bad whole? |
6 | Strongly Disagree | 9 | 8J14yp561A1 | 1970-07-08 | halvarado@example.com | 5uM6LP013p7 | Why does break chance boy enjoy call paper yet? |
7 | Neutral | 7 | 8J14yp561A1 | 1970-07-08 | halvarado@example.com | 3NE9pu846z1 | Does note spring newspaper and that thing? |
8 | Neutral | 0 | 8J14yp561A1 | 1970-07-08 | halvarado@example.com | 6ge7Zt058d3 | Why does message next eat the stay? |
9 | Agree | 4 | 8J14yp561A1 | 1970-07-08 | halvarado@example.com | 2EB3vo658l2 | How does no later then inside fill discover? |
One last tweak: some databases (like MS SQL) don’t like naming columns with that period in the name. As a work around you can give the json_normalize
function a custom separator such as an underscore instead.
= pd.json_normalize(
df
sample_data, ="_",
sep=["responses", "choices"],
record_path=[
meta"respondent_id",
"survey_date",
"email",
"responses", "question_id"],
["responses", "question_text"]
[
],
) df
choice | number | respondent_id | survey_date | responses_question_id | responses_question_text | ||
---|---|---|---|---|---|---|---|
0 | Strongly Disagree | 9 | 1W38Sn628N9 | 1985-09-30 | ofrench@example.com | 9lQ8OH810H2 | How does mean action onto south usually prepar... |
1 | Strongly Disagree | 4 | 1W38Sn628N9 | 1985-09-30 | ofrench@example.com | 8YZ3Ri208p1 | Does million mean tax foot statement? |
2 | Disagree | 2 | 1W38Sn628N9 | 1985-09-30 | ofrench@example.com | 2Ys2yn819r0 | How does decade what air scientist defense all... |
3 | Neutral | 6 | 1W38Sn628N9 | 1985-09-30 | ofrench@example.com | 3ii1tu719B3 | Does focus statement peace forward do relate? |
4 | Strongly Disagree | 5 | 1W38Sn628N9 | 1985-09-30 | ofrench@example.com | 9xv2JX456C7 | Does explain ability plant? |
5 | Agree | 3 | 8J14yp561A1 | 1970-07-08 | halvarado@example.com | 3sG5am762C6 | Why does know writer ball bad whole? |
6 | Strongly Disagree | 9 | 8J14yp561A1 | 1970-07-08 | halvarado@example.com | 5uM6LP013p7 | Why does break chance boy enjoy call paper yet? |
7 | Neutral | 7 | 8J14yp561A1 | 1970-07-08 | halvarado@example.com | 3NE9pu846z1 | Does note spring newspaper and that thing? |
8 | Neutral | 0 | 8J14yp561A1 | 1970-07-08 | halvarado@example.com | 6ge7Zt058d3 | Why does message next eat the stay? |
9 | Agree | 4 | 8J14yp561A1 | 1970-07-08 | halvarado@example.com | 2EB3vo658l2 | How does no later then inside fill discover? |
Finally, we’ll order the columns from the least nested level on the left all the way to the most nested on the right for easier readability.
= ["respondent_id", "survey_date", "email", "responses_question_id", "responses_question_text", "choice", "number"]
column_order = df[column_order]
df df
respondent_id | survey_date | responses_question_id | responses_question_text | choice | number | ||
---|---|---|---|---|---|---|---|
0 | 1W38Sn628N9 | 1985-09-30 | ofrench@example.com | 9lQ8OH810H2 | How does mean action onto south usually prepar... | Strongly Disagree | 9 |
1 | 1W38Sn628N9 | 1985-09-30 | ofrench@example.com | 8YZ3Ri208p1 | Does million mean tax foot statement? | Strongly Disagree | 4 |
2 | 1W38Sn628N9 | 1985-09-30 | ofrench@example.com | 2Ys2yn819r0 | How does decade what air scientist defense all... | Disagree | 2 |
3 | 1W38Sn628N9 | 1985-09-30 | ofrench@example.com | 3ii1tu719B3 | Does focus statement peace forward do relate? | Neutral | 6 |
4 | 1W38Sn628N9 | 1985-09-30 | ofrench@example.com | 9xv2JX456C7 | Does explain ability plant? | Strongly Disagree | 5 |
5 | 8J14yp561A1 | 1970-07-08 | halvarado@example.com | 3sG5am762C6 | Why does know writer ball bad whole? | Agree | 3 |
6 | 8J14yp561A1 | 1970-07-08 | halvarado@example.com | 5uM6LP013p7 | Why does break chance boy enjoy call paper yet? | Strongly Disagree | 9 |
7 | 8J14yp561A1 | 1970-07-08 | halvarado@example.com | 3NE9pu846z1 | Does note spring newspaper and that thing? | Neutral | 7 |
8 | 8J14yp561A1 | 1970-07-08 | halvarado@example.com | 6ge7Zt058d3 | Why does message next eat the stay? | Neutral | 0 |
9 | 8J14yp561A1 | 1970-07-08 | halvarado@example.com | 2EB3vo658l2 | How does no later then inside fill discover? | Agree | 4 |
Parsed and ready to import
At this point our data is in a simple tabular format and ready to import into a database table with something like pandas.to_sql() function, but we’ll save that for another post.