dchess.org - Normalize JSON with Pandas

When processing nested JSON data into a flat structure for importing into a relational database, it can be tricky to structure the data into the right shape. Pandas has a great tool for doing this called pandas.json_normalize() but the documentation doesn’t make it obvious how to leverage its capabilities for handling nested data structures.

I thought I could provide a brief example using some randomly generated survey response data (using the Faker library) to illustrate it’s advantages.

Setup

To start, I’m going to be using pandas and Faker so we’ll import those. I’m also going to need to easily display the parsed json as well as the returned dataframes, so I’m importing the json module from the standard lib as well as some IPython notebook helpers for displaying dataframes as HTML tables.

import json
from IPython.display import display, HTML
from faker import Faker
from faker.providers import BaseProvider, date_time, internet
import pandas as pd

Fake Data

Faker doesn’t have a built in provider for survey questions, so let’s go ahead and add a simple one that creates non-sensical questions with a simple hack to the sentence provider.

class MyProvider(BaseProvider):
    def question(self):
        stems = ('Does', 'How does', 'Which', 'Why does')
        stem = Faker().random_choices(elements=stems, length=1)[0]
        sentence = Faker().sentence()
        sentence = sentence[0].lower() + sentence[1:]
        question = sentence.replace(".", "?")
        question = f"{stem} {question}"
        return question

Using this new question provider, we’ll construct a few records of fake survey response data with some respondent level data like a respondent_id, survey_date, and respondent email. Within that we’ll nest a list of responses which will in turn have it’s own dictionary of data at the question level: id, question text, and choices. The choices list will be singular here, but assume it has that structure because the API this comes from has to also account for multi-select options and we’ll need to parse it as a list regardless. Depending on our analysis needs, this might also be a place where we’d want to keep these in a comma separated string, but for our purposes here we’ll ignore that use case.

fake = Faker()
fake.add_provider(MyProvider)
choices = ('Strongly Agree', 'Agree', 'Neutral', 'Disagree', 'Strongly Disagree')


sample_data = [
    {
        "respondent_id": fake.bothify(text="#?##??###?#"),
        "survey_date": fake.date(),
        "email": fake.email(domain="example.com"),
        "responses": [
            {
                "question_id": fake.bothify(text="#??#??###?#"),
                "question_text": fake.question(),
                "choices": [
                    {
                        "choice": fake.random_choices(elements=choices, length=1)[0],
                        "number": fake.random_digit(),
                    }
                ]
            } for _ in range(5)
        ]
    } for _ in range(2) 
]

Let’s print out a single record to see the resulting data structure that has been generated randomly.

print(json.dumps(sample_data[0], indent=2))

{
  "respondent_id": "1W38Sn628N9",
  "survey_date": "1985-09-30",
  "email": "ofrench@example.com",
  "responses": [
    {
      "question_id": "9lQ8OH810H2",
      "question_text": "How does mean action onto south usually prepare that?",
      "choices": [
        {
          "choice": "Strongly Disagree",
          "number": 9
        }
      ]
    },
    {
      "question_id": "8YZ3Ri208p1",
      "question_text": "Does million mean tax foot statement?",
      "choices": [
        {
          "choice": "Strongly Disagree",
          "number": 4
        }
      ]
    },
    {
      "question_id": "2Ys2yn819r0",
      "question_text": "How does decade what air scientist defense allow entire?",
      "choices": [
        {
          "choice": "Disagree",
          "number": 2
        }
      ]
    },
    {
      "question_id": "3ii1tu719B3",
      "question_text": "Does focus statement peace forward do relate?",
      "choices": [
        {
          "choice": "Neutral",
          "number": 6
        }
      ]
    },
    {
      "question_id": "9xv2JX456C7",
      "question_text": "Does explain ability plant?",
      "choices": [
        {
          "choice": "Strongly Disagree",
          "number": 5
        }
      ]
    }
  ]
}

Reading with Pandas

As you can see below, simply reading this directly into a dataframe only parses the top level respondent data, but then keeps the responses data as a json array. Which isn’t great for simple analysis. Could you load that “as is” into a jsonb field in PostgreSQL? Sure. If you like parsing json with SQL. Yuck!

df = pd.DataFrame(sample_data)
df

	respondent_id	survey_date	email	responses
0	1W38Sn628N9	1985-09-30	ofrench@example.com	[{'question_id': '9lQ8OH810H2', 'question_text...
1	8J14yp561A1	1970-07-08	halvarado@example.com	[{'question_id': '3sG5am762C6', 'question_text...

A messy custom parser

We could attempt to reshape this by writing some custom functions to handle extracting the responses and merging that data with the top-level meta data about the respondent with some dictionary unpacking, but this gets messy and would fall apart quickly as the structure changed. While this approach works, it’s not ideal.

Sure this code could be further refactored to simplify the logic, but it’s not worth it since the pandas.json_normalize() can do this for us easily.

def record_format(responses):
    data = []
    for response in responses:
        record = {
            "question_id": response.get("question_id"),
            "question_text": response.get("question_text"),
            "choice": response.get("choices")[0].get("choice"),
            "number": response.get("choices")[0].get("number"),
        }
        data.append(record)
    return data

def parse_json(records):
    data = []
    for record in sample_data:
        meta = {
            "respondent_id": record.get("respondent_id"),
            "survey_date": record.get("survey_date"),
            "email": record.get("email"),
        }
        responses = record.get("responses")
        formatted_responses = record_format(responses)
        for response in formatted_responses:
            combined = {**meta, **response}
            data.append(combined)
    return data

clean_data = parse_json(sample_data)
df = pd.DataFrame(clean_data)
df

	respondent_id	survey_date	email	question_id	question_text	choice	number
0	1W38Sn628N9	1985-09-30	ofrench@example.com	9lQ8OH810H2	How does mean action onto south usually prepar...	Strongly Disagree	9
1	1W38Sn628N9	1985-09-30	ofrench@example.com	8YZ3Ri208p1	Does million mean tax foot statement?	Strongly Disagree	4
2	1W38Sn628N9	1985-09-30	ofrench@example.com	2Ys2yn819r0	How does decade what air scientist defense all...	Disagree	2
3	1W38Sn628N9	1985-09-30	ofrench@example.com	3ii1tu719B3	Does focus statement peace forward do relate?	Neutral	6
4	1W38Sn628N9	1985-09-30	ofrench@example.com	9xv2JX456C7	Does explain ability plant?	Strongly Disagree	5
5	8J14yp561A1	1970-07-08	halvarado@example.com	3sG5am762C6	Why does know writer ball bad whole?	Agree	3
6	8J14yp561A1	1970-07-08	halvarado@example.com	5uM6LP013p7	Why does break chance boy enjoy call paper yet?	Strongly Disagree	9
7	8J14yp561A1	1970-07-08	halvarado@example.com	3NE9pu846z1	Does note spring newspaper and that thing?	Neutral	7
8	8J14yp561A1	1970-07-08	halvarado@example.com	6ge7Zt058d3	Why does message next eat the stay?	Neutral	0
9	8J14yp561A1	1970-07-08	halvarado@example.com	2EB3vo658l2	How does no later then inside fill discover?	Agree	4

JSON Normalize

Thankfully there is the json_normalize() function, but it requires a little understanding to get it to satisfactorily parse flat. Simply passing it the sample data without any parameters results in a very familiar result that gets us no further than we started in the first attempt.

df = pd.json_normalize(sample_data)
df

	respondent_id	survey_date	email	responses
0	1W38Sn628N9	1985-09-30	ofrench@example.com	[{'question_id': '9lQ8OH810H2', 'question_text...
1	8J14yp561A1	1970-07-08	halvarado@example.com	[{'question_id': '3sG5am762C6', 'question_text...

A few optional parameters can be used here to parse the first nested array called responses. We can direct the pandas json parser to a specific key as the source of records. The record_path parameter takes either a string or list of strings to construct that path. The name of this parameter is a hint about how to think of this when passed as a list as we’ll see later.

df = pd.json_normalize(
    sample_data, 
    record_path="responses", 
)
df

	question_id	question_text	choices
0	9lQ8OH810H2	How does mean action onto south usually prepar...	[{'choice': 'Strongly Disagree', 'number': 9}]
1	8YZ3Ri208p1	Does million mean tax foot statement?	[{'choice': 'Strongly Disagree', 'number': 4}]
2	2Ys2yn819r0	How does decade what air scientist defense all...	[{'choice': 'Disagree', 'number': 2}]
3	3ii1tu719B3	Does focus statement peace forward do relate?	[{'choice': 'Neutral', 'number': 6}]
4	9xv2JX456C7	Does explain ability plant?	[{'choice': 'Strongly Disagree', 'number': 5}]
5	3sG5am762C6	Why does know writer ball bad whole?	[{'choice': 'Agree', 'number': 3}]
6	5uM6LP013p7	Why does break chance boy enjoy call paper yet?	[{'choice': 'Strongly Disagree', 'number': 9}]
7	3NE9pu846z1	Does note spring newspaper and that thing?	[{'choice': 'Neutral', 'number': 7}]
8	6ge7Zt058d3	Why does message next eat the stay?	[{'choice': 'Neutral', 'number': 0}]
9	2EB3vo658l2	How does no later then inside fill discover?	[{'choice': 'Agree', 'number': 4}]

But when we direct the parser to just unpack the reponses array, we lose our data from the level above. Pandas can be instructed to keep this by giving it a list of metadata to repeat for each record it unpacks from the level above. We use the meta parameter and pass it a list of the fields to include.

df = pd.json_normalize(
    sample_data, 
    record_path="responses", 
     meta=[
        "respondent_id", 
        "survey_date",
        "email", 
    ],
)
df

	question_id	question_text	choices	respondent_id	survey_date	email
0	9lQ8OH810H2	How does mean action onto south usually prepar...	[{'choice': 'Strongly Disagree', 'number': 9}]	1W38Sn628N9	1985-09-30	ofrench@example.com
1	8YZ3Ri208p1	Does million mean tax foot statement?	[{'choice': 'Strongly Disagree', 'number': 4}]	1W38Sn628N9	1985-09-30	ofrench@example.com
2	2Ys2yn819r0	How does decade what air scientist defense all...	[{'choice': 'Disagree', 'number': 2}]	1W38Sn628N9	1985-09-30	ofrench@example.com
3	3ii1tu719B3	Does focus statement peace forward do relate?	[{'choice': 'Neutral', 'number': 6}]	1W38Sn628N9	1985-09-30	ofrench@example.com
4	9xv2JX456C7	Does explain ability plant?	[{'choice': 'Strongly Disagree', 'number': 5}]	1W38Sn628N9	1985-09-30	ofrench@example.com
5	3sG5am762C6	Why does know writer ball bad whole?	[{'choice': 'Agree', 'number': 3}]	8J14yp561A1	1970-07-08	halvarado@example.com
6	5uM6LP013p7	Why does break chance boy enjoy call paper yet?	[{'choice': 'Strongly Disagree', 'number': 9}]	8J14yp561A1	1970-07-08	halvarado@example.com
7	3NE9pu846z1	Does note spring newspaper and that thing?	[{'choice': 'Neutral', 'number': 7}]	8J14yp561A1	1970-07-08	halvarado@example.com
8	6ge7Zt058d3	Why does message next eat the stay?	[{'choice': 'Neutral', 'number': 0}]	8J14yp561A1	1970-07-08	halvarado@example.com
9	2EB3vo658l2	How does no later then inside fill discover?	[{'choice': 'Agree', 'number': 4}]	8J14yp561A1	1970-07-08	halvarado@example.com

We must go deeper!

That works for the most part, but we still have that annoying choices json array that would be nice to split out into columns.

df5 = pd.json_normalize(
    sample_data, 
    record_path="choices", 
    meta=[
        "respondent_id", 
        "survey_date",
        "email", 
    ],
)

Simply passing the choices field to the record_path param results in a KeyError though. This is because the choices field is actually nested in the responses field. So pandas need us to construct a path to reach it. We can get to it by passing each key as a record in the list to construct a path. Here that looks like ["responses", "choices"].

df = pd.json_normalize(
    sample_data, 
    record_path=["responses", "choices"], 
    meta=[
        "respondent_id", 
        "survey_date",
        "email", 
    ],
)
df

	choice	number	respondent_id	survey_date	email
0	Strongly Disagree	9	1W38Sn628N9	1985-09-30	ofrench@example.com
1	Strongly Disagree	4	1W38Sn628N9	1985-09-30	ofrench@example.com
2	Disagree	2	1W38Sn628N9	1985-09-30	ofrench@example.com
3	Neutral	6	1W38Sn628N9	1985-09-30	ofrench@example.com
4	Strongly Disagree	5	1W38Sn628N9	1985-09-30	ofrench@example.com
5	Agree	3	8J14yp561A1	1970-07-08	halvarado@example.com
6	Strongly Disagree	9	8J14yp561A1	1970-07-08	halvarado@example.com
7	Neutral	7	8J14yp561A1	1970-07-08	halvarado@example.com
8	Neutral	0	8J14yp561A1	1970-07-08	halvarado@example.com
9	Agree	4	8J14yp561A1	1970-07-08	halvarado@example.com

But when we do that, we lose our question_id and question_text fields. That’s because we need to add them in the meta list and pass their paths like the record path param. See below.

df = pd.json_normalize(
    sample_data, 
    record_path=["responses", "choices"], 
    meta=[
        "respondent_id", 
        "survey_date",
        "email", 
        ["responses", "question_id"],
        ["responses", "question_text"],
    ],
)
df

	choice	number	respondent_id	survey_date	email	responses.question_id	responses.question_text
0	Strongly Disagree	9	1W38Sn628N9	1985-09-30	ofrench@example.com	9lQ8OH810H2	How does mean action onto south usually prepar...
1	Strongly Disagree	4	1W38Sn628N9	1985-09-30	ofrench@example.com	8YZ3Ri208p1	Does million mean tax foot statement?
2	Disagree	2	1W38Sn628N9	1985-09-30	ofrench@example.com	2Ys2yn819r0	How does decade what air scientist defense all...
3	Neutral	6	1W38Sn628N9	1985-09-30	ofrench@example.com	3ii1tu719B3	Does focus statement peace forward do relate?
4	Strongly Disagree	5	1W38Sn628N9	1985-09-30	ofrench@example.com	9xv2JX456C7	Does explain ability plant?
5	Agree	3	8J14yp561A1	1970-07-08	halvarado@example.com	3sG5am762C6	Why does know writer ball bad whole?
6	Strongly Disagree	9	8J14yp561A1	1970-07-08	halvarado@example.com	5uM6LP013p7	Why does break chance boy enjoy call paper yet?
7	Neutral	7	8J14yp561A1	1970-07-08	halvarado@example.com	3NE9pu846z1	Does note spring newspaper and that thing?
8	Neutral	0	8J14yp561A1	1970-07-08	halvarado@example.com	6ge7Zt058d3	Why does message next eat the stay?
9	Agree	4	8J14yp561A1	1970-07-08	halvarado@example.com	2EB3vo658l2	How does no later then inside fill discover?

One last tweak: some databases (like MS SQL) don’t like naming columns with that period in the name. As a work around you can give the json_normalize function a custom separator such as an underscore instead.

df = pd.json_normalize(
    sample_data, 
    sep="_",
    record_path=["responses", "choices"], 
    meta=[
        "respondent_id", 
        "survey_date",
        "email", 
        ["responses", "question_id"],
        ["responses", "question_text"]
    ],
)
df

	choice	number	respondent_id	survey_date	email	responses_question_id	responses_question_text
0	Strongly Disagree	9	1W38Sn628N9	1985-09-30	ofrench@example.com	9lQ8OH810H2	How does mean action onto south usually prepar...
1	Strongly Disagree	4	1W38Sn628N9	1985-09-30	ofrench@example.com	8YZ3Ri208p1	Does million mean tax foot statement?
2	Disagree	2	1W38Sn628N9	1985-09-30	ofrench@example.com	2Ys2yn819r0	How does decade what air scientist defense all...
3	Neutral	6	1W38Sn628N9	1985-09-30	ofrench@example.com	3ii1tu719B3	Does focus statement peace forward do relate?
4	Strongly Disagree	5	1W38Sn628N9	1985-09-30	ofrench@example.com	9xv2JX456C7	Does explain ability plant?
5	Agree	3	8J14yp561A1	1970-07-08	halvarado@example.com	3sG5am762C6	Why does know writer ball bad whole?
6	Strongly Disagree	9	8J14yp561A1	1970-07-08	halvarado@example.com	5uM6LP013p7	Why does break chance boy enjoy call paper yet?
7	Neutral	7	8J14yp561A1	1970-07-08	halvarado@example.com	3NE9pu846z1	Does note spring newspaper and that thing?
8	Neutral	0	8J14yp561A1	1970-07-08	halvarado@example.com	6ge7Zt058d3	Why does message next eat the stay?
9	Agree	4	8J14yp561A1	1970-07-08	halvarado@example.com	2EB3vo658l2	How does no later then inside fill discover?

Finally, we’ll order the columns from the least nested level on the left all the way to the most nested on the right for easier readability.

column_order = ["respondent_id", "survey_date", "email", "responses_question_id", "responses_question_text", "choice", "number"]
df = df[column_order]
df

	respondent_id	survey_date	email	responses_question_id	responses_question_text	choice	number
0	1W38Sn628N9	1985-09-30	ofrench@example.com	9lQ8OH810H2	How does mean action onto south usually prepar...	Strongly Disagree	9
1	1W38Sn628N9	1985-09-30	ofrench@example.com	8YZ3Ri208p1	Does million mean tax foot statement?	Strongly Disagree	4
2	1W38Sn628N9	1985-09-30	ofrench@example.com	2Ys2yn819r0	How does decade what air scientist defense all...	Disagree	2
3	1W38Sn628N9	1985-09-30	ofrench@example.com	3ii1tu719B3	Does focus statement peace forward do relate?	Neutral	6
4	1W38Sn628N9	1985-09-30	ofrench@example.com	9xv2JX456C7	Does explain ability plant?	Strongly Disagree	5
5	8J14yp561A1	1970-07-08	halvarado@example.com	3sG5am762C6	Why does know writer ball bad whole?	Agree	3
6	8J14yp561A1	1970-07-08	halvarado@example.com	5uM6LP013p7	Why does break chance boy enjoy call paper yet?	Strongly Disagree	9
7	8J14yp561A1	1970-07-08	halvarado@example.com	3NE9pu846z1	Does note spring newspaper and that thing?	Neutral	7
8	8J14yp561A1	1970-07-08	halvarado@example.com	6ge7Zt058d3	Why does message next eat the stay?	Neutral	0
9	8J14yp561A1	1970-07-08	halvarado@example.com	2EB3vo658l2	How does no later then inside fill discover?	Agree	4

Parsed and ready to import

At this point our data is in a simple tabular format and ready to import into a database table with something like pandas.to_sql() function, but we’ll save that for another post.