Normalize JSON with Pandas

tutorial
Published

January 12, 2021

When processing nested JSON data into a flat structure for importing into a relational database, it can be tricky to structure the data into the right shape. Pandas has a great tool for doing this called pandas.json_normalize() but the documentation doesn’t make it obvious how to leverage its capabilities for handling nested data structures.

I thought I could provide a brief example using some randomly generated survey response data (using the Faker library) to illustrate it’s advantages.

Setup

To start, I’m going to be using pandas and Faker so we’ll import those. I’m also going to need to easily display the parsed json as well as the returned dataframes, so I’m importing the json module from the standard lib as well as some IPython notebook helpers for displaying dataframes as HTML tables.

import json
from IPython.display import display, HTML
from faker import Faker
from faker.providers import BaseProvider, date_time, internet
import pandas as pd

Fake Data

Faker doesn’t have a built in provider for survey questions, so let’s go ahead and add a simple one that creates non-sensical questions with a simple hack to the sentence provider.

class MyProvider(BaseProvider):
    def question(self):
        stems = ('Does', 'How does', 'Which', 'Why does')
        stem = Faker().random_choices(elements=stems, length=1)[0]
        sentence = Faker().sentence()
        sentence = sentence[0].lower() + sentence[1:]
        question = sentence.replace(".", "?")
        question = f"{stem} {question}"
        return question

Using this new question provider, we’ll construct a few records of fake survey response data with some respondent level data like a respondent_id, survey_date, and respondent email. Within that we’ll nest a list of responses which will in turn have it’s own dictionary of data at the question level: id, question text, and choices. The choices list will be singular here, but assume it has that structure because the API this comes from has to also account for multi-select options and we’ll need to parse it as a list regardless. Depending on our analysis needs, this might also be a place where we’d want to keep these in a comma separated string, but for our purposes here we’ll ignore that use case.

fake = Faker()
fake.add_provider(MyProvider)
choices = ('Strongly Agree', 'Agree', 'Neutral', 'Disagree', 'Strongly Disagree')


sample_data = [
    {
        "respondent_id": fake.bothify(text="#?##??###?#"),
        "survey_date": fake.date(),
        "email": fake.email(domain="example.com"),
        "responses": [
            {
                "question_id": fake.bothify(text="#??#??###?#"),
                "question_text": fake.question(),
                "choices": [
                    {
                        "choice": fake.random_choices(elements=choices, length=1)[0],
                        "number": fake.random_digit(),
                    }
                ]
            } for _ in range(5)
        ]
    } for _ in range(2) 
]

Let’s print out a single record to see the resulting data structure that has been generated randomly.

print(json.dumps(sample_data[0], indent=2))
{
  "respondent_id": "1W38Sn628N9",
  "survey_date": "1985-09-30",
  "email": "ofrench@example.com",
  "responses": [
    {
      "question_id": "9lQ8OH810H2",
      "question_text": "How does mean action onto south usually prepare that?",
      "choices": [
        {
          "choice": "Strongly Disagree",
          "number": 9
        }
      ]
    },
    {
      "question_id": "8YZ3Ri208p1",
      "question_text": "Does million mean tax foot statement?",
      "choices": [
        {
          "choice": "Strongly Disagree",
          "number": 4
        }
      ]
    },
    {
      "question_id": "2Ys2yn819r0",
      "question_text": "How does decade what air scientist defense allow entire?",
      "choices": [
        {
          "choice": "Disagree",
          "number": 2
        }
      ]
    },
    {
      "question_id": "3ii1tu719B3",
      "question_text": "Does focus statement peace forward do relate?",
      "choices": [
        {
          "choice": "Neutral",
          "number": 6
        }
      ]
    },
    {
      "question_id": "9xv2JX456C7",
      "question_text": "Does explain ability plant?",
      "choices": [
        {
          "choice": "Strongly Disagree",
          "number": 5
        }
      ]
    }
  ]
}

Reading with Pandas

As you can see below, simply reading this directly into a dataframe only parses the top level respondent data, but then keeps the responses data as a json array. Which isn’t great for simple analysis. Could you load that “as is” into a jsonb field in PostgreSQL? Sure. If you like parsing json with SQL. Yuck!

df = pd.DataFrame(sample_data)
df
respondent_id survey_date email responses
0 1W38Sn628N9 1985-09-30 ofrench@example.com [{'question_id': '9lQ8OH810H2', 'question_text...
1 8J14yp561A1 1970-07-08 halvarado@example.com [{'question_id': '3sG5am762C6', 'question_text...

A messy custom parser

We could attempt to reshape this by writing some custom functions to handle extracting the responses and merging that data with the top-level meta data about the respondent with some dictionary unpacking, but this gets messy and would fall apart quickly as the structure changed. While this approach works, it’s not ideal.

Sure this code could be further refactored to simplify the logic, but it’s not worth it since the pandas.json_normalize() can do this for us easily.

def record_format(responses):
    data = []
    for response in responses:
        record = {
            "question_id": response.get("question_id"),
            "question_text": response.get("question_text"),
            "choice": response.get("choices")[0].get("choice"),
            "number": response.get("choices")[0].get("number"),
        }
        data.append(record)
    return data

def parse_json(records):
    data = []
    for record in sample_data:
        meta = {
            "respondent_id": record.get("respondent_id"),
            "survey_date": record.get("survey_date"),
            "email": record.get("email"),
        }
        responses = record.get("responses")
        formatted_responses = record_format(responses)
        for response in formatted_responses:
            combined = {**meta, **response}
            data.append(combined)
    return data

clean_data = parse_json(sample_data)
df = pd.DataFrame(clean_data)
df
respondent_id survey_date email question_id question_text choice number
0 1W38Sn628N9 1985-09-30 ofrench@example.com 9lQ8OH810H2 How does mean action onto south usually prepar... Strongly Disagree 9
1 1W38Sn628N9 1985-09-30 ofrench@example.com 8YZ3Ri208p1 Does million mean tax foot statement? Strongly Disagree 4
2 1W38Sn628N9 1985-09-30 ofrench@example.com 2Ys2yn819r0 How does decade what air scientist defense all... Disagree 2
3 1W38Sn628N9 1985-09-30 ofrench@example.com 3ii1tu719B3 Does focus statement peace forward do relate? Neutral 6
4 1W38Sn628N9 1985-09-30 ofrench@example.com 9xv2JX456C7 Does explain ability plant? Strongly Disagree 5
5 8J14yp561A1 1970-07-08 halvarado@example.com 3sG5am762C6 Why does know writer ball bad whole? Agree 3
6 8J14yp561A1 1970-07-08 halvarado@example.com 5uM6LP013p7 Why does break chance boy enjoy call paper yet? Strongly Disagree 9
7 8J14yp561A1 1970-07-08 halvarado@example.com 3NE9pu846z1 Does note spring newspaper and that thing? Neutral 7
8 8J14yp561A1 1970-07-08 halvarado@example.com 6ge7Zt058d3 Why does message next eat the stay? Neutral 0
9 8J14yp561A1 1970-07-08 halvarado@example.com 2EB3vo658l2 How does no later then inside fill discover? Agree 4

JSON Normalize

Thankfully there is the json_normalize() function, but it requires a little understanding to get it to satisfactorily parse flat. Simply passing it the sample data without any parameters results in a very familiar result that gets us no further than we started in the first attempt.

df = pd.json_normalize(sample_data)
df
respondent_id survey_date email responses
0 1W38Sn628N9 1985-09-30 ofrench@example.com [{'question_id': '9lQ8OH810H2', 'question_text...
1 8J14yp561A1 1970-07-08 halvarado@example.com [{'question_id': '3sG5am762C6', 'question_text...

A few optional parameters can be used here to parse the first nested array called responses. We can direct the pandas json parser to a specific key as the source of records. The record_path parameter takes either a string or list of strings to construct that path. The name of this parameter is a hint about how to think of this when passed as a list as we’ll see later.

df = pd.json_normalize(
    sample_data, 
    record_path="responses", 
)
df
question_id question_text choices
0 9lQ8OH810H2 How does mean action onto south usually prepar... [{'choice': 'Strongly Disagree', 'number': 9}]
1 8YZ3Ri208p1 Does million mean tax foot statement? [{'choice': 'Strongly Disagree', 'number': 4}]
2 2Ys2yn819r0 How does decade what air scientist defense all... [{'choice': 'Disagree', 'number': 2}]
3 3ii1tu719B3 Does focus statement peace forward do relate? [{'choice': 'Neutral', 'number': 6}]
4 9xv2JX456C7 Does explain ability plant? [{'choice': 'Strongly Disagree', 'number': 5}]
5 3sG5am762C6 Why does know writer ball bad whole? [{'choice': 'Agree', 'number': 3}]
6 5uM6LP013p7 Why does break chance boy enjoy call paper yet? [{'choice': 'Strongly Disagree', 'number': 9}]
7 3NE9pu846z1 Does note spring newspaper and that thing? [{'choice': 'Neutral', 'number': 7}]
8 6ge7Zt058d3 Why does message next eat the stay? [{'choice': 'Neutral', 'number': 0}]
9 2EB3vo658l2 How does no later then inside fill discover? [{'choice': 'Agree', 'number': 4}]

But when we direct the parser to just unpack the reponses array, we lose our data from the level above. Pandas can be instructed to keep this by giving it a list of metadata to repeat for each record it unpacks from the level above. We use the meta parameter and pass it a list of the fields to include.

df = pd.json_normalize(
    sample_data, 
    record_path="responses", 
     meta=[
        "respondent_id", 
        "survey_date",
        "email", 
    ],
)
df
   
question_id question_text choices respondent_id survey_date email
0 9lQ8OH810H2 How does mean action onto south usually prepar... [{'choice': 'Strongly Disagree', 'number': 9}] 1W38Sn628N9 1985-09-30 ofrench@example.com
1 8YZ3Ri208p1 Does million mean tax foot statement? [{'choice': 'Strongly Disagree', 'number': 4}] 1W38Sn628N9 1985-09-30 ofrench@example.com
2 2Ys2yn819r0 How does decade what air scientist defense all... [{'choice': 'Disagree', 'number': 2}] 1W38Sn628N9 1985-09-30 ofrench@example.com
3 3ii1tu719B3 Does focus statement peace forward do relate? [{'choice': 'Neutral', 'number': 6}] 1W38Sn628N9 1985-09-30 ofrench@example.com
4 9xv2JX456C7 Does explain ability plant? [{'choice': 'Strongly Disagree', 'number': 5}] 1W38Sn628N9 1985-09-30 ofrench@example.com
5 3sG5am762C6 Why does know writer ball bad whole? [{'choice': 'Agree', 'number': 3}] 8J14yp561A1 1970-07-08 halvarado@example.com
6 5uM6LP013p7 Why does break chance boy enjoy call paper yet? [{'choice': 'Strongly Disagree', 'number': 9}] 8J14yp561A1 1970-07-08 halvarado@example.com
7 3NE9pu846z1 Does note spring newspaper and that thing? [{'choice': 'Neutral', 'number': 7}] 8J14yp561A1 1970-07-08 halvarado@example.com
8 6ge7Zt058d3 Why does message next eat the stay? [{'choice': 'Neutral', 'number': 0}] 8J14yp561A1 1970-07-08 halvarado@example.com
9 2EB3vo658l2 How does no later then inside fill discover? [{'choice': 'Agree', 'number': 4}] 8J14yp561A1 1970-07-08 halvarado@example.com

We must go deeper!

That works for the most part, but we still have that annoying choices json array that would be nice to split out into columns.

df5 = pd.json_normalize(
    sample_data, 
    record_path="choices", 
    meta=[
        "respondent_id", 
        "survey_date",
        "email", 
    ],
)

Simply passing the choices field to the record_path param results in a KeyError though. This is because the choices field is actually nested in the responses field. So pandas need us to construct a path to reach it. We can get to it by passing each key as a record in the list to construct a path. Here that looks like ["responses", "choices"].

df = pd.json_normalize(
    sample_data, 
    record_path=["responses", "choices"], 
    meta=[
        "respondent_id", 
        "survey_date",
        "email", 
    ],
)
df
choice number respondent_id survey_date email
0 Strongly Disagree 9 1W38Sn628N9 1985-09-30 ofrench@example.com
1 Strongly Disagree 4 1W38Sn628N9 1985-09-30 ofrench@example.com
2 Disagree 2 1W38Sn628N9 1985-09-30 ofrench@example.com
3 Neutral 6 1W38Sn628N9 1985-09-30 ofrench@example.com
4 Strongly Disagree 5 1W38Sn628N9 1985-09-30 ofrench@example.com
5 Agree 3 8J14yp561A1 1970-07-08 halvarado@example.com
6 Strongly Disagree 9 8J14yp561A1 1970-07-08 halvarado@example.com
7 Neutral 7 8J14yp561A1 1970-07-08 halvarado@example.com
8 Neutral 0 8J14yp561A1 1970-07-08 halvarado@example.com
9 Agree 4 8J14yp561A1 1970-07-08 halvarado@example.com

But when we do that, we lose our question_id and question_text fields. That’s because we need to add them in the meta list and pass their paths like the record path param. See below.

df = pd.json_normalize(
    sample_data, 
    record_path=["responses", "choices"], 
    meta=[
        "respondent_id", 
        "survey_date",
        "email", 
        ["responses", "question_id"],
        ["responses", "question_text"],
    ],
)
df
choice number respondent_id survey_date email responses.question_id responses.question_text
0 Strongly Disagree 9 1W38Sn628N9 1985-09-30 ofrench@example.com 9lQ8OH810H2 How does mean action onto south usually prepar...
1 Strongly Disagree 4 1W38Sn628N9 1985-09-30 ofrench@example.com 8YZ3Ri208p1 Does million mean tax foot statement?
2 Disagree 2 1W38Sn628N9 1985-09-30 ofrench@example.com 2Ys2yn819r0 How does decade what air scientist defense all...
3 Neutral 6 1W38Sn628N9 1985-09-30 ofrench@example.com 3ii1tu719B3 Does focus statement peace forward do relate?
4 Strongly Disagree 5 1W38Sn628N9 1985-09-30 ofrench@example.com 9xv2JX456C7 Does explain ability plant?
5 Agree 3 8J14yp561A1 1970-07-08 halvarado@example.com 3sG5am762C6 Why does know writer ball bad whole?
6 Strongly Disagree 9 8J14yp561A1 1970-07-08 halvarado@example.com 5uM6LP013p7 Why does break chance boy enjoy call paper yet?
7 Neutral 7 8J14yp561A1 1970-07-08 halvarado@example.com 3NE9pu846z1 Does note spring newspaper and that thing?
8 Neutral 0 8J14yp561A1 1970-07-08 halvarado@example.com 6ge7Zt058d3 Why does message next eat the stay?
9 Agree 4 8J14yp561A1 1970-07-08 halvarado@example.com 2EB3vo658l2 How does no later then inside fill discover?

One last tweak: some databases (like MS SQL) don’t like naming columns with that period in the name. As a work around you can give the json_normalize function a custom separator such as an underscore instead.

df = pd.json_normalize(
    sample_data, 
    sep="_",
    record_path=["responses", "choices"], 
    meta=[
        "respondent_id", 
        "survey_date",
        "email", 
        ["responses", "question_id"],
        ["responses", "question_text"]
    ],
)
df
choice number respondent_id survey_date email responses_question_id responses_question_text
0 Strongly Disagree 9 1W38Sn628N9 1985-09-30 ofrench@example.com 9lQ8OH810H2 How does mean action onto south usually prepar...
1 Strongly Disagree 4 1W38Sn628N9 1985-09-30 ofrench@example.com 8YZ3Ri208p1 Does million mean tax foot statement?
2 Disagree 2 1W38Sn628N9 1985-09-30 ofrench@example.com 2Ys2yn819r0 How does decade what air scientist defense all...
3 Neutral 6 1W38Sn628N9 1985-09-30 ofrench@example.com 3ii1tu719B3 Does focus statement peace forward do relate?
4 Strongly Disagree 5 1W38Sn628N9 1985-09-30 ofrench@example.com 9xv2JX456C7 Does explain ability plant?
5 Agree 3 8J14yp561A1 1970-07-08 halvarado@example.com 3sG5am762C6 Why does know writer ball bad whole?
6 Strongly Disagree 9 8J14yp561A1 1970-07-08 halvarado@example.com 5uM6LP013p7 Why does break chance boy enjoy call paper yet?
7 Neutral 7 8J14yp561A1 1970-07-08 halvarado@example.com 3NE9pu846z1 Does note spring newspaper and that thing?
8 Neutral 0 8J14yp561A1 1970-07-08 halvarado@example.com 6ge7Zt058d3 Why does message next eat the stay?
9 Agree 4 8J14yp561A1 1970-07-08 halvarado@example.com 2EB3vo658l2 How does no later then inside fill discover?

Finally, we’ll order the columns from the least nested level on the left all the way to the most nested on the right for easier readability.

column_order = ["respondent_id", "survey_date", "email", "responses_question_id", "responses_question_text", "choice", "number"]
df = df[column_order]
df
respondent_id survey_date email responses_question_id responses_question_text choice number
0 1W38Sn628N9 1985-09-30 ofrench@example.com 9lQ8OH810H2 How does mean action onto south usually prepar... Strongly Disagree 9
1 1W38Sn628N9 1985-09-30 ofrench@example.com 8YZ3Ri208p1 Does million mean tax foot statement? Strongly Disagree 4
2 1W38Sn628N9 1985-09-30 ofrench@example.com 2Ys2yn819r0 How does decade what air scientist defense all... Disagree 2
3 1W38Sn628N9 1985-09-30 ofrench@example.com 3ii1tu719B3 Does focus statement peace forward do relate? Neutral 6
4 1W38Sn628N9 1985-09-30 ofrench@example.com 9xv2JX456C7 Does explain ability plant? Strongly Disagree 5
5 8J14yp561A1 1970-07-08 halvarado@example.com 3sG5am762C6 Why does know writer ball bad whole? Agree 3
6 8J14yp561A1 1970-07-08 halvarado@example.com 5uM6LP013p7 Why does break chance boy enjoy call paper yet? Strongly Disagree 9
7 8J14yp561A1 1970-07-08 halvarado@example.com 3NE9pu846z1 Does note spring newspaper and that thing? Neutral 7
8 8J14yp561A1 1970-07-08 halvarado@example.com 6ge7Zt058d3 Why does message next eat the stay? Neutral 0
9 8J14yp561A1 1970-07-08 halvarado@example.com 2EB3vo658l2 How does no later then inside fill discover? Agree 4

Parsed and ready to import

At this point our data is in a simple tabular format and ready to import into a database table with something like pandas.to_sql() function, but we’ll save that for another post.