Normalize JSON with Pandas

Posted on Tue 12 January 2021 in Tutorial

When processing nested JSON data into a flat structure for importing into a relational database, it can be tricky to structure the data into the right shape. Pandas has a great tool for doing this called pandas.json_normalize() but the documentation doesn't make it obvious how to leverage its capabilities for handling nested data structures.

I thought I could provide a brief example using some randomly generated survey response data (using the Faker library) to illustrate it's advantages.

Setup

To start, I'm going to be using pandas and Faker so we'll import those. I'm also going to need to easily display the parsed json as well as the returned dataframes, so I'm importing the json module from the standard lib as well as some IPython notebook helpers for displaying dataframes as HTML tables.

In [63]:
import json
from IPython.display import display, HTML
from faker import Faker
from faker.providers import BaseProvider, date_time, internet
import pandas as pd

To make it easier to call the IPython display helper, we'll include this simple function as syntactic sugar.

In [64]:
def print_df(df):
    display(HTML(df.to_html()))

Fake Data

Faker doesn't have a built in provider for survey questions, so let's go ahead and add a simple one that creates non-sensical questions with a simple hack to the sentence provider.

In [65]:
class MyProvider(BaseProvider):
    def question(self):
        stems = ('Does', 'How does', 'Which', 'Why does')
        stem = Faker().random_choices(elements=stems, length=1)[0]
        sentence = Faker().sentence()
        sentence = sentence[0].lower() + sentence[1:]
        question = sentence.replace(".", "?")
        question = f"{stem} {question}"
        return question

Using this new question provider, we'll construct a few records of fake survey response data with some respondent level data like a respondent_id, survey_date, and respondent email. Within that we'll nest a list of responses which will in turn have it's own dictionary of data at the question level: id, question text, and choices. The choices list will be singular here, but assume it has that structure because the API this comes from has to also account for multi-select options and we'll need to parse it as a list regardless. Depending on our analysis needs, this might also be a place where we'd want to keep these in a comma separated string, but for our purposes here we'll ignore that use case.

In [66]:
fake = Faker()
fake.add_provider(MyProvider)
choices = ('Strongly Agree', 'Agree', 'Neutral', 'Disagree', 'Strongly Disagree')


sample_data = [
    {
        "respondent_id": fake.bothify(text="#?##??###?#"),
        "survey_date": fake.date(),
        "email": fake.email(domain="example.com"),
        "responses": [
            {
                "question_id": fake.bothify(text="#??#??###?#"),
                "question_text": fake.question(),
                "choices": [
                    {
                        "choice": fake.random_choices(elements=choices, length=1)[0],
                        "number": fake.random_digit(),
                    }
                ]
            } for _ in range(5)
        ]
    } for _ in range(2) 
]

Let's print out a single record to see the resulting data structure that has been generated randomly.

In [67]:
print(json.dumps(sample_data[0], indent=2))
{
  "respondent_id": "9U69hZ538l1",
  "survey_date": "1994-03-20",
  "email": "lindsey24@example.com",
  "responses": [
    {
      "question_id": "7YB4ZO864I7",
      "question_text": "How does speak beat ever tree?",
      "choices": [
        {
          "choice": "Strongly Agree",
          "number": 4
        }
      ]
    },
    {
      "question_id": "8uD0WY155Q9",
      "question_text": "How does write popular Mr know develop?",
      "choices": [
        {
          "choice": "Agree",
          "number": 9
        }
      ]
    },
    {
      "question_id": "8bN0OP829A8",
      "question_text": "Why does democratic represent far my pressure series third?",
      "choices": [
        {
          "choice": "Strongly Agree",
          "number": 9
        }
      ]
    },
    {
      "question_id": "2Rt9mk140E8",
      "question_text": "How does understand network old?",
      "choices": [
        {
          "choice": "Agree",
          "number": 2
        }
      ]
    },
    {
      "question_id": "8tf2SU934m8",
      "question_text": "Why does western next put?",
      "choices": [
        {
          "choice": "Neutral",
          "number": 8
        }
      ]
    }
  ]
}

Reading with Pandas

As you can see below, simply reading this directly into a dataframe only parses the top level respondent data, but then keeps the responses data as a json array. Which isn't great for simple analysis. Could you load that "as is" into a jsonb field in PostgreSQL? Sure. If you like parsing json with SQL. Yuck!

In [68]:
df = pd.DataFrame(sample_data)
print_df(df)
respondent_id survey_date email responses
0 9U69hZ538l1 1994-03-20 lindsey24@example.com [{'question_id': '7YB4ZO864I7', 'question_text': 'How does speak beat ever tree?', 'choices': [{'choice': 'Strongly Agree', 'number': 4}]}, {'question_id': '8uD0WY155Q9', 'question_text': 'How does write popular Mr know develop?', 'choices': [{'choice': 'Agree', 'number': 9}]}, {'question_id': '8bN0OP829A8', 'question_text': 'Why does democratic represent far my pressure series third?', 'choices': [{'choice': 'Strongly Agree', 'number': 9}]}, {'question_id': '2Rt9mk140E8', 'question_text': 'How does understand network old?', 'choices': [{'choice': 'Agree', 'number': 2}]}, {'question_id': '8tf2SU934m8', 'question_text': 'Why does western next put?', 'choices': [{'choice': 'Neutral', 'number': 8}]}]
1 3A45gW246W1 1991-07-13 robert21@example.com [{'question_id': '2CP2AW568v2', 'question_text': 'Why does whatever play office move alone class part wish?', 'choices': [{'choice': 'Neutral', 'number': 0}]}, {'question_id': '2YV7SR528f0', 'question_text': 'Why does which phone return much door?', 'choices': [{'choice': 'Disagree', 'number': 7}]}, {'question_id': '3PJ8eR903v4', 'question_text': 'Why does despite be class various great stay Republican person?', 'choices': [{'choice': 'Neutral', 'number': 4}]}, {'question_id': '2GV2kH553o3', 'question_text': 'Does become create store store?', 'choices': [{'choice': 'Disagree', 'number': 8}]}, {'question_id': '9aX3NJ910f8', 'question_text': 'Which team life although rate science I pattern?', 'choices': [{'choice': 'Strongly Disagree', 'number': 7}]}]

A messy custom parser

We could attempt to reshape this by writing some custom functions to handle extracting the responses and merging that data with the top-level meta data about the respondent with some dictionary unpacking, but this gets messy and would fall apart quickly as the structure changed. While this approach works, it's not ideal.

Sure this code could be further refactored to simplify the logic, but it's not worth it since the pandas.json_normalize() can do this for us easily.

In [69]:
def record_format(responses):
    data = []
    for response in responses:
        record = {
            "question_id": response.get("question_id"),
            "question_text": response.get("question_text"),
            "choice": response.get("choices")[0].get("choice"),
            "number": response.get("choices")[0].get("number"),
        }
        data.append(record)
    return data

def parse_json(records):
    data = []
    for record in sample_data:
        meta = {
            "respondent_id": record.get("respondent_id"),
            "survey_date": record.get("survey_date"),
            "email": record.get("email"),
        }
        responses = record.get("responses")
        formatted_responses = record_format(responses)
        for response in formatted_responses:
            combined = {**meta, **response}
            data.append(combined)
    return data

clean_data = parse_json(sample_data)
df = pd.DataFrame(clean_data)
print_df(df)
respondent_id survey_date email question_id question_text choice number
0 9U69hZ538l1 1994-03-20 lindsey24@example.com 7YB4ZO864I7 How does speak beat ever tree? Strongly Agree 4
1 9U69hZ538l1 1994-03-20 lindsey24@example.com 8uD0WY155Q9 How does write popular Mr know develop? Agree 9
2 9U69hZ538l1 1994-03-20 lindsey24@example.com 8bN0OP829A8 Why does democratic represent far my pressure series third? Strongly Agree 9
3 9U69hZ538l1 1994-03-20 lindsey24@example.com 2Rt9mk140E8 How does understand network old? Agree 2
4 9U69hZ538l1 1994-03-20 lindsey24@example.com 8tf2SU934m8 Why does western next put? Neutral 8
5 3A45gW246W1 1991-07-13 robert21@example.com 2CP2AW568v2 Why does whatever play office move alone class part wish? Neutral 0
6 3A45gW246W1 1991-07-13 robert21@example.com 2YV7SR528f0 Why does which phone return much door? Disagree 7
7 3A45gW246W1 1991-07-13 robert21@example.com 3PJ8eR903v4 Why does despite be class various great stay Republican person? Neutral 4
8 3A45gW246W1 1991-07-13 robert21@example.com 2GV2kH553o3 Does become create store store? Disagree 8
9 3A45gW246W1 1991-07-13 robert21@example.com 9aX3NJ910f8 Which team life although rate science I pattern? Strongly Disagree 7

JSON Normalize

Thankfully there is the json_normalize() function, but it requires a little understanding to get it to satisfactorily parse flat. Simply passing it the sample data without any parameters results in a very familiar result that gets us no further than we started in the first attempt.

In [70]:
df = pd.json_normalize(sample_data)
print_df(df)
respondent_id survey_date email responses
0 9U69hZ538l1 1994-03-20 lindsey24@example.com [{'question_id': '7YB4ZO864I7', 'question_text': 'How does speak beat ever tree?', 'choices': [{'choice': 'Strongly Agree', 'number': 4}]}, {'question_id': '8uD0WY155Q9', 'question_text': 'How does write popular Mr know develop?', 'choices': [{'choice': 'Agree', 'number': 9}]}, {'question_id': '8bN0OP829A8', 'question_text': 'Why does democratic represent far my pressure series third?', 'choices': [{'choice': 'Strongly Agree', 'number': 9}]}, {'question_id': '2Rt9mk140E8', 'question_text': 'How does understand network old?', 'choices': [{'choice': 'Agree', 'number': 2}]}, {'question_id': '8tf2SU934m8', 'question_text': 'Why does western next put?', 'choices': [{'choice': 'Neutral', 'number': 8}]}]
1 3A45gW246W1 1991-07-13 robert21@example.com [{'question_id': '2CP2AW568v2', 'question_text': 'Why does whatever play office move alone class part wish?', 'choices': [{'choice': 'Neutral', 'number': 0}]}, {'question_id': '2YV7SR528f0', 'question_text': 'Why does which phone return much door?', 'choices': [{'choice': 'Disagree', 'number': 7}]}, {'question_id': '3PJ8eR903v4', 'question_text': 'Why does despite be class various great stay Republican person?', 'choices': [{'choice': 'Neutral', 'number': 4}]}, {'question_id': '2GV2kH553o3', 'question_text': 'Does become create store store?', 'choices': [{'choice': 'Disagree', 'number': 8}]}, {'question_id': '9aX3NJ910f8', 'question_text': 'Which team life although rate science I pattern?', 'choices': [{'choice': 'Strongly Disagree', 'number': 7}]}]

A few optional parameters can be used here to parse the first nested array called responses. We can direct the pandas json parser to a specific key as the source of records. The record_path parameter takes either a string or list of strings to construct that path. The name of this parameter is a hint about how to think of this when passed as a list as we'll see later.

In [71]:
df = pd.json_normalize(
    sample_data, 
    record_path="responses", 
)
print_df(df)
question_id question_text choices
0 7YB4ZO864I7 How does speak beat ever tree? [{'choice': 'Strongly Agree', 'number': 4}]
1 8uD0WY155Q9 How does write popular Mr know develop? [{'choice': 'Agree', 'number': 9}]
2 8bN0OP829A8 Why does democratic represent far my pressure series third? [{'choice': 'Strongly Agree', 'number': 9}]
3 2Rt9mk140E8 How does understand network old? [{'choice': 'Agree', 'number': 2}]
4 8tf2SU934m8 Why does western next put? [{'choice': 'Neutral', 'number': 8}]
5 2CP2AW568v2 Why does whatever play office move alone class part wish? [{'choice': 'Neutral', 'number': 0}]
6 2YV7SR528f0 Why does which phone return much door? [{'choice': 'Disagree', 'number': 7}]
7 3PJ8eR903v4 Why does despite be class various great stay Republican person? [{'choice': 'Neutral', 'number': 4}]
8 2GV2kH553o3 Does become create store store? [{'choice': 'Disagree', 'number': 8}]
9 9aX3NJ910f8 Which team life although rate science I pattern? [{'choice': 'Strongly Disagree', 'number': 7}]

But when we direct the parser to just unpack the reponses array, we lose our data from the level above. Pandas can be instructed to keep this by giving it a list of metadata to repeat for each record it unpacks from the level above. We use the meta parameter and pass it a list of the fields to include.

In [72]:
df = pd.json_normalize(
    sample_data, 
    record_path="responses", 
     meta=[
        "respondent_id", 
        "survey_date",
        "email", 
    ],
)
print_df(df)
   
question_id question_text choices respondent_id survey_date email
0 7YB4ZO864I7 How does speak beat ever tree? [{'choice': 'Strongly Agree', 'number': 4}] 9U69hZ538l1 1994-03-20 lindsey24@example.com
1 8uD0WY155Q9 How does write popular Mr know develop? [{'choice': 'Agree', 'number': 9}] 9U69hZ538l1 1994-03-20 lindsey24@example.com
2 8bN0OP829A8 Why does democratic represent far my pressure series third? [{'choice': 'Strongly Agree', 'number': 9}] 9U69hZ538l1 1994-03-20 lindsey24@example.com
3 2Rt9mk140E8 How does understand network old? [{'choice': 'Agree', 'number': 2}] 9U69hZ538l1 1994-03-20 lindsey24@example.com
4 8tf2SU934m8 Why does western next put? [{'choice': 'Neutral', 'number': 8}] 9U69hZ538l1 1994-03-20 lindsey24@example.com
5 2CP2AW568v2 Why does whatever play office move alone class part wish? [{'choice': 'Neutral', 'number': 0}] 3A45gW246W1 1991-07-13 robert21@example.com
6 2YV7SR528f0 Why does which phone return much door? [{'choice': 'Disagree', 'number': 7}] 3A45gW246W1 1991-07-13 robert21@example.com
7 3PJ8eR903v4 Why does despite be class various great stay Republican person? [{'choice': 'Neutral', 'number': 4}] 3A45gW246W1 1991-07-13 robert21@example.com
8 2GV2kH553o3 Does become create store store? [{'choice': 'Disagree', 'number': 8}] 3A45gW246W1 1991-07-13 robert21@example.com
9 9aX3NJ910f8 Which team life although rate science I pattern? [{'choice': 'Strongly Disagree', 'number': 7}] 3A45gW246W1 1991-07-13 robert21@example.com

We must go deeper!

That works for the most part, but we still have that annoying choices json array that would be nice to split out into columns.

df5 = pd.json_normalize(
    sample_data, 
    record_path="choices", 
    meta=[
        "respondent_id", 
        "survey_date",
        "email", 
    ],
)

Simply passing the choices field to the record_path param results in a KeyError though. This is because the choices field is actually nested in the responses field. So pandas need us to construct a path to reach it. We can get to it by passing each key as a record in the list to construct a path. Here that looks like ["responses", "choices"].

In [73]:
df = pd.json_normalize(
    sample_data, 
    record_path=["responses", "choices"], 
    meta=[
        "respondent_id", 
        "survey_date",
        "email", 
    ],
)
print_df(df)
choice number respondent_id survey_date email
0 Strongly Agree 4 9U69hZ538l1 1994-03-20 lindsey24@example.com
1 Agree 9 9U69hZ538l1 1994-03-20 lindsey24@example.com
2 Strongly Agree 9 9U69hZ538l1 1994-03-20 lindsey24@example.com
3 Agree 2 9U69hZ538l1 1994-03-20 lindsey24@example.com
4 Neutral 8 9U69hZ538l1 1994-03-20 lindsey24@example.com
5 Neutral 0 3A45gW246W1 1991-07-13 robert21@example.com
6 Disagree 7 3A45gW246W1 1991-07-13 robert21@example.com
7 Neutral 4 3A45gW246W1 1991-07-13 robert21@example.com
8 Disagree 8 3A45gW246W1 1991-07-13 robert21@example.com
9 Strongly Disagree 7 3A45gW246W1 1991-07-13 robert21@example.com

But when we do that, we lose our question_id and question_text fields. That's because we need to add them in the meta list and pass their paths like the record path param. See below.

In [74]:
df = pd.json_normalize(
    sample_data, 
    record_path=["responses", "choices"], 
    meta=[
        "respondent_id", 
        "survey_date",
        "email", 
        ["responses", "question_id"],
        ["responses", "question_text"],
    ],
)
print_df(df)
choice number respondent_id survey_date email responses.question_id responses.question_text
0 Strongly Agree 4 9U69hZ538l1 1994-03-20 lindsey24@example.com 7YB4ZO864I7 How does speak beat ever tree?
1 Agree 9 9U69hZ538l1 1994-03-20 lindsey24@example.com 8uD0WY155Q9 How does write popular Mr know develop?
2 Strongly Agree 9 9U69hZ538l1 1994-03-20 lindsey24@example.com 8bN0OP829A8 Why does democratic represent far my pressure series third?
3 Agree 2 9U69hZ538l1 1994-03-20 lindsey24@example.com 2Rt9mk140E8 How does understand network old?
4 Neutral 8 9U69hZ538l1 1994-03-20 lindsey24@example.com 8tf2SU934m8 Why does western next put?
5 Neutral 0 3A45gW246W1 1991-07-13 robert21@example.com 2CP2AW568v2 Why does whatever play office move alone class part wish?
6 Disagree 7 3A45gW246W1 1991-07-13 robert21@example.com 2YV7SR528f0 Why does which phone return much door?
7 Neutral 4 3A45gW246W1 1991-07-13 robert21@example.com 3PJ8eR903v4 Why does despite be class various great stay Republican person?
8 Disagree 8 3A45gW246W1 1991-07-13 robert21@example.com 2GV2kH553o3 Does become create store store?
9 Strongly Disagree 7 3A45gW246W1 1991-07-13 robert21@example.com 9aX3NJ910f8 Which team life although rate science I pattern?

One last tweak: some databases (like MS SQL) don't like naming columns with that period in the name. As a work around you can give the json_normalize function a custom separator such as an underscore instead.

In [75]:
df = pd.json_normalize(
    sample_data, 
    sep="_",
    record_path=["responses", "choices"], 
    meta=[
        "respondent_id", 
        "survey_date",
        "email", 
        ["responses", "question_id"],
        ["responses", "question_text"]
    ],
)
print_df(df)
choice number respondent_id survey_date email responses_question_id responses_question_text
0 Strongly Agree 4 9U69hZ538l1 1994-03-20 lindsey24@example.com 7YB4ZO864I7 How does speak beat ever tree?
1 Agree 9 9U69hZ538l1 1994-03-20 lindsey24@example.com 8uD0WY155Q9 How does write popular Mr know develop?
2 Strongly Agree 9 9U69hZ538l1 1994-03-20 lindsey24@example.com 8bN0OP829A8 Why does democratic represent far my pressure series third?
3 Agree 2 9U69hZ538l1 1994-03-20 lindsey24@example.com 2Rt9mk140E8 How does understand network old?
4 Neutral 8 9U69hZ538l1 1994-03-20 lindsey24@example.com 8tf2SU934m8 Why does western next put?
5 Neutral 0 3A45gW246W1 1991-07-13 robert21@example.com 2CP2AW568v2 Why does whatever play office move alone class part wish?
6 Disagree 7 3A45gW246W1 1991-07-13 robert21@example.com 2YV7SR528f0 Why does which phone return much door?
7 Neutral 4 3A45gW246W1 1991-07-13 robert21@example.com 3PJ8eR903v4 Why does despite be class various great stay Republican person?
8 Disagree 8 3A45gW246W1 1991-07-13 robert21@example.com 2GV2kH553o3 Does become create store store?
9 Strongly Disagree 7 3A45gW246W1 1991-07-13 robert21@example.com 9aX3NJ910f8 Which team life although rate science I pattern?

Finally, we'll order the columns from the least nested level on the left all the way to the most nested on the right for easier readability.

In [76]:
column_order = ["respondent_id", "survey_date", "email", "responses_question_id", "responses_question_text", "choice", "number"]
df = df[column_order]
print_df(df)
respondent_id survey_date email responses_question_id responses_question_text choice number
0 9U69hZ538l1 1994-03-20 lindsey24@example.com 7YB4ZO864I7 How does speak beat ever tree? Strongly Agree 4
1 9U69hZ538l1 1994-03-20 lindsey24@example.com 8uD0WY155Q9 How does write popular Mr know develop? Agree 9
2 9U69hZ538l1 1994-03-20 lindsey24@example.com 8bN0OP829A8 Why does democratic represent far my pressure series third? Strongly Agree 9
3 9U69hZ538l1 1994-03-20 lindsey24@example.com 2Rt9mk140E8 How does understand network old? Agree 2
4 9U69hZ538l1 1994-03-20 lindsey24@example.com 8tf2SU934m8 Why does western next put? Neutral 8
5 3A45gW246W1 1991-07-13 robert21@example.com 2CP2AW568v2 Why does whatever play office move alone class part wish? Neutral 0
6 3A45gW246W1 1991-07-13 robert21@example.com 2YV7SR528f0 Why does which phone return much door? Disagree 7
7 3A45gW246W1 1991-07-13 robert21@example.com 3PJ8eR903v4 Why does despite be class various great stay Republican person? Neutral 4
8 3A45gW246W1 1991-07-13 robert21@example.com 2GV2kH553o3 Does become create store store? Disagree 8
9 3A45gW246W1 1991-07-13 robert21@example.com 9aX3NJ910f8 Which team life although rate science I pattern? Strongly Disagree 7

Parsed and ready to import

At this point our data is in a simple tabular format and ready to import into a database table with something like pandas.to_sql() function, but we'll save that for another post.