Python & JSON: Working with large datasets using Pandas

Python & JSON: Working with large datasets using Pandas

Working with large JSON datasets can be a pain, particularly when they are too large to fit into memory. In cases like this, a combination of command line tools and Python can make for an efficient way to explore and analyze the data. In this post, we’ll look at how to leverage tools like Pandas to explore and map out police activity in Montgomery County, Maryland. We’ll start with a look at the JSON data, then segue into exploration and analysis of the JSON with Python.

When data is stored in SQL databases, it tends to follow a rigid structure that looks like a table. Here’s an example from a SQLite database:

id|code|name|area|area_land|area_water|population|population_growth|birth_rate|death_rate|migration_rate|created_at|updated_at
1|af|Afghanistan|652230|652230|0|32564342|2.32|38.57|13.89|1.51|2015-11-01 13:19:49.461734|2015-11-01 13:19:49.461734
2|al|Albania|28748|27398|1350|3029278|0.3|12.92|6.58|3.3|2015-11-01 13:19:54.431082|2015-11-01 13:19:54.431082
3|ag|Algeria|2381741|2381741|0|39542166|1.84|23.67|4.31|0.92|2015-11-01 13:19:59.961286|2015-11-01 13:19:59.961286

As you can see, the data consists of rows and columns, where each column maps to a defined property, like id, or code. In the dataset above, each row represents a country, and each column represents some fact about that country.

But as the amount of data we capture increases, we often don’t know the exact structure of the data at the time we store it. This is called unstructured data. A good example is a list of events from visitors on a website. Here’s an example of a list of events sent to a server:

  {'event_type': 'started-mission',
   'keen': {'created_at': '2015-06-12T23:09:03.966Z',
    'id': '557b668fd2eaaa2e7c5e916b',
    'timestamp': '2015-06-12T23:09:07.971Z'},
   'sequence': 1}
   
  {'event_type': 'started-screen',
   'keen': {'created_at': '2015-06-12T23:09:03.979Z',
    'id': '557b668f90e4bd26c10b6ed6',
    'timestamp': '2015-06-12T23:09:07.987Z'},
   'mission': 1,
   'sequence': 4,
   'type': 'code'}
   
  {'event_type': 'started-screen',
   'keen': {'created_at': '2015-06-12T23:09:22.517Z',
    'id': '557b66a246f9a7239038b1e0',
    'timestamp': '2015-06-12T23:09:24.246Z'},
   'mission': 1,
   'sequence': 3,
   'type': 'code'},

As you can see, three separate events are listed above. Each event has different fields, and some of the fields are nested within other fields. This type of data is very hard to store in a regular SQL database. This unstructured data is often stored in a format called JavaScript Object Notation (JSON). JSON is a way to encode data structures like lists and dictionaries to strings that ensures that they are easily readable by machines. Even though JSON starts with the word Javascript, it’s actually just a format, and can be read by any language.

Python has great JSON support, with the json library. We can both convert lists and dictionaries to JSON, and convert strings to lists and dictionaries. JSON data looks much like a dictionary would in Python, with keys and values stored.

In this post, we’ll explore a JSON file on the command line, then import it into Python and work with it using Pandas.

The dataset

We’ll be looking at a dataset that contains information on traffic violations in Montgomery County, Maryland. You can download the data here. The data contains information about where the violation happened, the type of car, demographics on the person receiving the violation, and some other interesting information. There are quite a few questions we could answer using this dataset, including:

  • What types of cars are most likely to be pulled over for speeding?
  • What times of day are police most active?
  • How common are “speed traps”? Or are tickets spread pretty evenly in terms of geography?
  • What are the most common things people are pulled over for?

Unfortunately, we don’t know the structure of the JSON file upfront, so we’ll need to do some exploration to figure it out. We’ll use Jupyter Notebook for this exploration.

Exploring the JSON data

Even though the JSON file is only 600MB, we’ll treat it like it’s much larger so we can explore how analyzing a JSON file that doesn’t fit into memory might work. The first thing we’ll do is take a look at the first few lines of the md_traffic.json file. A JSON file is just an ordinary text file, so we can use all the standard command line tools to interact with it:

In [1]:
%%bash

head md_traffic.json
{
  "meta" : {
    "view" : {
      "id" : "4mse-ku6q",
      "name" : "Traffic Violations",
      "averageRating" : 0,
      "category" : "Public Safety",
      "createdAt" : 1403103517,
      "description" : "This dataset contains traffic violation information from all electronic traffic violations issued in the County.  Any information that can be used to uniquely identify the vehicle, the vehicle owner or the officer issuing the violation will not be published.\r\n\r\nUpdate Frequency:  Daily",
      "displayType" : "table",

From this, we can tell that the JSON data is a dictionary, and is well formatted. meta is a top level key, and is indented two spaces. We can get all of the top level keys by using the grep command to print any lines that have two leading spaces:

In [2]:
%%bash

grep -E '^ {2}"' md_traffic.json
  "meta" : {
  "data" : [ [ 1889194, "92AD0076-5308-45D0-BDE3-6A3A55AD9A04", 1889194, 1455876689, "498050", 1455876689, "498050", null, "2016-02-18T00:00:00", "09:05:00", "MCP", "2nd district, Bethesda", "DRIVER USING HANDS TO USE HANDHELD TELEPHONE WHILEMOTOR VEHICLE IS IN MOTION", "355/TUCKERMAN LN", "-77.105925", "39.03223", "No", "No", "No", "No", "No", "No", "No", "No", "No", "No", "MD", "02 - Automobile", "2010", "JEEP", "CRISWELL", "BLUE", "Citation", "21-1124.2(d2)", "Transportation Article", "No", "WHITE", "F", "GERMANTOWN", "MD", "MD", "A - Marked Patrol", [ "{\"address\":\"\",\"city\":\"\",\"state\":\"\",\"zip\":\"\"}", "-77.105925", "39.03223", null, false ] ]

This shows us that meta and data are top level keys in the md_traffic.json data. A list of lists appears to be associated with data, and this likely contains each record in our traffic violations dataset. Each inner list is a record, and the first record appears in the output from the grep command. This is very similar to the kind of structured data we’re used to working with when we operate on CSV files or SQL tables. Here’s a truncated view of how the data might look:

[
    [1889194, "92AD0076-5308-45D0-BDE3-6A3A55AD9A04", 1889194, 1455876689, "498050"],
    [1889194, "92AD0076-5308-45D0-BDE3-6A3A55AD9A04", 1889194, 1455876689, "498050"],
    ...
]

This looks a lot like the rows and columns that we’re used to working with. We’re just missing the headers that tell us what each column means. We may be able to find this information under the meta key.

meta usually refers to information about the data itself. Let’s dig a little more into meta and see what information is contained there. From the head command, we know that there are at least 3 levels of keys, with meta containing a key view, which contains the keys id, name, averageRating and others. We can print out the full key structure of the JSON file by using grep to print out any lines with 2-6 leading spaces:

In [6]:
%%bash

grep -E '^ {2,6}"' md_traffic.json
  "meta" : {
    "view" : {
      "id" : "4mse-ku6q",
      "name" : "Traffic Violations",
      "averageRating" : 0,
      "category" : "Public Safety",
      "createdAt" : 1403103517,
      "description" : "This dataset contains traffic violation information from all electronic traffic violations issued in the County.  Any information that can be used to uniquely identify the vehicle, the vehicle owner or the officer issuing the violation will not be published.\r\n\r\nUpdate Frequency:  Daily",
      "displayType" : "table",
      "downloadCount" : 2851,
      "iconUrl" : "fileId:r41tDc239M1FL75LFwXFKzFCWqr8mzMeMTYXiA24USM",
      "indexUpdatedAt" : 1455885508,
      "newBackend" : false,
      "numberOfComments" : 0,
      "oid" : 8890705,
      "publicationAppendEnabled" : false,
      "publicationDate" : 1411040702,
      "publicationGroup" : 1620779,
      "publicationStage" : "published",
      "rowClass" : "",
      "rowsUpdatedAt" : 1455876727,
      "rowsUpdatedBy" : "ajn4-zy65",
      "state" : "normal",
      "tableId" : 1722160,
      "totalTimesRated" : 0,
      "viewCount" : 6910,
      "viewLastModified" : 1434558076,
      "viewType" : "tabular",
      "columns" : [ {
      "disabledFeatureFlags" : [ "allow_comments" ],
      "grants" : [ {
      "metadata" : {
      "owner" : {
      "query" : {
      "rights" : [ "read" ],
      "sortBys" : [ {
      "tableAuthor" : {
      "tags" : [ "traffic", "stop", "violations", "electronic issued." ],
      "flags" : [ "default" ]
  "data" : [ [ 1889194, "92AD0076-5308-45D0-BDE3-6A3A55AD9A04", 1889194, 1455876689, "498050", 1455876689, "498050", null, "2016-02-18T00:00:00", "09:05:00", "MCP", "2nd district, Bethesda", "DRIVER USING HANDS TO USE HANDHELD TELEPHONE WHILEMOTOR VEHICLE IS IN MOTION", "355/TUCKERMAN LN", "-77.105925", "39.03223", "No", "No", "No", "No", "No", "No", "No", "No", "No", "No", "MD", "02 - Automobile", "2010", "JEEP", "CRISWELL", "BLUE", "Citation", "21-1124.2(d2)", "Transportation Article", "No", "WHITE", "F", "GERMANTOWN", "MD", "MD", "A - Marked Patrol", [ "{\"address\":\"\",\"city\":\"\",\"state\":\"\",\"zip\":\"\"}", "-77.105925", "39.03223", null, false ] ]

This shows us the full key structure associated with md_traffic.json, and tell us which parts of the JSON file are relevant for us. In this case, the columns key looks interesting, as it potentially contains information on the columns in the list of lists in the data key.

Extracting information on the columns

Now that we know which key contains information on the columns, we need to read that information in. Because we’re assuming that the JSON file won’t fit in memory, we can’t just directly read it in using the json library. Instead, we’ll need to iteratively read it in in a memory-efficient way.

We can accomplish this using the ijson package. ijson will iteratively parse the json file instead of reading it all in at once. This is slower than directly reading the whole file in, but it enables us to work with large files that can’t fit in memory. To use ijson, we specify a file we want to extract data from, then we specify a key path to extract:

In [7]:
import ijson

filename = "md_traffic.json"
with open(filename, 'r') as f:
    objects = ijson.items(f, 'meta.view.columns.item')
    columns = list(objects)

In the above code, we open the md_traffic.json file, then we use the items method in ijson to extract a list from the file. We specify the path to the list using the meta.view.columns notation. Recall that meta is a top level key, which contains view inside, which contains columns inside it. We then specify meta.view.columns.item to indicate that we should extract each individual item in the meta.view.columns list. The items function will return a generator, so we use the list method to turn the generator into a Python list. We can print out the first item in the list:

In [36]:
print(columns[0])
{'renderTypeName': 'meta_data', 'name': 'sid', 'fieldName': ':sid', 'position': 0, 'id': -1, 'format': {}, 'dataTypeName': 'meta_data'}

From the above output, it looks like each item in columns is a dictionary that contains information about each column. In order to get our header, it looks like fieldName is the relevant key to extract. To get our column names, we just have to extract the fieldName key from each item in columns:

In [33]:
column_names = [col["fieldName"] for col in columns]
column_names
Out[33]:
[':sid',
 ':id',
 ':position',
 ':created_at',
 ':created_meta',
 ':updated_at',
 ':updated_meta',
 ':meta',
 'date_of_stop',
 'time_of_stop',
 'agency',
 'subagency',
 'description',
 'location',
 'latitude',
 'longitude',
 'accident',
 'belts',
 'personal_injury',
 'property_damage',
 'fatal',
 'commercial_license',
 'hazmat',
 'commercial_vehicle',
 'alcohol',
 'work_zone',
 'state',
 'vehicle_type',
 'year',
 'make',
 'model',
 'color',
 'violation_type',
 'charge',
 'article',
 'contributed_to_accident',
 'race',
 'gender',
 'driver_city',
 'driver_state',
 'dl_state',
 'arrest_type',
 'geolocation']

Great! Now that we have our columns names, we can move to extracting the data itself.

Extracting the data

You may recall that the data is locked away in a list of lists inside the data key. We’ll need to read this data into memory to manipulate it. Fortunately, we can use the column names we just extracted to only grab the columns that are relevant. This will save a ton of space. If the dataset was larger, you could iteratively process batches of rows. So read in the first 10000000 rows, do some processing, then the next 10000000, and so on. In this case, we can define the columns we care about, and again use ijson to iteratively process the JSON file:

In [9]:
good_columns = [
    "date_of_stop", 
    "time_of_stop", 
    "agency", 
    "subagency",
    "description",
    "location", 
    "latitude", 
    "longitude", 
    "vehicle_type", 
    "year", 
    "make", 
    "model", 
    "color", 
    "violation_type",
    "race", 
    "gender", 
    "driver_state", 
    "driver_city", 
    "dl_state",
    "arrest_type"
]

data = []
with open(filename, 'r') as f:
    objects = ijson.items(f, 'data.item')
    for row in objects:
        selected_row = []
        for item in good_columns:
            selected_row.append(row[column_names.index(item)])
        data.append(selected_row)

Now that we’ve read the data in, we can print out the first item in data:

In [10]:
data[0]
Out[10]:
['2016-02-18T00:00:00',
 '09:05:00',
 'MCP',
 '2nd district, Bethesda',
 'DRIVER USING HANDS TO USE HANDHELD TELEPHONE WHILEMOTOR VEHICLE IS IN MOTION',
 '355/TUCKERMAN LN',
 '-77.105925',
 '39.03223',
 '02 - Automobile',
 '2010',
 'JEEP',
 'CRISWELL',
 'BLUE',
 'Citation',
 'WHITE',
 'F',
 'MD',
 'GERMANTOWN',
 'MD',
 'A - Marked Patrol']

Reading the data into Pandas

Now that we have the data as a list of lists, and the column headers as a list, we can create a Pandas Dataframe to analyze the data. If you’re unfamiliar with Pandas, it’s a data analysis library that uses an efficient, tabular data structure called a Dataframe to represent your data. Pandas allows you to convert a list of lists into a Dataframe and specify the column names separately.

In [11]:
import pandas as pd

stops = pd.DataFrame(data, columns=good_columns)

Now that we have our data in a Dataframe, we can do some interesting analysis. Here’s a table of how many stops are made by car color:

In [12]:
stops["color"].value_counts()
Out[12]:
BLACK          161319
SILVER         150650
WHITE          122887
GRAY            86322
RED             66282
BLUE            61867
GREEN           35802
GOLD            27808
TAN             18869
BLUE, DARK      17397
MAROON          15134
BLUE, LIGHT     11600
BEIGE           10522
GREEN, DK       10354
N/A              9594
GREEN, LGT       5214
BROWN            4041
YELLOW           3215
ORANGE           2967
BRONZE           1977
PURPLE           1725
MULTICOLOR        723
CREAM             608
COPPER            269
PINK              137
CHROME             21
CAMOUFLAGE         17
dtype: int64

Camouflage appears to be a very popular car color. Here’s a table of what kind of police unit created the citation:

In [13]:
stops["arrest_type"].value_counts()
Out[13]:
A - Marked Patrol                         671165
Q - Marked Laser                           87929
B - Unmarked Patrol                        25478
S - License Plate Recognition              11452
O - Foot Patrol                             9604
L - Motorcycle                              8559
E - Marked Stationary Radar                 4854
R - Unmarked Laser                          4331
G - Marked Moving Radar (Stationary)        2164
M - Marked (Off-Duty)                       1301
I - Marked Moving Radar (Moving)             842
F - Unmarked Stationary Radar                420
C - Marked VASCAR                            323
D - Unmarked VASCAR                          210
P - Mounted Patrol                           171
N - Unmarked (Off-Duty)                      116
H - Unmarked Moving Radar (Stationary)        72
K - Aircraft Assist                           41
J - Unmarked Moving Radar (Moving)            35
dtype: int64

With the rise of red light cameras and speed lasers, it’s interesting that patrol cars are still by far the dominant source of citations.

Converting columns

We’re now almost ready to do some time and location based analysis, but we need to convert the longitude, latitude, and date columns from strings to floats first. We can use the below code to convert latitude and longitude:

In [40]:
import numpy as np

def parse_float(x):
    try:
        x = float(x)
    except Exception:
        x = 0
    return x
stops["longitude"] = stops["longitude"].apply(parse_float)
stops["latitude"] = stops["latitude"].apply(parse_float)

Oddly enough, time of day and the date of the stop are stored in two separate columns, time_of_stop, and date_of_stop. We’ll parse both, and turn them into a single datetime column:

In [42]:
import datetime
def parse_full_date(row):
    date = datetime.datetime.strptime(row["date_of_stop"], "%Y-%m-%dT%H:%M:%S")
    time = row["time_of_stop"].split(":")
    date = date.replace(hour=int(time[0]), minute = int(time[1]), second = int(time[2]))
    return date

stops["date"] = stops.apply(parse_full_date, axis=1)

We can now make a plot of which days result in the most traffic stops:

In [50]:
import matplotlib.pyplot as plt
%matplotlib inline 

plt.hist(stops["date"].dt.weekday, bins=6)
Out[50]:
(array([ 112816.,  142048.,  133363.,  127629.,  131735.,  181476.]),
 array([ 0.,  1.,  2.,  3.,  4.,  5.,  6.]),
 <a list of 6 Patch objects>)

In this plot, Monday is 0, and Sunday is 6. It looks like Sunday has the most stops, and Monday has the least. This could also be a data quality issue where invalid dates resulted in Sunday for some reason. You’ll have to dig more deeply into the date_of_stop column to figure it out definitively (this is beyond the scope of this post).

We can also plot out the most common traffic stop times:

In [51]:
plt.hist(stops["date"].dt.hour, bins=24)
Out[51]:
(array([ 44125.,  35866.,  27274.,  18048.,  11271.,   7796.,  11959.,
         29853.,  46306.,  42799.,  43775.,  37101.,  34424.,  34493.,
         36006.,  29634.,  39024.,  40988.,  32511.,  28715.,  31983.,
         43957.,  60734.,  60425.]),
 array([  0.        ,   0.95833333,   1.91666667,   2.875     ,
          3.83333333,   4.79166667,   5.75      ,   6.70833333,
          7.66666667,   8.625     ,   9.58333333,  10.54166667,
         11.5       ,  12.45833333,  13.41666667,  14.375     ,
         15.33333333,  16.29166667,  17.25      ,  18.20833333,
         19.16666667,  20.125     ,  21.08333333,  22.04166667,  23.        ]),
 <a list of 24 Patch objects>)

It looks like the most stops happen around midnight, and the fewest happen around 5am. This might make sense, as people are driving home from bars and dinners late and night, and may be impaired. This may also be a data quality issue, and poking through the time_of_stop column will be necessary to get a full answer.

Subsetting the stops

Now that we’ve converted the location and date columns, we can map out the traffic stops. Because mapping is very intensive in terms of CPU resources and memory, we’ll need to filter down the rows we use from stops first:

In [43]:
last_year = stops[stops["date"] > datetime.datetime(year=2015, month=2, day=18)]

In the above code, we selected all of the rows that came in the past year. We can further narrow this down, and only select rows that occurred during rush hour – the morning period when everyone is going to work:

In [44]:
morning_rush = last_year[(last_year["date"].dt.weekday < 5) & (last_year["date"].dt.hour > 5) & (last_year["date"].dt.hour < 10)]
print(morning_rush.shape)
last_year.shape
(29824, 21)
Out[44]:
(234582, 21)

Using the excellent folium package, we can now visualize where all the stops occurred. Folium allows you to easily create interactive maps in Python by leveraging leaflet. In order to preserve performance, we’ll only visualize the first 1000 rows of morning_rush:

In [45]:
import folium
from folium import plugins

stops_map = folium.Map(location=[39.0836, -77.1483], zoom_start=11)
marker_cluster = folium.MarkerCluster().add_to(stops_map)
for name, row in morning_rush.iloc[:1000].iterrows():
    folium.Marker([row["longitude"], row["latitude"]], popup=row["description"]).add_to(marker_cluster)
stops_map.create_map('stops.html')
stops_map
Out[45]: