The data science life cycle is generally comprised of the following components:
- data retrieval
- data cleaning
- data exploration and visualization
- statistical or predictive modeling
While these components are helpful for understanding the different phases, they don't help us think about our programming workflow.
Often, the entire data science life cycle ends up as an arbitrary mess of notebook cells in either a Jupyter Notebook or a single messy script. In addition, most data science problems require us to switch between data retrieval, data cleaning, data exploration, data visualization, and statistical / predictive modeling.
But there's a better way! In this post, I'll go over the two mindsets most people switch between when doing programming work specifically for data science: the prototype mindset and the production mindset.
|Prototype mindset prioritizes:||Production mindset prioritizes:|
|iteration speed on small pieces of code||iteration speed on the full pipeline|
|less abstraction (directly modifying code and data objects)||more abstraction (modifying parameter values instead)|
|less structure to the code (less modularity)||more structure to the code (more modularity)|
|helping you and others understand the code and data||helping a computer run code automatically|
I personally use JupyterLab for the entire process (both prototyping and productionizing). I recommend using JupyterLab at least for prototyping.
Lending Club data
To help more concretely understand the difference between the prototyping and the production mindset, let's work with some real data. We'll work with lending data from the peer-to-peer lending site, Lending Club. Unlike a bank, Lending Club doesn't lend money itself. Lending Club is instead a marketplace for lenders to lend money to individuals who are seeking loans for a variety of reasons (home repairs, wedding costs, etc.). We can use this data to build models that will predict if a given loan application will be successful or not. We won't dive into building a machine learning pipeline for making predictions in this post, but we cover it in our Machine Learning Project Walkthrough Course.
Lending Club offers detailed, historical data on both completed loans (loan applications that were approved by Lending Club and they found lenders) and declined loans (loan application was declined by Lending Club and money never changed hands). Navigate to their data download page and select 2007-2011 under DOWLNOAD LOAN DATA.
In the prototype mindset, we're interested in quickly iterating and trying to understand some properties and truths about the data. Create a new Jupyter notebook and add a Markdown cell that explains:
- Any research you did on Lending Club to better understand the platform
- Any information on the data set you downloaded
First things first, let's read the CSV file into pandas.
import pandas as pd loans_2007 = pd.read_csv('LoanStats3a.csv') loans_2007.head(2)
We get two pieces of output, first a warning.
/home/srinify/anaconda3/envs/dq2/lib/python3.6/site-packages/IPython/core/interactiveshell.py:2785: DtypeWarning: Columns (0,1,2,3,4,7,13,18,24,25,27,28,29,30,31,32,34,36,37,38,39,40,41,42,43,44,46,47,49,50,51,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120,121,123,124,125,126,127,128,129,130,131,132,133,134,135,136,142,143,144) have mixed types. Specify dtype option on import or set low_memory=False. interactivity=interactivity, compiler=compiler, result=result)
Then the first 5 rows of the dataframe, which we'll avoid showing here (because it's quite long).
We also got the following dataframe output:
Notes offered by Prospectus (https://www.lendingclub.com/info/prospectus.action)
The warning lets us know that the type inferencing of pandas for each column would be improved if we set the
low_memory parameter to
False when calling
The second output is more problematic because the way the DataFrame is storing the data has issues. JupyterLab has a terminal environment built in, so we can open it and use the bash command
head to observe the first two lines of the raw file:
head -2 LoanStats3a.csv
While the second line contains the column names as we expect in a CSV file, it looks like the first line is throwing off the formatting of the DataFrame when pandas tries to parse the file:
Notes offered by Prospectus (https://www.lendingclub.com/info/prospectus.action)
Add a Markdown cell that details your observations and add a code cell that factors in the observations.
import pandas as pd loans_2007 = pd.read_csv('LoanStats3a.csv', skiprows=1, low_memory=False)
Read the data dictionary from the Lending Club download page to understand which columns don't contain useful information for features. The
url columns seem to immediately fit this criteria.
loans_2007 = loans_2007.drop(['desc', 'url'],axis=1)
The next step is to drop any columns with more than 50% missing rows. Use one cell to explore which columns meet that criteria, and another to actually drop the columns.
loans_2007 = loans_2007.dropna(thresh=half_count, axis=1)
Because we're using a Jupyter notebook to track our thoughts and our code, we're relying on the environment (via the IPython kernel) to keep track of changes of the state. This frees us up to be freeform, move cells around, run the same code multiple times, etc.
In general, code in the prototyping mindset should focus on:
- Markdown cells to describe our observations and assumptions
- Small pieces of code for the actual logic
- Lots of visualizations and counts
- Minimal abstractions
- Most code shouldn't be in functions (should feel more object-oriented)
Let's say we spent another hour exploring the data and writing markdown cells that describe the data cleaning we did. We can then switch over to the production mindset and make the code more robust.
In the production mindset, we want to focus on writing code that will generalize to more situations. In our case, we want our data cleaning code to work for any of the data sets from Lending Club (from other time periods). The best way to generalize our code is to turn it into a data pipeline. A data pipeline is designed using principles from functional programming, where data is modified within functions and then passed between functions.
Here's a first iteration of this pipeline using a single function to encapsulate data cleaning code:
import pandas as pd def import_clean(file_list): frames =  for file in file_list: loans = pd.read_csv(file, skiprows=1, low_memory=False) loans = loans.drop(['desc', 'url'], axis=1) half_count = len(loans)/2 loans = loans.dropna(thresh=half_count, axis=1) loans = loans.drop_duplicates() # Drop first group of features loans = loans.drop(["funded_amnt", "funded_amnt_inv", "grade", "sub_grade", "emp_title", "issue_d"], axis=1) # Drop second group of features loans = loans.drop(["zip_code", "out_prncp", "out_prncp_inv", "total_pymnt", "total_pymnt_inv", "total_rec_prncp"], axis=1) # Drop third group of features loans = loans.drop(["total_rec_int", "total_rec_late_fee", "recoveries", "collection_recovery_fee", "last_pymnt_d", "last_pymnt_amnt"], axis=1) frames.append(loans) return frames frames = import_clean(['LoanStats3a.csv'])
In the code above, we abstracted the code from earlier into a single function. The input to this function is a list of filenames and the output is a list of DataFrame objects.
In general, the production mindset should focus on:
- Healthy abstractions
- Code should generalize to be compatible with similar data sources
- Code shouldn't be so general that it becomes cumbersome to understand
- pipeline stability
- Reliability should match how frequently its run (daily? weekly? monthly?)
Switching between mindsets
Let's say we tried to run the function for all of the data sets from Lending Club and Python returned errors. Some potential sources for errors:
- Variance in column names in some of the files
- Variance in columns being dropped because of the 50% missing value threshold
- Different column types based on pandas type inference for that file
In those cases, we should actually switch back to our prototype notebook and investigate further. When we've determined that we want our pipeline to be more flexible and account for specific variations in the data, we can re-incorporate that back into the pipeline logic.
Here's an example where we adapted the function to accommodate for a different drop threshold value:
import pandas as pd def import_clean(file_list, threshold=0.5): frames =  for file in file_list: loans = pd.read_csv(file, skiprows=1, low_memory=False) loans = loans.drop(['desc', 'url'], axis=1) threshold_count = len(loans)*threshold loans = loans.dropna(thresh=half_count, axis=1) loans = loans.drop_duplicates() # Drop first group of features loans = loans.drop(["funded_amnt", "funded_amnt_inv", "grade", "sub_grade", "emp_title", "issue_d"], axis=1) # Drop second group of features loans = loans.drop(["zip_code", "out_prncp", "out_prncp_inv", "total_pymnt", "total_pymnt_inv", "total_rec_prncp"], axis=1) # Drop third group of features loans = loans.drop(["total_rec_int", "total_rec_late_fee", "recoveries", "collection_recovery_fee", "last_pymnt_d", "last_pymnt_amnt"], axis=1) frames.append(loans) return frames frames = import_clean(['LoanStats3a.csv'], threshold=0.7)
The default value is still
0.5, but we can over-ride it to
0.7 if we want.
Here are a few ways to make the pipeline more flexible, in decreasing priority:
- Use optional, positional, and required arguments
- Use if / then statements along with Boolean input values within the functions
- Use new data structures (dictionaries, lists, etc.) to represent custom actions for specific datasets
This pipeline can scale to all phases of the data science workflow. Here's some skeleton code that previews how this looks.
import pandas as pd def import_clean(file_list, threshold=0.5): ## Code def visualize(df_list): # Find the most important features and generate pairwise scatter plots # Display visualizations and write to file. plt.savefig("scatter_plots.png") def combine(df_list): # Combine dataframes and generate train and test sets # Drop features all dataframes don't share # Return both train and test dataframes return train,test def train(train_df): # Train model return model def validate(train_df, test-df): # K-fold cross validation # Return metrics dictionary return metrics_dict frames = import_clean(['LoanStats3a.csv', 'LoanStats2012.csv'], threshold=0.7) visualize(frames) train_df, test_df = combine(frames) model = train(train_df) metrics = test(train_df, test_df) print(metrics)
If you're interested in deepening your understanding and practicing further, I recommend the following next steps:
- Learn how to turn your pipeline into a standalone script that can be run as a module or from the command line: https://docs.python.org/3/library/main.html
- Learn how to use Luigi to build more complex pipelines that can be run in the cloud: Building Data Pipelines in Python and Luigi
- Learn more about data engineering: Data Engineering Posts on Dataquest