August 29, 2022

20 Data Science Projects with Source Code for Beginners

20 Data Science Projects

The demand for data scientists is incredibly high. Employers are desperate for data scientists, and recruiters have a hard time filling vacancies. Despite the fears of a looming recession, it appears “data scientists can still name their price.”

Have you ever thought of a career as a data scientist? Now is the perfect time to make a change. To land a rewarding data science job, you’ll need a portfolio of data science projects that will demonstrate your skills to recruiters. 

In this article, we’ll share 20 must-have projects for beginner — and their source code. These cover the skill spectrum required of a data scientist at every phase in the data science workflow. Completing these projects will help you stand out from the crowd in your job search. 

Here are the phases of the data science workflow we’ll discuss:

  • Data collection
  • Feature extraction and exploratory data analysis
  • Model selection and validation
  • Model deployment, continuous monitoring, and improvement

Data Collection

Data collection is one of the most important stages of the entire data analysis process; it can lead to the failure of your data science project if mishandled. And as you can guess, the process of gathering  data isn’t always as easy as you would like it to be. Here are some suggested data science projects to help you develop your data collection skills:

1. Data Collection and Annotation

Data scientists have multiple ways to source their data, but at times, you might need to collect your own data.

Imagine that you want to start a wine business in the center of Athens, and you need to know which wines you need to stock. You may have to look at how demographics affect the choice of wine in your locality. You can design and send out an online survey to collect your data with tools like Google Forms and Qualtrics, for example. You can read How to Gather Your Own Data by Conducting a Great Survey to learn how to do it effectively and avoid common mistakes. 

In supervised machine learning, datasets need to be annotated so that machines can understand them. Sometimes, these annotations aren’t available during the data collection step. Take a wildlife photographer’s birds collage, for example. If you’re interested in a bird species identification project, you have to first get the bird pictures annotated. Labeling the data by yourself can be slow and laborious. Crowdsourcing platforms like Amazon Mechanical Turk and Lionbridge AI help fill the gaps.

Let your imagination run wild with your data science project ideas. Here are some links that will get you started with data collection and annotation:

2. Scraping a Single Webpage

You’re interested in predicting the weather in your city. You know that the weather data for your city is available on the National Weather Service website, but it isn’t available in a downloadable format.

In this mini project on data science, you’ll learn how to scrape a single webpage using the requests and BeautifulSoup libraries. You will learn how to make a GET request call and parse the response to BeautifulSoup. You will master how to explore the structure of an HTML page and find tags using the Google Chrome Developer tool.

That’s not all. You’ll scrape a single webpage on the National Weather Service website extracting text with these tags and put the data in a pandas DataFrame. This data science mini project ends by introducing data preprocessing using regular expressions to extract relevant information from text. 

Web scraping is an invaluable data science skill, and we recommend our API and Web Scraping in Python course to help you get started.

Here is the link to the blog post about this mini project:

3. Scraping Multiple Webpages

You learned how to scrape data from a single webpage in the previous project. However, the data you need may be available on multiple webpages. A website is a collection of web pages linked together. To scrape multiple web pages, you will need to know how to find the tags that link to the web pages that you’re interested in. 

In this data science project, you’ll expand upon the previous web scraping project. You will scrape the English Premier League matches data from FBref.com. You will start by scraping the standings webpage to get the tags that connect the teams to web pages containing their data. You’ll continue to use the Python requests and BeautifulSoup libraries. 

You’ll master how to make multiple GET requests and parse their responses to BeautifulSoup using a `for-loop` statement. As with the previous project, you’ll put the scraped data in a pandas DataFrame. Lastly, you will learn how to write a pandas DataFrame object to a comma-separated values (CSV) file that you can reuse later. 

Here are the links to the video tutorial for this project — and the Github Link housing its source code:

Feature Extraction and Exploratory Data Analysis

Real-world data aren’t usually in formats that machine learning algorithms can understand. Thus, there is a need to preprocess and transform the data. Feature extraction reduces the number of features in the data by creating new ones. Principal Component Analysis (PCA) is the most popular feature extraction algorithm. 

Exploratory Data Analysis (EDA) seeks to understand the relationships between features using statistical and visualization techniques. Some of the most popular graphical techniques used for EDA include box plot, histogram, pair plot, scatter plot, heat map, and vertical and horizontal bar charts. 

Here are some cool data science projects to improve your feature extraction and EDA skills:

4. Dimensionality Reduction with PCA

Working with a high-dimensional dataset is common practice as a data scientist. A medical record or an image of a single person is an example of such high-dimensional data. A data is considered high-dimensional if the row, `r`, is less than or equal to the number of features or columns, `c`: $r \le c$.

Imagine that you have a 100 by 100 colored image of yourself. A colored image has three channels: red, green, blue. The number of features present in this image when it is flattened is 100 by 100 by 3. With PCA, the dimensions of this data can be reduced without the loss of too much information. Both the dimensions and size of the image in storage are reduced. When you have many high-resolution images and want to save storage space, or you want to improve the speed of training your machine learning algorithm, you can compress the image using PCA. 

This is exactly what we’ll do in this data science project. You’ll learn how the OpenCV library can process an image, and the Scikit-Learn implementation of the PCA algorithm to get its principal components. You will also see how to reconstruct the original image from its principal components. 

Here are the links to the tutorial, source code, and data for this project:

5. EDA with Seaborn

Going to college is very expensive, and you aren’t guaranteed economic success. To get a good return on your investment, you must be careful in selecting your major. Many college-bound students face a challenge selecting a major that improves their odds of financial success.

In this data science project, you’ll perform an extensive exploratory data analysis (EDA) on data containing the job outcomes of students who graduated from college between 2010 and 2012 using the Seaborn library. You’ll learn how to frame and answer questions by manipulating pandas DataFrames and visualizing the results. You will be able to answer questions like these: 

  • What college degrees have the highest average salary? 
  • Which ones have the highest and lowest employment rate? 
  • Does one’s gender affect salary in the same discipline? 
  • What majors have the highest percentage of men? Of women 
  • Are men- or women-dominated majors the most lucrative?

Here are the links to the source code and data for this project:

You can find other cool projects, such as Finding the Heavy Traffic Indicators on I-94 in our Data Visualization Fundamentals course. 

6. EDA with Plotly

While Seaborn allows you to make beautiful graphical plots, it isn’t sufficient when you need highly customizable and interactive plots. Plotly-Dash allows you to build interactive and customizable dashboards and applications that you can deploy. An application creates a layer of abstraction that hides the complexity of your code from your users. The application users interact with the dashboard. This makes sharing your data science projects easier.

This tutorial is a gentle introduction to the Plotly library. You will perform exploratory data analysis on the Netflix Dataset. At the end of the project, you will be able to answer questions like these: 

  • What are the most preferred genres?
  • Which Netflix shows have the highest ratings?
  • What’s the best time of the year to release a show on Netflix?
  • What are the most watched movies? 

This project answers some of these questions on a per-country level. 

Here are the links to the tutorial containing the source code and data for this project:

7. EDA the Fun Way with Matplotlib

We have explored how to use both the Plotly and Seaborn libraries in the preceding projects. Plots from these libraries are very business-like. Sometimes, we just want to make fun data science projects. 

In this data science project, you’ll learn how to perform EDA the fun way with the **xkcd** function in the Matplotlib library. You will continue working on the Netflix dataset using comical plots to investigate questions like these: 

  • What percentage of Netflix contents are Movies? TV Shows?
  • The amount of Netflix content by country?
  • Who are the most popular actors and directors on Netflix?
  • What kind of content is Netflix focusing on?
  • What are the top genres per country? 

The project ends by introducing you to Word Cloud. You will investigate the most-used words in the descriptions and titles of contents on Netflix. 

Here are the links to the tutorial containing the source code and data for this project:

Model Selection and Validation

In the data science workflow, the model selection and validation phase is when evaluation metrics are selected and models are trained and validated. Hyperparameter tuning optimizes models’ performance, and evaluation metrics quantify them. The selected machine learning model is the one that performs best against the evaluation metrics.

Our Machine Learning Fundamentals course will introduce you to the basics of machine learning. You’ll learn how to optimize machine learning models’ hyperparameters, evaluate their performance, and select the best model.

Rather than begin with a project that asks you to implement machine learning algorithms immediately, data science enthusiasts should first understand the mathematics behind these algorithms. Ian Goodfellow, one of the pioneers of modern deep learning and the co-author of one of the first books on deep learning, once said in an interview that to master the field of machine learning, it is important to understand the math happening under the hood. 

So, this section will start with data science projects that involve creating machine learning algorithms from scratch. After this, the discussion shifts to projects where you have to implement machine learning and deep learning algorithms from standard libraries like Scikit-Learn, Keras, and Tensorflow.

8. Linear Regression: the Normal Equation

You have used the linear regression algorithm for years without even realizing it. Can you recall when you were given a linear equation like $y = 2x + 3$ and a value of $x=2$ and were asked to find the value of $y$? You get a similar linear equation when you train a linear regression algorithm. When you need to find the value of $y$, given some values of $x$, that’s the linear regression algorithm making predictions.

There are several ways to implement the linear regression algorithm from scratch. In this data science project, you’ll implement this algorithm using its normal equation. The video tutorial first takes you through the mathematics before you implement the algorithm in Python using the NumPy library. Using `QR decomposition` and `gradient descent` are more stable ways to implement this algorithm; however, using the normal equation is the simplest way to understand the math behind it. 

Here are the links to the video tutorial, source code, and data for this project:

9.  Linear Regression with Gradient Descent from Scratch

The gradient descent algorithm is an iterative optimization algorithm for finding the local minimum of a differentiable function. It’s an important algorithm used to train linear regression and logistic regression algorithms and neural networks. The differentiable function is also called “cost function.” For the linear regression algorithm, this cost function is the **mean squared error**. 

In this data science project, you’ll learn how to implement the batch gradient descent using the NumPy library with data generated inside your program. You’ll visualize how the model’s performance improves with each iteration as it is being trained with gradient descent.

Here are the links to the tutorial containing the source code for this project:

10. Logistic Regression with Gradient Descent from Scratch

The linear regression algorithm doesn’t perform well on classification problems. Therefore, we would need another machine learning algorithm that handles such problems — for example, logistic regression. 

In this data science project, you’ll learn how to implement the logistic regression algorithm with batch gradient descent and **log-loss** function. This project introduces you to the concept of convexity: cost function approaching the global minimum with each iteration. You’ll train and test your algorithm with synthetic data generated inside your program. Finally, you’ll compare the performance of your algorithm with Scikit-Learn’s implementation of the logistic function.

Understanding how gradient descent and logistic regression work is a prerequisite to understanding how a standard neural network works. A standard neural network is a stack of logistic regression models that are trained using gradient descent. 

Here are the links to the tutorial containing the source code and a short read on the math behind gradient descent:

11. Linear Regression Algorithm with Scikit-Learn

You’ve seen the math behind the linear regression and logistic regression algorithms. You’ve implemented these algorithms from scratch. While you build a solid mathematical and theoretical foundation when you implement these algorithms from scratch, you don’t have to do everything over again every time you work on a data science project. There are libraries or frameworks that have implementations of these algorithms and have been rigorously tested, like Scikit-Learn, Tensorflow, and PyTorch.

In this project, you’ll learn how to use the Scikit-Learn implementation of the Linear Regression algorithm. You’ll predict house prices from several categorical and numerical features from the Ames dataset collected in Ames, Iowa between 2006 to 2010. This dataset isn’t clean. So, you’ll use data wrangling techniques to clean the data and impute missing values. You’ll learn how to engineer new features out of existing ones, and the different data transformation techniques you can apply to numerical and categorical features. Finally, you’ll train, predict, and measure the accuracy of your predictions against the test set using the root mean squared error metric.

Learning how the linear regression algorithm works is an important first step in mastering machine learning. In our Linear Regression for Machine Learning course,  you’ll learn how to preprocess and transform your data, select appropriate features, and implement the linear regression algorithm.

Here are the links to the source code and data for this project:

12. Extending the Logistic Regression Algorithm


By default, the Logistic Regression algorithm is a binary classifier. So, it’s incapable of handling multiclass classification problems except when we extend it in some ways. There are several ways to do this. You’ll learn one of the simplest ways of extending the logistic regression algorithm by changing some of its default parameters. You’ll learn how setting the `class_weight` and `multi_class` parameters in the Scikit-Learn implementation of the Logistic Regression algorithm enables it to handle imbalanced data and multiclass classification problems.

That’s not all. You’ll perform preprocessing of your dataset to handle missing values. You’ll work with the two kinds of categorical features — nominal and ordinal — and learn their different transformation techniques. You’ll perform extensive univariate and bivariate EDA and feature engineering. Besides the logistic regression algorithm, you’ll also learn the Scikit-Learn implementation of multi classification with the following algorithms: KNeighborsClassifier, Multinomial Naive Bayes, Random Forest, and GradientBoosting. You’ll learn how to optimize these algorithm hyperparameters using GridSearch Cross Validation. 

This project covers the entire data science workflow phases we have discussed so far. It highlights the fact that finding the solution to a data science problem is an iterative process involving extending, training, and optimizing several machine learning algorithms. One way you could improve this project is to create a classifier based on all the other algorithms trained using the majority rule. This is an ensemble technique in machine learning — we can do this using the Scikit-Learn Voting Classifier function.

Here are the links to the source code and data for this project

You can find other cool projects, such as predicting the stock market, in our Intermediate Machine Learning in Python course. 

13. Classification with Ensemble Learning

You used Randomforest and GradientBoosting ensemble models in the last project. But what is ensemble learning? It’s the machine learning technique where you seek to improve predictive performance by combining the predictions of many machine learning models. In this project, we’ll use the Scikit-Learn implementation of the RandomForestClassfier to predict stock prices.

Using its default settings, the RandomForestClassifier is an ensemble of 100 DecisionTreeClassifier models. Therefore, it makes its prediction based on the predictions of these DecisionTreeClassifiers using majority rule. In the case of a regression problem, it takes the average of all predictions.

In this project, you’ll learn how to predict the direction of price movement of a financial security. Although you’ll use the Microsoft stock price for this project, you can extend to any other financial security that interests you. All that you need to do is change the ticker from Microsoft, MSFT, to the ticker of your choice when calling the YahooFinance API — where we download the data. You’ll learn how to process time series data using the Pandas library. Stock prices are continuous variables and are modeled using linear regression. You’ll learn how to reframe a regression task into a classification task by transforming the target variable.

There are many metrics to validate your classification algorithm. This project discusses what you should consider when selecting a metric for your data science project. Furthermore, it goes into the details of creating a backtest to validate model performance. This project can be extended by training a GradientBoostingClassifier and comparing how it performs against the RandomForest Classifier. Advanced learners can train the Long-Short-Term-Memory (LSTM) model and compare its performance against the RandomForest and GradientBoosting classifiers.

Here are the links to the tutorial and source code for this project:

14. Classification with R

Python is a great programming language for completing projects on data science, but it isn’t the only language out there. The R programming language has a long history of use in statistical and scientific computing. Our Data Analyst in R path can help you get started with the R programming language.

This project will introduce you to using R for data science projects. You’ll learn how to train several machine learning algorithms to predict the outcome of UFC Fights using the UFC data on Kaggle. This data was scraped from the UFC Stats website. We suggest several web scraping projects in the data collection phase of the data science workflow. This is because web scraping is an important data science skill. Feel free to scrape the UFC Stats website if you feel the dataset is a little outdated.

In this project, you’ll train the following machine learning algorithms in R: K-Nearest Neighbors, Logistic Regression, DecisionTree, RandomForest, and Extreme Gradient Boost. You’ll validate the models and compare their performance against experts predictions. In the end, you’ll have several prediction models that you can use to predict the outcomes of upcoming UFC fights.

Here are the links to the tutorial, source code, and data for this project:

15. Predicting Customer Attrition with Classification Algorithms

Companies would like to find out when customers will stop doing business with them before they actually do. This will help the companies design promotional offers to retain their customers. In this project on data science, you’ll learn how companies can predict churn using machine learning. You’ll work with Telco Customer Churn data available on Kaggle.

You’ll start by preprocessing the data and performing EDA to identify patterns. Next, you’ll learn how to transform numerical and categorical features into formats that can be used for training machine learning algorithms. There are different metrics for evaluating the performance of classification algorithms; there is no one-size-fits-all metric for evaluating classification algorithms performance.

You’ll learn about the arguments the author puts forward for choosing the Recall metric. You’ll train and compare the performance of several machine learning algorithms (Logistic Regression, Decision Tree, Support Vector Machine, and XGBoost) with balanced and unbalanced datasets. You’ll learn how to tune these models to optimize their performance with GridSearch Cross Validation.

Here are the links to the tutorial, source code, and data for this project:

16. Advanced Regression Technique with Ensemble Learning

We have seen quite a number of classification problems that use the advanced ensemble technique. In this project, we’ll see how we can improve regression models’ performance using ensembling. You’ll work with Kaggle’s Housing Price Data. The data isn’t clean, so you’ll start this data science project with preprocessing the data.

You’ll learn how to visualize and remove outliers from your data. You’ll see how visualizing the number of missing values per feature helps you decide on an appropriate cutoff for percentage of missing values in a feature. Features with missing values above the cutoff are dropped, and appropriate imputation technique is used to fill the missing values for other features. 

You’ll perform an extensive EDA with discrete and continuous features using bar charts and histograms. Then you’ll engineer features based on domain knowledge and transform numerical and categorical features with the appropriate techniques. You’ll train and optimize the hyperparameters for the following models: XGBRegressor, Ridge, Lasso, Support Vector Regressor, LightGBM Regressor, and GradientBoostingRegressor.

Lastly, you will learn how to stack these regression models into a single ensemble model that you can use to make predictions. At the end of this project, you will have used state-of-the-art regression models and learned techniques that will enable you to become a competitive data scientist.

Here are the links to the tutorial with source code, and data for this project:

17. Bayesian Machine Learning

Spam messages are a menace. They clutter your inbox, distract you from noticing important messages, and take up storage space. This data science project introduces you to the field of natural language processing (NLP). It’s the aspect of artificial intelligence that handles how computers can process and analyze large amounts of natural language data. A spam classifier is one of the most basic applications of NLP. A more advanced application is in the automatic speech recognition systems of Alexa and Google Assistant.

In this data science project, you’ll learn how to process text data and build a probabilistic naive Bayes spam filter that can help you differentiate spam from non-spam messages using the SMS Spam Collection dataset on Kaggle. First, you’ll preprocess the dataset and transform it into a format from which you can create a bag-of-words model. Next, you’ll learn how to classify a message as spam or not-spam by calculating and comparing their probabilities. Finally, you’ll test your spam filter on your test set and calculate its accuracy.

In this data science project, the spam filter was built from scratch without the use of packages from a machine learning library. You can extend this project by using NLKT, Spacy, TFIDFVectorizer, and MultinomialNB to reduce the heavy work involved with building from scratch. 

Take our Conditional Probability course and the other courses in our Probability and Statistics module to gain the foundational knowledge required to complete this project.

Here are the links to the source code and data for this project:

18. Introduction to Deep Learning with Keras

We have mostly worked with tabular datasets up to this point. Classical machine learning algorithms perform well on tabular data. This isn’t the case for unstructured data like images and text. This is where deep learning algorithms shine. In this project, you’ll learn how to create a digit classifier with the popular mnist dataset. 

You’ll use the Keras API to import the data and preprocess the images and their labels. You’ll learn how to build your own standard neural network architecture using densely connected layers, activation functions, loss functions, optimizers, and metric. Then you’ll train, evaluate, and make predictions with the trained neural network.

By the end of this project, you’ll have a standard neural network model that can accurately predict digits. You can improve this data science project by optimizing the hyperparameters of the neural network: batch sizes, nodes, hidden units, using optimizers, and using regularization and dropout.

Here are the links to the tutorial with source code, and data for this project:

19. Convolutional Neural Network with Keras and Tensorflow

Let’s admit it. We don’t have unlimited computer resources to train very large models. In addition, large models may take several days or even weeks to train. Training deep learning models with very little data is a very important skill for a data scientist to have. In this project, you’ll train a convolutional neural network (convnet) that can differentiate cats from dogs with reasonable accuracy using little data. To put what we mean by little data into context, the dog vs. cats dataset on Kaggle contains 25,000 images of cats and dogs. But with only 2,000 images, you’ll train a convnet with an accuracy of about eighty percent. 

You’ll learn about data augmentation using Keras — the technique where synthetic data is generated from your original dataset to augment it. Next, you’ll learn how to build a convolutional neural network architecture containing convolution, activation, and pooling layers. You’ll learn how to connect your convnet architecture to fully connect layers that end with an output layer. Finally, you’ll learn how to train this neural network to classify cats and dogs accurately.

At the end of the tutorial, the author introduces the concept of transfer learning. Researchers have trained very deep neural networks with millions of datasets and have optimized the model parameter. With transfer learning, you don’t have to train your neural network from scratch. You can select a pre-trained model and add your own fully connected layers, freeze the layers from the pre-trained model, train with your data, and then make predictions. You’ll get a more accurate model than training from scratch.


Here are the links to the tutorial, source code, and data for this project:

Model Deployment, Continuous Monitoring and Improvement

You have trained a machine learning model that works, but it’s only available to you. The goal of machine learning is to solve a problem, and the model should be available for others to use. To make the model available to a wider audience, you have to put the model in production or deploy it as a web application or embedded in another system. 

After putting a machine learning model into production, its performance degrades over time. Thus, it’s very important to monitor the performance of your deployed model, re-train to improve its performance, and re-deploy. A simple way to put a model into production is to use interactive web applications like Shiny for Python and Streamlit. Extensive knowledge of web development isn’t necessary to build and deploy web applications using Shiny and Streamlit. This article discusses in depth how to continuously monitor your machine learning models post-deployment.

20. Deploy Machine Learning Model Using Streamlit in Python

In this project, you’ll learn how to develop a simple machine learning application using Streamlit. First, you’ll train and validate a RandomForestClassifier. Next, you’ll save the model as a pickle file ready for deployment. Afterward, you’ll learn how to use Streamlit to deploy the model as an interactive web application that makes predictions using your saved model. You’ll use the Kaggle Banknote Authentication Data to create an interactive Bank Authenticator web application that takes four inputs and predicts whether or not the bank note is authentic. At the end of this project, you’ll learn how to deploy your machine learning models as interactive web applications available for others to use. 

Here are the links to the video tutorial, source code, and data for this project:

Takeaway

In this article, we discussed 20 cool data science projects that cover the skill spectrum required of a data scientist. These projects cover the essential technical skills you would require to build end-to-end data science projects. Having a portfolio of data science projects helps showcase your data science skills to potential recruiters, which helps you stand out in your job search.

Here is a list of our projects that you can complete for free when you sign up with Dataquest. We also have curated 55 beginner-friendly Python projects that will enrich your portfolio in this blog post.

If you’re new to programming and haven’t learned the basics yet, we recommend the Machine Learning Introduction with Python skill path. You can also explore other courses in our skill paths and sign up for those that pique your interest. If you know the fundamentals, we recommend that you sign up for our Data Scientist in Python career path.

In this article, we’ve shared some personal projects from our alumni. Most started out simply as data science enthusiasts. Our hands-on approach to learning and our interactive platform helped them launch their careers as data scientists. Why not join them?



Aghogho Monorien

About the author

Aghogho Monorien

Aghogho is an engineer and aspiring Quant working on the applications of artificial intelligence in finance.

Learn data skills for free

Headshot Headshot

Join 1M+ learners

Try free courses