Halloween is arguably the best holiday for those of us who like candy (which is everyone). I always loved Halloween because no matter how old you are, you probably want some type of candy. But, with the season only a few days away, you may be wondering two things: what costume will you wear, and (more importantly) how can you make sure you get the candy trick-or-treaters want?
Based on where you're living, you may be able to guess what candies do better than others in your area. But, thinking isn’t always enough to make a definitive decision. Luckily, data and surveys give us a better chance of making the right choice.
In the case of Halloween in particular, there's one dataset that’s fun to explore and has the secret to the best candy option. In 2017, FiveThirtyEight created a dataset comparing 85 fun-sized popular candies based on a random matching with more than 265,000 participants to decide once and for all the true king of Halloween candy.
A First Look at the Dataset
Now, as a sugar-loving data scientist preparing for Halloween, I couldn't miss the chance to explore this dataset and see what candy I should add to my shopping list.
The Ultimate Halloween Candy Power Ranking dataset compares 85 candies based on 12 characteristic variables, 9 of which are binary variables describing the nature and content of the candy, such as whether or not they contain chocolate, caramel, or nuts; are they solid candy or bar-shaped, etc. The remaining 3 variables describe the candy's sugar percentage (or how I like to think of it, sweetness level), price, and its winning percentage based on the 11 other variables.
A Deeper Dive (Getting to Know the Data)
We need to see this data first to use this dataset to decide on the best candy to buy this Halloween — or just to know which is the best Halloween candy. So, let's start with a simple data exploration using Python and Pandas.
1. Load the data and explore basic information
After we clone the dataset from GitHub, we need to load our dataset into a Pandas DataFrame and explore different aspects, such as the size of the dataset and the types of variables within it. We can even set the precision to 2 decimal points for easier handling of the float numerical data.
import pandas as pd import matplotlib.pyplot as plt %matplotlib inline candy = pd.read_csv("candy-data.csv") #Dispaly the size of the dataset len(candy) #Examining the shape of the data frame Candy.shape #Set display precision to 2 pd.set_option("display.precision", 2) #Getting the first 5 rows of the dataframe candy.head()
We can also determine the variable types within the dataset using candy.info()
We can get basic stats using candy.info()
2. Access and query the dataset
Knowing the variable types and the construction of the dataset will allow us to explore the data using simple queries and groupings. For example, we can show only the candies with chocolate and caramel in them or those with a winning chance of 70% or higher. We can also put the two conditions together to see what chocolate-caramel candy will have a higher winning probability.
candy[(candy["chocolate"] == 1) & (candy["caramel"] == 1) & (candy["winpercent"] > 70)]
We can also determine the average winning percentage for any candy with chocolate versus those with both chocolate and caramel, and how that percentage changes if the candy has chocolate, caramel, and nuts (which will be 60.92%).
3. Visualizing the data
My favorite part about exploring any new dataset has to be visualizing the data. While you can get far just by querying the data, grouping some columns and rows, and analyzing the content of the data in a written form, because we’re visual creatures, we perceive information better when it's visualized. Moreover, sometimes the data can contain hidden patterns or trends that you can only discover through visualization.
Assume we want to know if the sweetness level or price affects the candy's percentage. We can use a scatter plot to show the sweetness level, the price versus the winning percentage. If we do that, we will see that there is no direct connection between these two factors. We can see that some candies with high sweetness levels have a low chance of winning. Price was not a definitive factor in determining whether a candy won or not.
ax1 = nba.plot(kind='scatter', x='sugarpercent', y='winpercent', color='r', label ='sweetness leve') ax2 = nba.plot(kind='scatter', x='pricepercent', y='winpercent', color='g', label='price', ax=ax1) ax2.set_xlabel("Sugar/ Price Percentage") ax2.set_ylabel("Win Percentage")
So, that would lead us to wonder which factor affects the percentage of a candy winning. We can then focus on the 9 binary variables and see how each affects the percentage of winning. We can start by plotting the average percentage of winning for each of those 9 variables.
cat = ["chocolate","fruity","caramel","peanutyalmondy","nougat","crispedricewafer","hard","bar","pluribus"] ave_win = [nba.groupby(item)['winpercent'].mean() for item in cat] plt.figure(figsize=(15,5)) plt.bar(cat, ave_win )
From there, we can see that candies with chocolate, nuts (peanuts or almonds), and a crispy element (like wafers) have the highest chances of winning. If we sort our data based on the winning percentage in descending order and then plot the top 10 candies against their winning percentage, we will see that these candies all contain chocolate and nutty elements. The top two candies in this dataset are Reese's peanut butter cups and its successor, mini Reese's.
From the averages, we can also see that if the candy has a fruity element or is hard candy, their chances of winning will be much less than the nutty chocolate competitors.
df = candy.sort_values("winpercent", ascending=False) df.head(10).plot(x="competitorname" , y="winpercent" , kind="bar")
From the simple data exploration techniques we used on this dataset, I can tell you with a high degree of confidence that if you want to be a hit this Halloween, go with candies that have chocolate, nuts, and a crispy element to them. That will surely make your house a trick-or-treater’s favorite this year.
Aside from winning the Halloween trick-or-treat battle, you can also use this dataset to determine which candy to bring to your next party. You can also use this dataset to train a machine learning model to predict the winning percentage of other sweets not included in this dataset. Finally, you can use this dataset and some clustering algorithms to create a formula for the perfect candy based on the opinions of more than a quarter of a million people.
So, next time you're stuck choosing candy to satisfy your sweet tooth, know that the data can always help you make a more educated decision. After all, nothing is more fun than analyzing candies before Halloween.