# Math in Data Science

Math is like an octopus: it has tentacles that can reach out and touch just about every subject. And while some subjects only get a light brush, others get wrapped up like a clam in the tentacles’ vice-like grip. Data science falls into the latter category. If you want to do data science, you’re going to have to deal with math.

If you’ve completed a math degree or some other degree that provides an emphasis on quantitative skills, you’re probably wondering if everything you learned to get your degree was necessary. I know I did. And if you *don’t* have that background, you’re probably wondering: how much math is *really* needed to do data science? In this post, we’re going to explore what it means to do data science and talk about just how much math you need to know to get started.

Let’s start with what “data science” actually means. You probably could ask a dozen people and get a dozen different answers! Here at Dataquest, we define data science as the discipline of using data and advanced statistics to make predictions. It’s a professional discipline that’s focused on creating understanding from sometimes-messy and disparate data (although precisely what a data scientist is tackling will vary by employer).

Statistics is the only mathematical discipline we mentioned in that definition, but data science also regularly involves other fields within math. Learning statistics is a great start, but data science also uses algorithms to make predictions. These algorithms are called machine learning algorithms and there are literally hundreds of them. Covering how much math is needed for every type of algorithm in depth is not within the scope of this post, I will discuss how much math you need to know for each of the following commonly-used algorithms:

- Naive Bayes
- Linear Regression
- Logistic Regression
- Neural Networks
- K-Means clustering
- Decision Trees

Now let’s take a look how much math is actually required for each of these!

## Naïve Bayes’ Classifiers

**What they are**: Naïve Bayes’ classifiers are a family of algorithms based on the common principle that the value of a specific feature is independent of the value of any other feature. They allow us to predict the probability of an event happening based on conditions we know about events in question. The name comes from **Bayes’ theorem**, which can be written mathematically as follows:

$$P(A \mid B) = \frac{P(B \mid A)P(A)}{P(B)}$$

where \(A\) and \(B\) are events and \(P(B)\) is not equal to 0.

That looks complicated, but we can break it down into pretty manageable pieces:

- \(P(A|B)\) is a conditional probability. Specifically, the likelihood of event A occurring given that \(B\) is true.
- \(P(B|A)\) is also a conditional probability. Specifically, the likelihood of event \(B\) occurring given the \(A\) is true.
- \(P(A)\) and \(P(B)\) are the probabilities of observing \(A\) and \(B\) independently of each other.

**Math we need**: If you want to scrape the surface of Naïve Bayes’ classifier algorithms and all the uses for Bayes’ theorem, a course in probability would be sufficient. To get an introduction to probability, you can check out our course on probability and statistics.

## Linear Regression

**What it is**: Linear regression is the most basic type of regression. It allows us to understand the relationships between two continuous variables. In the case of simple linear regression, this means taking a set of data points and plotting a trend line that can be used to make predictions about the future.

Linear regression is an example of parametric machine learning. In parametric machine learning, the training process ultimately enables the machine learning algorithms is a mathematical function that best approximates the patterns it found in the training set. That mathematical function can then be used to make predictions about expected future results. In machine learning, mathematical functions are referred to as models.

In the case of linear regression, the model can be expressed as:

$$ y = a_0 + a_1 x_1 + a_2 x_2 + \ldots + a_i x_i $$

where \(a_1\), \(a_2\), \(\ldots\), \(a_n\) represent the parameter values specific to the data set, \(x_1\), \(x_2\), \(\ldots\), \(x_n\) represent the feature columns we choose to use in out model, and \(y\) represents the target column.

The goal of linear regression is to find the optimal parameter values that best describe the relationship between the feature column and the target column. In other words: to find the line of best fit for the data so that a trend line can be extrapolated to predict future results.

To find the optimal parameters for a linear regression model, we want to minimize the model’s residual sum of squares. Residual is often referred to as the error and it describes the difference between the predicted values and the true values.

The formula for the residual sum of squares can be expressed as:

$$ RSS = (y_1 – \hat{y_1})^{2} + (y_2 – \hat{y_2})^{2} + \ldots + (y_n – \hat{y_n})^{2} $$

(where \(\hat{y}\) are the predicted values for the target column and y are the true values.)

**Math we need**: If you want to scrape the surface, a course in elementary statistics would be fine. If you want a deep conceptual understanding, you’ll probably want to know how the formula for residual sum of squares in derived, which you can learn in most courses on advanced statistics.

## Logistic Regression

**What it is**: Logistic regression focuses on estimating the probability of an event occurring in cases where the dependent variable is binary (i.e., only two values, 0 and 1, represent outcomes).

Like linear regression, logistic regression is an example of parametric machine learning. Thus, the result of the training process for these machine learning algorithms is a mathematical function that best approximates the patterns in the training set. But where a linear regression model outputs a real number, a logistic regression model outputs a probability value.

Just as a linear regression algorithm produces a model that is a linear function, a logistic regression algorithm produces a models that is a logistic function. You might also hear it referred to as a sigmoid function, which squashes all values to produced a probabilty result between 0 and 1.

The sigmoid function can be represented as follows:

$$ y = \frac{1}{1+e^{-x}} $$

So why does the sigmoid function always return a value between 0 and 1? Remember from algebra that raising any number to a negative exponent is the same as raising that number’s reciprocal to the corresponding positive exponent.

**Math we need**: We’ve discussed exponents and probability here, and you’ll want to have a solid understanding of both Algebra and probability to get a working knowledge of what is happening in logistic algorithms. If you want to get a deep conceptual understanding, I would recommend learning about Probability Theory as well as Discrete Mathematics or Real Analysis.

## Neural Networks

**What they are**: Neural networks are machine learning models that are very loosely inspired by the structure of neurons in the human brain. These models are built by using a series of activation units, known as neurons, to make predictions of some outcome. Neurons take some input, apply a transformation function, and return an output.

Neural networks excel at capturing nonlinear relationships in data and aid us in tasks such as audio and image processing. While there are many different kinds of neural networks (recurrent neural networks, feed forward neural networks, recurrent neural networks etc.), they all rely on that fundamental concept of transforming an input to generate an output.

When looking at a neural network of any kind, we’ll notice lines going everywhere, connecting every circle to another circle. In mathematics, this is what is known as a graph, a data structure that consists of nodes (represented as circles) that are connected by edges (represented as lines.)

Keep in mind, the graph we’re referring to here is not the same as the graph of a linear model or some other equation. If you’re familiar with the Traveling Salesman Problem, you’re probably familiar with the concepts of graphs.

At its core, neural networks is a system that takes in some data, performs some linear algebra and then outputs some answers. Linear algebra is the key to understanding what is going on behind the scenes in neural networks.

Linear Algebra is the branch of mathematics concerning linear equations such as \(y=mx+b\) and their representations through matrices and vector spaces. Because linear Algebra concerns the representation of linear equations through matrices, a matrix is the fundamental idea that you’ll need to know to even begin understanding the core part of neural networks.

A matrix is a rectangular array consisting of numbers, symbols, or expressions, arranged in rows or columns. Matrices are described in the following fashion: rows by columns. For example, the following matrix:

is called a 3 by 3 matrix because it has three rows and three columns.

With dealing with neural networks, each feature is represented as an input neuron. Each numerical value of the feature column multiples to a weight vector that represents your output neuron. Mathematically, the process is written like this:

$$ \hat{y} = Xa^{T} + b$$

where \(X\) is an \(m x n\) matrix where \(m\) is the number of input neurons there are and \(n\) is the number of neurons in the next layer. Our weights vector is denoted as \(a\), and \(a^{T}\) is the transpose of \(a\). Our bias unit is represented as \(b\). Bias units are units that influence the output of neural networls by shifting the sigmoid function to the left or right to give better predictions on some data set.

Transpose is a Linear Algebra term and all it means is the rows become columns and columns become rows. We need to take the transpose of \(a\) because the number columns of the first matrix must equal the number of rows in the second matrix when multiplying matrices. For example, if we have a \(3×3\) matrix and a weights vector that is a \(1×3\) vector, we can’t multiply outright because three does not equal one. However, if we take the transpose of the \(1×3\) vector, we get a \(3×1\) vector and we can successfully multiply our matrix with the vector.

After all of the feature columns and weights are multiplied, an activation function is called that determines whether the neuron is activated. There are three main types of activation functions: the RELU function, the sigmoid function, and the hyperbolic tangent function. We’re already familiar with the sigmoid function. The RELU function is a neat function that takes an input x and outputs the same number if it is greater than 0; however, it’s equal to 0 if the input is less than 0. The hyperbolic tangent function is essentially the same as the sigmoid function except that it constrains any value between -1 and 1.

**Math we need**: We’ve discussed quite a bit here in terms of concepts! If you want to have a basic understanding of the math presented here, a discrete mathematics course and a course in Linear Algebra are good starting points. For a deep conceptual understanding, I would recommend courses in Graph Theory, Matrix Theory, Multivariate Calculus, and Real Analysis. If you’re interested learning linear algebra fundamentals, you can get started with our Linear Algebra for Machine Learning course.

## K-Means Clustering

**What it is**: The K Means Clustering algorithm is a type of unsupervised machine learning, which is used to categorize unlabeled data, i.e. data without defined categories or groups. The algorithm works by finding groups within the data, with the number of groups represented by the variable k. It then iterates through the data to assign each data point to one of k groups based on the features provided.

K-means clustering relies on the notion of distance throughout the algorithm to “assign” data points to a cluster. If you’re not familiar with the notion of distance, it refers to the amount of space between two given items. Within mathematics, any function that describes the distance between any two elements of a set is called a distance function or metric. There are two types of metrics: the Euclidean metric and the taxicab metric.

The Euclidean metric is defined at the following:

$$ d( (x_1, y_1), (x_2, y_2) ) = \sqrt{(x_2 – x_1)^{2} + (y_2 – y_1)^{2}} $$

where \((x_1, y_1)\) and \((x_2, y_2)\) are coordinate points on a Cartesian plane.

While the Euclidean metric is sufficient, there are some cases where it does not work. Suppose you’re on a walk in a huge city; it does not make sense to say “I’m 6.5 units away from my destination” if there’s a gigantic building blocking your path.

To combat this, we can use the taxicab metric. The taxicab metric is as follows:

$$ d( (x_1, y_1),(x_2, y_2) ) = |x_1 – x_2| + |y_1 – y_2| $$

where \((x_1, y_1)\) and \((x_2, y_2)\) are coordinate points on a Cartesian plane.

**Math we need**: This one’s a bit less involved; really you just need to know to add and subtract, and understand the basics of algebra so you can grasp the distance formulas. But to get a firm understanding of the basic kinds of geometries each of those metrics exist in, I would recommend a Geometry class that covers both Euclidean and Non-Euclidean geometries. For an in-depth understanding of both metrics and what it means to be a metric space, I would read up on Mathematical Analysis and take a Real Analysis class.

## Decision Tree

**What it is**: A decision tree is a flow-chart-like tree structure that uses a branching method to illustrate every possible outcome of a decision. Each node within the tree represents a test on a specific variable - and each branch is the outcome of that test.

Decision trees rely on a theory called information theory to determine how they’re constructed. In information theory, the more one knows about a topic, the less new information one can know. One of the key measures in information theory is known as entropy. Entropy is a measure which quantifies the amount of uncertainty for a given variable.

Entropy can be written like this:

$$ \hbox{Entropy} = -\sum_{i=1}^{n}P(x_i)\log_b P(x_i) $$

In the above equation, \(P(x)\) is the probability of the feature occurring in the data set. It should be noted that any base b can be used for the logarithm; however, common values are 2, \(e\) (2.71), and 10.

You might have noted the fancy symbol that looks like an “S”. That is the summation symbol and it means continuously add whatever function is outside the summation for as many times as you can. How many times you add is dictated by the lower and upper limits of the summation.

After computing entropy, we can start constructing out decision tree by using information gain, which tells which split will reduce entropy the most.

The formula for information gain is written as:

$$ IG(T,A) = \hbox{Entropy}(T) – \sum_{v \in A} \frac{|T_v|}{|T|} \cdot \hbox{Entropy}(T_v) $$

Information gain measures how many “bits” of information someone can gain. In the case of decision trees, we can calculate the information gain on each of our columns in the data set in order to find what column will give us maximum information gain and then split on that column.

**Math we need**: Basic Algebra and probability are all you really need to scrape the surface of decision trees. If you want a deep conceptual understanding of probability and the logarithm, I would recommend courses in Probability Theory and Algebra.

# Final Thoughts

If you’re still in school, I highly recommend taking some pure and applied mathematics courses. They can definitely feel daunting at times, but you can take solace in the fact that you will be better equipped when you encounter these algorithms and know how to best apply them.

If you’re not currently in school, I recommend hopping over to your closest book store and reading up on the topics highlighted in this post. If you can find books that deal with Probability, Statistics, and Linear Algebra, I would strongly recommend picking up books that cover each of those topics in depth to truly get a feel for what is going on behind the scenes of both the machine algorithms covered in this post and ones that aren’t covered in this post.

Math is everywhere in data science. And while some data science algorithms feel like magic at times, we can understand the ins and outs of many algorithms without needing much more than algebra and elementary probability and statistics.

Don’t want to learn any math? Technically, you *can* rely on machine learning libraries such as scikit-learn to do all of this for you. But it’s very helpful for a data scientist to have a solid understanding of the math and statistics behind those algorithms so they can choose the best algorithm for their problems and datasets and thus make more accurate predictions.

So embrace the pain, and dive into the math! It’s not as tough as you think, and we’ve even got courses on a few of these topics to get you started:

Marketing Analyst at Dataquest.io. Randall is also a self-described math and data geek.