Apply Functions in R with Examples [apply(), sapply(), lapply (), tapply()]
In this tutorial, we'll learn about the apply() function in R, including when to use it and why it's more efficient than loops.
apply() function is the basic model of the family of apply functions in R, which includes specific functions like
eapply(), and others. All of these functions allow us to iterate over a data structure such as a list, a matrix, an array, a DataFrame, or a selected slice of a given data structure — and perform the same operation at each element.
Such operations can imply aggregation (i.e., calculating summary statistics like mean, max, min, sum, etc.), transformation — or any other vectorized functions, either simple or complex, built-in or custom. The difference between the functions of the apply family is the types of input and output of the data structures and the function they perform.
In comparison to the more conservative approach of using loop constructs for the same purpose, the
apply() function and its variations offer significantly faster program execution and compact, one-line syntax instead of a code block that spans multiple lines. This becomes particularly important when working with large datasets.
How to Use the
apply() Function (and Its Varieties) in R
Let's explore some of the most useful varieties of the apply functions in R.
We'll start with the main function of the apply group:
apply(). It takes a DataFrame, a matrix, or a multi-dimensional array as input and, depending on the input object type and the function passed in, outputs a vector, a list, a matrix, or an array.
The syntax of the
apply() function is very simple and has only three parameters:
apply(X, MARGIN, FUN)
X is an input object (a DataFrame, a matrix, or an array),
MARGIN is the parameter that determines the function application (it can take values
c(1,2), meaning that the function is applied row-wise, column-wise, or both row- and column-wise, correspondingly), and
FUN is the function (built-in or custom) to apply to the input data.
Let's look at some examples. We'll use a matrix as an input data structure, but the same principle works for the other possible data structures:
my_matrix <- matrix((1:12), nrow=3) print(my_matrix)
[,1] [,2] [,3] [,4] [1,] 1 4 7 10 [2,] 2 5 8 11 [3,] 3 6 9 12
For example, we may want to find the maximum value for each row of our matrix. For this purpose, we'll set
1 to the
MARGIN parameter and pass in the
print(apply(my_matrix, 1, max))
 10 11 12
In the code above, we virtually implemented an aggregation on the input matrix, (which is a two-dimensional data structure). As a result, the output is a vector (which is a one-dimensional data structure) containing the corresponding maximum values for each row.
Now, let's calculate the sum of values of the matrix by column (
print(apply(my_matrix, 2, sum))
 6 15 24 33
The output data structure is again a vector. The
sum function is another example of an aggregation function that reduces the dimensionality of an input object by 1.
In some cases, we may need to calculate a cumulative sum by column:
print(apply(my_matrix, 2, cumsum))
[,1] [,2] [,3] [,4] [1,] 1 4 7 10 [2,] 3 9 15 21 [3,] 6 15 24 33
This time, we obtained a matrix of the same size as the input one since the
cumsum function computes a value for each value of the input matrix.
Note that the last result (the output object being the same size as the input object) isn't always the case for a non-aggregation function. For example, we might want a range of values by column:
print(apply(my_matrix, 2, range))
[,1] [,2] [,3] [,4] [1,] 1 4 7 10 [2,] 3 6 9 12
Here, we also got a matrix as an output object but of a different size than the input one (2x4 rather than 3x4).
It is possible to provide any custom function to
apply(). Let's define a function that calculates the mean of squared values for each input:
mean_squared_vals <- function(x) mean(x**2)
Just as we did earlier, we can apply this function by row (
print(apply(my_matrix, 1, mean_squared_vals))
 41.5 53.5 67.5
We can also apply the function by column (
print(apply(my_matrix, 2, mean_squared_vals))
 4.666667 25.666667 64.666667 121.666667
Finally — and this is something we haven't tried yet — we can apply it by both rows and columns (
print(apply(my_matrix, c(1,2), mean_squared_vals))
[,1] [,2] [,3] [,4] [1,] 1 16 49 100 [2,] 4 25 64 121 [3,] 9 36 81 144
In the last case, we got a matrix where each value is a squared corresponding value of the input matrix. Since the
mean component of our user-defined function was practically applied to only one value at each iteration, the value itself was returned. So, in this particular case, the
mean operation doesn't make any sense.
lapply() function is a variety of
apply() that takes in a vector, a list, or a DataFrame as input and always outputs a list ("l" in the function name stands for "list"). The specified function applies to each element of the input object, hence the length of the resulting list is always equal to the input object's length.
The syntax of this function is similar to the syntax of
apply(), only here there is no need for the
MARGIN parameter since the function applies element-wise for lists and vectors and column-wise for DataFrames:
Let's see how it works on vectors, lists, and DataFrames. First, we'll create a simple function that adds 1 to an input value:
add_one <- function(x) x+1
Let's test it on a vector:
my_vector = c(1, 2, 3) print(lapply(my_vector, add_one))
[]  2 []  3 []  4
We added 1 to each value of the vector.
Now, we will create a list:
my_list = list(TRUE, c(1, 2, 3), 10) print(my_list)
[]  TRUE []  1 2 3 []  10
Now we will apply our function on it:
[]  2 []  2 3 4 []  11
TRUE evaluates to 1, adding 1 to it, we got the value of 2 for the first item of the resulting list. In the case of a vector item, 1 was added to each of its values.
Finally, let's use
lapply() on a dataframe:
my_df <- data.frame(a=1:3, b=4:6, c=7:9, d=10:12) print(my_df)
a b c d 1 1 4 7 10 2 2 5 8 11 3 3 6 9 12
$a  2 3 4 $b  5 6 7 $c  8 9 10 $d  11 12 13
As we mentioned earlier, the
lapply() function applies column-wise for DataFrames.
sapply() function is a simplified form of
lapply() ("s" in the function name stands for "simplified"). It has the same syntax as
sapply(X, FUN)); takes in a vector, a list, or a DataFrame as input, just as
lapply() does, and tries to reduce the output object to the most simplified data structure. That means that, by default, the
sapply() function outputs a vector for a vector, a list for a list, and a matrix for a DataFrame.
Let's try it on our variables
my_df using the same custom function
add_one as earlier:
 2 3 4
[]  2 []  2 3 4 []  11
a b c d [1,] 2 5 8 11 [2,] 3 6 9 12 [3,] 4 7 10 13
We can change the default behavior of the
sapply() function passing in an optional parameter
simplify=FALSE (by default, it is
TRUE). In this case, the
sapply() function becomes identical to
lapply() and always outputs a list for any valid input data structure:
print(typeof(sapply(my_vector, add_one, simplify=FALSE))) print(typeof(sapply(my_list, add_one, simplify=FALSE))) print(typeof(sapply(my_df, add_one, simplify=FALSE)))
 "list"  "list"  "list"
We use the
tapply() function for calculating summary statistics (such as mean, median, min, max, sum, etc.) for different factors (i.e., categories). It has the following syntax:
tapply(X, INDEX, FUN)
X is an R object, typically a vector, containing numeric data;
INDEX is an R object, typically a vector or a list, containing factors; and
FUN is the function to be applied on
To see how it works, let's imagine we have information about the salaries of a group of people with data-related jobs: Data Scientist (DS), Data Analyst (DA), and Data Engineer (DE). Using the
tapply() function, we can calculate the mean salary by job title.
(Side note: as a rough guide, here we used the information from Indeed to estimate the mean salary by role in the USA, February 2022.)
salaries <- c(80000, 62000, 113000, 68000, 75000, 79000, 112000, 118000, 65000, 117000) jobs <- c('DS', 'DA', 'DE', 'DA', 'DS', 'DS', 'DE', 'DE', 'DA', 'DE') print(tapply(salaries, jobs, mean))
DA DE DS 65000 115000 78000
To sum up, we learned many things about using the apply functions in R. Now we know the following:
- How to define the
apply()function in R
- The varieties of the function and which are the most common
- Why the functions of the apply family are more efficient than loops
- When each of the common varieties of the apply function is applicable
- The syntax of each variety
- The types of input each variety takes and how to use it on different types of input