May 31, 2022

# Apply Functions in R with Examples [apply(), sapply(), lapply (), tapply()]

## In this tutorial, we’ll learn about the apply() function in R, including when to use it and why it’s more efficient than loops.

The apply() function is the basic model of the family of apply functions in R, which includes specific functions like lapply(), sapply(), tapply(), mapply(), vapply(), rapply(), bapply(), eapply(), and others. All of these functions allow us to iterate over a data structure such as a list, a matrix, an array, a DataFrame, or a selected slice of a given data structure — and perform the same operation at each element.

Such operations can imply aggregation (i.e., calculating summary statistics like mean, max, min, sum, etc.), transformation — or any other vectorized functions, either simple or complex, built-in or custom. The difference between the functions of the apply family is the types of input and output of the data structures and the function they perform.

In comparison to the more conservative approach of using loop constructs for the same purpose, the apply() function and its variations offer significantly faster program execution and compact, one-line syntax instead of a code block that spans multiple lines. This becomes particularly important when working with large datasets.

## How to Use the apply() Function (and Its Varieties) in R

Let’s explore some of the most useful varieties of the apply functions in R.

### apply()

We’ll start with the main function of the apply group: apply(). It takes a DataFrame, a matrix, or a multi-dimensional array as input and, depending on the input object type and the function passed in, outputs a vector, a list, a matrix, or an array.

The syntax of the apply() function is very simple and has only three parameters:

apply(X, MARGIN, FUN)

Here X is an input object (a DataFrame, a matrix, or an array), MARGIN is the parameter that determines the function application (it can take values 1, 2, or c(1,2), meaning that the function is applied row-wise, column-wise, or both row- and column-wise, correspondingly), and FUN is the function (built-in or custom) to apply to the input data.

Let’s look at some examples. We’ll use a matrix as an input data structure, but the same principle works for the other possible data structures:

my_matrix <- matrix((1:12), nrow=3)
print(my_matrix)
     [,1] [,2] [,3] [,4]
[1,]    1    4    7   10
[2,]    2    5    8   11
[3,]    3    6    9   12

For example, we may want to find the maximum value for each row of our matrix. For this purpose, we’ll set 1 to the MARGIN parameter and pass in the max function:

print(apply(my_matrix, 1, max))
[1] 10 11 12

In the code above, we virtually implemented an aggregation on the input matrix, (which is a two-dimensional data structure). As a result, the output is a vector (which is a one-dimensional data structure) containing the corresponding maximum values for each row.

Now, let’s calculate the sum of values of the matrix by column (MARGIN=2):

print(apply(my_matrix, 2, sum))
[1]  6 15 24 33

The output data structure is again a vector. The sum function is another example of an aggregation function that reduces the dimensionality of an input object by 1.

In some cases, we may need to calculate a cumulative sum by column:

print(apply(my_matrix, 2, cumsum))
     [,1] [,2] [,3] [,4]
[1,]    1    4    7   10
[2,]    3    9   15   21
[3,]    6   15   24   33

This time, we obtained a matrix of the same size as the input one since the cumsum function computes a value for each value of the input matrix.

Note that the last result (the output object being the same size as the input object) isn’t always the case for a non-aggregation function. For example, we might want a range of values by column:

print(apply(my_matrix, 2, range))
     [,1] [,2] [,3] [,4]
[1,]    1    4    7   10
[2,]    3    6    9   12

Here, we also got a matrix as an output object but of a different size than the input one (2×4 rather than 3×4).

It is possible to provide any custom function to apply(). Let’s define a function that calculates the mean of squared values for each input:

mean_squared_vals <- function(x) mean(x**2)

Just as we did earlier, we can apply this function by row (MARGIN=1):

print(apply(my_matrix, 1, mean_squared_vals))
[1] 41.5 53.5 67.5

We can also apply the function by column (MARGIN=2):

print(apply(my_matrix, 2, mean_squared_vals))
[1]   4.666667  25.666667  64.666667 121.666667

Finally — and this is something we haven’t tried yet — we can apply it by both rows and columns (MARGIN=c(1,2)):

print(apply(my_matrix, c(1,2), mean_squared_vals))
     [,1] [,2] [,3] [,4]
[1,]    1   16   49  100
[2,]    4   25   64  121
[3,]    9   36   81  144

In the last case, we got a matrix where each value is a squared corresponding value of the input matrix. Since the mean component of our user-defined function was practically applied to only one value at each iteration, the value itself was returned. So, in this particular case, the mean operation doesn’t make any sense.

### lapply()

The lapply() function is a variety of apply() that takes in a vector, a list, or a DataFrame as input and always outputs a list ("l" in the function name stands for "list"). The specified function applies to each element of the input object, hence the length of the resulting list is always equal to the input object’s length.

The syntax of this function is similar to the syntax of apply(), only here there is no need for the MARGIN parameter since the function applies element-wise for lists and vectors and column-wise for DataFrames:

lapply(X, FUN)

Let’s see how it works on vectors, lists, and DataFrames. First, we’ll create a simple function that adds 1 to an input value:

add_one <- function(x) x+1

Let’s test it on a vector:

my_vector = c(1, 2, 3)
print(lapply(my_vector, add_one))
[[1]]
[1] 2

[[2]]
[1] 3

[[3]]
[1] 4

We added 1 to each value of the vector.

Now, we will create a list:

my_list = list(TRUE, c(1, 2, 3), 10)
print(my_list)
[[1]]
[1] TRUE

[[2]]
[1] 1 2 3

[[3]]
[1] 10

Now we will apply our function on it:

print(lapply(my_list, add_one))
[[1]]
[1] 2

[[2]]
[1] 2 3 4

[[3]]
[1] 11

Since TRUE evaluates to 1, adding 1 to it, we got the value of 2 for the first item of the resulting list. In the case of a vector item, 1 was added to each of its values.

Finally, let’s use lapply() on a dataframe:

my_df <- data.frame(a=1:3, b=4:6, c=7:9, d=10:12)
print(my_df)
  a b c  d
1 1 4 7 10
2 2 5 8 11
3 3 6 9 12
print(lapply(my_df, add_one))
$a [1] 2 3 4$b
[1] 5 6 7

$c [1] 8 9 10$d
[1] 11 12 13

As we mentioned earlier, the lapply() function applies column-wise for DataFrames.

### sapply()

The sapply() function is a simplified form of lapply() ("s" in the function name stands for "simplified"). It has the same syntax as lapply() (i.e., sapply(X, FUN)); takes in a vector, a list, or a DataFrame as input, just as lapply() does, and tries to reduce the output object to the most simplified data structure. That means that, by default, the sapply() function outputs a vector for a vector, a list for a list, and a matrix for a DataFrame.

Let’s try it on our variables my_vector, my_list, and my_df using the same custom function add_one as earlier:

print(sapply(my_vector, add_one))
[1] 2 3 4
print(sapply(my_list, add_one))
[[1]]
[1] 2

[[2]]
[1] 2 3 4

[[3]]
[1] 11
print(sapply(my_df, add_one))
     a b  c  d
[1,] 2 5  8 11
[2,] 3 6  9 12
[3,] 4 7 10 13

We can change the default behavior of the sapply() function passing in an optional parameter simplify=FALSE (by default, it is TRUE). In this case, the sapply() function becomes identical to lapply() and always outputs a list for any valid input data structure:

print(typeof(sapply(my_vector, add_one, simplify=FALSE)))
print(typeof(sapply(my_df, add_one, simplify=FALSE)))
[1] "list"
[1] "list"
[1] "list"

### tapply()

We use the tapply() function for calculating summary statistics (such as mean, median, min, max, sum, etc.) for different factors (i.e., categories). It has the following syntax:

tapply(X, INDEX, FUN)

Here X is an R object, typically a vector, containing numeric data; INDEX is an R object, typically a vector or a list, containing factors; and FUN is the function to be applied on X.

To see how it works, let’s imagine we have information about the salaries of a group of people with data-related jobs: Data Scientist (DS), Data Analyst (DA), and Data Engineer (DE). Using the tapply() function, we can calculate the mean salary by job title.

(Side note: as a rough guide, here we used the information from Indeed to estimate the mean salary by role in the USA, February 2022.)

salaries <- c(80000, 62000, 113000, 68000, 75000, 79000, 112000, 118000, 65000, 117000)
jobs <- c('DS', 'DA', 'DE', 'DA', 'DS', 'DS', 'DE', 'DE', 'DA', 'DE')
print(tapply(salaries, jobs, mean))
    DA     DE     DS
65000 115000  78000 

## Conclusion

To sum up, we learned many things about using the apply functions in R. Now we know the following:

• How to define the apply() function in R
• The varieties of the function and which are the most common
• Why the functions of the apply family are more efficient than loops
• When each of the common varieties of the apply function is applicable
• The syntax of each variety
• The types of input each variety takes and how to use it on different types of input