How to Add a Column to a DataFrame in R (with 18 Code Examples)
In this tutorial, we'll consider one of the most common operations used for manipulating DataFrames in R: how to add a column to a DataFrame in the base R.
A DataFrame is one of the basic data structures of the R programming language. It is also a very versatile data structure since it can store multiple data types, be easily modified, and easily updated.
What is a Dataframe in R?
Technically speaking, a DataFrame in R is a specific case of a list of vectors of the same length, where different vectors can be (and usually are) of different data types. Since a DataFrame has a tabular, 2-dimensional form, it has columns (variables) and rows (data entries).
Adding a Column to a DataFrame in R
We may want to add a new column to an R DataFrame for various reasons: to calculate a new variable based on the existing ones, to add a new column based on the available one but with a different format (keeping in this way both columns), to append an empty or placeholder column for further filling it, to add a column containing completely new information.
Let's explore different ways of adding a new column to a DataFrame in R. For our experiments, we'll be mostly using the same DataFrame called super_sleepers
which we'll reconstruct each time from the following initial DataFrame:
super_sleepers_initial <- data.frame(rating=1:4,
animal=c('koala', 'hedgehog', 'sloth', 'panda'),
country=c('Australia', 'Italy', 'Peru', 'China'))
print(super_sleepers_initial)
Output:
rating animal country
1 1 koala Australia
2 2 hedgehog Italy
3 3 sloth Peru
4 4 panda China
Our task will be to add to this DataFrame a new column called avg_sleep_hours
representing the average time in hours that each of the above animals sleeps per day, according to the following scheme:
Animal | Avg hrs of sleep per day |
---|---|
koala | 21 |
hedgehog | 18 |
sloth | 17 |
panda | 10 |
For some examples, we'll experiment with adding two other columns: avg_sleep_hours_per_year
and has_tail
.
Now, let's dive in.
Adding a Column to a DataFrame in R Using the $ Symbol
Since a DataFrame in R is a list of vectors where each vector represents an individual column of that DataFrame, we can add a column to a DataFrame just by adding the corresponding new vector to this "list". The syntax is as follows:
dataframe_name$new_column_name <- vector
Let's reconstruct our super_sleepers
DataFrame from the initial super_sleepers_initial
DataFrame (we'll do so for each subsequent experiment) and add to it a column called avg_sleep_hours
represented by the vector c(21, 18, 17, 10)
:
# Reconstructing the `super_sleepers` DataFrame
super_sleepers <- super_sleepers_initial
print(super_sleepers)
cat('\n\n') # printing an empty line
# Adding a new column `avg_sleep_hours` to the `super_sleepers` DataFrame
super_sleepers$avg_sleep_hours <- c(21, 18, 17, 10)
print(super_sleepers)
Output:
rating animal country
1 1 koala Australia
2 2 hedgehog Italy
3 3 sloth Peru
4 4 panda China
rating animal country avg_sleep_hours
1 1 koala Australia 21
2 2 hedgehog Italy 18
3 3 sloth Peru 17
4 4 panda China 10
Note that the number of items added in the vector must be equal to the current number of rows in a DataFrame, otherwise, the program throws an error:
# Reconstructing the `super_sleepers` DataFrame
super_sleepers <- super_sleepers_initial
print(super_sleepers)
cat('\n\n')
# Attempting to add a new column `avg_sleep_hours` to the `super_sleepers` DataFrame
# with the number of items in the vector NOT EQUAL to the number of rows in the DataFrame
super_sleepers$avg_sleep_hours <- c(21, 18, 17)
print(super_sleepers)
Output:
rating animal country
1 1 koala Australia
2 2 hedgehog Italy
3 3 sloth Peru
4 4 panda China
Error in $<-.data.frame(*tmp*, avg_sleep_hours, value = c(21, 18, 17): replacement has 3 rows, data has 4
Traceback:
1. <-(*tmp*, avg_sleep_hours, value = c(21, 18, 17))
2. <-.data.frame(*tmp*, avg_sleep_hours, value = c(21, 18, 17))
3. stop(sprintf(ngettext(N, "replacement has
. "replacement has
Instead of assigning a vector, we can assign a single value, whether numeric or character, for all the rows of a new column:
# Reconstructing the `super_sleepers` DataFrame
super_sleepers <- super_sleepers_initial
print(super_sleepers)
cat('\n\n')
# Adding a new column `avg_sleep_hours` to the `super_sleepers` DataFrame and setting it to 0
super_sleepers$avg_sleep_hours <- 0
print(super_sleepers)
Output:
rating animal country
1 1 koala Australia
2 2 hedgehog Italy
3 3 sloth Peru
4 4 panda China
rating animal country avg_sleep_hours
1 1 koala Australia 0
2 2 hedgehog Italy 0
3 3 sloth Peru 0
4 4 panda China 0
In this case, the new column plays a role of a placeholder for the real values of the specified data type (in the above case, numeric) that we can insert later.
Alternatively, we can calculate a new column based on the existing ones. Let's first add the avg_sleep_hours column to our DataFrame and then calculate a new column avg_sleep_hours_per_year
from it. We want to know how many hours these animals sleep on average per year:
# Reconstructing the `super_sleepers` DataFrame
super_sleepers <- super_sleepers_initial
print(super_sleepers)
cat('\n\n')
# Adding a new column `avg_sleep_hours` to the `super_sleepers` DataFrame
super_sleepers$avg_sleep_hours <- c(21, 18, 17, 10)
print(super_sleepers)
cat('\n\n')
# Adding a new column `avg_sleep_hours_per_year` calculated from `avg_sleep_hours`
super_sleepers$avg_sleep_hours_per_year <- super_sleepers$avg_sleep_hours * 365
print(super_sleepers)
Output:
rating animal country
1 1 koala Australia
2 2 hedgehog Italy
3 3 sloth Peru
4 4 panda China
rating animal country avg_sleep_hours
1 1 koala Australia 21
2 2 hedgehog Italy 18
3 3 sloth Peru 17
4 4 panda China 10
rating animal country avg_sleep_hours avg_sleep_hours_per_year
1 1 koala Australia 21 7665
2 2 hedgehog Italy 18 6570
3 3 sloth Peru 17 6205
4 4 panda China 10 3650
Also, it's possible to copy a column from one DataFrame to another using the following syntax:
df1$new_col <- df2$existing_col
Let's replicate such a situation:
# Creating the `super_sleepers_1` dataframe with the only column rating
super_sleepers_1 <- data.frame(rating=1:4)
print(super_sleepers_1)
cat('\n\n')
# Copying the `animal` column from `super_sleepers_initial` to `super_sleepers_1`
# Note that in the new DataFrame, the column is called `ANIMAL` instead of `animal`
super_sleepers_1$ANIMAL <- super_sleepers_initial$animal
print(super_sleepers_1)
Output:
rating
1 1
2 2
3 3
4 4
rating ANIMAL
1 1 koala
2 2 hedgehog
3 3 sloth
4 4 panda
The drawback of this approach (i.e., using the $ operator to append a column to a DataFrame) is that we can't add in this way a column whose name contains white spaces or special symbols. Indeed, it can't contain anything that is not a letter (upper- or lowercase), a number, a dot, or an underscore. Also, this approach doesn't work for adding multiple columns.
Adding a Column to a DataFrame in R Using Square Brackets
Another way of adding a new column to an R DataFrame is more "DataFrame-style" rather than "list-style": by using bracket notation. Let's see how it works:
# Reconstructing the `super_sleepers` DataFrame
super_sleepers <- super_sleepers_initial
print(super_sleepers)
cat('\n\n')
# Adding a new `column avg_sleep_hours` to the `super_sleepers` DataFrame:
super_sleepers['avg_sleep_hours'] <- c(21, 18, 17, 10)
print(super_sleepers)
Output:
rating animal country
1 1 koala Australia
2 2 hedgehog Italy
3 3 sloth Peru
4 4 panda China
rating animal country avg_sleep_hours
1 1 koala Australia 21
2 2 hedgehog Italy 18
3 3 sloth Peru 17
4 4 panda China 10
In the piece of code above, we can substitute this line:
super_sleepers['avg_sleep_hours'] <- c(21, 18, 17, 10)
This line can also be substituted:
super_sleepers[['avg_sleep_hours']] <- c(21, 18, 17, 10)
Lastly, this one can be substituted as well:
super_sleepers['avg_sleep_hours'] <- c(21, 18, 17, 10)
The result will be identical, those are just 3 different versions of the syntax.
As it was for the previous method, we can assign a single value instead of a vector to the new column:
# Reconstructing the `super_sleepers` DataFrame
super_sleepers <- super_sleepers_initial
print(super_sleepers)
cat('\n\n')
# Adding a new column `avg_sleep_hours` to the `super_sleepers` DataFrame and assigning it to 'Unknown'
super_sleepers['avg_sleep_hours'] <- 'Unknown'
print(super_sleepers)
Output:
rating animal country
1 1 koala Australia
2 2 hedgehog Italy
3 3 sloth Peru
4 4 panda China
rating animal country avg_sleep_hours
1 1 koala Australia Unknown
2 2 hedgehog Italy Unknown
3 3 sloth Peru Unknown
4 4 panda China Unknown
As an alternative, we can calculate a new column based on the existing ones:
# Reconstructing the `super_sleepers` DataFrame
super_sleepers <- super_sleepers_initial
print(super_sleepers)
cat('\n\n')
# Adding a new column `avg_sleep_hours` to the `super_sleepers` DataFrame
super_sleepers['avg_sleep_hours'] <- c(21, 18, 17, 10)
print(super_sleepers)
cat('\n\n')
# Adding a new column `avg_sleep_hours_per_year` calculated from `avg_sleep_hours`
super_sleepers['avg_sleep_hours_per_year'] <- super_sleepers['avg_sleep_hours'] * 365
print(super_sleepers)
Output:
rating animal country
1 1 koala Australia
2 2 hedgehog Italy
3 3 sloth Peru
4 4 panda China
rating animal country avg_sleep_hours
1 1 koala Australia 21
2 2 hedgehog Italy 18
3 3 sloth Peru 17
4 4 panda China 10
rating animal country avg_sleep_hours avg_sleep_hours_per_year
1 1 koala Australia 21 7665
2 2 hedgehog Italy 18 6570
3 3 sloth Peru 17 6205
4 4 panda China 10 3650
Using another option we can copy a column from another DataFrame:
# Creating the `super_sleepers_1` dataframe with the only column `rating`
super_sleepers_1 <- data.frame(rating=1:4)
print(super_sleepers_1)
cat('\n\n')
# Copying the `animal` column from `super_sleepers_initial` to `super_sleepers_1`
# Note that in the new DataFrame, the column is called `ANIMAL` instead of `animal`
super_sleepers_1['ANIMAL'] <- super_sleepers_initial['animal']
print(super_sleepers_1)
Output:
rating
1 1
2 2
3 3
4 4
rating ANIMAL
1 1 koala
2 2 hedgehog
3 3 sloth
4 4 panda
The advantage of using square brackets over the $ operator to append a column to a DataFrame is that we can add a column whose name contains white spaces or any special symbols.
Adding a Column to a DataFrame in R Using the cbind()
Function
The third way of adding a new column to an R DataFrame is by applying the cbind()
function that stands for "column-bind" and can also be used for combining two or more DataFrames. Using this function is a more universal approach than the previous two since it allows adding several columns at once. Its basic syntax is as follows:
df <- cbind(df, new_col_1, new_col_2, ..., new_col_N)
The piece of code below adds the avg_sleep_hours
column to the super_sleepers
DataFrame:
# Reconstructing the `super_sleepers` DataFrame
super_sleepers <- super_sleepers_initial
print(super_sleepers)
cat('\n\n')
# Adding a new column `avg_sleep_hours` to the `super_sleepers` DataFrame
super_sleepers <- cbind(super_sleepers,
avg_sleep_hours=c(21, 18, 17, 10))
print(super_sleepers)
Output:
rating animal country
1 1 koala Australia
2 2 hedgehog Italy
3 3 sloth Peru
4 4 panda China
rating animal country avg_sleep_hours
1 1 koala Australia 21
2 2 hedgehog Italy 18
3 3 sloth Peru 17
4 4 panda China 10
The next piece of code adds two new columns – avg_sleep_hours
and has_tail
– to the super_sleepers
DataFrame at once:
# Reconstructing the `super_sleepers` DataFrame
super_sleepers <- super_sleepers_initial
print(super_sleepers)
cat(\n\n)
# Adding two new columns `avg_sleep_hours` and `has_tail` to the `super_sleepers` DataFrame
super_sleepers <- cbind(super_sleepers,
avg_sleep_hours=c(21, 18, 17, 10),
has_tail=c('no', 'yes', 'yes', 'yes'))
print(super_sleepers)
Output:
rating animal country
1 1 koala Australia
2 2 hedgehog Italy
3 3 sloth Peru
4 4 panda China
rating animal country avg_sleep_hours has_tail
1 1 koala Australia 21 no
2 2 hedgehog Italy 18 yes
3 3 sloth Peru 17 yes
4 4 panda China 10 yes
Apart from adding multiple columns at once, another advantage of using the cbind()
function is that it allows assigning the result of this operation (i.e., adding one or more columns to an R DataFrame) to a new DataFrame leaving the initial one unchanged:
# Reconstructing the `super_sleepers` DataFrame
super_sleepers <- super_sleepers_initial
print(super_sleepers)
cat('\n\n')
# Creating a new DataFrame `super_sleepers_new` based on `super_sleepers` with a new column `avg_sleep_hours`
super_sleepers_new <- cbind(super_sleepers,
avg_sleep_hours=c(21, 18, 17, 10),
has_tail=c('no', 'yes', 'yes', 'yes'))
print(super_sleepers_new)
cat('\n\n')
print(super_sleepers)
Output:
rating animal country
1 1 koala Australia
2 2 hedgehog Italy
3 3 sloth Peru
4 4 panda China
rating animal country avg_sleep_hours has_tail
1 1 koala Australia 21 no
2 2 hedgehog Italy 18 yes
3 3 sloth Peru 17 yes
4 4 panda China 10 yes
rating animal country
1 1 koala Australia
2 2 hedgehog Italy
3 3 sloth Peru
4 4 panda China
As it was for the previous two approaches, inside the cbind()
function, we can assign a single value to the whole new column:
# Reconstructing the `super_sleepers` DataFrame
super_sleepers <- super_sleepers_initial
print(super_sleepers)
cat('\n\n')
# Adding a new column `avg_sleep_hours` to the `super_sleepers` DataFrame and setting it to 0.999
super_sleepers <- cbind(super_sleepers,
avg_sleep_hours=0.999)
print(super_sleepers)
Output:
rating animal country
1 1 koala Australia
2 2 hedgehog Italy
3 3 sloth Peru
4 4 panda China
rating animal country avg_sleep_hours
1 1 koala Australia 0.999
2 2 hedgehog Italy 0.999
3 3 sloth Peru 0.999
4 4 panda China 0.999
Another option allows us to calculate it based on the existing columns:
# Reconstructing the `super_sleepers` DataFrame
super_sleepers <- super_sleepers_initial
print(super_sleepers)
cat('\n\n')
# Adding a new column `avg_sleep_hours` to the `super_sleepers` DataFrame
super_sleepers <- cbind(super_sleepers,
avg_sleep_hours=c(21, 18, 17, 10))
print(super_sleepers)
cat('\n\n')
# Adding a new column `avg_sleep_hours_per_year` calculated from `avg_sleep_hours`
super_sleepers <- cbind(super_sleepers,
avg_sleep_hours_per_year=super_sleepers['avg_sleep_hours'] * 365)
print(super_sleepers)
Output:
rating animal country
1 1 koala Australia
2 2 hedgehog Italy
3 3 sloth Peru
4 4 panda China
rating animal country avg_sleep_hours
1 1 koala Australia 21
2 2 hedgehog Italy 18
3 3 sloth Peru 17
4 4 panda China 10
rating animal country avg_sleep_hours avg_sleep_hours
1 1 koala Australia 21 7665
2 2 hedgehog Italy 18 6570
3 3 sloth Peru 17 6205
4 4 panda China 10 3650
With the following option we can copy a column from another DataFrame:
# Creating the `super_sleepers_1` DataFrame with the only column `rating`
super_sleepers_1 <- data.frame(rating=1:4)
print(super_sleepers_1)
cat('\n\n')
# Copying the `animal` column from `super_sleepers_initia`l to `super_sleepers_1`
# Note that in the new DataFrame, the column is still called `animal` despite setting the new name `ANIMAL`
super_sleepers_1 <- cbind(super_sleepers_1,
ANIMAL=super_sleepers_initial['animal'])
print(super_sleepers_1)
Output:
rating
1 1
2 2
3 3
4 4
rating animal
1 1 koala
2 2 hedgehog
3 3 sloth
4 4 panda
However, unlike the $ operator and square bracket approaches, pay attention to the following two nuances here:
- We can't create a new column and calculate one more column based on the new one inside the same
cbind()
function. For example, the piece of code below will throw an error.
# Reconstructing the `super_sleepers` DataFrame
super_sleepers <- super_sleepers_initial
print(super_sleepers)
cat('\n\n')
# Attempting to add a new column `avg_sleep_hours` to the `super_sleepers` DataFrame
# AND another new column `avg_sleep_hours_per_year` based on it
super_sleepers <- cbind(super_sleepers,
avg_sleep_hours=c(21, 18, 17, 10),
avg_sleep_hours_per_year=super_sleepers['avg_sleep_hours'] * 365)
print(super_sleepers)
Output:
rating animal country
1 1 koala Australia
2 2 hedgehog Italy
3 3 sloth Peru
4 4 panda China
Error in <code>[.data.frame</code>(super_sleepers, "avg_sleep_hours"): undefined columns selected
Traceback:
1. cbind(super_sleepers, avg_sleep_hours = c(21, 18, 17, 10), avg_sleep_hours_per_year = super_sleepers["avg_sleep_hours"] *
. 365)
2. super_sleepers["avg_sleep_hours"]
3. <code>[.data.frame</code>(super_sleepers, "avg_sleep_hours")
4. stop("undefined columns selected")
- When we copy a column from another DataFrame and try to give it a new name inside the
cbind()
function, this new name will be ignored, and the new column will be called exactly as it was called in the original DataFrame. For example, in the piece of code below, the new name ANIMAL
was ignored, and the new column was called animal
, just as in the DataFrame from which it was copied:
# Creating the `super_sleepers_1` DataFrame with the only column `rating`
super_sleepers_1 <- data.frame(rating=1:4)
print(super_sleepers_1)
cat('\n\n')
# Copying the `animal` column from `super_sleepers_initial` to `super_sleepers_1`
# Note that in the new DataFrame, the column is still called `animal` despite setting the new name `ANIMAL`
super_sleepers_1 <- cbind(super_sleepers_1,
ANIMAL=super_sleepers_initial['animal'])
print(super_sleepers_1)
Output:
rating
1 1
2 2
3 3
4 4
rating animal
1 1 koala
2 2 hedgehog
3 3 sloth
4 4 panda
Conclusion
In this tutorial, we discussed the various reasons why we may need to add a new column to an R DataFrame and what kind of information it can store. Then, we explored the three different ways of doing so: using the \$ symbol, square brackets, and the cbind()
function. We considered the syntax of each of those approaches and its possible variations, the pros and cons of each method, possible additional functionalities, the most common pitfalls and errors, and how to avoid them. Also, we learned how to add multiple columns to an R dataframe at once.
It's worth noting that the discussed approaches are not the only ways to add a column to a DataFrame in R. For example, for the same purpose, we can use the mutate()
or add_column()
functions. However, to be able to apply these functions, we need to install and load specific R packages (dplyr and tibble, respectively) without them adding any extra functionalities to the operation of interest than those that we discussed in this tutorial. Instead, using the $ symbol, square brackets, and the cbind()
function doesn't require any installation to be implemented in the base R.