Tutorial: Advanced For Loops in Python
In a previous tutorial, we covered the basics of Python for loops, looking at how to iterate through lists and lists of lists. But there's a lot more to for loops than looping through lists, and in real-world data science work, you may want to use for loops with other data structures, including numpy arrays and pandas DataFrames.
This tutorial begins with how to use for loops to iterate through common Python data structures other than lists (like tuples and dictionaries). Then we'll dig into using for loops in tandem with common Python data science libraries like numpy
, pandas
, and matplotlib
. We'll also take a closer look at the range()
function and how it's useful when writing for loops.
A Quick Review: The Python For Loop
A for loop is a programming statement that tells Python to iterate over a collection of objects, performing the same operation on each object in sequence. The basic syntax is:
for object in collection_of_objects:
# code you want to execute on each object
Each time Python iterates through the loop, the variable object
takes on the value of the next object in our sequence collection_of_objects
, and Python will execute the code we have written on each object from collection_of_objects
in sequence.
Now, let's dive into how to use for loops with different sorts of data structures. We'll skip lists since those have been covered in the previous tutorial; if you need further review, check out the introductory tutorial or Dataquest's interactive lesson on lists and for loops.
Data Structures
Tuples
Tuples are sequences, just like lists. The difference between tuples and lists is that tuples are immutable; that is, they cannot be changed (learn more about mutable and immutable objects in Python). Tuples also use parentheses instead of square brackets.
Regardless of these differences, looping over tuples is very similar to lists.
x = (10,20,30,40,50)
for var in x:
print("index "+ str(x.index(var)) + ":",var)
index 0: 10
index 1: 20
index 2: 30
index 3: 40
index 4: 50
If we have a list of tuples, we can access the individual elements in each tuple in our list by including them both as variables in the for loop, like so:
x = [(1,2), (3,4), (5,6)]
for a, b in x:
print(a, "plus", b, "equals", a+b)
1 plus 2 equals 3
3 plus 4 equals 7
5 plus 6 equals 11
Dictionaries
In addition to lists and tuples, dictionaries are another common Python data type you're likely to encounter when working with data, and for loops can iterate through dictionaries, too.
Python dictionaries are composed of key-value pairs, so in each loop, there are two elements we need to access (the key and the value). Instead of using enumerate()
like we would with lists, to loop over both keys and the corresponding values for each key-value pair we need to call the .items()
method.
For example, imagine we have a dictionary called stocks
that contains both stock tickers and the corresponding stock prices. We'll use the .items()
method on our dictionary to generate a key and value for each iteration:
stocks = {
'AAPL': 187.31,
'MSFT': 124.06,
'FB': 183.50
}
for key, value in stocks.items() :
print(key + " : " + str(value))
AAPL : 187.31
MSFT : 124.06
FB : 183.5
Note that the names key and value are completely arbitrary; we could also label these as k and v or x and y.
Strings
As mentioned in the introductory tutorial, for loops can also iterate through each character in a string. As a quick review, here's how that works:
print("data science")
for c in "data science":
print(c)
data science
d
a
t
a
s
c
i
e
n
c
e
Numpy Arrays
Now, let's take a look at how for loops can be used with common Python data science packages and their data types.
We'll start by looking at how to use for loops with numpy
arrays, so let's start by creating some arrays of random numbers.
import numpy as np
np.random.seed(0) # seed for reproducibility
x = np.random.randint(10, size=6)
y = np.random.randint(10, size=6)
Iterating over a one-dimensional numpy array is very similar to iterating over a list:
for val in x:
print(val)
5
0
3
3
7
9
Now, what if we want to iterate through a two-dimensional array? If we use the same syntax to iterate a two-dimensional array as we did above, we can only iterate entire arrays on each iteration.
# creating our 2-dimensional array
z = np.array([x, y])
for val in z:
print(val)
[5 0 3 3 7 9]
[3 5 2 4 7 6]
A two-dimensional array is built up from a pair of one-dimensional arrays. To visit every element rather than every array, we can use the numpy function nditer()
, a multi-dimensional iterator object which takes an array as its argument.
In the code below, we'll write a for loop that iterates through each element by passing z
, our two-dimensional array, as the argument for nditer()
:
for val in np.nditer(z):
print(val)
5
0
3
3
7
9
3
5
2
4
7
6
As we can see, this first lists all of the elements in x, then all elements of y.
Remember! When looping through these different data structures, dictionaries require a method, numpy arrays require a function.
Pandas DataFrames
When we're working with data in Python, we're often using pandas
DataFrames. And thankfully, we can use for loops to iterate through those, too.
Let's practice doing this while working with a small CSV file that records the GDP, capital city, and population for six different countries. We will read this into a pandas DataFrame below.
Pandas works a bit differently from numpy, so we won't be able to simply repeat the numpy process we've already learned. If we try to iterate over a pandas DataFrame as we would a numpy array, this would just print out the column names:
import pandas as pd
df = pd.read_csv('gdp.csv', index_col=0)
for val in df:
print(val)
Capital
GDP ($US Trillion)
Population
Instead, we need to mention explicitly that we want to iterate over the rows of the DataFrame. We do this by calling the iterrows()
method on the DataFrame, and print row labels and row data, where a row is the entire pandas series.
for label, row in df.iterrows():
print(label)
print(row)
Ireland
Capital Dublin
GDP ($US Trillion) 0.3337
Population 4784000
Name: Ireland, dtype: object
United Kingdom
Capital London
GDP ($US Trillion) 2.622
Population 66040000
Name: United Kingdom, dtype: object
United States
Capital Washington, D.C.
GDP ($US Trillion) 19.39
Population 327200000
Name: United States, dtype: object
China
Capital Beijing
GDP ($US Trillion) 12.24
Population 1386000000
Name: China, dtype: object
India
Capital New Delhi
GDP ($US Trillion) 2.597
Population 1339000000
Name: India, dtype: object
Germany
Capital Berlin
GDP ($US Trillion) 3.677
Population 82790000
Name: Germany, dtype: object
We can also access specific values from a pandas series. Suppose we just want to print out the capital of each country. We can specify that we want only output from the "Capital" column like so:
for label, row in df.iterrows():
print(label + " : " + row["Capital"])
Ireland : Dublin
United Kingdom : London
United States : Washington, D.C.
China : Beijing
India : New Delhi
Germany : Berlin
To take things further than simple printouts, let's add a column using a for loop. Let's add a GDP per capita column. Remember that .loc[]
is label-based. In the code below, we'll add the column and compute its contents for each country by dividing its total GDP from its population and multiplying the result by one trillion (since the GDP numbers are listed in trillions).
for label, row in df.iterrows():
df.loc[label,'gdp_per_cap'] = row['GDP ($US Trillion)']/row['Population '] * 1000000000000
print(df)
Capital GDP ($US Trillion) Population \
Country
Ireland Dublin 0.3337 4784000
United Kingdom London 2.6220 66040000
United States Washington, D.C. 19.3900 327200000
China Beijing 12.2400 1386000000
India New Delhi 2.5970 1339000000
Germany Berlin 3.6770 82790000
gdp_per_cap
Country
Ireland 69753.344482
United Kingdom 39703.210176
United States 59260.391198
China 8831.168831
India 1939.507095
Germany 44413.576519
For each row in our dataframe, we are creating a new label, and setting the row data equal to the total GDP divided by the country’s population, and multiplying by $1T for thousands of dollars.
The range()
function
We have seen how we can use for loops to iterate over any sequence or data structure. But what if we would like to iterate over these sequences in a specific order, or for a specific number of times?
This can be accomplished with Python’s built-in range()
function. Depending on how many arguments you pass to the function, you can decide where that series of numbers will begin and end as well as how big the difference will be between one number and the next. Note that, similar to lists, the range()
function’s count starts from 0 and not from 1.
There are three ways we can call range()
:
- range(stop)
- range(start, stop)
- range(start, stop, step)
range(stop)
range(stop) takes one argument, used when we want to iterate over a series of numbers thats starts at 0 and includes every number up to, but not including, the number we set as the stop.
for i in range(3):
print(i)
0
1
2
range(start, stop)
range(start, stop) takes two arguments, where we can not only set the end of the series but also the beginning. You can use range() to generate a series of numbers from A to B using a range(A, B).
for i in range(1, 8):
print(i)
1
2
3
4
5
6
7
range(start, stop, step)
range(start, stop, step) takes three arguments. In addition to the minimum and maximum values, we can set the difference between one number in the sequence and the next. The default step value is 1 if none is provided.
for i in range(3, 16, 3):
print(i)
3
6
9
12
15
Note that this works the same for non-numerical sequences.
We can also use the index of elements in a sequence to iterate. The key idea is to first calculate the length of the list and then iterate over the sequence within the range of this length. Let's take a look at an example:
languages = ['Spanish', 'English', 'French', 'German', 'Irish', 'Chinese']
for index in range(len(languages)):
print('Language:', languages[index])
Language: Spanish
Language: English
Language: French
Language: German
Language: Irish
Language: Chinese
In our for loop above we are looking at a variable’s index and language, the in keyword, and the range()
function to create a sequence of numbers. Note that we also use the len()
function in this case, as the list is not numerical.
For each iteration, we are executing our print statement. So for every index in the range len(languages), we want to print a language. Because the length of our languages sequence is 6 (that is the value that len(langauges)
evaluates to), we can rewrite the statement as follows:
for index in range(6):
print('Language:', languages[index])
Language: Spanish
Language: English
Language: French
Language: German
Language: Irish
Language: Chinese
Plotting with For Loops
Suppose we want to iterate through a collection, and use each element to produce a subplot, or even for each trace in a single plot. For example, let’s take the popular iris data set (learn more about this data) and do some plotting with for loops. Consider the graph below.
(If you are unfamiliar with Matplotlib or Seaborn, check out these beginner guides fro Kyso: Matplotlib, Seaborn. Dataquest also offers interactive courses on Python data visualization).
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv('iris.csv')
# create a figure and axis
fig, ax = plt.subplots()
# scatter the sepal_length against the sepal_width
ax.scatter(df['sepal_length'], df['sepal_width'])
# set a title and labels
ax.set_title('Iris Dataset')
ax.set_xlabel('sepal_length')
ax.set_ylabel('sepal_width')
Text(0,0.5,'sepal_width')
Above, we’ve plotted each sepal length vs sepal width, but we can give the graph more meaning by coloring in each data point by each flower's species class. One way to do this is by scattering each point on its own using a for loop and passing in the respective color.
# create color dictionary
colors = {'setosa':'r', 'versicolor':'g', 'virginica':'b'}
# create a figure and axis
fig, ax = plt.subplots()
# plot each data-point
for i in range(len(df['sepal_length'])):
ax.scatter(df['sepal_length'][i], df['sepal_width'][i], color=colors[df['species'][i]])
# set a title and labels
ax.set_title('Iris Dataset')
ax.set_xlabel('sepal_length')
ax.set_ylabel('sepal_width')
Text(0,0.5,'sepal_width')
What if we want to visualize the univariate distribution of certain features of our iris dataset? We can do this with plt.subplot()
, which creates a single subplot within a grid, the numbers of columns and rows of which we can set.
fig, axes = plt.subplots(nrows=2, ncols=2, figsize=(8, 6))
fig.subplots_adjust(hspace=0.8)
fig.suptitle('Distributions of Iris Features')
for ax, feature, name in zip(axes.flatten(), df.drop('species',axis=1).values.T, df.columns.values):
ax.hist(feature, bins=len(np.unique(df.drop('species',axis=1).values.T[0])//2))
ax.set(title=name[:-4].upper(), xlabel='cm')
Without diving too deep into the matplotlib syntax for now, below is a brief description of each main component of our graph:
- plt.subplot( ) - used to create our 2-by-2 grid and set the overall size.
- zip( ) - this is a built-in python function that makes it super simple to loop through multiple iterables of the same length in simultaneously.
- axes.flatten( ), where flatten( ) is a numpy array method - this returns a flattened version of our arrays (columns).
- ax.set( ) - allows us to set all of the attributes of our
axes
object with a single method.
Additional Operations
Nested Loops
Python allows us to use one loop inside another loop. This involves an outer loop that has, inside its commands, an inner loop.
Consider the following structure:
for iterator_var in sequence:
for iterator_var in sequence:
# statements(s)
# statements(s)
Nested for loops can be useful for iterating through items within lists composed of lists. In a list composed of lists, if we employ just one for loop, the program will output each internal list as an item:
languages = [['Spanish', 'English', 'French', 'German'], ['Python', 'Java', 'Javascript', 'C++']]
for lang in languages:
print(lang)
['Spanish', 'English', 'French', 'German']
['Python', 'Java', 'Javascript', 'C++']
In order to access each individual item of the internal lists, we define a nested for loop:
for x in languages:
print("------")
for lang in x:
print(lang)
------
Spanish
English
French
German
------
Python
Java
Javascript
C++
Above, the outer for loop is looping through the main list-of-lists (which contains two lists in this example) and the inner for loop is looping through the individual lists themselves. The outer loop executes 2 iterations (for each sub-list) and at each iteration we execute our inner loop, printing all elements of the respective sub-lists.
This tells us that the control travels from the outermost loop, traverses the inner loop and then back again to the outer for loop, continuing until the control has covered the entire range, which is 2 times in this case.
Continuing and Breaking For Loops
Loop control statements change the execution of a for loop from its normal sequence.
What if we want to filter out a specific language within our inner loop? We can use a continue statement to do this, which allows us to skip over a specific part of our loop when an external condition is triggered.
for x in languages:
print("------")
for lang in x:
if lang == "German":
continue
print(lang)
------
Spanish
English
French
------
Python
Java
Javascript
C++
In our loop above, within the inner loop, if the langauge equals “German”, we skip that iteration only and continue with the rest of the loop. The loop is not terminated.
Let’s look at a numerical example below:
from math import sqrt
number = 0
for i in range(10):
number = i ** 2
if i
continue # continue here
print(str(round(sqrt(number))) + ' squared is equal to ' + str(number))
1 squared is equal to 1
3 squared is equal to 9
5 squared is equal to 25
7 squared is equal to 49
9 squared is equal to 81
So here, we have defined a loop that iterates over all numbers 0 through 9, and squares each number. Within our loop, at each iteration, we are checking if the number is divisible by 2, at which point the loop will continue to execute, skipping the iteration when i evaluates to an even number.
What about a break statement? This allows us to exit a loop entirely when an external condition is met. Let’s see a simple demonstration of how this works using the same example as above:
number = 0
for i in range(10):
number = i ** 2
if i == 7:
break
print(str(round(sqrt(number))) + ' squared is equal to ' + str(number))
0 squared is equal to 0
1 squared is equal to 1
2 squared is equal to 4
3 squared is equal to 9
4 squared is equal to 16
5 squared is equal to 25
6 squared is equal to 36
In the above example, our if statement presents the condition that if our variable i evaluates to 7, our loop will break, so our loop iterates over integers 0 through 6 before dropping out of the loop entirely.
Looking for more? Here are some additional resources that might be useful:
- Python Tutorials — Our ever-expanding list of Python tutorials for data science.
- Data Science Courses — Take your studies to the next level with fully interactive programming, data science, and stats courses, right in your browser.
Conclusion
In this tutorial, we learned about some more advanced applications of for loops, and how they might be used in typical Python data science workflows.
We learned how to iterate over different types of data structures, and how loops can be used with pandas DataFrames and matplotlib to create multiple traces or sub-plots programmatically.
Finally, we looked at some more advanced techniques that give us more control over the operation and execution of our for loops.
If you’d like to learn more about this topic, check out Dataquest's Data Scientist in Python path that will help you become job-ready in around 6 months.