Tutorial: Data Cleaning MoMA’s Art Collection with Python
The museum has put out a disclaimer, however, that the dataset is still a work in progress – an evolving artwork in its own right perhaps. Because it’s still in progress, the dataset has data quality issues and needs some cleanup before we can analyze it. In this post, we’ll introduce a few tools for cleaning data (or munging data, as it’s sometimes called) in Python. and discuss how we can use them to diagnose data quality issues and correct them.
You can download the dataset for yourself on Github.
Why use Python for Data Cleaning?
Python is a programming language that institutions ranging from government agencies, like the SEC, to internet companies, like Dropbox, use to create powerful software. Python has become the dominant language used for data analysis within organizations that work heavily with data. While we won’t dive into too much detail about how Python works in this post, we do have a Learn Python series we highly recommend if you’re interested in learning more.
Exploring the Data with Pandas
For this post, we’ll be working with just the first 100 rows of the dataset. We will first need to import the Pandas library into our environment and then read in the dataset into a Pandas DataFrame. A DataFrame is a speed-optimized representation of our dataset, built into Pandas, which we can then use to quickly explore and analyze our data. Once we’ve read the dataset into a DataFrame object,
artworks_100, we will then:
import pandas artworks_100 = pandas.read_csv("MOMA_Artworks.csv") artworks_100.head(10)
|0||Ferdinandsbrücke Project, Vienna, Austria , El…||Otto Wagner||(Austrian, 1841–1918)||1896||Ink and cut-and-pasted painted pages on paper||19 1/8 x 66 1/2″ (48.6 x 168.9 cm)||Fractional and promised gift of Jo Carole and …||885.1996||A&D Architectural Drawing||Architecture & Design||1996-04-09||Y||2||http://www.moma.org/collection/works/2|
|1||City of Music, National Superior Conservatory …||Christian de Portzamparc||(French, born 1944)||1987||Paint and colored pencil on print||16 x 11 3/4″ (40.6 x 29.8 cm)||Gift of the architect in honor of Lily Auchinc…||1.1995||A&D Architectural Drawing||Architecture & Design||1995-01-17||Y||3||http://www.moma.org/collection/works/3|
|2||Villa near Vienna Project, Outside Vienna, Aus…||Emil Hoppe||(Austrian, 1876–1957)||1903||Graphite, pen, color pencil, ink, and gouache …||13 1/2 x 12 1/2″ (34.3 x 31.8 cm)||Gift of Jo Carole and Ronald S. Lauder||1.1997||A&D Architectural Drawing||Architecture & Design||1997-01-15||Y||4||http://www.moma.org/collection/works/4|
|3||The Manhattan Transcripts Project, New York, N…||Bernard Tschumi||(French and Swiss, born Switzerland 1944)||1980||Photographic reproduction with colored synthet…||20 x 20″ (50.8 x 50.8 cm)||Purchase and partial gift of the architect in …||2.1995||A&D Architectural Drawing||Architecture & Design||1995-01-17||Y||5||http://www.moma.org/collection/works/5|
|4||Villa, project, outside Vienna, Austria, Exter…||Emil Hoppe||(Austrian, 1876–1957)||1903||Graphite, color pencil, ink, and gouache on tr…||15 1/8 x 7 1/2″ (38.4 x 19.1 cm)||Gift of Jo Carole and Ronald S. Lauder||2.1997||A&D Architectural Drawing||Architecture & Design||1997-01-15||Y||6||http://www.moma.org/collection/works/6|
|5||The Manhattan Transcripts Project, New York, N…||Bernard Tschumi||(French and Swiss, born Switzerland 1944)||1976-77||Gelatin silver photograph||14 x 18″ (35.6 x 45.7 cm)||Purchase and partial gift of the architect in …||3.1995.1||A&D Architectural Drawing||Architecture & Design||1995-01-17||Y||7||http://www.moma.org/collection/works/7|
|6||The Manhattan Transcripts Project, New York, N…||Bernard Tschumi||(French and Swiss, born Switzerland 1944)||1976-77||Gelatin silver photographs||Each: 14 x 18″ (35.6 x 45.7 cm)||Purchase and partial gift of the architect in …||3.1995.1-24||A&D Architectural Drawing||Architecture & Design||1995-01-17||Y||8||http://www.moma.org/collection/works/8|
|7||The Manhattan Transcripts Project, New York, N…||Bernard Tschumi||(French and Swiss, born Switzerland 1944)||1976-77||Gelatin silver photograph||14 x 18″ (35.6 x 45.7 cm)||Purchase and partial gift of the architect in …||3.1995.10||A&D Architectural Drawing||Architecture & Design||1995-01-17||Y||9||http://www.moma.org/collection/works/9|
|8||The Manhattan Transcripts Project, New York, N…||Bernard Tschumi||(French and Swiss, born Switzerland 1944)||1976-77||Gelatin silver photograph||14 x 18″ (35.6 x 45.7 cm)||Purchase and partial gift of the architect in …||3.1995.11||A&D Architectural Drawing||Architecture & Design||1995-01-17||Y||10||http://www.moma.org/collection/works/10|
|9||The Manhattan Transcripts Project, New York, N…||Bernard Tschumi||(French and Swiss, born Switzerland 1944)||1976-77||Gelatin silver photograph||14 x 18″ (35.6 x 45.7 cm)||Purchase and partial gift of the architect in …||3.1995.12||A&D Architectural Drawing||Architecture & Design||1995-01-17||Y||11||http://www.moma.org/collection/works/11|
Working with Dates in Pandas
If you look carefully at the
Date column, you may notice that some of the values are year ranges (
1976-77) instead of individual years (
1976). Year ranges are difficult to work with and we can’t easily plot them like individual years. Let’s utilize the
value_counts() function from Pandas to look for any other oddities. First, to select a column, use the bracket notation and specify the column name in quotes
artworks_100['Date'] and then attach
.value_counts() to get the distribution of values using
1976-77 25 1980-81 15 1979 12 Unknown 7 1980 5 1917 5 1978 5 1923 4 1935 3 1987 2 1903 2 1970 1 1896 1 1975 1 1984 1 1918 1 1986 1 n.d. 1 1906 1 1905 1 1930 1 1974 1 1936 1 1968 1 1900 1 c. 1917 1 dtype: int64
Cleaning the Data – Patterns
In addition to year ranges, we have three other patterns we need to watch out for. Here’s a quick summary of the types of irregular values in the
Pattern 1: “1976-77” (year ranges)
Pattern 2: “c. 1917”
Pattern 3: “Unknown”
Pattern 4: “n.d.”
If we come up with rules to deal with each pattern, we can write out the logic in Python and transform all of the irregular values into the appropriate format. Once we write out our logic, we can iterate over the DataFrame row by row and change the value for the
Date column if necessary. While rows with pattern 1 or 2 have date values and are just poorly formatted for us, rows with pattern 3 and 4 actually don’t have information on when the artworks were made. Therefore, our code for handling patterns 1 and 2 should focus on reformatting the values into clean dates, while our code for patterns for 3 and 4 should just identify those columns as missing date information. To keep things simple, we can leave the rows with pattern 3 as is (as
"Unknown") and transform rows with pattern 4 to match pattern 3.
Since all of the rows with pattern 1 are year ranges spanning only two years (e.g.
1980-81), we can select a year and have it replace the range. To keep things simple, let’s select the first year in the range since it contains all four digits of the year (
1980) while the second year in the range has only the last two digits (
We also need a reliable way to identify which rows actually exhibit pattern 1 so we only update those rows and leave the others intact. We need to leave the others intact either because they are already in the proper date format or because they will need to be modified later using the logic we write for handling the other patterns.
Since year ranges contain a hyphen
- separating the two years, we can look for the
- in each row’s
Date value and split it into two separate years. The core Python library contains a function named
.split() which in this situation will return a list of the two years if a hyphen is found or the original value if it isn’t. Since we are looking for just the first year, we can call
.split("-") on every row’s
Date, check to see if the resulting list contains two elements, and if it does, return the first element. Let’s write a function
clean_split_dates(row) which will do exactly that:
def clean_split_dates(row): # Initial date contains the current value for the Date column initial_date = str(row['Date']) # Split initial_date into two elements if "-" is found split_date = initial_date.split('-') # If a "-" is found, split_date will contain a list with at least two items if len(split_date) > 1: final_date = split_date # If no "-" is found, split_date will just contain 1 item, the initial_date else: final_date = initial_date return final_date # Assign the results of "clean_split_dates" to the 'Date' column. # We want Pandas to go row-wise so we set "axis=1". We would use "axis=0" for column-wise. artworks_100['Date'] = artworks.apply(lambda row: clean_split_dates(row), axis=1) artworks_100['Date'].value_counts()
1976 25 1980 20 1979 12 Unknown 7 1917 5 1978 5 1923 4 1935 3 1987 2 1903 2 c. 1917 1 1918 1 1975 1 1968 1 n.d. 1 1905 1 1896 1 1984 1 1930 1 1970 1 1974 1 1986 1 1936 1 1900 1 1906 1 dtype: int64
Data Cleaning with Python: Next Steps
.value_counts() on the
Date column at the end to verify that all year ranges were removed from the DataFrame. If you’re interested in learning more about data cleaning, check out our interactive Data Cleaning Course at Dataquest. This six-part course uses Python and the pandas library to teach you how to clean and work with data. The course includes two guided projects which help you synthesize your skills as well as start a data science portfolio you can use to demonstrate your skills to employers. Best of all — no installation required! The course is taught entirely in an interactive environment where you can write your own code and see the result immediately in the browser.
Srini is currently the Director of Product at Dataquest