/ Datacleaning

Data Cleaning with Python — MoMAs Artwork Collection

Art is a messy business. Over centuries, artists have created everything from simple paintings to complex sculptures, and art historians have been cataloging everything they can along the way. The Museum of Modern Art, or MoMA for short, is considered one of the most influential museums in the world and recently released a dataset of all the artworks they’ve cataloged in their collection. This dataset contains basic information on metadata for each artwork and is part of MoMA's push to make art more accessible to everyone.

The museum has put out a disclaimer however that the dataset is still a work in progress - an evolving artwork in its own right perhaps. Because it's still in progress, the dataset has data quality issues and needs some cleanup before we can analyze it. In this post, we’ll introduce a few tools for cleaning data (or munging data, as it's sometimes called) in Python. and discuss how we can use them to diagnose data quality issues and correct them. You can download the dataset for yourself on Github.

Why use Python for Data Cleaning?

Python is a programming language that institutions ranging from government agencies, like the SEC, to internet companies, like Dropbox, use to create powerful software. Python has become the dominant language used for data analysis within organizations that work heavily with data. While we won’t dive into too much detail about how Python works in this post, we do have a Learn Python series we highly recommend if you're interested in learning more.

Exploring the Data with Pandas

For this post, we'll be working with just the first 100 rows of the dataset. We will first need to import the Pandas library into our environment and then read in the dataset into a Pandas DataFrame. A DataFrame is a speed-optimized representation of our dataset, built into Pandas, which we can then use to quickly explore and analyze our data. Once we've read the dataset into a DataFrame object, artworks_100, we will then


import pandas
artworks_100 = pandas.read_csv("MOMA_Artworks.csv")

artworks_100.head(10)

Title Artist ArtistBio Date Medium Dimensions CreditLine MoMANumber Classification Department DateAcquired CuratorApproved ObjectID URL
0 Ferdinandsbrücke Project, Vienna, Austria , El... Otto Wagner (Austrian, 1841–1918) 1896 Ink and cut-and-pasted painted pages on paper 19 1/8 x 66 1/2" (48.6 x 168.9 cm) Fractional and promised gift of Jo Carole and ... 885.1996 A&D Architectural Drawing Architecture & Design 1996-04-09 Y 2 http://www.moma.org/collection/works/2
1 City of Music, National Superior Conservatory ... Christian de Portzamparc (French, born 1944) 1987 Paint and colored pencil on print 16 x 11 3/4" (40.6 x 29.8 cm) Gift of the architect in honor of Lily Auchinc... 1.1995 A&D Architectural Drawing Architecture & Design 1995-01-17 Y 3 http://www.moma.org/collection/works/3
2 Villa near Vienna Project, Outside Vienna, Aus... Emil Hoppe (Austrian, 1876–1957) 1903 Graphite, pen, color pencil, ink, and gouache ... 13 1/2 x 12 1/2" (34.3 x 31.8 cm) Gift of Jo Carole and Ronald S. Lauder 1.1997 A&D Architectural Drawing Architecture & Design 1997-01-15 Y 4 http://www.moma.org/collection/works/4
3 The Manhattan Transcripts Project, New York, N... Bernard Tschumi (French and Swiss, born Switzerland 1944) 1980 Photographic reproduction with colored synthet... 20 x 20" (50.8 x 50.8 cm) Purchase and partial gift of the architect in ... 2.1995 A&D Architectural Drawing Architecture & Design 1995-01-17 Y 5 http://www.moma.org/collection/works/5
4 Villa, project, outside Vienna, Austria, Exter... Emil Hoppe (Austrian, 1876–1957) 1903 Graphite, color pencil, ink, and gouache on tr... 15 1/8 x 7 1/2" (38.4 x 19.1 cm) Gift of Jo Carole and Ronald S. Lauder 2.1997 A&D Architectural Drawing Architecture & Design 1997-01-15 Y 6 http://www.moma.org/collection/works/6
5 The Manhattan Transcripts Project, New York, N... Bernard Tschumi (French and Swiss, born Switzerland 1944) 1976-77 Gelatin silver photograph 14 x 18" (35.6 x 45.7 cm) Purchase and partial gift of the architect in ... 3.1995.1 A&D Architectural Drawing Architecture & Design 1995-01-17 Y 7 http://www.moma.org/collection/works/7
6 The Manhattan Transcripts Project, New York, N... Bernard Tschumi (French and Swiss, born Switzerland 1944) 1976-77 Gelatin silver photographs Each: 14 x 18" (35.6 x 45.7 cm) Purchase and partial gift of the architect in ... 3.1995.1-24 A&D Architectural Drawing Architecture & Design 1995-01-17 Y 8 http://www.moma.org/collection/works/8
7 The Manhattan Transcripts Project, New York, N... Bernard Tschumi (French and Swiss, born Switzerland 1944) 1976-77 Gelatin silver photograph 14 x 18" (35.6 x 45.7 cm) Purchase and partial gift of the architect in ... 3.1995.10 A&D Architectural Drawing Architecture & Design 1995-01-17 Y 9 http://www.moma.org/collection/works/9
8 The Manhattan Transcripts Project, New York, N... Bernard Tschumi (French and Swiss, born Switzerland 1944) 1976-77 Gelatin silver photograph 14 x 18" (35.6 x 45.7 cm) Purchase and partial gift of the architect in ... 3.1995.11 A&D Architectural Drawing Architecture & Design 1995-01-17 Y 10 http://www.moma.org/collection/works/10
9 The Manhattan Transcripts Project, New York, N... Bernard Tschumi (French and Swiss, born Switzerland 1944) 1976-77 Gelatin silver photograph 14 x 18" (35.6 x 45.7 cm) Purchase and partial gift of the architect in ... 3.1995.12 A&D Architectural Drawing Architecture & Design 1995-01-17 Y 11 http://www.moma.org/collection/works/11

Working with Dates in Pandas

If you look carefully at the Date column, you may notice that some of the values are year ranges (1976-77) instead of individual years (1976). Year ranges are difficult to work with and we can't easily plot them like individual years. Let's utilize the value_counts() function from Pandas to look for any other oddities.

First, to select a column, use the bracket notation and specify the column name in quotes artworks_100['Date'] and then attach .value_counts() to get the distribution of values artworks_100['Date'].value_counts().


artworks_100['Date'].value_counts()

1976-77    25
1980-81    15
1979       12
Unknown     7
1980        5
1917        5
1978        5
1923        4
1935        3
1987        2
1903        2
1970        1
1896        1
1975        1
1984        1
1918        1
1986        1
n.d.        1
1906        1
1905        1
1930        1
1974        1
1936        1
1968        1
1900        1
c. 1917     1
dtype: int64

Cleaning the Data - Patterns

In addition to year ranges, we have three other patterns we need to watch out for. Here's a quick summary of the types of irregular values in the Date column:

  • Pattern 1: "1976-77" (year ranges)
  • Pattern 2: "c. 1917"
  • Pattern 3: "Unknown"
  • Pattern 4: "n.d."

If we come up with rules to deal with each pattern, we can write out the logic in Python and transform all of the irregular values into the appropriate format. Once we write out our logic, we can iterate over the DataFrame row by row and change the value for the Date column if necessary.

While rows with pattern 1 or 2 have date values and are just poorly formatted for us, rows with pattern 3 and 4 actually don't have information on when the artworks were made. Therefore, our code for handling patterns 1 and 2 should focus on reformatting the values into clean dates, while our code for patterns for 3 and 4 should just identify those columns as missing date information. To keep things simple, we can leave the rows with pattern 3 as is (as "Unknown") and transform rows with pattern 4 to match pattern 3.

Pattern 1

Since all of the rows with pattern 1 are year ranges spanning only two years (e.g. 1980-81), we can select a year and have it replace the range. To keep things simple, let's select the first year in the range since it contains all four digits of the year (1980) while the second year in the range has only the last two digits (81).

We also need a reliable way to identify which rows actually exhibit pattern 1 so we only update those rows and leave the others intact. We need to leave the others intact either because they are already in the proper date format or because they will need to be modified later using the logic we write for handling the other patterns.

Since year ranges contain a hyphen - separating the two years, we can look for the - in each row's Date value and split it into two separate years. The core Python library contains a function named .split() which in this situation will return a list of the two years if a hyphen is found or the original value if it isn't. Since we are looking for just the first year, we can call .split("-") on every row's Date, check to see if the resulting list contains two elements, and if it does, return the first element. Let's write a function clean_split_dates(row) which will do exactly that:


def clean_split_dates(row):
    # Initial date contains the current value for the Date column
    initial_date = str(row['Date'])

    # Split initial_date into two elements if "-" is found
    split_date = initial_date.split('-') 

    # If a "-"  is found, split_date will contain a list with at least two items
    if len(split_date) > 1:
        final_date = split_date[0]
    # If no "-" is found, split_date will just contain 1 item, the initial_date
    else:
        final_date = initial_date
    
    return final_date

# Assign the results of "clean_split_dates" to the 'Date' column. 
# We want Pandas to go row-wise so we set "axis=1". We would use "axis=0" for column-wise.
artworks_100['Date'] = artworks.apply(lambda row: clean_split_dates(row), axis=1)
artworks_100['Date'].value_counts()

1976       25
1980       20
1979       12
Unknown     7
1917        5
1978        5
1923        4
1935        3
1987        2
1903        2
c. 1917     1
1918        1
1975        1
1968        1
n.d.        1
1905        1
1896        1
1984        1
1930        1
1970        1
1974        1
1986        1
1936        1
1900        1
1906        1
dtype: int64

Data Cleaning with Python: Next Steps

We ran .value_counts() on the Date column at the end to verify that all year ranges were removed from the DataFrame.

If you're interested in learning more about data cleaning, checkout our interactive Data Cleaning Course at Dataquest. This six part course uses Python and the pandas library to teach you how to clean and work with data. The course includes two guided projects which help you synthesize your skills as well as start a data science portfolio you can use to demonstrate your skills to employers.

Best of all — no installation required! The course is taught entirely in an interactive environment where you can write your own code and see the result immediately in the browser.

Srini Kadamati

Srini Kadamati

Data scientist at Dataquest.io. Loves tacos and BBQ. Texan living in California.

Read More