/ Data Cleaning

Cleaning CSV Data Using the Command Line and csvkit

The Museum of Modern Art is one of the most influential museums in the world and they have released a dataset on the artworks in their collection. The dataset has some data quality issues, however, and requires cleanup.

In a previous post, we discussed how we used Python and Pandas to clean the dataset. In this post, we'll learn about how to use the csvkit library to acquire and explore tabular data.

Why the command line?

Great question! When working in cloud data science environments, you sometimes only have access to a server's shell. In these situations, proficiency with command line data science is a true superpower. As you become more proficient, using the command line for some data science tasks is much quicker than writing a Python script or a Hadoop job. Lastly, the command line has a rich ecosystem of tools and integration into the file system. This makes certain kinds of tasks, especially those involving multiple files, incredibly easy.

Some experience working in the command line is expected for this post. If you're new to the command line, I recommend checking out our interactive command line course

csvkit

csvkit is a library optimized for working with CSV files. It's written in Python but the primary interface is the command line. You can install csvkit using pip:

pip install csvkit

You'll need this library to follow along with this post.

Data acquisition

The MOMA artworks dataset is available on the museum's Github repo. Let's use curl to download Artworks.csv from Github. curl is a tool built into most shell environments that allows you to transfer data between servers. Github provides direct URLs for every file, which you can find by clicking on the Raw button. Lastly, we'll use the > operator to redirect the output from curl to a file named artworks.csv.

curl https://raw.githubusercontent.com/MuseumofModernArt/collection/master/Artworks.csv > artworks.csv

Data exploration

We can use the head command to display the first n lines of a file (10 by default). To display the first 3 lines, either of these commands will work:

head -n 3 artworks.csv
head -3 artworks.csv
Title,Artist,ArtistBio,Date,Medium,Dimensions,CreditLine,MoMANumber,Classification,Department,DateAcquired,CuratorApproved,ObjectID,URL
"Ferdinandsbrücke Project, Vienna, Austria , Elevation, preliminary version",Otto Wagner,"(Austrian, 1841–1918)",1896,Ink and cut-and-pasted painted pages on paper,"19 1/8 x 66 1/2"" (48.6 x 168.9 cm)",Fractional and promised gift of Jo Carole and Ronald S. Lauder,885.1996,A&D Architectural Drawing,Architecture & Design,1996-04-09,Y,2,http://www.moma.org/collection/works/2
"City of Music, National Superior Conservatory of Music and Dance, Paris, France, View from interior courtyard",Christian de Portzamparc,"(French, born 1944)",1987,Paint and colored pencil on print,"16 x 11 3/4"" (40.6 x 29.8 cm)",Gift of the architect in honor of Lily Auchincloss,1.1995,A&D Architectural Drawing,Architecture & Design,1995-01-17,Y,3,http://www.moma.org/collection/works/3

As with most datasets, it looks like the first line contains the column headers. Other than that, the output is messy and it's hard to glean anything else from the default output. We need a tool that knows how to display CSV files in a readable way.

csvlook

csvlook is a tool within csvkit that allows you to display and preview a CSV file as a table. csvlook artworks.csv will display the entire dataset, which is cumbersome to explore. Let's instead pipe the stdout of head -5 artworks.csv to csvlook to explore the first 5 lines:

head -5 artworks.csv | csvlook
|----------------------------------------------------------------------------------------------------------------+--------------------------+-------------------------------------------+------+----------------------------------------------------------------+------------------------------------+-------------------------------------------------------------------------+------------+---------------------------+-----------------------+--------------+-----------------+----------+-----------------------------------------|
|  Title                                                                                                        | Artist                   | ArtistBio                                 | Date | Medium                                                         | Dimensions                         | CreditLine                                                              | MoMANumber | Classification            | Department            | DateAcquired | CuratorApproved | ObjectID | URL                                     |
|----------------------------------------------------------------------------------------------------------------+--------------------------+-------------------------------------------+------+----------------------------------------------------------------+------------------------------------+-------------------------------------------------------------------------+------------+---------------------------+-----------------------+--------------+-----------------+----------+-----------------------------------------|
|  Ferdinandsbrücke Project, Vienna, Austria , Elevation, preliminary version                                    | Otto Wagner              | (Austrian, 1841–1918)                     | 1896 | Ink and cut-and-pasted painted pages on paper                  | 19 1/8 x 66 1/2" (48.6 x 168.9 cm) | Fractional and promised gift of Jo Carole and Ronald S. Lauder          | 885.1996   | A&D Architectural Drawing | Architecture & Design | 1996-04-09   | Y               | 2        | http://www.moma.org/collection/works/2  |
|  City of Music, National Superior Conservatory of Music and Dance, Paris, France, View from interior courtyard | Christian de Portzamparc | (French, born 1944)                       | 1987 | Paint and colored pencil on print                              | 16 x 11 3/4" (40.6 x 29.8 cm)      | Gift of the architect in honor of Lily Auchincloss                      | 1.1995     | A&D Architectural Drawing | Architecture & Design | 1995-01-17   | Y               | 3        | http://www.moma.org/collection/works/3  |
|  Villa near Vienna Project, Outside Vienna, Austria, Elevation                                                 | Emil Hoppe               | (Austrian, 1876–1957)                     | 1903 | Graphite, pen, color pencil, ink, and gouache on tracing paper | 13 1/2 x 12 1/2" (34.3 x 31.8 cm)  | Gift of Jo Carole and Ronald S. Lauder                                  | 1.1997     | A&D Architectural Drawing | Architecture & Design | 1997-01-15   | Y               | 4        | http://www.moma.org/collection/works/4  |
|  The Manhattan Transcripts Project, New York, New York , Introductory panel to Episode 1: The Park             | Bernard Tschumi          | (French and Swiss, born Switzerland 1944) | 1980 | Photographic reproduction with colored synthetic laminate      | 20 x 20" (50.8 x 50.8 cm)          | Purchase and partial gift of the architect in honor of Lily Auchincloss | 2.1995     | A&D Architectural Drawing | Architecture & Design | 1995-01-17   | Y               | 5        | http://www.moma.org/collection/works/5  |
|----------------------------------------------------------------------------------------------------------------+--------------------------+-------------------------------------------+------+----------------------------------------------------------------+------------------------------------+-------------------------------------------------------------------------+------------+---------------------------+-----------------------+--------------+-----------------+----------+-----------------------------------------|

Here's a diagram representing the pipeline:

ditaa_diagram_1

While the output is now easier to read, it's still a little difficult to explore. Let's now learn how to use csvcut to select just a few columns and display those.

csvcut

csvcut is a tool within csvkit that is referred to as the data scalpel since it allows you to slice and modify columns in a CSV. Let's first list all the columns using the -n flag:

csvcut -n artworks.csv
1: Title
2: Artist
3: ArtistBio
4: Date
5: Medium
6: Dimensions
7: CreditLine
8: MoMANumber
9: Classification
10: Department
11: DateAcquired
12: CuratorApproved
13: ObjectID
14: URL

We can then use the -c flag to specify the columns we want. csvcut -c 1,2,3 artworks.csv | csvlook will return the first 3 columns. You can also use the column names themselves: csvcut -c Artist,ArtistBio,Date.

Running either command will display the 3 columns for the entire dataset so we'll need to take advantage of piping to view just a few lines.

We can consult the csvkit documentation to read about piping between csvkit utilities:

All csvkit utilities accept an input file as “standard in”, in addition to as a filename. This means that we can make the output of one csvkit utility become the input of the next.

This means we can pipe the stdout of csvcut to the stdin of csvlook! We can build the following pipeline:

  • extract just the first 10 lines using head
  • filter to just the first 3 columns using csvcut
  • display in a clean way using csvlook
head -10 artworks.csv | csvcut -c 1,2,3 | csvlook
|----------------------------------------------------------------------------------------------------------------+--------------------------+--------------------------------------------|
|  Title                                                                                                        | Artist                   | ArtistBio                                  |
|----------------------------------------------------------------------------------------------------------------+--------------------------+--------------------------------------------|
|  Ferdinandsbrücke Project, Vienna, Austria , Elevation, preliminary version                                    | Otto Wagner              | (Austrian, 1841–1918)                      |
|  City of Music, National Superior Conservatory of Music and Dance, Paris, France, View from interior courtyard | Christian de Portzamparc | (French, born 1944)                        |
|  Villa near Vienna Project, Outside Vienna, Austria, Elevation                                                 | Emil Hoppe               | (Austrian, 1876–1957)                      |
|  The Manhattan Transcripts Project, New York, New York , Introductory panel to Episode 1: The Park             | Bernard Tschumi          | (French and Swiss, born Switzerland 1944)  |
|  Villa, project, outside Vienna, Austria, Exterior perspective                                                 | Emil Hoppe               | (Austrian, 1876–1957)                      |
|  The Manhattan Transcripts Project, New York, New York , Episode 1: The Park                                   | Bernard Tschumi          | (French and Swiss, born Switzerland 1944)  |
|  The Manhattan Transcripts Project, New York, New York , Episode 1: The Park                                   | Bernard Tschumi          | (French and Swiss, born Switzerland 1944)  |
|  The Manhattan Transcripts Project, New York, New York , Episode 1: The Park                                   | Bernard Tschumi          | (French and Swiss, born Switzerland 1944)  |
|  The Manhattan Transcripts Project, New York, New York, Episode 1: The Park                                    | Bernard Tschumi          | (French and Swiss, born Switzerland 1944)  |
|----------------------------------------------------------------------------------------------------------------+--------------------------+--------------------------------------------|

csvgrep

When working with historical datasets, we need to make sure the date and time columns are formatted correctly (or even basic time series plots won't work). Let's explore the Date and DateAcquired columns:

head -20 artworks.csv | csvcut -c Date,DateAcquired | csvlook
|----------+---------------|
|  Date    | DateAcquired  |
|----------+---------------|
|  1896    | 1996-04-09    |
|  1987    | 1995-01-17    |
|  1903    | 1997-01-15    |
|  1980    | 1995-01-17    |
|  1903    | 1997-01-15    |
|  1976-77 | 1995-01-17    |
|  1976-77 | 1995-01-17    |
|  1976-77 | 1995-01-17    |
|  1976-77 | 1995-01-17    |
|  1976-77 | 1995-01-17    |
|  1976-77 | 1995-01-17    |
|  1976-77 | 1995-01-17    |
|  1976-77 | 1995-01-17    |
|  1976-77 | 1995-01-17    |
|  1976-77 | 1995-01-17    |
|  1976-77 | 1995-01-17    |
|  1976-77 | 1995-01-17    |
|  1976-77 | 1995-01-17    |
|  1976-77 | 1995-01-17    |
|----------+---------------|

While the first 20 values in DateAcquired look fine, the Date column has some values that won't play nicely with most data visualization tools like 1976-77. We can easily deal with this by just selecting the first year in the range (e.g. 1976 from the range 1976-77). Before we do that, let's figure out how many lines match this pattern.

We can use the csvgrep tool to extract all values in a column (or columns) that match a regular expression. We specify the columns we want csvgrep to match on using the -c flag. We specify the regular expression we want csvgrep to use using the -regex flag.

The regex ^([0-9]*-[0-9]*) matches pairs of numeric values that are separated by a hyphen (-). Since we're searching for instances of the pattern on the Date column, we write the following:

csvgrep --c Date --regex "^([0-9]*-[0-9]*)"

Let's modify and run the pipeline we've built to incorporate csvgrep:

head -10 artworks.csv | csvcut -c Date | csvgrep --c Date --regex "^([0-9]*-[0-9]*)" | csvlook
|-----------|
|  Date     |
|-----------|
|  1976-77  |
|  1976-77  |
|  1976-77  |
|  1976-77  |
|-----------|

csvstat

Wonderful! Now that we know it works, let's apply the regular expression over the entire Date column (instead of just the first 10 lines) and determine how many lines match this pattern. The csvstat tool takes a CSV as an input (stdin) and computes summary statistics. We can use the --count flag to specify that we only want the line count. We can also remove csvcut, head, and csvlook since we don't need to display the output.

csvgrep --c Date --regex "^([0-9]*-[0-9]*)" artworks.csv | csvstat --count
Row count: 18073

It looks like 18,073 lines match this pattern. Let's now calculate:

  • how many lines match the 4 digit year pattern
  • how many total lines the dataset contains

We can use the regex (^[0-9]{4}$) to find all 4 digit year values and pipe the results to csvstat:

csvgrep --c Date --regex "[0-9]{4}$" artworks.csv | csvstat --count
Row count: 76263

Finally, to get the total number of lines of a dataset, we can use the wc command with the -l flag (to display just the number of lines):

wc -l artworks.csv
137382 artworks.csv

If we combine the number of lines that match the 4 digit year regex (76263) with the number of lines that match the year range regex (18073), we get (94336) lines. Given that there are 137,382 lines total, this is a great starting point for our analysis!

Next Steps

If you're interesting in learning more about using csvkit, you can check out our interactive csvkit mission.

Srini Kadamati

Srini Kadamati

Director of Content at Dataquest.io. Based in Austin, TX.

Read More