In a previous post, we discussed how we used Python and Pandas to clean the dataset. In this post, we’ll learn about how to use the csvkit
library to acquire and explore tabular data.
Why the command line?
Great question! When working in cloud data science environments, you sometimes only have access to a server’s shell. In these situations, proficiency with command line data science is a true superpower. As you become more proficient, using the command line for some data science tasks is much quicker than writing a Python script or a Hadoop job.
Lastly, the command line has a rich ecosystem of tools and integration into the file system. This makes certain kinds of tasks, especially those involving multiple files, incredibly easy. Some experience working in the command line is expected for this post. If you’re new to the command line, I recommend checking out our interactive command line course
csvkit
csvkit
is a library optimized for working with CSV files. It’s written in Python but the primary interface is the command line. You can install csvkit
using pip
:
pip install csvkit
You’ll need this library to follow along with this post.
Data acquisition
The MOMA artworks dataset is available on the museum’s Github repo. Let’s use curl
to download Artworks.csv
from Github. curl
is a tool built into most shell environments that allows you to transfer data between servers. Github provides direct URLs for every file, which you can find by clicking on the Raw
button.
Lastly, we’ll use the >
operator to redirect the output from curl
to a file named artworks.csv
.
curl https://raw.githubusercontent.com/MuseumofModernArt/collection/master/Artworks.csv > artworks.csv
Data exploration
We can use the head
command to display the first n
lines of a file (10 by default). To display the first 3 lines, either of these commands will work:
head -n 3 artworks.csv
head -3 artworks.csv
Title,Artist,ArtistBio,Date,Medium,Dimensions,CreditLine,MoMANumber,Classification,Department,DateAcquired,CuratorApproved,ObjectID,URL
"Ferdinandsbrücke Project, Vienna, Austria , Elevation, preliminary version",Otto Wagner,"(Austrian, 1841–1918)",1896,Ink and cut-and-pasted painted pages on paper,"19 1/8 x 66 1/2"" (48.6 x 168.9 cm)",Fractional and promised gift of Jo Carole and Ronald S. Lauder,885.1996,A&D Architectural Drawing,Architecture & Design,1996-04-09,Y,2,http://www.moma.org/collection/works/2
"City of Music, National Superior Conservatory of Music and Dance, Paris, France, View from interior courtyard",Christian de Portzamparc,"(French, born 1944)",1987,Paint and colored pencil on print,"16 x 11 3/4"" (40.6 x 29.8 cm)",Gift of the architect in honor of Lily Auchincloss,1.1995,A&D Architectural Drawing,Architecture & Design,1995-01-17,Y,3,http://www.moma.org/collection/works/3
As with most datasets, it looks like the first line contains the column headers. Other than that, the output is messy and it’s hard to glean anything else from the default output. We need a tool that knows how to display CSV files in a readable way.
csvlook
csvlook
is a tool within csvkit
that allows you to display and preview a CSV file as a table. csvlook artworks.csv
will display the entire dataset, which is cumbersome to explore. Let’s instead pipe the stdout
of head -5 artworks.csv
to csvlook
to explore the first 5 lines:
head -5 artworks.csv | csvlook
| Title | Artist | ArtistBio | Date | Medium | Dimensions | CreditLine | MoMANumber | Classification | Department | DateAcquired | CuratorApproved | ObjectID | URL
| Ferdinandsbrücke Project, Vienna, Austria , Elevation, preliminary version | Otto Wagner | (Austrian, 1841–1918) | 1896 | Ink and cut-and-pasted painted pages on paper | 19 1/8 x 66 1/2" (48.6 x 168.9 cm) | Fractional and promised gift of Jo Carole and Ronald S. Lauder | 885.1996 | A&D Architectural Drawing | Architecture & Design | 1996-04-09 | Y | 2 | http://www.moma.org/collection/works/2 |
| City of Music, National Superior Conservatory of Music and Dance, Paris, France, View from interior courtyard | Christian de Portzamparc | (French, born 1944) | 1987 | Paint and colored pencil on print | 16 x 11 3/4" (40.6 x 29.8 cm) | Gift of the architect in honor of Lily Auchincloss | 1.1995 | A&D Architectural Drawing | Architecture & Design | 1995-01-17 | Y | 3 | http://www.moma.org/collection/works/3 |
| Villa near Vienna Project, Outside Vienna, Austria, Elevation | Emil Hoppe | (Austrian, 1876–1957) | 1903 | Graphite, pen, color pencil, ink, and gouache on tracing paper | 13 1/2 x 12 1/2" (34.3 x 31.8 cm) | Gift of Jo Carole and Ronald S. Lauder | 1.1997 | A&D Architectural Drawing | Architecture & Design | 1997-01-15 | Y | 4 | http://www.moma.org/collection/works/4 |
| The Manhattan Transcripts Project, New York, New York , Introductory panel to Episode 1: The Park | Bernard Tschumi | (French and Swiss, born Switzerland 1944) | 1980 | Photographic reproduction with colored synthetic laminate | 20 x 20" (50.8 x 50.8 cm) | Purchase and partial gift of the architect in honor of Lily Auchincloss | 2.1995 | A&D Architectural Drawing | Architecture & Design | 1995-01-17 | Y | 5 | http://www.moma.org/collection/works/5
Here’s a diagram representing the pipeline:
While the output is now easier to read, it’s still a little difficult to explore. Let’s now learn how to use
csvcut
to select just a few columns and display those.
csvcut
csvcut
is a tool within csvkit
that is referred to as the data scalpel since it allows you to slice and modify columns in a CSV. Let’s first list all the columns using the -n
flag:
csvcut -n artworks.csv
1: Title
2: Artist
3: ArtistBio
4: Date
5: Medium
6: Dimensions
7: CreditLine
8: MoMANumber
9: Classification
10: Department
11: DateAcquired
12: CuratorApproved
13: ObjectID
14: URL
We can then use the -c
flag to specify the columns we want. csvcut -c 1,2,3 artworks.csv | csvlook
will return the first 3 columns. You can also use the column names themselves: csvcut -c Artist,ArtistBio,Date
. Running either command will display the 3 columns for the entire dataset so we’ll need to take advantage of piping to view just a few lines.
We can consult the csvkit
documentation to read about piping between csvkit
utilities:
All csvkit utilities accept an input file as “standard in”, in addition to as a filename. This means that we can make the output of one csvkit utility become the input of the next.
This means we can pipe the stdout
of csvcut
to the stdin
of csvlook
! We can build the following pipeline:
-
extract just the first
10
lines usinghead
-
filter to just the first
3
columns usingcsvcut
-
display in a clean way using
csvlook
head -10 artworks.csv | csvcut -c 1,2,3 | csvlook
| Title | Artist | ArtistBio |
| Ferdinandsbrücke Project, Vienna, Austria , Elevation, preliminary version | Otto Wagner | (Austrian, 1841–1918) |
| City of Music, National Superior Conservatory of Music and Dance, Paris, France, View from interior courtyard | Christian de Portzamparc | (French, born 1944) |
| Villa near Vienna Project, Outside Vienna, Austria, Elevation | Emil Hoppe | (Austrian, 1876–1957) |
| The Manhattan Transcripts Project, New York, New York , Introductory panel to Episode 1: The Park | Bernard Tschumi | (French and Swiss, born Switzerland 1944) |
| Villa, project, outside Vienna, Austria, Exterior perspective | Emil Hoppe | (Austrian, 1876–1957) |
| The Manhattan Transcripts Project, New York, New York , Episode 1: The Park | Bernard Tschumi | (French and Swiss, born Switzerland 1944) |
| The Manhattan Transcripts Project, New York, New York , Episode 1: The Park | Bernard Tschumi | (French and Swiss, born Switzerland 1944) |
| The Manhattan Transcripts Project, New York, New York , Episode 1: The Park | Bernard Tschumi | (French and Swiss, born Switzerland 1944) |
| The Manhattan Transcripts Project, New York, New York, Episode 1: The Park | Bernard Tschumi | (French and Swiss, born Switzerland 1944)
csvgrep
When working with historical datasets, we need to make sure the date and time columns are formatted correctly (or even basic time series plots won’t work). Let’s explore the Date
and DateAcquired
columns:
head -20 artworks.csv | csvcut -c Date,DateAcquired | csvlook
|----------+---------------|
| Date | DateAcquired |
|----------+---------------|
| 1896 | 1996-04-09 |
| 1987 | 1995-01-17 |
| 1903 | 1997-01-15 |
| 1980 | 1995-01-17 |
| 1903 | 1997-01-15 |
| 1976-77 | 1995-01-17 |
| 1976-77 | 1995-01-17 |
| 1976-77 | 1995-01-17 |
| 1976-77 | 1995-01-17 |
| 1976-77 | 1995-01-17 |
| 1976-77 | 1995-01-17 |
| 1976-77 | 1995-01-17 |
| 1976-77 | 1995-01-17 |
| 1976-77 | 1995-01-17 |
| 1976-77 | 1995-01-17 |
| 1976-77 | 1995-01-17 |
| 1976-77 | 1995-01-17 |
| 1976-77 | 1995-01-17 |
| 1976-77 | 1995-01-17 |
|----------+---------------|
While the first 20 values in DateAcquired
look fine, the Date
column has some values that won’t play nicely with most data visualization tools like 1976-77
. We can easily deal with this by just selecting the first year in the range (e.g. 1976
from the range 1976-77
). Before we do that, let’s figure out how many lines match this pattern.
We can use the csvgrep
tool to extract all values in a column (or columns) that match a regular expression. We specify the columns we want csvgrep
to match on using the -c
flag. We specify the regular expression we want csvgrep
to use using the -regex
flag. The regex ^([0-9]*-[0-9]*)
matches pairs of numeric values that are separated by a hyphen (-
).
Since we’re searching for instances of the pattern on the Date
column, we write the following:
csvgrep --c Date --regex "^([0-9]*-[0-9]*)"
Let’s modify and run the pipeline we’ve built to incorporate csvgrep
:
head -10 artworks.csv | csvcut -c Date | csvgrep --c Date --regex "^([0-9]*-[0-9]*)" | csvlook
|-----------|
| Date |
|-----------|
| 1976-77 |
| 1976-77 |
| 1976-77 |
| 1976-77 |
|-----------|
csvstat
Wonderful! Now that we know it works, let’s apply the regular expression over the entire Date
column (instead of just the first 10 lines) and determine how many lines match this pattern. The csvstat
tool takes a CSV as an input (stdin
) and computes summary statistics. We can use the --count
flag to specify that we only want the line count.
We can also remove csvcut
, head
, and csvlook
since we don’t need to display the output.
csvgrep --c Date --regex "^([0-9]*-[0-9]*)" artworks.csv | csvstat --count
Row count: 18073
It looks like 18,073 lines match this pattern. Let’s now calculate:
- how many lines match the 4 digit year pattern
- how many total lines the dataset contains
We can use the regex (^[0-9]{4}$
) to find all 4 digit year values and pipe the results to csvstat
:
csvgrep --c Date --regex "[0-9]{4}
Row count: 76263
Finally, to get the total number of lines of a dataset, we can use the wc
command with the -l
flag (to display just the number of lines):
wc -l artworks.csv
137382 artworks.csv
If we combine the number of lines that match the 4 digit year regex (76263) with the number of lines that match the year range regex (18073), we get (94336) lines. Given that there are 137,382 lines total, this is a great starting point for our analysis!
Next Steps
If you’re interesting in learning more about using csvkit, you can check out our interactive csvkit mission.