Data Cleaning and Exploration Using Csvkit
So far, we’ve been using the default command-line tools to clean, munge, and explore data. Tools like `wc` and `head` are useful tools but weren’t designed specifically for working with datasets and are limited in many ways. These tools lack features specific to working with tabular datasets, like parsing the header row or understanding the row and column layout. Because of this, in the Data Munging Using the Command Line challenge, you had to specifically compute the number of lines in each CSV file using the `wc` tool and use that number to select just the non-header rows using the `tail` tool. You then had to repeat this for each CSV file you were trying to merge into the resulting, single file!
In this lesson, we’ll learn about the csvkit library, which supercharges your workflow by adding 13 new command-line tools specifically for working with CSV files. We’ll focus on these five tools from csvkit:
- csvstack: for stacking rows from multiple CSV files.
- csvlook: renders CSV in pretty table format.
- csvcut: for selecting specific columns from a CSV file.
- csvstat: for calculating descriptive statistics for some or all columns.
- csvgrep: for filtering tabular data using specific criteria.
As you work through each concept, you’ll get to apply what you’ve learned from within your browser; there’s no need to use your own machine to do the exercises. The Python environment inside of this course includes answer-checking to ensure you’ve fully mastered each concept before learning the next.
- Learn how to use the csvkit to clean and explore datasets
- Csvcut | csvstat
- Filtering out problematic rows
- Next steps