The data science life cycle is generally comprised of the following components: data retrieval data cleaning data exploration and visualization statistical or predictive modeling While these components are helpful for understanding the different phases, they don’t help us think about our programming workflow. Often, the entire data science life cycle ends up as an arbitrary […]
This post is the second in a series on visualizing the Women’s Marches from January 2017. In the first post, we explored the intensive data collection and data cleaning process necessary to produce clean pandas dataframes. Data Enrichment Because we eventually want to be able to build maps visualizing the marches, we need latitude and […]
In celebration of Women’s History Month, I wanted to better understand the scale of the Women’s Marches that occurred in January 2017. Shortly after the marches, Vox published a map visualizing the estimated turnout across the entire country. This map is excellent at displaying: locations with the highest relative turnouts hubs and clusters of where […]
Whether you’re running out of memory on your local machine or simply want your code to run faster on a more powerful machine, there are many benefits to doing data science on a cloud server. A cloud server is really just a computer, like the one you’re using now, that’s located elsewhere. In this post, […]
The pandas workflow is a common favorite among data analysts and data scientists. The workflow looks something like this: The pandas workflow works well when: the data fits in memory (a few gigabytes but not terabytes) the data is relatively static (doesn’t need to be loaded into memory every minute because the data has changed) […]
Overview Brad Klingenberg is the Director of Styling Algorithms at Stitch Fix in San Francisco. His team uses data and algorithms to improve the selection of merchandise sent to clients. Prior to joining Stitch Fix, Brad worked with data and predictive analytics at financial and technology companies. He studied applied mathematics at the University of […]
Learn how to clean data on the command line, a key skill for doing data analysis and data science, using Python and csvkit.
Here’s how to install PySpark on your computer and get started working with large data sets using Python and PySpark in a Jupyter Notebook.
Learn how Dataquest user Patrick Nelli improve his understanding of data in his role as VP of Corporate Analytics.
Tracks Since we launched Dataquest a year ago, tens of thousands of people a month have learned data science and have started to experience the benefits in their respective careers. Dataquest continues to grow because of people like you who give us feedback and push us to improve the experience of learning data science. We […]
A step-by-step tutorial on data cleaning (or data munging, a core data science skill) a dataset from the MoMA with Python, using the Pandas module.