July 8, 2025

Project Tutorial: Analyzing Startup Fundraising Deals from Crunchbase

In this project walkthrough, we'll explore how to work with large datasets efficiently by analyzing startup investment data from Crunchbase. By optimizing memory usage and leveraging SQLite, we'll uncover insights about fundraising trends while learning techniques that scale to real-world data challenges.

Working with large datasets is a common challenge in data analysis. When your dataset exceeds your computer's memory capacity, you need smart strategies to process and analyze the data effectively. This project demonstrates practical techniques for handling medium-to-large datasets that every data professional should know.

Throughout this tutorial, we'll work within a self-imposed constraint of 10MB of memory—a teaching device that simulates the challenges you'll face with truly massive datasets. By the end, you'll have transformed a 57MB dataset into an optimized database ready for lightning-fast analysis.

What You'll Learn

By the end of this tutorial, you'll know how to:

  • Process large CSV files in chunks to avoid memory errors
  • Identify and remove unnecessary columns to reduce dataset size
  • Convert data types to more memory-efficient alternatives
  • Load optimized data into SQLite for efficient querying
  • Combine pandas and SQL for powerful data analysis workflows
  • Diagnose encoding issues and handle non-UTF-8 files
  • Create visualizations from aggregated database queries

Before You Start: Pre-Instruction

To make the most of this project walkthrough, follow these preparatory steps:

  1. Review the Project

    Access the project and familiarize yourself with the goals and structure: Analyzing Startup Fundraising Deals Project.

  2. Access the Solution Notebook

    You can view and download it here to see what we'll be covering: Solution Notebook

  3. Prepare Your Environment

    • If you're using the Dataquest platform, everything is already set up for you
    • If working locally, ensure you have Python with pandas and sqlite3 installed
    • Download the dataset from GitHub (October 2013 Crunchbase data)
  4. Prerequisites

    • Comfortable with Python basics (loops, dictionaries, data types)
    • Basic familiarity with pandas DataFrames
    • Some SQL knowledge is helpful but not required

New to Markdown? We recommend learning the basics to format headers and add context to your Jupyter notebook: Markdown Guide.

Setting Up Your Environment

Let's begin by importing pandas and understanding the challenge ahead:

import pandas as pd

Before we begin working with our dataset, let's understand our constraint. We're simulating a scenario where we only have 10MB of memory available for our analysis. While this specific limit is artificial for teaching purposes, the techniques we'll learn apply directly to real-world situations with datasets containing millions of rows.

Learning Insight: When I first started as a data analyst, I tried to load a 100,000-row dataset on my local machine and it crashed. I was blindsided because I didn't know about optimization techniques. Understanding what causes large memory usage and how to handle it are valuable skills that will save you countless hours of frustration.

Loading the Dataset with Chunking

Our first challenge: the CSV file isn't UTF-8 encoded. Let's handle this properly:

# This won't work - let's see the error
df = pd.read_csv('crunchbase-investments.csv')

When you try this, you'll get a UnicodeDecodeError. This is our first lesson in working with real-world data: it's often messy!

# Let's use chunking with proper encoding
chunk_iter = pd.read_csv('crunchbase-investments.csv',
                         chunksize=5000,
                         encoding='Latin-1')

Learning Insight: Latin-1 is a good fallback encoding when UTF-8 fails. Think of encoding like languages—if your computer expects English (UTF-8) but gets Spanish (Latin-1), it needs to know how to translate. The encoding parameter is just telling pandas which "language" to expect.

Now let's find out how big our dataset is:

cb_length = []

# Count rows across all chunks
for chunk in chunk_iter:
    cb_length.append(len(chunk))

total_rows = sum(cb_length)
print(f"Total rows in dataset: {total_rows}")
Total rows in dataset: 52870

Identifying Optimization Opportunities

Finding Missing Values

Columns with many missing values are prime candidates for removal:

# Re-initialize the chunk iterator (it's exhausted after one pass)
chunk_iter = pd.read_csv('crunchbase-investments.csv',
                         chunksize=5000,
                         encoding='Latin-1')

mv_list = []
for chunk in chunk_iter:
    if not mv_list:  # First chunk - get column names
        columns = chunk.columns
        print("Columns in dataset:")
        print(columns.sort_values())
    mv_list.append(chunk.isnull().sum())

# Combine missing value counts across all chunks
combined_mv_vc = pd.concat(mv_list)
unique_combined_mv_vc = combined_mv_vc.groupby(combined_mv_vc.index).sum()
print("\nMissing values by column:")
print(unique_combined_mv_vc.sort_values(ascending=False))
Columns in dataset:
Index(['company_category_code', 'company_city', 'company_country_code',
       'company_name', 'company_permalink', 'company_region',
       'company_state_code', 'funded_at', 'funded_month', 'funded_quarter',
       'funded_year', 'funding_round_type', 'investor_category_code',
       'investor_city', 'investor_country_code', 'investor_name',
       'investor_permalink', 'investor_region', 'investor_state_code',
       'raised_amount_usd'],
      dtype='object')

Missing values by column:
investor_category_code    50427
investor_state_code       16809
investor_city             12480
investor_country_code     12001
raised_amount_usd          3599
...

Learning Insight: The investor_category_code column has 50,427 missing values out of 52,870 total rows. That's over 95% missing! This column provides virtually no value and is a perfect candidate for removal. When a column is missing more than 75% of its values, it's usually safe to drop it entirely.

Analyzing Memory Usage

Let's see which columns are consuming the most memory:

chunk_iter = pd.read_csv('crunchbase-investments.csv',
                         chunksize=5000,
                         encoding='Latin-1')

counter = 0
series_memory_fp = pd.Series(dtype='float64')

for chunk in chunk_iter:
    if counter == 0:
        series_memory_fp = chunk.memory_usage(deep=True)
    else:
        series_memory_fp += chunk.memory_usage(deep=True)
    counter += 1

# Drop the index memory calculation
series_memory_fp_before = series_memory_fp.drop('Index').sort_values()
print("Memory usage by column (bytes):")
print(series_memory_fp_before)

# Total memory in megabytes
total_memory_mb = series_memory_fp_before.sum() / (1024 * 1024)
print(f"\nTotal memory usage: {total_memory_mb:.2f} MB")
Memory usage by column (bytes):
funded_year                  422960
raised_amount_usd            422960
investor_category_code       622424
...
investor_permalink          4980548
dtype: int64

Total memory usage: 56.99 MB

Our dataset uses nearly 57MB, far exceeding our 10MB constraint! The permalink columns (URLs) are the biggest memory hogs, making them great candidates to remove from our dataframe.

Implementing Optimization Strategies

Dropping Unnecessary Columns

Based on our analysis, let's remove columns that won't help our analysis:

# Columns to drop
drop_cols = ['investor_permalink',      # URL - not needed for analysis
             'company_permalink',        # URL - not needed for analysis
             'investor_category_code',   # 95% missing values
             'funded_month',            # Redundant - we have funded_at
             'funded_quarter',          # Redundant - we have funded_at
             'funded_year']             # Redundant - we have funded_at

# Get columns to keep
chunk = pd.read_csv('crunchbase-investments.csv', nrows=1, encoding='Latin-1')
keep_cols = chunk.columns.drop(drop_cols)
print("Columns we're keeping:")
print(keep_cols.tolist())

Learning Insight: Notice how we're only dropping investor_category_code among the columns with missing data. Although investor_city, investor_state_code, and investor_country_code also contain missing values, they didn't exceed the 75% threshold and were retained for potential future insights.

Also, we’re dropping the month, quarter, and year columns even though they have data. This follows the programming principle "Don't Repeat Yourself" (DRY). Since we have the full date in funded_at, we can extract month, quarter, or year later if needed. This saves us significant memory without losing information.

Identifying Data Type Optimization Opportunities

Let's check which columns can be converted to more efficient data types:

# Analyze data types across chunks
col_types = {}
chunk_iter = pd.read_csv('crunchbase-investments.csv',
                         chunksize=5000,
                         encoding='Latin-1',
                         usecols=keep_cols)

for chunk in chunk_iter:
    for col in chunk.columns:
        if col not in col_types:
            col_types[col] = [str(chunk.dtypes[col])]
        else:
            col_types[col].append(str(chunk.dtypes[col]))

# Get unique types per column
uniq_col_types = {}
for k, v in col_types.items():
    uniq_col_types[k] = set(col_types[k])

print("Data types by column:")
for col, types in uniq_col_types.items():
    print(f"{col}: {types}")

Now let's check unique value counts to identify categorical candidates:

unique_values = {}

chunk_iter = pd.read_csv('crunchbase-investments.csv',
                         chunksize=5000,
                         encoding='Latin-1',
                         usecols=keep_cols)

for chunk in chunk_iter:
    for col in chunk.columns:
        if col not in unique_values:
            unique_values[col] = set()
        unique_values[col].update(chunk[col].unique())

print("Unique value counts:")
for col, unique_vals in unique_values.items():
    print(f"{col}: {len(unique_vals)} unique values")
Unique value counts:
company_name: 11574 unique values
company_category_code: 44 unique values
company_country_code: 3 unique values
company_state_code: 51 unique values
...
funding_round_type: 10 unique values

Learning Insight: Any column with fewer than 100 unique values is a good candidate for the category data type. Categories are incredibly memory-efficient for repeated string values. Think of it like a dictionary—instead of storing "USA" 10,000 times, pandas stores it once and uses a reference number.

Applying Optimizations

Let's load our data with all optimizations applied:

Note: We are converting funded_at to a datetime data type using the pd.read_csv optional parameter parse_dates

# Define categorical columns
col_types = {
    'company_category_code': 'category',
    'funding_round_type': 'category',
    'investor_state_code': 'category',
    'investor_country_code': 'category'
}

# Load with optimizations
chunk_iter = pd.read_csv('crunchbase-investments.csv',
                         chunksize=5000,
                         encoding='Latin-1',
                         usecols=keep_cols,
                         dtype=col_types,
                         parse_dates=['funded_at'])

# Calculate new memory usage
counter = 0
series_memory_fp_after = pd.Series(dtype='float64')

for chunk in chunk_iter:
    if counter == 0:
        series_memory_fp_after = chunk.memory_usage(deep=True)
    else:
        series_memory_fp_after += chunk.memory_usage(deep=True)
    counter += 1

series_memory_fp_after = series_memory_fp_after.drop('Index')
total_memory_after = series_memory_fp_after.sum() / (1024 * 1024)

print(f"Memory usage before optimization: 56.99 MB")
print(f"Memory usage after optimization: {total_memory_after:.2f} MB")
print(f"Reduction: {((56.99 - total_memory_after) / 56.99 * 100):.1f}%")
Memory usage before optimization: 56.99 MB
Memory usage after optimization: 26.84 MB
Reduction: 52.9%

We've cut our memory usage by more than half!

Loading into SQLite

Now let's load our optimized data into a SQLite database for efficient querying:

import sqlite3

# Create database connection
conn = sqlite3.connect('crunchbase.db')

# Load data in chunks and insert into database
chunk_iter = pd.read_csv('crunchbase-investments.csv',
                         chunksize=5000,
                         encoding='Latin-1',
                         usecols=keep_cols,
                         dtype=col_types,
                         parse_dates=['funded_at'])

for chunk in chunk_iter:
    chunk.to_sql("investments", conn, if_exists='append', index=False)

# Verify the table was created
cursor = conn.cursor()
cursor.execute("SELECT name FROM sqlite_master WHERE type='table';")
tables = cursor.fetchall()
print(f"Tables in database: {[t[0] for t in tables]}")

# Preview the data
cursor.execute("SELECT * FROM investments LIMIT 5;")
print("\nFirst 5 rows:")
for row in cursor.fetchall():
    print(row)

Learning Insight: SQLite is perfect for this use case because once our data is in the database, we don't need to worry about memory constraints anymore. SQL queries are incredibly fast and efficient, especially for aggregation operations. What would require complex pandas operations with chunking can be done in a single SQL query.

Analyzing the Data

Let’s answer a business question: Which categories of companies attract the most investment on average?


# Use pandas to read SQL results directly into a DataFrame
query = """
    SELECT company_category_code,
           AVG(raised_amount_usd) as avg_raised
    FROM investments
    WHERE raised_amount_usd IS NOT NULL  -- Exclude nulls for accurate average
    GROUP BY company_category_code
    ORDER BY avg_raised DESC;
"""

df_results = pd.read_sql(query, conn)

# Visualize top 15 categories
top_n = 15
ax = df_results.head(top_n).set_index('company_category_code').plot(
    kind='bar',
    figsize=(10, 6),
    title=f'Top {top_n} Categories by Average Investment Amount',
    ylabel='Average Investment (USD)',
    legend=False
)

# Format y-axis to show actual numbers instead of scientific notation
ax.ticklabel_format(style='plain', axis='y')
plt.tight_layout()
plt.show()

company_category_code

Key Takeaways and Next Steps

Through this project, we've learned essential techniques for handling large datasets:

  1. Chunking prevents memory errors: Process data in manageable pieces when the full dataset exceeds memory
  2. Data type optimization matters: Converting to categories and appropriate numeric types can reduce memory by 50% or more
  3. Remove redundant columns: Don't store the same information multiple ways
  4. SQLite enables fast analysis: Once optimized and loaded, complex queries run in seconds
  5. Encoding issues are common: Real-world data often isn't UTF-8; know your alternatives

More Projects to Try

We have some other project walkthrough tutorials you may also enjoy:

Homework Challenges

To deepen your understanding, try extending this project with:

  1. Further optimization: Can you convert raised_amount_usd from float64 to float32 or smaller? Test if this maintains adequate precision.
  2. Advanced analysis questions:
    • What proportion of total funding did the top 10% of companies receive?
    • Which investors contributed the most money overall?
    • Which funding rounds (seed, Series A, etc.) are most common?
    • Are there seasonal patterns in funding? (Hint: use the funded_at column)
  3. Data quality: Investigate why the original CSV isn't UTF-8 encoded. Can you find and fix the problematic characters?

Personal Reflection: As someone who transitioned from teaching math to data analysis, I remember being intimidated by memory issues and encoding errors. But these challenges taught me that real-world data is messy, and learning to handle that messiness is what separates good analysts from great ones. Every optimization technique you learn makes you more capable of tackling bigger, more interesting datasets.

This project demonstrates that with the right techniques, you can analyze datasets much larger than your available memory. Whether you're working with startup data, customer records, or sensor readings, these optimization strategies will serve you well throughout your data career.

If you're new to pandas and found the chunking concepts challenging, start with our Processing Large Datasets in Pandas course. For those ready to dive deeper into data engineering, check out our Data Engineering path.

Happy analyzing!

Anna Strahl

About the author

Anna Strahl

A former math teacher of 8 years, Anna always had a passion for learning and exploring new things. On weekends, you'll often find her performing improv or playing chess.