October 28, 2025

Setting Up Your Data Engineering Lab with Docker

This guide helps you set up a clean, isolated environment for running Dataquest tutorials. While many tutorials work fine directly on your computer, some (particularly those involving data processing tools like PySpark) can run into issues depending on your operating system or existing software setup. The lab environment we'll create ensures everything runs consistently, with the right versions of Python and other tools, without affecting your main system.

What's a Lab Environment?

You can think of this "lab" as a separate workspace just for your Dataquest tutorials. It's a controlled space where you can experiment and test code without affecting your main computer. Just like scientists use labs for experiments, we'll use this development lab to work through tutorials safely.

Benefits for everyone:

  • Windows/Mac users: Avoid errors from system differences. No more "command not found" or PySpark failing to find files
  • Linux users: Get the exact versions of Python and Java needed for tutorials, without conflicting with your system's packages
  • Everyone: Keep your tutorial work separate from personal projects. Your code and files are saved normally, but any packages you install or system changes you make stay contained in the lab

We'll use a tool called Docker to create this isolated workspace. Think of it as having a dedicated computer just for tutorials inside your regular computer. Your files and code are saved just like normal (you can edit them with your favorite editor), but the tutorial environment itself stays clean and separate from everything else on your system.

The lab command you'll use creates this environment, and it mirrors real data engineering workflows (most companies use isolated environments like this to ensure consistency across their teams).

Installing Docker

Docker creates isolated Linux environments on any operating system. This means you'll get a consistent Linux workspace whether you're on Windows, Mac, or even Linux itself. We're using it as a simple tool, so no container orchestration or cloud deployment knowledge is needed.

On Windows:
Download Docker Desktop from docker.com/products/docker-desktop. Run the installer, restart your computer when prompted, and open Docker Desktop. You'll see a whale icon in your system tray when it's running.

Note: Docker Desktop will automatically enable required Windows features like WSL 2. If you see an error about virtualization, you may need to enable it in your computer's BIOS settings. Search online for your computer model + "enable virtualization" for specific steps.

On Mac:
Download Docker Desktop for your chip type (Intel or Apple Silicon) from the same link. Drag Docker to your Applications folder and launch it. You'll see the whale in your menu bar.

On Linux:
You probably already have Docker, but if not, run this command in your terminal:

curl -fsSL https://get.docker.com -o get-docker.sh && sh get-docker.sh

Verify it works:
Open your terminal (PowerShell, Terminal, or bash) and run:

docker --version
docker compose version

You should see version numbers for both commands. If you see "command not found," restart your terminal or computer and try again.

Getting the Lab Environment

The lab is already configured in the Dataquest tutorials repository. Clone or download it:

git clone https://github.com/dataquestio/tutorials.git
cd tutorials

If you don't have git, download the repository as a ZIP file from GitHub and extract it.

The repository includes everything you need:

  • Dockerfile - Configures the Linux environment with Python 3.11 and Java (for Spark)
  • docker-compose.yml - Defines the lab setup
  • Tutorial folders with all the code and data

Starting Your Lab

In your IDE’s terminal, ensure you’re in the tutorials folder and start the lab:

docker compose run --rm lab

Note that the first time you run this command, the setup may take 2-5 minutes.

You're now in Linux! Your prompt will change to something like root@abc123:/tutorials#, which is your Linux command line where everything will work as expected.

The --rm flag means the lab cleans itself up when you exit, keeping your system tidy.

Using Your Lab

Once you’re in the lab environment, here's your typical workflow:

1. Navigate to the tutorial you're working on

# See all available tutorials
ls

# Enter a specific tutorial
cd pyspark-etl

2. Install packages as needed
Each tutorial might need different packages:

# For PySpark tutorials
pip install pyspark

# For data manipulation tutorials
pip install pandas numpy

# For database connections
pip install sqlalchemy psycopg2-binary

3. Run the tutorial code

python <script-name>.py

Because the code will run within a standardized Linux environment, you shouldn’t run into setup errors.

4. Edit files normally
The beauty of this setup: you can still use your favorite editor! The tutorials folder on your computer is synchronized with the lab. Edit files in VS Code, PyCharm, or any editor, and the lab sees changes immediately.

5. Exit when done
Type exit or press Ctrl+D to leave the lab. The environment cleans itself up automatically.

Common Workflow Examples

Running a PySpark tutorial:

docker compose run --rm lab
cd pyspark-etl
pip install pyspark pandas
python main.py

Working with Jupyter notebooks:

docker compose run --rm -p 8888:8888 lab
pip install jupyterlab
jupyter lab --ip=0.0.0.0 --allow-root --no-browser
# Open the URL it shows in your browser

Keeping packages installed between sessions:
If you're tired of reinstalling packages, create a requirements file:

# After installing packages, save them
pip freeze > requirements.txt

# Next session, restore them
pip install -r requirements.txt

Quick Reference

The one command you need:

# From the tutorials folder
docker compose run --rm lab

Exit the lab:

exit # Or press Ctrl+D

Where things are:

  • Tutorial code: Each folder in /tutorials
  • Your edits: Automatically synchronized
  • Data files: In each tutorial's data/ folder
  • Output files: Save to the tutorial folder to see them on your computer

Adding services (databases, etc.):
For tutorials needing PostgreSQL, MongoDB, or other services, we can extend the docker-compose.yml. For now, the base setup handles all Python and PySpark tutorials.

Troubleshooting

  • "Cannot connect to Docker daemon"

    • Docker Desktop needs to be running. Start it from your applications.
  • "docker compose" not recognized

    • Older Docker versions use docker-compose (with a hyphen). Try:

      docker-compose run --rm lab
  • Slow performance on Windows

    • Docker on Windows can be slow with large datasets. For better performance, store data files in the container rather than the mounted folder.
  • "Permission denied" on Linux

    • Add your user to the docker group:

      sudo usermod -aG docker $USER

      Then log out and back in.

You're Ready

You now have a Linux lab environment that matches production systems. Happy experimenting!

Anna Strahl

About the author

Anna Strahl

A former math teacher of 8 years, Anna always had a passion for learning and exploring new things. On weekends, you'll often find her performing improv or playing chess.