Practical Application of Prompt Engineering for Data Professionals
In Part 1 of this tutorial, you learned about prompt engineering fundamentals and strategies to communicate effectively with AI models. Now, we'll put those skills into practice with a common data task: analyzing survey data.
As a data professional, you've likely worked with survey responses before, whether it was customer feedback, employee satisfaction surveys, or user experience questionnaires. Survey analysis often involves both quantitative measures (ratings, scales) and qualitative feedback (open-ended responses), making it a perfect use case for applying prompt engineering techniques.
In this practical application of prompt engineering, you'll learn how to:
- Generate synthetic survey data using structured prompts
- Categorize qualitative feedback into meaningful themes
- Extract structured JSON outputs ready for downstream analysis
What makes this approach particularly valuable is that you'll not only learn to analyze survey data more efficiently but also gain a reusable framework for creating practice datasets. This means you can practice your data analysis techniques on "real-fake data" without risking privacy concerns or waiting for appropriate datasets to become available.
Let's get started!
Understanding Our Survey Structure
For this project, we'll work with a fictional Dataquest course feedback survey that includes both quantitative ratings and qualitative feedback. Here's the structure we’ll be using:
Question Type | Description | Data Format |
---|---|---|
Quantitative | How confident are you in applying what you learned? | Scale: 1-7 (1 = Not confident, 7 = Very confident) |
Quantitative | How would you rate the course overall? | Scale: 1-7 (1 = Poor, 7 = Excellent) |
Freeform | What aspects of the course did you find most helpful, and were there any areas where you think the course could be improved? |
Open-ended text response |
Categorical | Technology | One of: Python, SQL, R, Excel, Power BI, Tableau |
Binary | Completed | True/False |
Unique ID | User_ID | Unique identifier per learner |
This mix of structured ratings and open-ended feedback is common in many survey scenarios, making the techniques we'll explore widely applicable.
Why This Matters
Before we get into the technical aspects, let's understand why generating and analyzing synthetic survey data is a valuable skill for data professionals:
- Privacy and compliance: Using synthetic data lets you practice analysis techniques without risking exposure of real respondent information.
- Control and variation: You can generate exactly the distributions and patterns you want to test your analytical approaches.
- Rapid prototyping: Rather than sinking a lot of time into finding an appropriate dataset, you can immediately start developing your analysis pipeline.
- Reproducible examples: You can share examples and methods without sharing sensitive data.
- Testing edge cases: You can generate uncommon patterns in your data to ensure your analysis handles outliers properly.
For data teams, having the ability to quickly generate realistic test data can significantly accelerate development and validation of analytics workflows.
Step 1: Generating Realistic Synthetic Survey Data
Our first task is to generate synthetic survey responses that feel authentic. This is where the prompt engineering techniques from Part 1 will help us a lot!
Basic Approach
Let's start with a simple prompt to generate a synthetic survey response to see how the AI handles creating a single response:
Generate a single realistic response to a course feedback survey with these fields:
- Confidence rating (1-7 scale)
- Overall course rating (1-7 scale)
- Open-ended feedback (about 2-3 sentences)
- Technology focus (one of: Python, SQL, R, Excel, Power BI, Tableau)
- Completed (True/False)
- User_ID (format: UID followed by 5 digits)
While this might produce a basic response, it lacks the nuance and realism we need. Let's improve it by applying our prompt engineering techniques learned in Part 1.
Improved Approach with Structured Output
Using a structured output prompt, we can request more precise formatting:
Generate a realistic response to a Dataquest course feedback survey. Format the response as a
JSON object with the following fields:
{
"confidence_rating": [1-7 scale, where 1 is not confident and 7 is very confident],
"overall_rating": [1-7 scale, where 1 is poor and 7 is excellent],
"feedback": [2-3 sentences of realistic course feedback, including both positive aspects and suggestions for improvement],
"technology": [one of: "Python", "SQL", "R", "Excel", "Power BI", "Tableau"],
"completed": [boolean: true or false],
"user_id": ["UID" followed by 5 random digits]
}
Make the feedback reflect the ratings given, and create a realistic response that might come
from an actual learner.
This improved prompt:
- Identifies Dataquest as the learning platform
- Specifies the exact output format (JSON)
- Defines each field with clear expectations
- Requests internal consistency (feedback should reflect ratings)
- Asks for realism in the responses
This prompt offers several key advantages over the basic version. By specifying the exact JSON structure and detailing the expected format for each field, we've significantly increased the likelihood of receiving consistent, well-formatted responses. The prompt also establishes a clear connection between the quantitative ratings and qualitative feedback, ensuring internal consistency in the synthetic data.
While this represents a significant improvement, it still lacks specific context about the course content itself, which could lead to generic feedback that doesn't reference actual learning materials or concepts. In the next iteration, we'll address this limitation by providing more specific course context to generate even more authentic-sounding responses.
Adding Context for Even Better Results
We can further enhance our prompt by providing context about the course, which helps the AI generate more authentic-sounding feedback:
You are generating a synthetic response to a feedback survey for a Dataquest data science course
on [Python Data Cleaning]. The course covered techniques for handling missing values, dealing
with outliers, string manipulation, and data validation.
Generate a realistic survey response as a JSON object with these fields:
{
"confidence_rating": [1-7 scale, where 1 is not confident and 7 is very confident],
"overall_rating": [1-7 scale, where 1 is poor and 7 is excellent],
"feedback": [2-3 sentences of realistic course feedback that specifically mentions course content],
"technology": "Python",
"completed": [boolean: true or false],
"user_id": ["UID" followed by 5 random digits]
}
If the confidence_rating and overall_rating are high (5-7), make the feedback predominantly
positive with minor suggestions. If the ratings are medium (3-4), include a balance of positive
points and constructive criticism. If the ratings are low (1-2), focus on specific issues while
still mentioning at least one positive aspect.
This enhanced prompt:
- Provides specific context about the course content
- Guides the model to create feedback that references actual course topics
- Creates realistic correlation between ratings and feedback sentiment
- Fixes the technology field to match the course topic
This prompt represents another significant improvement by providing specific course context. By mentioning that it's a "Python Data Cleaning" course and detailing specific topics like "handling missing values" and "string manipulation," we're giving the AI concrete elements to reference in the feedback. The prompt also includes explicit guidance on how sentiment should correlate with numerical ratings, creating more realistic psychological patterns in the responses. The technology field is now fixed to match the course topic, ensuring internal consistency.
While this approach generates highly authentic individual responses, creating a complete survey dataset would require submitting similar prompts multiple times, once for each course technology (Python, SQL, R, etc.) you want to include.
This strategy offers several advantages:
- Each batch of responses can be tailored to specific course content
- You can control the distribution of technologies in your dataset
- You can vary the context details to generate more diverse feedback
However, there are also some limitations to consider:
- Generating large datasets requires multiple prompt submissions
- Maintaining consistent distributions across different technology batches can be challenging
- Each submission may have slightly different "styles" of feedback
- It's more time-consuming than generating all responses in a single prompt
For smaller datasets where quality and specificity matter more than quantity, this approach works well. For larger datasets, you might consider using the next prompt strategy, which generates multiple responses in a single query while still maintaining distribution control.
Generating Multiple Responses with Distribution Control
When building a synthetic dataset, we typically want multiple responses with a realistic distribution. We can guide this using our prompt:
Generate 10 synthetic responses to a Dataquest course feedback survey on Data Visualization
with Tableau. Format each response as a JSON object.
Distribution requirements:
- Overall ratings should follow a somewhat positively skewed distribution:
- Mostly 5-7
- Some 3-4
- Few 1-2
- Include at least one incomplete course response
- Ensure technology is set to "Tableau" for all responses
- Create a mix of confident and less confident learners
For each response, provide this structure:
{
"confidence_rating": [1-7 scale],
"overall_rating": [1-7 scale],
"feedback": [2-3 sentences of specific, realistic feedback mentioning visualization techniques],
"technology": "Tableau",
"completed": [boolean],
"user_id": ["UID" followed by 5 random digits]
}
Make each response unique and realistic, with feedback that references specific course content
but also includes occasional tangential comments about platform issues, requests for unrelated
features, or personal circumstances affecting their learning experience, just as real students
often do. For instance, some responses might mention dashboard design principles but then
digress into comments about the code editor timing out, requests for content on completely
different technologies, or notes about their work schedule making it difficult to complete
exercises.
This prompt:
- Requests multiple responses in one go
- Specifies the desired distribution of ratings
- Ensures variety in completion status
- Maintains consistency in the technology field
- Asks for domain-specific feedback
Try experimenting with different distribution patterns to see how AI models respond. For instance, you might request a bimodal distribution (e.g., ratings clustered around 2-3 and 6-7) to simulate polarized opinions or a more uniform distribution to test how your analysis handles diverse feedback.
Prompt Debugging for Synthetic Data
Sometimes, our initial prompts don't produce the desired results. Here are common issues and fixes for synthetic data generation:
Issue | Symptoms | Solution |
---|---|---|
Unrealistic distribution |
• AI generates mostly positive responses or a perfectly balanced distribution • Missing natural variability in ratings • Too symmetrical to be realistic |
• Explicitly specify the distribution pattern: ◦ (e.g., "70% positive ratings (5-7), 20% neutral (3-4), 10% negative (1-2)") • Request some outliers and unexpected combinations |
Repetitive patterns in data |
• Similar phrasing across multiple responses • Same examples or concepts repeatedly mentioned • Identical sentence structures with only minor word changes • Predictable positive/negative patterns |
• Explicitly request linguistic diversity in the prompt • Break generation into smaller batches • Provide examples of varied writing styles • Request specific personality types for different respondents: ◦ (e.g., "detailed technical feedback," "big-picture comments," "time-constrained learner") |
Format inconsistencies |
• JSON format errors or inconsistent field names • Missing brackets or commas • Inconsistent data types |
• Provide an exact template with field names • Use explicit instructions about the format • Request validation of JSON syntax |
Unrealistic correlations |
• Disconnected ratings and feedback • Perfect correlation between metrics • Contradictory data points |
• Explicitly instruct alignment between quantitative and qualitative data • Request some noise in the correlations • Specify expected relationships |
Building Your Complete Synthetic Survey Dataset
Now that we've explored different prompting strategies for generating synthetic survey data, let's bring all these techniques together to create a complete dataset that we'll use throughout the remainder of this tutorial.
Follow these steps to build a robust synthetic survey dataset:
- Define your dataset parameters:
- Decide how many responses you need (aim for 100-300 for meaningful analysis)
- Determine the distribution of technologies (e.g., 40% Python, 30% SQL, 20% R, etc.)
- Choose a realistic rating distribution (typically slightly positively skewed)
- Plan for completion rate (usually 70-80% complete, 20-30% incomplete)
- Create a master prompt template:
Generate {number} synthetic responses to a Dataquest course feedback survey on {course_topic}.
The course covered {specific_concepts}.
Distribution requirements:
- Overall ratings should follow this pattern: {distribution_pattern}
- Confidence ratings should generally correlate with overall ratings
- Include approximately {percent}% incomplete course responses
- Set technology to "{technology}" for all responses
For each response, provide this structure:
{
"confidence_rating": [1-7 scale],
"overall_rating": [1-7 scale],
"feedback": [2-3 sentences of specific, realistic feedback mentioning course concepts],
"technology": "{technology}",
"completed": [boolean],
"user_id": ["UID" followed by 5 random digits]
}
Make each response unique and realistic, with feedback that specifically references
content from the course. Ensure that feedback sentiment aligns with the ratings.
- Generate data in batches by technology:
- For each technology (Python, SQL, R, etc.), fill in the template with appropriate details
- Request 10-20 responses per batch to ensure quality and specificity
- Adjust distribution parameters slightly between batches for natural variation
- Validate and combine the data:
- Review each batch for quality and authenticity
- Ensure JSON formatting is correct
- Combine all batches into a single dataset
- Check for any duplicate
user_id
values and fix, if necessary
- Save the combined dataset:
- Store the final dataset as a JSON file
- This file will be our reference dataset for all subsequent analysis steps
Using this structured approach ensures we create synthetic data that maintains a realistic distribution, contains course-specific feedback, and provides enough variation for meaningful analysis. The resulting dataset mimics what you might receive from an actual course survey while giving you complete control over its characteristics.
Give it a try! Modify the prompts provided to generate synthetic survey data for a course topic you're interested in. Experiment with different distribution patterns and see how the results change.
Step 2: Categorizing Open-Ended Feedback
Once we have our synthetic survey data, one of the most challenging aspects is making sense of the open-ended feedback. Let's use prompt engineering to categorize these responses into meaningful themes.
Setting Up the Categorization Task
Here's a basic prompt to categorize a single feedback response:
Categorize this course feedback into one or more relevant themes:
"I really enjoyed the practical exercises on SQL joins, but I wish there were more
real-world examples. The videos explaining the concepts were clear, but sometimes
moved too quickly. Overall, a good introduction to databases."
This prompt might work for a single response, but it lacks structure and guidance for consistent categorization. Let's improve it using few-shot prompting.
Few-Shot Prompting for Consistent Categorization
Categorize the following course feedback excerpts into these themes:
- Content Quality
- Exercise/Hands-on Practice
- Pace and Difficulty
- Technical Issues
- Instructional Clarity
- Career Relevance
For each theme identified in the feedback, include a brief explanation of why it fits that category.
Example 1:
Feedback: "The Python exercises were challenging but helpful. However, the platform kept
crashing when I tried to submit my solutions."
Categorization:
- Exercise/Hands-on Practice: Mentions Python exercises being challenging but helpful
- Technical Issues: Reports platform crashes during submission
Example 2:
Feedback: "The explanations were clear and I loved how the course related the SQL concepts
to real job scenarios. Made me feel more prepared for interviews."
Categorization:
- Instructional Clarity: Praises clear explanations
- Career Relevance: Appreciates connection to job scenarios and interview preparation
Now categorize this new feedback:
"I found the R visualizations section fascinating, but the pace was too fast for a beginner like
me. The exercises helped reinforce the concepts, though I wish there were more examples
showing how these skills apply in the healthcare industry where I work."
This improved prompt:
- Defines specific themes for categorization
- Provides clear examples of how to categorize feedback
- Demonstrates the expected output format
- Requests explanations for why each theme applies
Handling Ambiguous Feedback
Sometimes feedback doesn't clearly fall into predefined categories or might span multiple themes. We can account for this:
Categorize the following course feedback into the provided themes. If feedback doesn't fit
cleanly into any theme, you may use "Other" with an explanation. If feedback spans multiple
themes, include all relevant ones.
Themes:
- Content Quality (accuracy, relevance, depth of material)
- Exercise/Hands-on Practice (quality and quantity of exercises)
- Pace and Difficulty (speed, complexity, learning curve)
- Technical Issues (platform problems, bugs, accessibility)
- Instructional Clarity (how well concepts were explained)
- Career Relevance (job applicability, real-world value)
- Other (specify)
Example 1: [previous example]
Example 2: [previous example]
Now categorize this feedback:
"The SQL course had some inaccurate information about indexing performance. Also, the
platform logged me out several times during the final assessment, which was frustrating.
On the positive side, the instructor's explanations were very clear."
This approach handles edge cases better by:
- Allowing an "Other" category for unexpected feedback
- Explicitly permitting multiple theme assignments
- Providing clearer definitions of what each theme encompasses
Batch Processing with Structured Output
When dealing with many feedback entries, structured output becomes essential:
I have multiple course feedback responses that need categorization into themes. For each
response, identify all applicable themes and return the results in JSON format with
explanations for why each theme applies.
Themes:
- Content Quality
- Exercise/Hands-on Practice
- Pace and Difficulty
- Technical Issues
- Instructional Clarity
- Career Relevance
Example output format:
{
"feedback": "The example feedback text here",
"themes": [
{
"theme": "Theme Name",
"explanation": "Why this theme applies to the feedback"
}
]
}
Please categorize each of these feedback responses:
1. "The R programming exercises were well-designed, but I struggled to keep up with the
pace of the course. Some more foundational explanations would have helped."
2. "Great Python content with real-world examples that I could immediately apply at work.
The only issue was occasional lag on the exercise platform."
3. "The Power BI course had outdated screenshots that didn't match the current interface.
Otherwise, the instructions were clear and I appreciated the career-focused project at the end."
This format:
- Processes multiple responses efficiently
- Maintains consistent structure through JSON formatting
- Preserves the original feedback for reference
- Includes explanations for each theme assignment
Try this with your synthetic survey data and observe how different feedback patterns emerge.
Step 3: Sentiment Analysis and Feature Extraction
Beyond categorization, we often want to understand the sentiment of feedback and extract specific features or suggestions. Prompt engineering can help here too.
Basic Sentiment Analysis
Let's start with a simple sentiment prompt:
Analyze the sentiment of this course feedback on a scale of negative (-1) to positive (+1),
with 0 being neutral. Provide a brief explanation for your rating.
Feedback: "The Excel course covered useful functions, but moved too quickly and didn't
provide enough practice examples. The instructor was knowledgeable but sometimes
unclear in their explanations."
This works for basic sentiment, but we can enhance it for more nuanced analysis.
Multi-dimensional Sentiment Analysis
Perform a multi-dimensional sentiment analysis of this course feedback. For each aspect,
rate the sentiment from -2 (very negative) to +2 (very positive), with 0 being neutral.
Aspects to analyze:
- Overall sentiment
- Content quality sentiment
- Instructional clarity sentiment
- Exercise/practice sentiment
- Pace/difficulty sentiment
Feedback: "The SQL course contained comprehensive content and the exercises were
challenging in a good way. However, the instruction sometimes lacked clarity, especially
in the joins section. The pace was a bit too fast for someone new to databases like me."
Provide your analysis as a JSON object with each aspect's score and a brief explanation for
each rating.
This approach:
- Breaks sentiment into specific dimensions
- Uses a more granular scale (-2 to +2)
- Requests explanations for each rating
- Structures the output for easier processing
Feature Extraction for Actionable Insights
Beyond sentiment, we often want to extract specific suggestions or notable features:
Extract actionable insights and suggestions from this course feedback. Identify:
1. Specific strengths mentioned
2. Specific weaknesses or areas for improvement
3. Concrete suggestions made by the student
4. Any unique observations or unexpected points
Format the results as a structured JSON object.
Feedback: "The Python data visualization module was excellent, especially the Matplotlib
section. The seaborn examples were too basic though, and didn't cover complex multivariate
plots. It would be helpful if you added more advanced examples with real datasets from fields
like finance or healthcare. Also, the exercises kept resetting when switching between notebook
cells, which was frustrating."
This prompt targets specific types of information that would be valuable for course improvement.
Combined Analysis with Focused Extraction
For a comprehensive approach, we can combine sentiment, categorization, and feature extraction:
Perform a comprehensive analysis of this course feedback, including:
1. Overall sentiment (scale of -2 to +2)
2. Primary themes (select from: Content Quality, Exercise/Practice, Pace, Technical Issues, Instructional Clarity, Career Relevance)
3. Key strengths (list up to 3)
4. Key areas for improvement (list up to 3)
5. Specific actionable suggestions
Format your analysis as a structured JSON object.
Feedback: "The Tableau course provided a solid introduction to visualization principles, but
the instructions for connecting to different data sources were confusing. The exercises
helped reinforce concepts, though more complex scenarios would better prepare students
for real-world applications. I really appreciated the dashboard design section, which I've
already applied at work. It would be better if the course included more examples from
different industries and had a troubleshooting guide for common data connection issues."
This comprehensive approach gives you a rich, structured analysis of each response that can drive data-informed decisions.
Try applying these techniques to your synthetic data to extract patterns and insights that would be useful in a real course improvement scenario.
Step 4: Structured JSON Output Extraction
The final step in our workflow is to transform all our analyses into a structured JSON format that's ready for downstream processing, visualization, or reporting.
Defining the JSON Schema
First, let's define what we want in our output:
Convert the following course feedback analysis into a standardized JSON format with this schema:
{
"response_id": "UID12345",
"ratings": {
"confidence": 5,
"overall": 6
},
"course_metadata": {
"technology": "Python",
"completed": true
},
"content_analysis": {
"overall_sentiment": 0.75,
"theme_categorization": [
{"theme": "ThemeName", "confidence": 0.9}
],
"key_strengths": ["Strength 1", "Strength 2"],
"key_weaknesses": ["Weakness 1", "Weakness 2"],
"actionable_suggestions": ["Suggestion 1", "Suggestion 2"]
},
"original_feedback": "The original feedback text goes here."
}
Use this schema to format the analysis of this feedback:
"""
[Your feedback and preliminary analysis here]
"""
This schema:
- Preserves the original quantitative ratings
- Includes course metadata
- Structures the qualitative analysis
- Maintains the original feedback for reference
- Uses nested objects to organize related information
Handling JSON Consistency Challenges
While LLMs are powerful tools for generating and analyzing content, they sometimes struggle with maintaining perfect consistency in structured outputs, especially across multiple entries in a dataset. When values get mixed up or formats drift between entries, this can create challenges for downstream analysis.
A practical approach to address this limitation is to combine prompt engineering with light validation code. For example, you might:
from pydantic import BaseModel, Field
from typing import List, Dict, Optional, Union, Float
# Define your schema as a Pydantic model
class ThemeCategorization(BaseModel):
theme: str
confidence: float
class ContentAnalysis(BaseModel):
overall_sentiment: float
theme_categorization: List[ThemeCategorization]
key_strengths: List[str]
key_weaknesses: List[str]
actionable_suggestions: List[str]
class SurveyResponse(BaseModel):
response_id: str
ratings: Dict[str, int]
course_metadata: Dict[str, Union[str, bool]]
content_analysis: ContentAnalysis
original_feedback: str
# Validate and correct JSON output from the LLM
try:
# Parse the LLM output
validated_response = SurveyResponse.parse_obj(llm_generated_json)
# Now you have a validated object with the correct types and structure
except Exception as e:
print(f"Validation error: {e}")
# Handle the error - could retry with a refined prompt
This validation step ensures that your JSON follows the expected schema, with appropriate data types and required fields. For multiple-choice responses or predefined categories, you can add additional logic to normalize values.
For our tutorial purposes, we'll continue focusing on the prompt engineering aspects, but keep in mind that in production environments, this type of validation layer significantly improves the reliability of LLM-generated structured outputs.
Batch Processing Multiple Responses
When working with multiple survey responses, we can process them as a batch:
I have analyzed 3 course feedback responses and need them converted to a standardized
JSON format. Use this schema for each:
{
"response_id": "",
"ratings": {
"confidence": 0,
"overall": 0
},
"course_metadata": {
"technology": "",
"completed": true/false
},
"content_analysis": {
"overall_sentiment": 0.0,
"theme_categorization": [
{"theme": "", "confidence": 0.0}
],
"key_strengths": [],
"key_weaknesses": [],
"actionable_suggestions": []
},
"original_feedback": ""
}
Return an array of JSON objects, one for each of these feedback responses:
Response 1:
Response ID: UID12345
Confidence Rating: 6
Overall Rating: 7
Technology: Python
Completed: True
Feedback: "The Python data cleaning course was excellent. I particularly enjoyed the regex
section and the real-world examples. The exercises were challenging but doable, and I
appreciate how the content directly applies to my work in data analysis."
Sentiment: Very positive (0.9)
Themes: Content Quality, Exercise Quality, Career Relevance
Strengths: Regex explanation, Real-world examples, Appropriate challenge level
Weaknesses: None explicitly mentioned
[Continue with Responses 2 and 3...]
This approach:
- Processes multiple responses in one go
- Maintains consistent structure across all entries
- Incorporates all prior analysis into a cohesive format
Batch Processing Multiple Responses
When working with multiple survey responses, we can process them as a batch:
I have analyzed 3 course feedback responses and need them converted to a standardized
JSON format. Use this schema for each:
{
"response_id": "",
"ratings": {
"confidence": 0,
"overall": 0
},
"course_metadata": {
"technology": "",
"completed": true/false
},
"content_analysis": {
"overall_sentiment": 0.0,
"theme_categorization": [
{"theme": "", "confidence": 0.0}
],
"key_strengths": [],
"key_weaknesses": [],
"actionable_suggestions": []
},
"original_feedback": ""
}
Return an array of JSON objects, one for each of these feedback responses:
Response 1:
Response ID: UID12345
Confidence Rating: 6
Overall Rating: 7
Technology: Python
Completed: True
Feedback: "The Python data cleaning course was excellent. I particularly enjoyed the regex
section and the real-world examples. The exercises were challenging but doable, and I
appreciate how the content directly applies to my work in data analysis."
Sentiment: Very positive (0.9)
Themes: Content Quality, Exercise Quality, Career Relevance
Strengths: Regex explanation, Real-world examples, Appropriate challenge level
Weaknesses: None explicitly mentioned
[Continue with Responses 2 and 3...]
This approach:
- Processes multiple responses in one go
- Maintains consistent structure across all entries
- Incorporates all prior analysis into a cohesive format
Handling Edge Cases
Sometimes we encounter unusual responses or missing data. We can tell the AI how to handle these:
Convert the following feedback responses to our standard JSON format. For any missing or
ambiguous data, use these rules:
- If sentiment is unclear, set to 0 (neutral)
- If no strengths or weaknesses are explicitly mentioned, use an empty array
- If the technology is not specified, set to "Not specified"
- For incomplete responses (e.g., missing ratings), include what's available and set missing
values to null
"""
[Provide your edge case response data here]
"""
This ensures consistency even with imperfect data.
Validating Output Format
To ensure the JSON is valid and matches your schema, add a validation step:
After generating the JSON output, verify that:
1. All JSON syntax is valid (proper quotes, commas, brackets)
2. All required fields are present
3. Arrays and nested objects have the correct structure
4. Numeric values are actual numbers, not strings
5. Boolean values are true/false, not strings
If any issues are found, correct them and provide the fixed JSON.
"""
[Your JSON generation prompt here]
"""
This extra validation step helps prevent downstream processing errors.
Step 5: Practical Real-World Analysis Tasks
Now that we have structured data, let's explore how to use both prompt engineering and programming tools to analyze it. We'll demonstrate a complete workflow that combines AI-assisted analysis with code-based implementation.
Using AI to Plan Your Analysis Approach
Before getting into any code, we can use prompt engineering to help plan our analytical approach:
I have JSON-formatted survey data with feedback from different technology courses (Python,
SQL, R, etc.). Help me identify significant differences in sentiment, strengths, and
weaknesses across these course types.
Focus your analysis on:
1. Which technology has the highest overall satisfaction and why?
2. Are there common weaknesses that appear across multiple technologies?
3. Do completion rates correlate with overall ratings?
4. What are the unique strengths of each technology course?
Provide your analysis in a structured format with headings for each question, and include
specific evidence from the data to support your findings.
"""
[Your sample of JSON-formatted survey data here]
"""
This prompt:
- Defines specific analytical questions
- Requests cross-segment comparisons
- Asks for evidence-based conclusions
- Specifies a structured output format
Implementing the Analysis in Code
Once you have an analysis plan, you can implement it using Python. Here's how you might load and analyze your structured JSON data:
import json
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# Load the JSON data
with open('survey_data.json', 'r') as file:
survey_data = json.load(file)
# Convert to pandas DataFrame for easier analysis
df = pd.json_normalize(
survey_data,
# Flatten nested structures with custom column names
meta=[
['response_id'],
['ratings', 'confidence'],
['ratings', 'overall'],
['course_metadata', 'technology'],
['course_metadata', 'completed'],
['content_analysis', 'overall_sentiment']
]
)
# Basic analysis: Ratings by technology
tech_ratings = df.groupby('course_metadata.technology')['ratings.overall'].agg(['mean', 'count', 'std'])
print("Average ratings by technology:")
print(tech_ratings.sort_values('mean', ascending=False))
# Correlation between completion and ratings
completion_corr = df.groupby('course_metadata.completed')['ratings.overall'].mean()
print("\nAverage rating by completion status:")
print(completion_corr)
# Sentiment analysis by technology
sentiment_by_tech = df.groupby('course_metadata.technology')['content_analysis.overall_sentiment'].mean()
print("\nAverage sentiment by technology:")
print(sentiment_by_tech.sort_values(ascending=False))
This code:
- Loads the JSON data into a pandas DataFrame
- Normalizes nested structures for easier analysis
- Performs basic segmentation by technology
- Analyzes correlations between completion status and ratings
- Compares sentiment scores across different technologies
Visualizing the Insights
Visualization makes patterns more apparent. Here's how you might visualize key findings:
# Set up the visualization style
plt.style.use('seaborn-v0_8-whitegrid')
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
# Plot 1: Rating distribution by technology
sns.boxplot(
x='course_metadata.technology',
y='ratings.overall',
data=df,
palette='viridis',
ax=axes[0, 0]
)
axes[0, 0].set_title('Overall Ratings by Technology')
axes[0, 0].set_xlabel('Technology')
axes[0, 0].set_ylabel('Overall Rating (1-7)')
# Plot 2: Sentiment by technology
sns.barplot(
x=sentiment_by_tech.index,
y=sentiment_by_tech.values,
palette='viridis',
ax=axes[0, 1]
)
axes[0, 1].set_title('Average Sentiment by Technology')
axes[0, 1].set_xlabel('Technology')
axes[0, 1].set_ylabel('Sentiment Score (-1 to 1)')
# Plot 3: Completion correlation with ratings
sns.barplot(
x=completion_corr.index.map({True: 'Completed', False: 'Not Completed'}),
y=completion_corr.values,
palette='Blues_d',
ax=axes[1, 0]
)
axes[1, 0].set_title('Average Rating by Completion Status')
axes[1, 0].set_xlabel('Course Completion')
axes[1, 0].set_ylabel('Average Rating (1-7)')
# Plot 4: Theme frequency across all responses
# First, extract themes from the nested structure
all_themes = []
for response in survey_data:
themes = [item['theme'] for item in response['content_analysis']['theme_categorization']]
all_themes.extend(themes)
theme_counts = pd.Series(all_themes).value_counts()
sns.barplot(
x=theme_counts.values,
y=theme_counts.index,
palette='viridis',
ax=axes[1, 1]
)
axes[1, 1].set_title('Most Common Feedback Themes')
axes[1, 1].set_xlabel('Frequency')
axes[1, 1].set_ylabel('Theme')
plt.tight_layout()
plt.show()
This visualization code creates a 2x2 grid of plots that shows:
- Box plots of ratings distribution by technology
- Average sentiment scores across technologies
- Correlation between course completion and ratings
- Frequency of different feedback themes
Using R for Statistical Analysis
If you prefer R for statistical analysis, you can use similar approaches:
library(jsonlite)
library(dplyr)
library(ggplot2)
library(tidyr)
# Load JSON data
survey_data <- fromJSON("survey_data.json", flatten = TRUE)
# Convert to data frame
survey_df <- as.data.frame(survey_data)
# Analyze ratings by technology
tech_stats <- survey_df %>%
group_by(course_metadata.technology) %>%
summarise(
mean_rating = mean(ratings.overall),
count = n(),
sd_rating = sd(ratings.overall)
) %>%
arrange(desc(mean_rating))
print("Average ratings by technology:")
print(tech_stats)
# Create visualization
ggplot(survey_df, aes(x = course_metadata.technology, y = ratings.overall, fill = course_metadata.technology)) +
geom_boxplot() +
labs(
title = "Overall Course Ratings by Technology",
x = "Technology",
y = "Rating (1-7 scale)"
) +
theme_minimal() +
theme(legend.position = "none")
SQL for Analyzing Structured Survey Data
For teams that store survey results in databases, SQL can be a powerful tool for analysis:
-- Example SQL queries for analyzing survey data in a database
-- Average ratings by technology
SELECT
technology,
AVG(overall_rating) as avg_rating,
COUNT(*) as response_count
FROM survey_responses
GROUP BY technology
ORDER BY avg_rating DESC;
-- Correlation between completion and ratings
SELECT
completed,
AVG(overall_rating) as avg_rating,
COUNT(*) as response_count
FROM survey_responses
GROUP BY completed;
-- Most common themes in feedback
SELECT
theme_name,
COUNT(*) as frequency
FROM survey_response_themes
GROUP BY theme_name
ORDER BY frequency DESC
LIMIT 10;
-- Strengths mentioned in high-rated courses (6-7)
SELECT
strength,
COUNT(*) as mentions
FROM survey_responses sr
JOIN survey_strengths ss ON sr.response_id = ss.response_id
WHERE sr.overall_rating >= 6
GROUP BY strength
ORDER BY mentions DESC;
Combining AI Analysis with Code
For a truly powerful workflow, you can use AI to help interpret the results from your code analysis:
I've analyzed my survey data and found these patterns:
1. Python courses have the highest average rating (6.2/7) followed by SQL (5.8/7)
2. Completed courses show a 1.3 point higher average rating than incomplete courses
3. The most common themes are "Content Quality" (68 mentions), "Exercise Quality" (52), and "Pace" (43)
4. Python courses have more mentions of "Career Relevance" than other technologies
- What insights can I derive from these patterns?
- What business recommendations would you suggest based on this analysis?
- Are there any additional analyses you would recommend to better understand
our course effectiveness?
This prompt combines your concrete data findings with a request for interpretation and next steps, leveraging both code-based analysis and AI-assisted insight generation.
Advanced Visualization Planning
You can also use AI to help plan more sophisticated visualizations:
Based on my survey analysis, I want to create an interactive dashboard for our course team.
The data includes ratings (1-7), completion status, technology types, and themes.
What visualization components would be most effective for this dashboard? For each chart type,
explain what data preparation would be needed and what insights it would reveal.
Also suggest how to visualize the relationship between themes and ratings; I'm looking for
something more insightful than basic bar charts.
This could lead to recommendations for visualizations like:
- Heat maps showing theme co-occurrence
- Radar charts comparing technologies across multiple dimensions
- Network graphs showing relationships between themes
- Sentiment flow diagrams tracking feedback across course modules
Try combining these analytical approaches with your synthetic data set. The structured JSON format makes integration with code-based analysis seamless, while prompt engineering helps with planning, interpretation, and insight generation.
Troubleshooting and Quick Fixes
As you work through this project, you may encounter some common challenges. Here's how to address them:
Challenge | Symptoms | Quick Fix |
---|---|---|
JSON syntax errors |
• Missing commas or brackets • Inconsistent quote usage • Invalid nesting |
• Provide an exact template with sample values • Ask for explicit validation of JSON syntax • For larger structures, break into smaller chunks |
Repetitive or generic analysis |
• Similar feedback categorization across different responses • Vague strengths/weaknesses • Missing nuance in sentiment analysis |
• Request specific examples for each categorization • Explicitly ask for unique insights per response • Provide more context about what constitutes meaningful analysis |
Unrealistic synthetic data |
• Too uniform or too random • Lack of correlation between ratings and comments • Generic feedback without specific course references |
• Specify distribution parameters and correlations • Provide more context about course content • Ask for feedback that references specific concepts from the course |
Inconsistent categorization |
• Different terms used for similar concepts • Overlapping categories • Missing categories |
• Use few-shot examples to demonstrate desired categorization • Provide explicit category definitions • Use structured output with predefined category options |
Remember, troubleshooting often involves some iterative refinement. Start with a basic prompt, identify issues in the response, and then refine your prompt to address those specific issues.
Project Wrap-Up
Throughout this project, you've learned how to apply prompt engineering techniques to a complete survey data workflow:
- Generating synthetic data with realistic distributions and controlled variations
- Categorizing qualitative feedback into meaningful themes
- Analyzing sentiment and extracting features from text responses
- Creating structured JSON outputs ready for further analysis
- Performing analytical tasks on the processed data
The value of this approach extends far beyond just survey analysis. You now have a framework for:
- Creating practice datasets for any domain or problem you're exploring
- Automating routine analysis tasks that previously required manual review
- Extracting structured insights from unstructured feedback
- Standardizing outputs for integration with visualization tools or dashboards
Next Steps and Challenges
Ready to take your prompting skills even further? Try these advanced challenges:
Challenge 1: Multi-survey Comparison
Generate synthetic data for two different types of courses (e.g., programming vs. data visualization), and then create prompts to compare and contrast the feedback patterns. Look for differences in:
- Common strengths and weaknesses
- Sentiment distributions
- Completion rates
- Suggested improvements
Challenge 2: Custom Category Creation
Instead of providing predefined categories, create a prompt that asks the AI to:
- Analyze a set of feedback responses
- Identify natural groupings or themes that emerge
- Name and define these emergent categories
- Categorize all responses using this custom taxonomy
Compare the AI-generated categories with your predefined ones.
Challenge 3: Alternative Datasets
Apply the techniques from this tutorial to create synthetic data for different scenarios:
- Customer product reviews
- Employee satisfaction surveys
- User experience feedback
- Event evaluation forms
Adapt your prompts to account for the unique characteristics of each type of feedback.
Final Thoughts
The ability to generate, analyze, and extract insights from survey data is just one application of effective prompt engineering. The techniques you've learned here—structured outputs, few-shot prompting, context enrichment, and iterative refinement—can be applied to countless data analysis scenarios.
As you continue to explore prompt engineering, remember the goal isn’t to craft a perfect prompt on the first try. Instead, focus on communicating your needs clearly to AI, iterating based on results, and building increasingly sophisticated workflows that leverage AI as a powerful analysis partner.
With these skills, you can accelerate your data analysis work, explore new datasets more efficiently, and extract deeper insights from qualitative information that might otherwise be challenging to process systematically.
What dataset will you create next?