How to Gather Your Own Data by Conducting a Great Survey
- Selecting a population
- Sampling methods
- Making a data analysis plan
- Writing good questions
- Distribution options
Data Scientists know that even the slickest code, the best data analysis, and the most beautiful visualizations are worthless if the data it’s based on is unsound. So how can we ensure that the data from our survey is accurate and meaningful? It’s a process, but let’s start with a quick test. Can you spot the problems with these real survey questions?
Do you think most politicians are dishonest or are they like other people? (Source)
❍ Yes ❍ No
When was the last time you upgraded your computer and printer? (Source)
❍ Years ago ❍ Months ago ❍ A few weeks ago ❍ A few days ago
These survey questions are hard to interpret, and annoying to answer. What if readers don’t think politicians are dishonest or like other people? And what if readers upgraded their computer a few days ago, but don’t have a printer? We’re going to design a survey that avoids making mistakes like those.
Online surveys and the survey process
Data scientists are interested in understanding what people think and do, and surveys are one of the most straightforward ways to collect that kind of data. Traditional methods for administering surveys include telephone interviews, mail surveys, and personal interviews. But today, online surveys are one of the most popular ways to collect information about people, because they offer several advantages:
- Easy to create and administer: There are many tools available that make it easy to design and distribute your survey.
- Reach a large number of people: There are no limitations such as printing costs and time spent on distribution.
- Wide geographical reach: We can send our survey to people all over the world who have access to the internet.
- High response speed: It is easy to for respondents to complete — there’s no need to print the survey, fill it out, and return by mail.
- Direct data entry: This reduces time spent importing data and data errors, which is a major benefit for data scientists.
Nothing is perfect, of course, and online surveys have disadvantages as well:
- Low response rates: Only a fraction of the people exposed to a survey online are likely to click through and fill in the whole thing (although this can be mitigated somewhat with good survey design and interesting questions).
- Samples do not always represent the populations: Online surveys only reach people with access to the internet.
This post focuses on online surveys as a method, but it is important to ensure online surveys are the best method for your research objectives. If you work with a small population, face-to-face interviews may be better. With a very large population, look for methods that allow you best to draw a probability sample, as this allows the most data analysis options. All survey types, including online surveys, follow the same steps:
1. Determine your objectives.
2. Select respondents.
3. Create a data analysis plan.
4. Develop the survey.
5. Pre-test the survey.
6. Distribute and conduct the survey.
7. Analyse the data.
8. Report the results.
Let’s walk through the survey process step-by-step.
Step one: Determine your objectives
Each survey starts with a purpose or topic, which needs to be broken down into objectives. Our objectives should be clearly defined, as they inform our questions and data analysis. Make sure objectives are specific, measurable, and inform actions (such as pricing strategy or a marketing campaign). Let’s look at an example:
Purpose of the survey: To assess employees’ opinions about allowing dogs in the office.
- Objective 1: Assess opinions about having dogs in the office.
- Objective 2: Find out if employees are allergic to dogs.
- Objective 3: Assess opinions about spending approximately $100 per month on dog treats.
- Objective 4: Determine if people own dogs and how many.
To define clear objectives, we can research literature, ask others for input, and do qualitative research. Qualitative research, such as face-to-face interviews, helps us to learn more about the problem we’re researching and its context. And asking others for input and researching literature may help us quickly spot issues, target topics for further research, or discover existing datasets that might inform our analysis.
Step two: Population and sampling
Defining the population
The group of individuals we’re trying to research by conducting a survey are called the population. In our dogs-in-office example, the population is all employees of the company. So we’ll start off by generating a list of all population members, which we call a sampling frame. In our dog example, we could get a list of employee names and email addresses from HR. Survey companies like SurveyMonkey and Google Surveys can also help you find respondents.
Respondents or participants are the people who’ll actually take part in our research project. If we send all employees (the entire population) an invitation to participate in the survey, we are conducting a census. But it is often not possible to collect data on the entire population — for example, when the research is about consumer habits of teenagers in the United States.
To solve this, we can select a small group: a sample. This is not ideal, but if we draw a sample that is very similar to our population (i.e., statistically representative of the population), we can draw conclusions about the population based on the sample we surveyed. Let's assume the company in our example is very big. We’ll probably need to draw a sample since surveying the entire company would be too difficult. That means we’ll need to make sure the sample is not biased (we don’t want to pick only people who like cats, for example), so let's look at how to draw a good sample in the next section.
Sampling methods for online surveys
Discussing all sampling methods is outside the scope of this article, so we’ll focus on a selection of sampling methods for online surveys, as recommended by Sue and Ritter in their excellent book about online surveys. Data scientists usually work with probability samples. To draw a probability sample, we need a clearly defined sampling frame, such as a list of employee names. Probability samples depend on random selection, which means that every individual in the population has the same chance of being selected. Working with random selection allows you to make conclusions about the population based on statistics. Some probability sampling methods include:
- Simple random sampling and systematic random sampling, in which everyone in the population has the same chance of being selected.
- Stratified sampling and cluster sampling, which allows us to select people from different sub-groups, such as age-groups or genders.
(Discussing how to draw probability samples is outside the scope of this article, but this course can help you learn more about that topic.) There are non-probability sampling techniques that are commonly used for online surveys, but they’re not ideal because they cannot be used to generalize about the population. For example, convenience sampling is a method where there are no restrictions to who can participate in the research.
A common way to use this method is by posting a link to a survey on a website to find participants. This may result in a large number of responses, but without having a way to control who responds, it’ll be difficult for us to draw meaningful conclusions from that data. Another technique, intercept sampling, involves website visitors being invited to participate in a survey through a pop-up window to, for example, give feedback about their experience. Again, this method is not ideal: We won't be able to generalize about the population using this method and it strongly limits the statistical methods we can use when analyzing our data.
Sources of error and sample size
We want our sample to resemble the population. However, there is always a difference between the data we would gather from our population and our sample, which we call the sampling error. A sampling error is low when the sample looks closely like the population we're surveying. The margin of error and level of confidence reflect how well a sample represents a population. The larger the margin of error, the less confidence one should have that the survey’s results represent the population well. Researchers usually decide on a level of confidence, reflecting how much a sample can differ from the results collected from the actual population.
One way to decrease the margin of error is to increase the sample size. That same Dataquest course can help you measure whether your sample represents your population well and what sample size you should use. Keep in mind that non-probability samples need different guidelines on sample sizes than probability samples. When we select a sample, it does not necessarily mean that we know all of the people who will participate in our survey.
Nonresponse error is a term that describes how many of the selected people actually fill out the survey. Measuring nonresponse is important, as it affects whether the sample represents the population well. There will also be people who start filling out the survey but do not complete it or skip questions, who we call dropouts. Analyzing dropouts and skipped questions will be important later, when we’re testing our survey, because they can indicate problems with our questions — such as the content, length, question phrasing or a software error — that will impact our results.
Step three: Create a data analysis plan
Before designing our survey, we need an analysis plan. This will ensure we think about everything we want to analyze, and how we can get statistically representative results that will let us make that analysis. The easiest way to create an analysis plan is to write out our survey objectives and how we plan to analyze each one. For example: Objective 4: Determine if people own dogs and how many
- Survey question: How many dogs does your household have?
- Answer options: 0, 1, 2, 3, 4, 5, 6+.
- Level of measurement: Interval data.
- Analysis plan: Calculate minimum, maximum, mean.
We’ll want to keep referring back to our analysis plan as we write survey questions to make sure we are collecting the right data for the desired analysis. Let's take a moment to look at level of measurement, which determines the nature of information we’re collecting. There are four types of data we can gather: nominal, ordinal, ratio, and interval.
These levels are hierarchical, with interval as the highest level of data we can collect. Nominal data is information that is not associated with numerical values and not ordered in any way.
For example, asking about gender, favorite types of dogs or someone’s ethnicity:
What is your ethnicity?
□ Asian/Pacific Islander □ African American □ White □ Hispanic □ Native American/Native Alaskan □ Other
Ordinal data is data that can be ordered. However, there is no equal distance between the ranks. For example: level of education, or experience buying a product:
Are you satisfied with your experience purchasing a product on our website?
❍ Very Unsatisfied ❍ Somewhat Unsatisfied ❍ Neutral ❍ Somewhat Satisfied ❍ Very Satisfied
Ratio data are numerical values, with interpretable distances and an absolute zero. An example is age: we know that a 50-year-old is five times older than a 10 year old.
What is your age?
..... years old.
Interval data are the same as ratio, but without the absolute zero. For example, temperature measured in Celsius, where the zero point is arbitrary, and doesn’t actually mean the absence of anything.
Step four: Designing the survey
By now, we have determined your objectives, population, sampling strategy, survey method, and analysis plan. These will all help with the next step: writing questions and designing our survey. Most surveys collect three types of information:
- Demographics of respondents, including age, gender, income, and level of education, which can be used to describe the respondents and compare groups of respondents.
- Quantifiable information we can analyze statistically.
- Open-ended questions that allow respondents to add qualitative data.
Questions can be asked in an open-ended and closed-ended format. Open-ended questions allow users to input their own answer and do not provide predefined response options. Open-ended questions produce difficult-to-analyze responses and can be a hurdle for respondents that increases dropout. For those reasons, experts say they What do you like best about your job? …………………………………………………………………
Closed-ended questions provide different types of response options from which respondents can choose an option. In contrast to open-ended questions, closed ended questions are easy to answer and allow us to conduct statistical analysis on the results. Formats of closed-ended questions include:
- Multiple-choice questions.
- Dichotomous questions: Provide with two (or three) options, such as “yes” and “no”. They can also contain a third option, such as “don’t know”.
- Rankings: Allow you to rank how important you think something is, compared to other options.
- Rating scales: Allow you to indicate how strongly you agree with something or rate something.
Let’s look at some examples:
Multiple choice question:
What is you employment status at our company?
□ Full-time □ Part-time □ Contractor □ Other
Dichotomous question:
Do you like dogs?
❍ Yes ❍ No
Ranking question:
Please rank the following in order of importance, from 1 to 4, where 1 is most important.
Clean office ☐ Lively office ☐ Quiet office with no distractions ☐ Fun office ☐
Rating scale question:
Indicate to what extend you agree with this statement: Having dogs around improves my feelings of happiness.
❍ Strongly disagree ❍ Disagree ❍ Neither agree nor disagree ❍ Agree ❍ Strongly Agree
With all types of questions, it is important that our answer options are exhaustive. Survey questions often include the answer options “Other” and “I don’t know” for this reason. We think these options should be included, but used with caution, by trying to include all answer-options so that respondents don’t use them to avoid a question.
Answers also need to be mutually exclusive, which means they do not overlap with one another. Most surveys also use contingency questions, which are questions that follow each other up, and allow respondents to skip questions. Online surveys help to automatically skip questions if they are not relevant based on a previous answer.
For example:
Contingency questions: Do you own a dog?
❍ Yes ❍ No
Would you like to occasionally bring your dog to work?
❍ Yes ❍ No
In a well-designed online survey, users who answered "No" to the first question would not be asked to answer the second question. Don’t forget, our questions should be informed by our objectives and analysis plan.
Writing good questions
It takes effort to write good survey questions — questions that are easy to understand and meaningful to your respondents. Questions that are hard to understand lead to annoyance and dropouts, and uninteresting or irrelevant questions cause "non-attitudes, where participants use guesswork or the ‘don’t know’ option continuously. The introduction to this post shows examples of bad survey questions: questions that are hard to interpret and answer. This was one of the bad survey questions in our introduction:
Do you think most politicians are dishonest or are they like other people? ❍ Yes ❍ No This question is:
- Double-barreled: It really consists of two questions ("Do you think politicians are honest?" and "Do you think politicians are like other people?"), which makes it difficult to answer fully.
- Loaded and leading: It implies that "other people" are typically honest, suggesting that if we think politicians aren't like other people then we must also believe they're dishonest.
- Ambiguous: It is open to interpretation.
Overall, this question is annoying to answer, and annoying questions can cause dropouts and non-responses. We don't want that. This was the second “bad” survey question from the introduction:
When was the last time you upgraded your computer and printer?
❍ Years ago ❍ Months ago ❍ A few weeks ago ❍ A few days ago
This question is hard to answer, because it assumes we own both a computer and printer and that we upgrade both at similar times. If we do not own one of these, or we do own them but have only upgraded one, it's not clear which answer we should choose. Good survey questions are:
- Self-explanatory: They should be easy to understand and answer.
- Unambiguous: There should be only one way to interpret or understand the question.
- Meaningful to our respondents.
- Exhaustive.
- Mutually exclusive.
- Short.
- Free of jargon.
- Visually appealing.
- Without absolutes: We should provide a variety of answer options and avoid words like “always” or “ever” in our questions.
- Matching our respondents' language: We should use language our respondents will understand, and explain any technical terminology that's needed.
Let’s rewrite the two “bad” survey questions from the introduction:
Bad survey question:
Do you think most politicians are dishonest or are they like other people?
❍ Yes ❍ No
Good survey question:
Indicate to what extent you agree with this statement: “I think politicians who are currently in office in the United States are dishonest”.
❍ Strongly disagree ❍ Disagree ❍ Neither agree nor disagree ❍ Agree ❍ Strongly Agree
Bad survey question:
When was the last time you upgraded your computer and printer?
❍ Years ago ❍ Months ago ❍ A few weeks ago ❍ A few days ago
Good survey questions (split into two questions):
When was the last time you upgraded your computer?
❍ Never ❍ Years ago ❍ Months ago ❍ A few weeks ago ❍ A few days ago ❍ I don’t own a computer
When was the last time you upgraded your printer?
❍ Never ❍ Years ago ❍ Months ago ❍ A few weeks ago ❍ A few days ago ❍ I don’t own a printer
Asking valid questions
We want our survey questions to measure what they are supposed to measure. If we're trying to measure whether people upgrade their computers, and people answer the “bad” survey question with “A few years ago” because they do not own a printer and never upgrade their computer, we haven't actually measured anything.
Also, we need to make sure we're collecting accurate information. There are several reasons why respondents do not answer questions accurately. As Schaeffer and Presser point out, people may not always have accurate memories of the events relevant to your survey. To minimize this effect, we should try to ask respondents to recall only events that happened recently, and we should try to ask about specific events (rather than a range of events). Participants might also answer questions based on what they believe is expected from them, or according to social norms. We call this social desirability bias.
But there's good news here: according to Kreuter et. al., in contrast to other survey methods, online surveys actually increase the reporting of sensitive information. So this bias won't impact our survey as strongly as it might a phone survey, for example. Even so, all survey types risk social desirability bias.
We can reduce this bias by emphasizing anonymity and confidentiality, and phrasing the questions in a way that makes the answers seem acceptable whenever there might be a social reason to lie.
For example: Bad survey question:
Do you ever illegally exceed the speed limit?
Better survey question:
Have you driven faster than the speed limit in the last week? (Reminder: responses to this survey are anonymous)
Other reasons why participants may not respond accurately to our survey questions are that they do not know how to answer, or answer questions they don't actually have opinions on. It is important to look for this when you are testing your survey questions and adjust them accordingly. Ask respondents how they interpreted a question or ask why they selected a certain option. Their answers may reveal these sorts of problems.
Formatting the survey
Online survey tools make it easy to design a survey and make it appealing. However, there are a few things to pay attention to. As Mick P. Couper points out in his book Designing Effective Web Surveys, the design of online surveys is very important because there is no interviewer to compensate for poor design and explain what is meant.
If we decide to use a start page for our survey, we'll need to make sure we introduce the survey in an interesting way. We should also mention how long it takes to complete, and thank participants for taking the time to fill it out. Emphasize anonymity and confidentiality. Keep the introduction short. Question order should be based on what is interesting to our respondents (and not just to us), because we want them to fill out the entire survey.
Have you ever done a survey that made you feel like you were answering endless demographic questions? This is a known annoyance. Keep surveys short and end with demographic questions instead of starting with them. We can use any number of different formats and styles, but whatever we choose needs to be conventional and readable (in terms of colors, fonts and size). It also needs to include clear instructions. If we're using ratings and scales, they should also be clear and conventional. These are some well-known response formats:
- Small circles (radio buttons): These usually indicate you can select one option in a multiple choice question.
- Check boxes: These usually mean you can select all options that apply.
- Drop-down menus: These usually show the text “select one” so that respondents know they can scroll down for options. Note that drop-downs are less user friendly, as a user needs to click / tap and scroll to see all the possible answers. It's best to use dropdowns only if we must include a long list of answers (for example, asking for country of residence).
- Text boxes: These are used for answering open-ended questions.
- Rank options: These allow respondents to add 1, 2, 3, 4, 5 etc. to rank preferences.
It is important that our survey is accessible to everyone in our sample, including people with disabilities, slow internet connections, and different kinds of devices. The extent to which this affects us will depend on our sample. Different language options and sound options are common features we may want to consider adding.
Step five: Pre-testing
Just because we've written our survey questions doesn't mean the survey is ready to go! The next step is testing to be sure that we'll get the quality of responses we're looking for. Pre-testing can reveal numerous problems, such as badly-phrased questions and missing response categories. First, we can do some tests by ourselves:
- Check different computers, tablets, phones and browsers to make sure everything works well. For example: Do rating scales work on a mobile phone?
- Check whether any contingency questions are skipping automatically.
- Be sure images, graphs, sounds, links and charts work and load.
- Look for style issues. Can you scroll down? Is the font easy to read?
Second, we're going to to assemble a small group of people who haven’t seen the survey before. Then we're going to let them take the survey, without offering any help or clarification if they have questions. Afterward, we can talk to them and look at their responses to see if there are any problem areas in our survey that need correcting.
Finally, test the survey with a larger group of people and track how they do it. We'll want to measure the time it takes to complete questions and the entire survey, and we'll be looking to see where people become "dropouts" and which questions they skip. There are analytics tools that can help with that. All of these testing phases should offer useful insights that we can use to improve the quality of our survey. After making changes, we'll continue the testing cycle until all of the issues testing uncovered have been resolved. Then it's finally time for...
Step six: Conducting the survey
There are different ways to distribute a survey:
- Send by email.
- Embed or advertise on a website or app, using a link.
- Using pop-up windows on a website or app.
- Posting on social media.
Each of these methods are connected to different sampling strategies, and each will affect the data analysis options we have once we get the results. Online tools we can use to help design and distribute our survey include SurveyMonkey, Google Surveys and Typeform. These tools can help with most steps in the survey process, including the sampling frame, designing the survey, advertisement of the survey, and simple data analysis. We can improve response rates to our survey in several ways:
- Use an appealing invitation to (or advertisement of) the survey.
- Keep the survey short.
- Offer respondents something in exchange for filling out the survey, such as a discount on a product (but note that this impacts the sampling error).
- Follow-up invitations to remind people to fill out your survey.
Step seven: Data analysis
Most survey tools include basic statistics and simple data analysis options. But this is Dataquest and we're Data Scientists! We'll definitely want to download the data and conduct our own analysis, as this provides us with more possibilities. One thing we don't want to forget as we're crunching our numbers: we should also analyze the way respondents took the survey.
We should track where people dropped out, how long it took them to complete the survey, etc., and use these insights to augment our data analysis. We should also analyze why data is missing (if any is missing), and look at outliers to see how these impact the results. Specific ways to do this are outside of the scope of this article, but Dataquest courses can help you with those skills.
Step eight: Reporting results
The last step is to report survey results. Depending on the purpose of our survey, we might do things like:
- Write a report.
- Present the results in a meeting.
- Use the results as part of a larger research project.
If we write a report, we'll want to define its audience beforehand and write the report with them in mind. For example, if the audience is a busy executive, they'll probably want a quick birds-eye-view summary of the most pertinent and actionable info; other audiences might want a much more in-depth question-by-question analysis. In any kind of reporting, we'll want to keep in mind that numbers can seem abstract and difficult to grasp for some people. We should use visual aids (like graphs and charts) to describe our results.
Conclusion
Online surveys are an effective way to collect data for a data science project. However, careful attention needs to be paid to each step in the survey process to ensure that the survey will produce useful, valid, representative response data. Here's a quick review of the steps for reference:
1. Determine your objectives.
2. Select respondents.
3. Create a data analysis plan.
4. Develop the survey.
5. Pre-test the survey.
6. Distribute and conduct the survey.
7. Analyse the data.
8. Report the results.
Resources
Partially Derivative: Chris Albon, Jonathon Morgan, and Vidya Spandana (2017),
“Your Surveys Suck”. Mick P. Couper (2008), Designing Effective Web Surveys, Cambridge University Press. Stanley Presser, Mick P. Couper, Judith T. Lessler, Elizabeth Martin, Jean Martin, Jennifer M. Rothgeb & Eleanor Singer (2004),
"Methods for testing and evaluating survey questions", Public Opinion Quarterly. Kreuter, F., Presser, S., and Tourangeau, R. (2008),
"Social Desirability Bias in CATI, IVR, and Web Surveys: The Effects of Mode and Question Sensitivity", Public Opinion Quarterly. Bradburn, N., Sudman, S., and Wansink, B. (2004),
"The Science of Asking Questions", Annual Review of Sociology. Valerie M. Sue & Lois A. Ritter (2007),
Conducting Online Surveys, Sage Publications.