Skip to content

Data Cleaning: How to Avoid Bad Data & What to Do After Data Has Been Collected

 

In survey research, the quality of data collected directly impacts the validity and usefulness of insights. Collecting bad or biased data leads to faulty conclusions, which can skew business decisions, research findings, or public policies. One of the most critical steps in ensuring high-quality data is implementing a thorough data cleaning process. This begins with the survey design itself and continues through data collection and post-survey processing.

Part 1: Survey Design to Avoid Bad Data

Effective survey design is the foundation for generating clean and reliable data. By carefully structuring your survey and crafting your questions, you can avoid common pitfalls that lead to messy or untrustworthy data. Below are some key strategies to help ensure you generate clean data from the start:

1. Use Screening Questions

Screening questions are vital for ensuring that only respondents who meet the necessary criteria participate in your survey. Whether you're targeting specific demographics, behaviors, or opinions, asking effective qualifying questions early in the survey will help to filter out those who don’t meet your sample criteria, improving the relevance and quality of your data. This also prevents out-of-scope or noisy data from respondents who aren’t part of your target audience.

2. Incorporate Attention Checks

Attention checks are essential in modern survey design. They ensure respondents are engaging with the survey and answering questions thoughtfully. A well-placed attention check, such as “Please select ‘Agree’ for this statement,” helps to screen out careless respondents. Including at least two attention-check questions can help flag participants who might be randomly clicking through without paying attention.

3. Balanced Scale Questions

Balanced scales (e.g., strongly disagree to strongly agree) ensure that both positive and negative options are equally represented, which helps avoid bias in responses. Using scales that are appropriate to the question—such as satisfaction or interest—ensures that the questions resonate logically with respondents, leading to more reliable answers.

It’s also important to decide whether to include a midpoint (e.g., a neutral option) in your scale. Midpoints can be helpful when respondents genuinely feel neutral, but including one also provides an easy "escape route" for indecisive participants. Consider your research objectives when deciding whether or not to include this option.

4. Avoid Long Batteries and Repetitive Sections

Survey fatigue is real, and long batteries of questions or repetitive sections increase the chances of respondent disengagement. Keep your survey concise and avoid asking redundant or overly long questions. Respondents tend to lose focus when faced with long, repetitive question sets, and this can lead to straight-lining (choosing the same response for every question), which diminishes the quality of your data. For more information on how question types can influence respondent engagement, please check out our other article on survey length perspectives

5. Ensure Mobile Friendliness

Many respondents will take your survey on a mobile device, so it's essential to design with mobile compatibility in mind. Long questions and intricate layouts can be difficult to navigate on smaller screens, leading to dropout or poorly-considered responses. Make sure your survey platform is responsive, and test the survey on multiple devices to ensure a seamless experience for all participants.

6. Set Min/Max Responses for Numeric Entry

For questions requiring numeric entry, establish reasonable minimum and maximum values. This prevents respondents from inputting extreme or nonsensical/accidental numbers (such as typing "999" for an income question). Also, look out for numeric entry responses that should have an entry that is higher/lower than a related numeric question. For example, if they purchased multiple brands, their response to how much money they spent for a single brand should be lower than the response for how much they spent in total across all brands. Setting realistic response ranges will help reduce noise in your dataset and make it easier to spot genuine outliers during the cleaning process.

Part 2: Post-Collection Data Cleaning

After survey data has been collected, the cleaning process becomes critical to removing bad data and ensuring that your dataset is ready for analysis. Here are essential steps to follow during this stage:

1. Remove Straight-Liners

One of the first tasks in cleaning data is identifying and removing straight-liners—respondents who give the same answer to every question. This could indicate that they weren’t paying attention or didn’t engage thoughtfully with the survey. A good way to catch straight-liners is by checking for identical responses across all scale questions or calculating the variance in grid questions.

2. Remove Speeders

Check the time respondents took to complete the survey. If a participant finished far faster than the average time (or in an unreasonably short time), it’s likely they didn’t engage meaningfully with the survey. Filtering out these responses will improve the quality of your dataset.

3. Check IP Addresses and Duplicates

Sometimes, respondents may attempt to take the survey multiple times to game the system or skew results. By tracking IP addresses, you can flag and remove duplicate responses. Be cautious when dealing with public or shared networks (like a workplace), as multiple respondents might share the same IP. Still, combining IP tracking with other checks like response patterns and completion time can help you identify malicious duplicates.

4. Flag Outliers in Numeric Responses

For questions that require numeric input, identifying and addressing outliers is critical. One common technique is to flag responses that fall 1.5 standard deviations above or below the mean. You can also look at the 99th or 1st percentile to identify extreme values. Once flagged, these outliers can either be removed or examined for further context.

Outliers aren't necessarily bad data—they might represent genuine responses from a particular subset of your sample. But it’s essential to check them closely to determine whether they’re plausible and if the respondent has other bad responses before deciding to exclude them.

5. Analyze Open-Ended Responses

Open-ended questions provide valuable qualitative insights but can also be a source of bad data. It's important to review these responses for nonsensical or low-effort answers (like single words, irrelevant phrases, or random characters). If a respondent has filled in open-ended questions with spam or inappropriate content, this is often a red flag that they weren’t engaging with the survey meaningfully, and you might consider removing their entire response.

6. Review Bad or Questionable Data in Context

Sometimes, bad data is a result of poor survey design rather than respondent behavior. If a large number of participants provide questionable responses to a particular question, it might indicate that the question was poorly phrased, confusing, or biased. If this is the case, you’ll need to carefully review whether the data from that question can be used. In extreme cases, you may have to discard the data from the entire question or section.

Be mindful of the possibility that some questions could unintentionally lead respondents toward a particular answer, introduce confusion, or cause frustration. In these cases, understanding the root cause can help inform future survey design to avoid repeating the same mistakes.

7. Advanced Data Cleaning

While the above steps can use algorithms and manual review of the data to flag bad respondents, IntelliSurvey’s in-house tool CheatSweep™ goes beyond conventional checks by utilizing digital fingerprinting, which applies rules based on respondent IP and device data. It also scans patterns from respondent behaviors on each survey page (clicking, typing, etc.). 

CheatSweep allows you to automatically remove respondents with a high fraud probability. This returns the fraud status to sample vendors without manual intervention. It also saves from counting them towards quotas, resulting in faster and smoother fielding. 

Conclusion

Clean data is the backbone of reliable research, and data cleaning is a critical step in ensuring the integrity of your insights. By prioritizing survey design with attention checks, balanced scales, and mobile compatibility, you can reduce the chances of bad data creeping in. On the back end, thorough data cleaning—removing straight-liners, flagging outliers, analyzing open-ended responses, and reviewing your survey design—will help you eliminate poor-quality data before moving forward with analysis.

Ultimately, a well-structured and carefully cleaned dataset enables you to draw valid and actionable conclusions from your survey research, setting the foundation for confident decision-making based on real insights. If you need support, IntelliSurvey’s in-house research team is well-versed in ensuring the quality and reliability of our clients’ survey data. For more information on how we can assist with your next study, please contact us today

Subscribe to our Monthly Newsletter