In a lot of our work, we’re using social media data as a feature to predict something else such as internet access gender gaps, or poverty rates, or migration rates. As such, we’re often ok with “just” observing correlations, as long as those correlations add predictive power over using baselines such as econometric data, or satellite imagery, or … As the predictive power might change over time, regular re-validation of the predictive models that are using social media data, in combination with other data sources, is required.
Note that in our work noise, such as users lying about their characteristics or fake accounts are, per se, not an issue as long as their percentage is somehow consistent. What is more of a challenge are patterns that cannot be modeled, such as lots of fake accounts in one country, and almost none in another. Those would then create challenges to correct for.
Other researchers in other domains validate data at the individual level but, typically, we don’t have individual level ground truth and can’t apply that methodology.