This post is a response to an article, posted on her F*cebook Wall by a friend, Stacey Burkett: http://timesofindia.indiatimes.com/Life/Relationships/Man-Woman/Women-are-most-attractive-at-

- F-test Fisher-Snedecor distribution compares elements of variation in data: the basis of ANOVA, one of the most common statistical tests.
31/articleshow/6187549.cms. When my Wall comment exceeded the length of most of even some of my longer blog posts, I decided that I should actually make it an actual blog post.
This post is also relevant to the recent move by Statistics Canada to eliminate the long-form census, issued to 20% of the population every five years, that must be completed under penalty of law and replace it with an entirely-voluntary one; the same principles apply, although I will dedicate another blog post to the Statistics Canada census issue to illustrate the principles in another application as well as to respond more precisely to specific issues.
Surveys and statistics are used to describe all of us, all the time. Used by marketing researchers who want to define and characterize target markets and psychologists who want to determine the impact of certain personality traits on job performance, surveys can characterize a target population of interest, with known precision, without requiring a census (“census” is a formal survey design term that refers to measuring every member of a population, rather than a “sample,” which is a smaller subset of the population, selected such that the results from it can be generalized to represent the entire population in cases where it is impractical or impossible to conduct a census. But it is not only population size that limits our ability to conduct a meaningful census; the simple fact that not every individual that is relevant to your survey will be alive at the same time can make a true census impossible.)
LIES, DAMN LIES, AND STATISTICS
“Lies, damn lies, and statistics,” is a reference to the abuse of statistics to support a position. I feel that this cliché has, itself, been abused and resulted in the unnecessary malignment of statistics, which is actually an extremely powerful tool not only to characterize populations or phenomena, but to predict events, with known confidence (such as the application of probit models and logistic regression that take categorical or numeric predictor variables, such as age, income level, and preferences, that describe a customer segment and calculate the probability that customers will purchase a new product). I am even more dismayed at the cynicism that has come to surround statistics; whilst the cliché describes the intentional and unintentional abuse of statistics out of context or inappropriately with intent to influence rather than inform, most people–even those who have taken an introductory statistics course in university (perhaps especially those, since most people are thoroughly confused and intimidated by the subject after an introductory course)–do not have sufficient understanding of the theoretical bases for statistical techniques to see the power in them. Our world is not a deterministic place; even the most reliable process will occasionally yield unexpected results. Thus, it is vitally important that we can quantify the likelihood that an observation is truly a characteristic result of a phenomenon and state how confident we are that a given observation was not the result of random chance.

Binary outcomes can be modeled with the probit model
The rest of this post pertains to the article that my friend, Stacey, posted to Facebook: http://timesofindia.indiatimes.com/Life/Relationships/Man-Woman/Women-are-most-attractive-at-31/articleshow/6187549.cms The article claims that a 2,000 man and woman survey administered by QVC, an American shopping channel, established that “females in their early thirties are seen as more attractive than younger girls as they are more confident and stylish.” Although the article is clearly intended as a lighthearted attempt to console their aging customers by presenting findings that run contrarily to what most people would expect, published statistics have a way of turning up supporting an opinion that they do not legitimately apply to. With respect to the QVC survey, I believe that the results should always be accompanied with a disclaimer outlining the specific population for which the findings can be considered valid; else, surely such a finding would eventually be applied to justify discrimination or disadvantage individuals unfairly.
Having just completed RSCH 8200 in my third straight quarter of research design coursework, survey design–specifically, internal and external validity, and reliability–is quite fresh in my mind. Especially now that the “Long-form Census” is so prominent in the news, I thought a quick run-down on sampling and survey design would help us all put the discussions we hear in the news into perspective–I found that many of the arguments presented by “experts” in the news are not compelling to someone that is trained in statistics because they often will take an extreme position which is not necessarily relevant, such as implying to the public that there are not methods to quantify how relevant certain findings are to the general public or how consistently people will give a response.
Critical thinkers who think ahead but lack discipline in their problem solving would, by now, be wanting to ask, “How do you quantify attractiveness? What is considered “attractive” versus “unattractive”? Is a sample of 2,000 adequate to establish this?
In order to have any meaning at all (which is the origin of that saying, “There are lies, d*mn lies, and statistics,” which much maligns statistics, which is actually the best quantitative tool we have to evaluate results and estimate confidence in them), it is important to keep the following considerations in mind (all other aspects of the design being done “by the book”):
EXTERNAL VALIDITY / GENERALIZABILITY:
External validity is a characteristic that quantifies how generalizable the findings are. Do they apply only to native residents of Toronto? all of Ontario? Eastern Canada? To evaluate this in a social sciences setting, we use our understanding of the underlying theory to inform our design (how attractive a person is may be affected by any or all of physiology, culture, how cosmopolitan was someone’s hometown, any specific trauamatic or pleasurable experiences a person had, etc.)
SAMPLING STRATEGY
The choice of sampling strategy used (random sampling, stratified sampling, purposeful non-probability sampling, etc.) is important to match the population the sample is to be representative of. Most people intuitively know that a sample must be random in order to be representative. But being random is not the only important consideration; if the population QVC wishes to characterize comprises 80% women and 20% men, then the 2,000-subject sample should comprise 1,600 women and 400 men–a 50/50 mix would not be representative of their customer base, if gender contributes to aesthetic preference. Similarly, if men and women of different ages tend to have different preferences, the random sample should also comprise similar age proportions to the population of interest.
SAMPLING SIZE
Whenever poll results are presented in the news, particularly during governmental elections, we are used to hearing “Poll is accurate to within 3.1% 19 times out of 20.” This means that the sample was adequate to be 95% confident that the results from the poll’s sample are within 3.1% of the true results of the entire populaton. In order to do this, we need to consider what kind of statistical analyses will be done (this determines what measures will be relevant), the size of the effect being studied (phenomena with stronger effects generally need smaller samples to be confident of their effects), and the desired confidence (in most cases, 95%). Software such as G*Power 3 (http://www.psycho.uni-duesseldorf.de/abteilungen/aap/gpower3/) can be used to calculate the sample size that is required to attain a given margin of error for a given effect size and statistical test. In doctoral research, dissertation committees and Internal Review Boards will generally require justification of proposed sample size in order to assess and minimize the burden on test subjects.
INTERNAL VALIDITY:
There are several facets to Internal Validity, but in general, they all pertain to ascertaining “how well does this survey actually measure the underlying construct that I intend to measure?” For example, face validity assesses how good a question is as a proxy to what you actually want to know: “What was your last grade completed?” is less valid than “How many years of school did you complete?” if you want to compare amount of education to job performance without regard to at what level the schooling was–people who have skipped a grade would have completed a higher grade level for the same number of years of schooling. Content validity describes how completely the survey describes the contributing factors to a phenomenon; in the above example, a survey that records only the number of years of schooling, without regard to at what levels, would lack content validity in describing the relationship between job performance and education completed, because 8 years of education between Grade 1 and Grade 8 would not have the same results as 8 years of education between Grade 9 and completing a Bachelors degree.
RECOMMENDATION: ALWAYS ACCOMPANY RESULTS WITH EXPLICIT DISCLAIMER
While it is clear that QVC intended this survey as a lighthearted consolation to its aging customers by presenting results that seem to run contrary to what most people might guess, statistics have a way of turning up supporting controversial opinions that are not valid for the sample used. In order to minimize harm to individuals and mitigate the malignment of statistics, QVC should accompany its results with an explicit explanation of the population that its findings can be legitimately applied to.