Statistics for Dummies (10 page)

Authors: Deborah Jean Rumsey

Tags: #Non-Fiction, #Reference

BOOK: Statistics for Dummies

12.21Mb size Format: txt, pdf, ePub

Chapter 3:
Tools of the Trade

In today's numbers explosion, the buzzword is data, as in, "Do you have any data to support your claim?" "What data do you have on this?" "The data supported the original hypothesis that
…
", "Statistical data show that
…
", and "The data bear this out
…
." But the field of statistics is not just about data. Statistics is the entire process involved in gathering evidence to answer questions about the world, in cases where that evidence happens to be numerical data.

In this chapter, you see firsthand how statistics works as a process and where the numbers play their part. You also get in on the most commonly used forms of statistical jargon, and you find out how these definitions and concepts all fit together as part of that process. So, the next time you hear someone say, "This survey had a margin of error of plus or minus 3 percentage points", you'll have a basic idea of what that means.

Statistics: More than Just Numbers

Most statisticians don't want statistics to be thought of as "just statistics." While the rest of the world views them as such, statisticians don't think of themselves as number crunchers; more often, they think of themselves as the keepers of the scientific method. (Of course, statisticians depend on experts in other fields to supply the interesting questions, because man cannot live by statistics alone.) The
scientific method
(asking questions, doing studies, collecting evidence, analyzing that evidence, and making conclusions) is something you may have come across before, but you may also be wondering what this method has to do with statistics.

All research starts with a question, such as:

Is it possible to drink too much water?
What's the cost of living in San Francisco?
Who will win the next presidential election?
Do herbs really help maintain good health?
Will my favorite TV show get renewed for next year?

None of these questions asks anything directly about numbers. Yet each question requires the use of data and statistical processes to come up with the answer.

Suppose a researcher wants to determine who will win the next U.S. presidential election. To answer this question with confidence, the researcher has to follow several steps:

Determine the group of people to be studied.
In this case, the researcher would use registered voters who plan to vote in the next election.
Collect the data.
This step is a challenge, because you can't go out and ask every person in the United States whether they plan to vote, and if so, for whom they plan to vote. Beyond that, suppose someone says, "Yes, I plan to vote." Will that person really vote come Election Day? And will that same person tell you for whom he or she actually plans to vote? And what if that person changes his or her mind later on and votes for a different candidate?
Organize, summarize, and analyze the data.
After the researcher has gone out and gotten the data that she needs, getting it organized, summarized, and analyzed helps the researcher answer her question. This is what most people recognize as the business of statistics.
Take all the data summaries, the charts and graphs, and the analyses, and draw conclusions from them to try to answer the researcher's original question.
Of course, the researcher will not be able to have 100% confidence that her answer is correct, because not every person in the United States was asked. But she can get an answer that she can be
nearly
100% sure is the correct answer. In fact, with a sample of about 2,500 people who are selected in a fair and
unbiased
way (so that each member of the population
has an equal chance of being selected), the researcher can get accurate results, within plus or minus 2.5% (that is, if all of the steps in the research process are done correctly).

HEADS UP

In making conclusions, the researcher has to be aware that every study has limits, and that — because there is always a chance for error — the results could be wrong. A numerical value can be reported that tells others how confident the researcher is about the results, and how accurate these results are expected to be. (See
Chapter 10
for more information on margin of error.)

REMEMBER

After the research is done and the question has been answered, the results typically lead to even more questions and even more research. For example, if men appear to favor Miss Calculation but women favor her opponent, the next questions could be, "Who goes to the polls more often on Election Day — men or women — and what factors determine whether they will vote?"

The field of statistics is really the business of using the scientific method to answer research questions about the world. Statistical methods are involved in every step of a good study, from designing the research to collecting the data to organizing and summarizing the information to doing an analysis, drawing conclusions, discussing limitations and, finally, to designing the next study in order to answer new questions that arise. Statistics is more than just a number, it's a process!

Grabbing Some Basic Statistical Jargon

Every trade has a basic set of tools, and statistics is no different. If you think about the statistical process as a series of stages that one goes through to get from a question to an answer, you may guess that at each stage, you'll find a group of tools and a set of terms (or statistical jargon) to go along with it. Now if the hair is beginning to stand up on the back of your neck, don't worry. No one is asking you to become a statistics expert and plunge into the heavy-duty stuff, and no one is asking you to become a statistics nerd and use this jargon all the time. And you don't have to carry a calculator and pocket protector in your front left pocket like statisticians do, either.

But as the world becomes more numbers-conscious, statistical terms are thrown around more in the media and in the workplace, so knowing what the language really means can give you a leg up. Also, if you're reading this book because you want to find out more about how to calculate some simple statistics, understanding some of the basic jargon is your first step. So, in this section, you get a taste of statistical jargon; I send you to the appropriate chapters elsewhere in the book to get details.

Population

For virtually any question that you may want to investigate about the world, you have to center your attention on a particular group of individuals (for example, a particular group of people, cities, animals, rock specimens, exam scores, and so on). For example:

What do Americans think about the president's foreign policy?
What percentage of the planted crops in Wisconsin were destroyed by deer last year?
What's the prognosis for breast cancer patients taking a new experimental drug?
What percentage of all toothpaste tubes get filled according to their specifications?

In each of these examples, a question is posed. And in each case, you can identify a specific group of individuals who are being studied: the American people, all planted crops in Wisconsin, all breast cancer patients, and all toothpaste tubes that are being filled, respectively. The group of individuals that you wish to study in order to answer your research question is called a
population.
Populations, however, can be hard to define. In a good study, researchers define the population very clearly, while in a bad study, the population is poorly defined.

The question about whether babies sleep better with music is a good example of how difficult defining the population can be. Exactly how would you define a baby? Under 3 months old? Under a year old? And do you want to study babies only in the United States, or do you want to study all babies worldwide? The results may be different for older and younger babies, for American versus European versus African babies, and so on.

HEADS UP

Many times, researchers want to study and make conclusions about a broad population, but in the end, in order to save time, money, or just because they don't know any better, they study only a narrowly defined population. That can lead to big trouble when conclusions are drawn. For example, suppose a college professor wants to study how TV ads persuade consumers to buy products. Her study is based on a group of her own students who participated in order to get five points extra credit (you know you're one of them!). This may be a convenient sample, but her results can't be generalized to any population beyond her own students, because no other population was represented in her study.

Sample

When you sample some soup, what do you do? You stir the pot, reach in with a spoon, take out a little bit of the soup, and taste it. Then you draw a conclusion
about the whole pot of soup, without actually having tasted all of it. If your sample is taken in a fair way (for example, you didn't just grab all the good stuff) you will get a good idea how the soup tastes without having to eat it all. This is what's done in statistics. Researchers want to find out something about a population, but they don't have time or money to study every single individual in the population. So what do they do? They select a small number of individuals from the population, study those individuals, and use that information to draw conclusions about the whole population. This is called a
sample.

Sounds nice and neat, right? Unfortunately it's not. Notice that I said
select
a sample. That sounds like a simple process, but in fact, it isn't. The way a sample is selected from the population can mean the difference between results that are correct and fair and results that are garbage. As an example, suppose you want to get a sample of teenagers' opinions on whether they're spending too much time on the Internet. If you send out a survey over e-mail, your results won't represent the opinions of
all teenagers
, which is your intended population. They will represent only those teenagers who have Internet access. Does this sort of statistical mismatch happen often? You bet.

HEADS UP

One of the biggest culprits of statistical misrepresentation caused by bad sampling is surveys done on the Internet. You can find thousands of examples of surveys on the Internet that are done by having people log on to a particular Web site and give their opinions. Even if 50,000 people in the United States complete a survey on the Internet, it doesn't represent the population of all Americans. It represents only those folks who have Internet access, who logged on to that particular Web site, and who were interested enough to participate in the survey (which typically means that they have strong opinions about the topic in question).

REMEMBER

The next time you're hit with the results of a study, find out the makeup of the sample of participants and ask yourself whether this sample represents the intended population. Be wary of any conclusions being made about a broader population than what was actually studied. (More in
Chapter 16
.)

Random

A
random sample
is a good thing; it gives every member of the population an equal chance of being selected, and it uses some mechanism of chance to choose them. What this really means is that people don't select themselves to participate, and no one in the population is favored over another individual in the selection process.

As an example of how the experts do it, here is the way The Gallup Organization does its random sampling process. It starts with a computerized list of all telephone exchanges in America, along with estimates of the number of residential households that have those exchanges. The computer
uses a procedure called
random digit dialing
(RDD) to randomly create phone numbers from those exchanges, and then selects samples of telephone numbers from those. So what really happens is that the computer creates a list of
all possible
household phone numbers in America, and then selects a subset of numbers from that list for Gallup to call. (Note that some of these phone numbers may not yet be assigned to a household, creating some logistical issues to deal with.)

Another example of random sampling involves the manufacturing sector and the concept of quality control. Most manufacturers have strict specifications for their products being produced, and errors in the process can cost them money, time, and credibility. Many companies try to head off problems before they get too large by monitoring their processes and using statistics to make decisions as to whether the process is operating correctly or needs to be stopped. For more on quality control and statistics, see
Chapter 19
.

Examples of
non-random
(in other words
bad
) sampling include samples from polls for which you phone in your opinion. This is not truly a random sample because it doesn't give everyone in the population an equal opportunity to participate in the survey. (If you have to buy the newspaper or watch that TV show, and then agree to write in or call in, that gives you a big clue that the sampling process is not random.) For more on sampling and polls, see
Chapter 16
.

REMEMBER

Any time you look at results of a study that were based on a sample of individuals, read the fine print, and look for the term "random sample." If you see that term, dig further into the fine print to see how the sample was actually selected and use the preceding definition to verify that the sample was, in fact, selected randomly.

Bias

Bias is a word you hear all the time, and you probably know that it means something bad. But what really constitutes bias?
Bias
is systematic favoritism that is present in the data collection process, resulting in lopsided, misleading results.

Bias can occur in any of a number of ways.

In the way the sample is selected:
For example, if you want to get an estimate of how much Christmas shopping people in your community plan to do this year, and you take your clipboard and head out to the mall on the day after Thanksgiving to ask customers about their shopping plans, you have bias in your sampling process. Your sample tends to favor those die-hard shoppers at that particular mall who were braving the massive crowds that day.
In the way data are collected:
Poll questions are a major source of bias. Because researchers are often looking for a particular result, the questions they ask can often reflect that expected result. For example, the issue of a tax levy to help support local schools is something every voter faces at one time or another. A poll question asking, "Don't you think it would be a great investment in our future to support the local schools?" does have a bit of bias. On the other hand, so does the question, "Aren't you tired of paying money out of your pocket to educate other people's children besides your own?" Question wording can have a huge impact on the results. See
Chapter 16
for more on designing polls and surveys.

Tip	When examining polling results that are important to you or that you're particularly interested in, find out what questions were asked and exactly how the questions were worded before drawing your conclusions about the results.

Data

Data
are the actual measurements that you get through your study. (Remember that "data" is plural — the singular is
datum
— so sentences that use that word always sound a little funny, but they are grammatically correct.) Most data fall into one of two groups: numerical data or categorical data (see
Chapter 5
for additional information).

Numerical data
are data that have meaning as a measurement, such as a person's height, weight, IQ, or blood pressure; the number of stocks a person owns; the number of teeth a person's dog has; or anything else that can be counted. (Statisticians also refer to numerical data as
quantitative data
or
measurement data.
)
Categorical data
represent characteristics, such as a person's gender, opinion, race, or even bellybutton orientation (innie versus outie — is nothing sacred anymore?). While these characteristics can take on numerical values (such as a "1" indicating male and "2" indicating female), those numbers don't have any specific meaning. You couldn't add them together, for example. (Note that statisticians also call this
qualitative data.
)

HEADS UP

Not all data are created equal. Finding out how the data were collected can go a long way toward determining how you weigh the results and what conclusions you draw from them.

Data set

A
data set
is the collection of all the data taken from your sample. For example, if you measured the weights of five packages, and those weights were 12 lbs, 15 lbs, 22 lbs, 68 lbs, and 3 lbs, those five numbers (12, 15, 22, 68, 3) constitute your data set. Most data sets are quite a bit larger than this one, however.

Statistic

A statistic is a number that summarizes the data collected from a sample. People use many different statistics to summarize data. For example, data can be summarized as a percentage (60% of the households sampled from the United States own more than two cars), an average (the average price of a home in this sample is
…
), a median (the median salary for the 1,000 computer scientists in this sample was
…
), or a percentile (your baby's weight is at the 90th percentile this month, based on data collected from over 10,000 babies
…
).

HEADS UP

Not all statistics are correct or fair, of course. Just because someone gives you a statistic, nothing guarantees that the statistic is scientific or legitimate! You may have heard the saying, "Figures don't lie, but liars figure."

TECHNICAL STUFF

Statistics are based on sample data, not on population data. If you collect data from the entire population, this process is called a
census.
If you then summarize all of the census information into one number, that number is a
parameter
, not a statistic. Most of the time, researchers are trying to estimate the parameters using statistics. In the case of the U.S. Census Bureau, that agency wants to report the total number of people in the United States, so it conducts a census. However, due to logistical problems in doing such an arduous task (such as being able to contact homeless folks), the census numbers can only be called estimates in the end, and they're adjusted upward to account for those people that the census missed. The long form for the census is filled out by a random sample of households; the U.S. Census Bureau uses this information to draw conclusions about the entire population (without asking every person to fill out the long form).

Mean (average)

The
mean
, also referred to by statisticians as the
average
, is the most common statistic used to measure the center, or middle, of a numerical data set. The mean is the sum of all the numbers divided by the total number of numbers. See
Chapter 5
for more on the mean.

HEADS UP

The mean may not be a fair representation of the data, because the average is easily influenced by
outliers
(very large or very small values in the data set that are not typical).

Other books

Lusted in Las Vegas by Sandra Bunino

Cannibals by Ray Black

The Paladin Caper by Patrick Weekes

Bluebolt One by Philip McCutchan

The Silenced by Heather Graham

Owning Her Curves by Sway Jones

American Savior by Roland Merullo

Rachael Ray's Big Orange Book by Rachael Ray

Zeck by Khloe Wren

Damnation Alley by Roger Zelazny