This document contains detailed instructions of how to use SAS to do
the kinds of analyses documented in the text Applied Statistics
for Engineers and Scientists. Each of the first fifteen sections is
keyed to one chapter of the text, and references material in that
chapter. For example, Section 1 begins by describing where to find the
electric usage data from Chapter 1 of the book and how to use those
data and SAS to generate Figures~1.1 and 1.3 in the text.
If you have not used SAS at WPI before, you should visit the
Unix SAS Tutorial, which will
tell you how to set up your Unix account to run SAS at WPI.
Many of the analyses described in the text use a graphically-oriented
component of SAS called SAS/INSIGHT.
(SAS and SAS/INSIGHT are registered trademarks of
SAS Institute, Inc.) An introduction to SAS/INSIGHT is found in
An Introduction to SAS/INSIGHT.
We have written a collection of SAS macros to facilitate the use of
SAS in statistical applications and in labs. Instructions on how to
access these macros are found in
An Introduction to SAS/EIS.
The uses of these macros are
described where appropriate in this document.
Throughout this document we will give only one-part names for all
SAS data sets. This is the same as assuming the data sets are in your
work, or temporary storage, area. At WPI, all data sets
are in a SAS data library named SASDATA. To access the data sets from
SAS/INSIGHT, go to the SASDATA library in the open data sets dialog
window. If inputing the data set name to a macro, prefix the name with
SASDATA. For example to use the ELECE data set in a macro, input the
name SASDATA.ELECE.
Data Sets
Important data sets for Chapter 1 are:
- ELECE : Electric usage data from Professor P.'s household.
- WASHER5: Washer thickness data, Example 1.2.
- WASHER7: Washer thickness data, Example 1.3.
- EG1_5: Bearing play data, Example 1.5.
- EG1_5A: Bearing play data, Example 1.5.
Stationarity of Processes
Figures 1.1 and 1.3 of the Text
Figures 1.1 and 1.3 were produced with
SAS/INSIGHT.
Figure 1.1 was created by choosing Analyze:Histogram/Bar Chart ( Y
) and then selecting DKWH from the
resulting dialog window.
Figure 1.3 was produced by first selecting
Analyze:Scatterplot( Y X ) and then choosing DKWH as the Y variable
and DATE as the X variable in the dialog window. This produced the
scatterplot. Producing the corresponding histogram was a little
trickier. First we created a rectangle to the right of the scatterplot
by clicking there with the left mouse button and dragging. It doesn't
matter how large the rectangle is. We next put a vertical bar chart
there by choosing Analyze:Histogram/Bar Chart( Y ) and then
selecting KWH from the resulting dialog window. To make the bar chart
horizontal (this is the neat part), we clicked on the upper left
corner and dragged that corner down past the lower right. (It's the
click-and-drag version of turning a sleeve inside out.) We then moved
the rectangle next to the scatterplot and resized it as desired. To
align the KWH axes on both plots, we chose Edit:Windows:Align.
Figures 1.4, 1.5 and 1.6 of the Text
Figure 1.4 was produced by
the macro TSPLOT, and Figures 1.5 and
1.6 were produced by the macro TSMAPRED. Input
to TSMAPRED will include (in order)
- The name of the SAS data set containing the data
(ELECE).
- The name of a SAS data set to contain the output
(original data plus the smoothed series and residuals). Any name of
eight characters or less will do.
- The name of the variable to be smoothed (here it is
DKWH).
- The name of the variable to contain the smoothed
series. This will be put in the output data set; use whatever name of
eight characters or less you like.
- The name of the variable to contain the residuals (which
are the original data minus the smoothed values). Again, use whatever
name of eight characters or less you like.
- The name of the time variable (DATE).
- The number of terms in the moving average. We used 7 and
28 for the two plots.
The equivalent of Figure 1.4 can be produced in
SAS/INSIGHT by choosing Analyze:Line Plot~(~Y~X~) and selecting
KWH as the Y variable and DATE as the X variable.
Causes of Variation
All the plots in this section were created using SAS/INSIGHT .
Example 1.2: A Stationary Process that Wasn't
The data
for Figure 1.8 of the text are in the SAS data set WASHER5. To create
Figure~1.8 select Scatterplot ( Y X ). In the resulting dialog
box, choose THICK as the Y variable, ORDER as the X variable and
MACHINE as the Group variable. The result is the three plots you see,
but aligned horizontally rather than vertically. In addition, the
vertical axes of the plots differ. To get the vertical axes to line
up, on the graph window select Edit:Windows:Align. Use clicking
(on the bounding box of the plots) and dragging to place the graphs in
the vertical configuration shown. Figure 1.9 was done in exactly the
same way using the data in WASHER7.
Identifying Possible Causes of Variation
Creating a New Ishikawa Diagram
While you can draw effective Ishikawa diagrams by hand,
presentation-quality diagrams are easily drawn using SAS as follows:
- \item[1.] From the menu bar on the PROGRAM EDITOR, LOG or
OUTPUT windows, choose Globals:Analyze:Quality Improvement.
- From the Statistical Quality Control Menu, select
``ISHIKAWA''.
- From the resulting pop up window select ``Create a New
Ishikawa Diagram''.
- A graphics window containing a template for an
Ishikawa diagram will appear. By entering text from the keyboard and
using the mouse to position arrows, it is easy to create your own
diagram. For example, to begin creating the diagram for High Strength
Mesh, click on the graphics window to make it active. Then just type
``High Strength Mesh'' on the keyboard and hit return twice. The words
``High Strength Mesh'' will now appear in the box describing the main
arrow. Now enter ``Methods'' in the same way, and the box for the
upper left arrow is set. The other boxes and arrows will now
disappear and you are free to create the diagram as you want. To
learn how to do this, click on ``Help'' from the menu bar of the
graphics window, and then on ``Extended help'' from the pop up window.
- Read the first help screen, then click on the
highlighted word ``Introduction''. To find out how to perform a given
task, click on the highlighted word (e.g. adding, moving, etc.)
describing that task. If you see anything about using PROC ISHIKAWA,
ignore it; SAS has already called that SAS procedure for you-so you
don't have to.
- This all sounds more complicated than it is. In
fact, in a few minutes, you should be adept at drawing the diagram.
As always, if you have questions, just ask.
Saving a Diagram
You may want to save a diagram for later use. To do this, click on
``File'' on the action bar at the top left of the Ishikawa window,
then on ``Save as'' and ``File''. A ``File Requestor'' window (for
selecting where to save the diagram) will appear. You must first
select a library in which to save the diagram. If you want to save it
temporarily (it will disappear after you exit SAS), select the library
``WORK''. If you want it to be there for future SAS sessions, select
the library ``SASUSER''. Next select a name for the data set (your
choice, 8 or fewer characters), and click on ``OK''.
Retrieving a Diagram
To retrieve a saved Ishikawa diagram from the Ishikawa window, click
on ``File'' on the action bar at the top left of the Ishikawa window,
then on ``Open''. A ``File Requestor'' window will appear. This window
is identical to the one you used to save the diagram. When you select
the file, a new Ishikawa window with the selected diagram will appear.
To retrieve a saved Ishikawa diagram from the Statistical Quality
Control window in SAS, click on the ISHIKAWA icon, then from the
resulting window select ``Edit an Existing Ishikawa Diagram''. A
``File Requestor'' window will appear. Choose the saved diagram you
desire, and a window will appear with the saved diagram in it. Some of
the finer detail may be missing, however. To restore it, click
anywhere on the diagram with the right mouse button and select ``>
Detail''.
Printing a Diagram
You may want to print your Ishikawa diagram. To do this, you must
first save the diagram to a graphics catalog. To do this, click on
``File'' on the action bar at the top left of the Ishikawa window,
then on ``Save as'' and ``Graph''. An ``Output Manager'' window (for
selecting where to save the diagram) will appear. This window will
have a default library and file name already showing (probably
WORK.GSEG.ISHIKAWA). If you want to go with this name (and we'll
assume here that you do), click on the ``OK'' button. The Ishikawa
diagram will appear in a regular graphics window. It can be printed
from there in the usual way of printing all graphics output.
Data Sets
- TECHSAL: Salaries of technical support workers, Example 2.4.
- PLANETS: Planet diameters.
- BEARINGS: Weights and diameters of 100 ball bearings.
- ULTRACAL: Ultrasound calibration data, Example 2.6.
Histograms and Bar Charts
Frequency histograms and bar charts are obtained in SAS/INSIGHT using
the command Analyze:Histogram/Bar Chart ( Y ).
Boxplots
You can easily generate boxplots in SAS/INSIGHT by choosing
Analyze:Box Plot/Mosaic Plot ( Y ) . For example, the
side-by-side boxplots shown in Figure 2.13 of the text compare the
salaries of men and women in the TECHSAL data set. They were produced
by selecting SALARY as the Y variable and GENDER as the X variable.
You can add information to the boxplots. Choosing
:means will add a diamond-shaped figure with the mean indicated
by a horizontal line and a span of +- two standard
deviations. Choosing
:Serifs will add
serifs: little cross lines at the ends of the whiskers. Choosing
:Values will put the values of the medians,
quartiles and ends of whiskers on the graph. If the mean diamonds are
chosen, the values of the means will also be displayed. Try these
features yourself with the TECHSAL data.
Numerical Summaries
The command Analyze:Distribution ( Y ) will produce numerical
summaries such as the mean, median and standard deviation. It will also
produce two plots: a boxplot and a density histogram. Density
histograms are like frequency histograms, except that the height of each
bar equals the density, rather than the frequency, of data in that bar's
subinterval. The density in a subinterval is the frequency in the
subinterval divided by the product of the number of observations in the
data set and the subinterval width. You will learn more about density
histograms in Chapter 4.
Resistant Measures
SAS/INSIGHT allows you to select from among a resistant estimator of
the standard deviation (Gini's mean difference), and the two resistant
estimators of location discussed above: the trimmed mean and the
Winsorized mean. For the latter two you can choose the number of
observations or the percentage of observations to be trimmed or
Winsorized at each end. To compute these estimators, you must first
generate the distribution window by choosing Analyze:Distribution
( Y ). From the menu bar on this window, click on Tables, and
then select the resistant estimator of your choice.
Doing Lab 2-1 with SAS
The instructions below are numbered to correspond to the step numbers
in the Experimental Procedure section of Lab 2-1. There are two
versions of the instructions: the first for SAS/INSIGHT users and the
second for for input of instructions from the command line.
SAS/INSIGHT
- 1
- Access SAS/INSIGHT. The data are found in
CRIME.
- 2,3
- Select Analyze:Distribution ( Y ) from the
data window. From the dialog box select AUTO as the Y variable and
STATEN (the state name) as the label. A Distribution Window will
appear with a density histogram, a boxplot, and tables of summary
measures for AUTO.
- 2
- Compute the k-times trimmed mean for k=3 by choosing
Tables:Trimmed Mean:(1/2)N:3.
- 4
- To identify the outlier on the boxplot, click on it.
It will become highlighted, and its name will appear.
- 5
- To change the Massachusetts auto theft rate of
1140.1 to a value of, say, 2140.1, click on the data window cell
containing the value 1140.1, type 2140.1, and hit ``Enter'' (or
``Return'') on the keyboard. When you do this the plots and summary
measures in the Distribution Window will be updated to reflect the
change in the data.
- 6
- To remove the Massachusetts data value, select
Massachusetts then choose Edit:Observations:Exclude in
Calculations from the Distribution Window. The summary measures will
be updated to reflect the change, and the plots will be modified to
show the observation is not included in the calculation (The square
denoting the value in the boxplot will change to an x, for example.)
To also remove the observation from the plots, choose
Edit:Observations: Hide in Graphs.
An alternative to all this is to select Massachusetts and then choose
Edit:Delete. This removes Massachusetts from the copy of the
data set you are working on (Don't worry, it won't remove it from the
original data set; you have to save the modified copy to the same data
set name to do that.)
From the SAS Command Line
The commands
proc univariate data=crime plot;
var auto;
run;
will get all the output you need, except for the trimmed mean, which
is unavailable from the SAS command line. The histogram produced will
be a stem-and-leaf plot, in which data values serve as the histogram
bars.
Data Set
- WATCHES: Watch assembly times, Example 3.9.
Randomly Assigning Treatments to Experimental Units
From SAS/INSIGHT
To see how to use SAS/INSIGHT to randomly assign treatments to
experimental units, consider again the example of watch assemblers and
assembly methods from Example 3.9.
- Begin with a data set consisting of the numbers
1 through 15 (one for each assembler).
- Next generate a column of 15 numbers randomly selected
from the interval (0,1) by choosing Edit: Variables: Other...
and then choosing ranuni(a) from the resulting dialog box.
- Now sort the column of random numbers by choosing
:Sort. When the random numbers are sorted, the
assembler numbers become randomly ordered.
- Assign the first 5 of the assembler numbers to assembly
method 1, the next 5 to assembly method 2 and the last 5 to assembly
method 3.
From the SAS Command Line
The following commands will produce two columns of numbers in the
output window:
data assign;
do assemblr=1 to 15;
rannum=ranuni(-1);
output;
end;
run;
proc sort data=assign out=assign;
by rannum;
proc print;
var assemblr;
run;
Assign the first 5 of the assemblr numbers to assembly
method 1, the next 5 to assembly method 2 and the last 5 to assembly
method 3.
Doing Lab 3-2 with SAS
In SAS/INSIGHT, you can label the observations in the scatterplot of
PRESS versus STUDENT by selecting HAND as the label variable in the
SAS Scatterplot ( Y X ) dialog window. Then, clicking on each point on
the resulting plot will label the point. You can also label the points
for right and left with different colors or symbols. To do this,
select Edit: Windows: Tools. The SAS:Tools window will appear.
To give the two hands different colors, click on the long color button at
the bottom of the color pallette. A ``SAS: Color Observations'' window
will appear. Click on HAND, and then on OK. To get different plotting
symbols for the two hands, do the same steps, beginning with a click
on the long button with all the symbols on it.
Data Sets
TECHSAL: Salaries of technical support workers, Example 4.1.
GASKET: Gasket thicknesses, Example 4.23.
Computing Probabilities
You can use the macro NPROBS to compute the probability
P(a < Y <= b) where the random variable Y has any of the following
distributions studied in this section: binomial, Poisson, normal
or Weibull. A data entry window prompts you for the name of the
distribution and its parameters. You are also prompted for the values
of a and b.
To obtain P(Y <= b for the normal distribution, select a=-9999999.
To obtain P(Y <= b) for the binomial, Poisson or Weibull
distributions, select a=-1 (or any other negative value). To obtain
P(Y > a) for the b(n,p) distribution, select b=n. To obtain
P(Y > a) for the Poisson, normal
or Weibull distributions, select b=99999999.
Fitting a Normal Distribution
Here is a sequence of steps a data analyst might use in analyzing the
gasket data in Example 4.23.
- First, produce a line plot of thickness versus
production order (which, if done at the outset, would have saved the
quality personnel in Example 4.23 a great deal of trouble). To do so,
enter SAS/INSIGHT and choose Analyze:Line Plot ( Y X ). Select
THICK as the Y variable and ORDER as the X variable. The plot should
reveal the outliers as the first two values.
- To look at the distribution of thickness, choose
Analyze:Distribution ( Y ) and select THICK as the variable to be
analyzed. Look at the distribution window that appears. What are the
salient features of the data as displayed on the boxplot and
histogram? How well do you think a normal curve will fit the data?
- Fit the normal curve
N(
,S^2) to the data
by choosing Curves:Parametric Density from the distribution
window. In the resulting dialog box, make sure Normal is selected as
Distribution: and Sample Estimates/MLE is selected as Method: before
clicking on the ``OK'' button.
- To produce a normal quantile plot, choose
Graphs:Q-Q Plot. In the resulting dialog box, make sure ``Normal'' is
selected as the distribution. The normal quantile plot will appear
with the values of the data quantiles on the vertical axis and those
of the normal distribution quantiles on the horizontal axis: just the
opposite of the graphs in the text. To assess normality, it really
doesn't matter which quantities are plotted on which axes. However, in
the text we have plotted the data quantiles on the horizontal axis in
order to match up these values with the boxplot and histogram. To
reverse the axes, move the cursor to the upper left corner of the box
surrounding the normal quantile plot, press down on the left mouse
button, and drag the left corner diagonally down through and past the
lower right corner of the box. (Once again, the click-and-drag
version of turning a sleeve inside out.) You can then resize and move
the box as you want. By moving some other graphics boxes, you can line
up the normal quantile plot below the histogram. Choosing
Edit:Windows:Align will align the values of the horizontal variable
(THICK, for the gasket data) in all three plots.
To add a reference line to the normal quantile plot, choose
Curves:QQ Ref Line.
- Look at the normal curve fit to the histogram and at the
normal quantile plot. How do they look? Those two outliers are really
causing problems, aren't they? Remove the most extreme one as
follows. Select the extreme outlier by clicking on it in the boxplot.
Choose Edit:Observations:Exclude in Calculations. Notice that
the normal curve and normal quantile plot are recalculated without the
extreme outlier.
Do you like this fit any better? Perhaps you should remove the other
outlier now. Proceed as in the last paragraph. With the two outliers
removed, the normal density fits the histogram well, and the normal
quantile plot is nearly a straight line. (Note: to include the
outliers in the calculations again, make sure they are selected and
choose Edit:Observations:Include in Calculations).
Transformations
A selection of transformations is available in SAS/INSIGHT
by choosing Edit:Variables.
Doing Lab 4-1 with SAS
To do lab 4-1, merely run the macro LAB4_1. Both the required density
histogram and the plot of the cumulative proportion of values Y=1
versus trial will be automatically produced.
Doing Lab 4-2 with SAS
The macro LAB4_2 will produce the necessary histogram. You will be
prompted for your values of N and n: choose n=5. Output from
the macro LAB4_2 consists of a density histogram just like you
produced for the 10 trials you conducted by hand, only for
10,000 trials. The relative frequency of each of 0---5 successes for
the 10000 measurements will appear at the top of the corresponding
bar.
Doing Lab 4-3 with SAS
First a word about the macros you will use in the simulations.
When running the macro, don't worry if graphs pop up on the screen and
disappear. They will reappear on a one-page template containing all
four graphs that you called for. CAUTION: If you wish to
print the template you must do it \underline BEFORE moving on to
the next macro. Submitting a new macro will overwrite the previous
template and you'll have to run the first macro again.
The Central Limit Theorem for Rolls of a Die
- You are going to call the macro MAKEDATA. This
macro will generate random data from the discrete uniform distribution
having an equal probability of producing any of the integers 1,2,3,4,5
or 6, just like a fair die. The data will be put in the data set
ROLLS. The macro will simulate the trial of rolling a fair die 50
times. It replicates this trial 250 times, producing a total of 250
times 50 or 12,500 simulated die rolls.
MAKEDATA simulates the 50 rolls of the fair die, putting the result of
the i-th roll in variable Ci. Thus each row in C1- C50 represents
one replication of the trial. There are 250 such rows corresponding
to the 250 replications of the trial. The macro also computes the
means of the first 2, 10, 30 and 50 rolls from each trial, calling
them MEAN2, MEAN10, MEAN30 and MEAN50, respectively.
Run the macro now. A window will pop up informing you when the data
set has been created. Click on the window as directed and hit
return. The window will go away. Note that if the window fails to
appear, something has gone wrong and you should ask for help.
Use SAS or SAS/INSIGHT to look at ROLLS, which is a data set
containing a portion of the data. Specifically, ROLLS has in each row
the first 5 of the 50 original observations (die rolls) under
variable names C1-C5. The mean of C1 and C2 is in MEAN2, the mean of
C1-C10 is in MEAN10, the mean of C1-C30 is in MEAN30, and the mean of
C1-C50 is in MEAN50.
- You may use SAS or SAS/INSIGHT make a frequency and
a density histogram of C1. Recall that in SAS/INSIGHT,
Analyze:Histogram/Bar Chart ( Y ) will produce a frequency histogram,
and Analyze:Distribution ( Y ) will produce a density histogram.
To obtain density histograms of C1, MEAN2, MEAN10, and MEAN50 all
plotted on the same scale, simply use the macro HISTREP. When you
call HISTREP an input window will appear. Click on the green cursor
and enter 'rolls' (without the quotes) as the data set name. Hit
return and enter 'u', then successively the names C1, MEAN2 MEAN10 and
MEAN50. Density histograms of these variables will appear in the SAS
GRAPH window. You should print these now.
- Make normal quantile plots of C1, MEAN2, MEAN10,
and MEAN50. To do this, call the macro `NORMREP' and proceed as you
did for HISTREP. Print these graphs now.
- \item[4.] In SAS/INSIGHT, Analyze:Distribution ( Y )
will produce the means and standard deviations of C1, MEAN2, MEAN10,
and MEAN50.
- The macro SMEAN will compute the standardized means
of C1, MEAN2, MEAN10, and MEAN50. These standardized means will be
found in the data set ROLLS under the variable names SC1, SMEAN2,
SMEAN10, and SMEAN50. Use SAS/INSIGHT to check that the means of each
of these variables are nearly 0 and the standard deviations are nearly
1.
Use the macro HISTREP to generate density histograms and
NORMREP to generate normal plots of the standardized means (don't
forget to enter an 's' to to denote the fact that the data are
standardized).
An Example Where the Central Limit Theorem Fails
The macro MAKECAU will
generate 250 data sets each of 50 observations from a Cauchy
distribution model. The data will be placed in CAU. C1 again
denotes the first column of data, and MEAN2, MEAN10 and MEAN50 have
the same meaning here as they did in ROLLS. Now do steps 2. and 3. on
these data; don't forget to enter a 'c' to denote the fact that the
data are Cauchy.
Data Sets
- SOL: One hundred measurements of the speed of light.
- BEARINGS: Weights and diameters of 100 ball bearings.
Estimation Using SAS/INSIGHT
Before any inference procedure for measurement data, you should
investigate the data for outliers and non-normality. SAS/INSIGHT
is the easiest way to do this.
SAS/INSIGHT will compute one sample t confidence intervals
(equation (5.8)). To do this, first do a distribution
analysis of the variable in question. From the distribution analysis
window choose Tables: C.I. for Mean and then select the desired
confidence level.
SAS Macros
Estimation
- The macro TINT will compute one sample t
confidence intervals (equation (5.8)). It will also compute
the prediction interval given by equation (5.11).
- The macro TWINT will compute two sample
approximate t (equation (5.24)) confidence intervals for
the difference of means in two independent C+E models.
- The macro BIEXACT will compute exact one
sample confidence intervals for population proportions.
- The macro BINORM will compute one and two sample,
large sample confidence intervals for population proportions.
Utilities
- The macro INVPROBS will compute quantiles for the
normal and t distributions.
- The macro NPROBS will compute probabilities
P(a < Y <= b) for the normal and t distributions.
\end{itemize}
Doing Lab 5-1 with SAS
- 1
- The macro LAB5_1A will generate as many sets of
data from the C+E model as you tell it to. A window will ask you for
the number of data sets, the name of the SAS data file where you want
the data sets written, the number of observations per data set, and
the values of
and
that define the C+E model.
- 3
- Use the macro LAB5_1B to generate the 500 data sets
each of size 20 from the same C+E model that you chose above, and then
to draw histograms of the parameter estimates.
Doing Lab 5-2 with SAS
- 1
- The macro LAB5_2 will create 100 samples from the
C+E model and calculate level L confidence intervals for
. A
window will prompt you to input the number of data sets (choose 100),
the number of observations in data set (choose 20), the values of the
parameters
and
(choose what you like) and the
confidence level L (choose .95). The window will also prompt you for
``Contamination level'' (choose 0).
The input window will display the mean width of the 100 intervals. A
graph will display the true value of
and the computed confidence
intervals. The intervals that contain the true parameter value are
displayed in green and the intervals that do not contain the true
parameter value are displayed in red.
- 3
- Run the macro using the same parameters as
previously, but first with a 0.1 proportion of contamination and then
with a 0.5 proportion of contamination.
One Sample Tests for the Mean in the C+E Model
A two-sided test can be obtained from SAS/INSIGHT. Choose
Analyze: Distribution ( Y ) : Tables: Location Tests.
From the resulting pop-up window, choose Student's T Test
and for Parameter
input the value of
_0. Output consists of the value of the
test statistic and the two-sided p-value. From this information,
the p-value for either one-sided test can be computed.
As an example, the t* for the one sample test of
H_0:
= 275,
H_a:
> < 275. (where > < stands for "not equal".)
for the artificial pancreas data (see Section 6.3)
is given in SAS/INSIGHT as -2.79 with p-value 0.068. Since
t* < 0, we know that the area under the t_3 curve below t*
is 0.068/2=0.034. This is the p-value for testing the one-sided
alternative H_a:
< 275. The p-value for testing the
opposite one-sided alternative, H_a:
> 275, is the area above
t*, which is 1-0.034=0.966.
Comparing Two Means
The macro TWTEST will perform both the pooled and approximate
one and two-sided t tests. It accepts as input either (1) data for
the two samples as separate columns in a SAS data set, or (2) summary
data consisting of the sample mean and standard deviation for each
sample.
Tests for Proportions
The test statistics are easy enough to compute using pencil and paper.
The macro NPROBS will compute the appropriate tail areas for the
binomial (exact test) or normal (large sample approximation)
distributions.
Doing Lab 6-1 with SAS
The instructions below are keyed to the instructions in the text.
The Meaning of Statistical Significance and p-values
- 1
- Use the macro LAB6_1 to generate 1 set of 10
observations from a N(25,1) distribution. The macro will also
compute the t statistic and the p-value for testing H_0:
=25
versus H_a:
> < 25 for this data set.
- 3
- Now use the same macro to generate 1000 sets of 10
observations each from a N(25,1) distribution, and to compute the
t statistic and p-values for each. The 1000 sets of t statistics
and p-values will be saved in the SAS data set TEST25.
- 4
- You can use SAS/INSIGHT with the data set TEST25 to
obtain the proportion of the 1000 test statistics that provide as much
evidence against the null and in favor of the alternative hypothesis
as does t*, and the proportion of p-values as small as or
smaller than the p-value associated with t*.
How Nonnormality Affects the Results
- 1
- Generate a histogram of 1000 observations from the
exponential distribution using the macro LAB6_1.
- 3
- Use the macro LAB6_1 to generate 1 set of 10
observations from
an exponential distribution with mean
=25.
The macro will also compute the t statistic and the p-value for
testing
H_0:
=25 versus H_a:
> < 25.
- 5
- Now use the macro to generate 1000 sets of 10
observations each from an exponential distribution with mean 25, and
to compute the t statistic and p-values for each. The 1000 sets of
t statistics and p-values are saved in the SAS data set EXPO25.
These 1000 data sets represent 1000 experiments identical to the
original one.
- 6
- Use SAS/INSIGHT to obtain the proportion of the 1000
test statistics that provide as much evidence against the null and in
favor of the alternative hypothesis as does t*. Obtain the
proportion of p-values as small as or smaller than the p-value
associated with t*.
Data Sets
- TWEAR: Tool wear data.
- TWEAR8: Tool wear data for VELOCITY=800.
- FUEL: Fuel consumption versus equivalence ratio.
- DRAFTLOT: 1970 draft lottery data.
- TRAPDATA: Bacterial trap data.
- DONNER: Donner party data.
- DERBY: Kentucky Derby data.
The Median Trace
The macro MTRACE will compute a median trace. An input window will
appear; click on the cursor location. To do a median trace for the
draft lottery data, the data set, Y variable, X variable and
number of slices you should enter are DRAFTLOT, NUMBER, BDATE and 12
respectively. Next another input window window will appear asking for
the upper boundary of the first slice. Tell it 31 for the 31 days in
January (don't forget to click on the cursor first). The red window
will reappear asking each time for the upper boundary of the next
slice. Give it (let's see, thirty days hath September...) the values
60, 91, 121, 152, 182, 213, 244, 274, 305, 335 and 366 successively.
You can experiment if you like with different boundaries for the
slices and different numbers of slices.
The Tool Wear Data
To generate Figure 7.1,
choose Analyze:Scatter Plot (Y X). From the resulting dialog
window, select
WEAR as the Y and TIME as the X variable.
A scatterplot window will appear. Enlarge and renew this window for
better viewing.
To generate Figure 7.5, use the markers in SAS/INSIGHT
(just as you did in Chapter 1) to give a different plot symbol to each
value of VELOCITY on the WEAR versus TIME scatterplot. For viewing at
the computer you may prefer to use the palettes to give different
colors instead of different plotting symbols. Or you can do both.
You can obtain the scatterplot in Figure 7.6 from the
data set TWEAR8.
Correlation
It's easy to standardize variables in SAS/INSIGHT. To do it, from the
data window choose Edit:Variables:Other.... From the resulting
dialog window choose the transformation ``(Y-mean(Y))/std(Y)'' and
whichever variable you want transformed. Try this now for the two
variables WEAR and TIME in the data set with VELOCITY=800. Plot the
standardized variables against each other.
To find the correlation of the tool wear data for VELOCITY=800, access
TWEAR8 and choose Analyze:Multivariate ( Y's ). From the
resulting dialog window
select TIME and WEAR and ORDER as the Y variables.
A window will appear containing a number of descriptive
statistics. The Correlation Matrix in that window contains
Pearson correlations for all pairs of variables. On the diagonal are
the correlations of each variable with itself (What are these? Does
this surprise you?). The off-diagonals are the correlations between
pairs of different variables. Which other variable is most correlated
with WEAR? The correlation matrix is symmetric (i.e. the entries
below the upper left to lower right diagonal are mirror image of those
above the diagonal). Why do you think this is?
Regression
Least Squares Fit
It is very easy to compute the least squares estimators using
SAS/INSIGHT: just choose Analyze:Fit ( Y X ), and select the X
and Y variable from the dialog window.
When you choose Analyze:Fit ( Y X ), SAS/INSIGHT automatically
computes the fitted values and residuals and places them in the data
set under the names P_Y and R_Y, respectively, where Y is the name
of the Y variable. So, for example in the regression of WEAR on TIME,
the fitted values are called P_WEAR and the residuals are called
R_WEAR. A plot of residuals versus fitted values is also produced
automatically. You can now plot the residuals versus any variables of
interest.
Studentized Residuals and Normal Quantile Plots
Generate Studentized residuals by choosing Vars: Studentized
Residual. The Studentized residuals will be placed in a variable
named with the prefix RT_ followed by something resembling the name
of the response variable in the regression.
It is a good idea to look at the Studentized residuals. Choosing
Analyze: Distribution ( Y ) will do a distribution
analysis of the Studentized residuals
The SAS macro TQPLOT will produce a plot of Studentized residuals
versus t quantiles. It will also write the original
data, the Studentized residuals and the t quantiles to a data set
of your choice.
Confidence and Prediction Bands and Intervals
The confidence and prediction bands in Figure 7.21 were
generated by choosing Curves: Confidence Curves: Mean: and
Curves: Confidence Curves: Prediction:, respectively. You are allowed
to choose the confidence level of the bands.
The SAS macro REGPRED computes level .95 confidence intervals for the
mean of the response and level .95 prediction intervals for a new
observation at each data value in the input data set and at additional
user-specified predictor values. The predicted values are stored
under the name PRED. The endpoints of the confidence intervals for the
mean are stored under names L95MPRED and U95MPRED and those for
prediction intervals for a future observation are stored under the
names L95PRED and U95PRED in the SAS data set REGPRED. Standard SAS
regression output is written to the SAS/OUTPUT window.
Categorical Data
In SAS/INSIGHT you can analyze data for a single categorical variable
using bar charts. You can obtain information on the relation between
two categorical variables using mosaic plots. For example,
Figure 7.23 was produced by choosing
Box Plot/Mosaic Plot ( Y ) and then selecting GENDER as the Y
variable and FATE as the X variable. The frequencies and percentages
were added by choosing
:Values.
The SAS macro CAT2WAY will create two-way tables. Since it was
designed with additional sophisticated analyses in
mind, the input to and output from CAT2WAY contains some terms you
will not be familiar with. Still, it is very easy to use, as the
following example, based on the Donner data, shows.
The following will produce one and two-way frequency tables for FATE
and GENDER for the Donner data:
\begin{enumerate}
\item Invoke the macro CAT2WAY.
\item Enter the names of the data set (DONNER), row
variable (FATE) and column variable (GENDER) where indicated.
\item You are next asked if there is a count variable. For the
Donner data, there is not, so answer 'N'. Were the data set to
have a variable giving cell counts, you would answer 'Y', and
then be prompted to give the name of the count variable.
\item You are next asked if you want to conduct Fisher's exact
test. As you don't know what this is, just answer 'n'.
\item When the computations are finished, you will be prompted
to hit return to exit the macro. The table will be output to
the SAS Output Window. Each cell of the table will contain the
cell count or frequency, overall percent, row percent, column
percent, expected frequency and the cell chi-square. The cell
chi-square is just the square of the Pearson residual. A number
of test statistics are also output, including Pearson's
chi-square, which will appear thus:
Statistic DF Value Prob
------------------------------------------------------
Chi-Square 1 4.811 0.028
In this output, the quantities shown are the degrees of freedom, the
observed value of the chi-square test statistic, 4.811, and the
p-value, 0.028.
Computing p-Values for the Chi-Square Distribution
The p-value for a chi square test is easily computed using the SAS
macro NPROBS, remembering that a chi-square distribution with m
degrees of freedom is a gamma distribution with parameters ALPHA=m/2
and BETA=2.
For Example 7.11 about the categories of defective computers, we have
an observed value 13.36 of the test statistic, and we want to
compute its p-value using a chi-square distribution with 4 degrees of
freedom as the
reference. To do this, invoke the macro NPROBS and select the gamma
distribution. Enter 2 (=4/2) for ALPHA and 2 for BETA. Enter 13.36 for
A and some very large number (we used 10000) for B.
Proc FREQ
Proc FREQ can conduct Pearson's chi-square test, and other associated
quantities. To illustrate its use, we consider data relating
consumption of ascorbic acid (vitamin C) to the incidence of colds in
a group of French skiers. In a controlled experiment, 279 French
skiers were divided into a treatment and a control group. The
treatment group received ascorbic acid and the control group a
placebo. Whether or not the skier had a cold during the trial period
was recorded.
To enter the data, submit the following program from the SAS PROGRAM
EDITOR window:
title 'Analysis of data on French skiers';
options linesize=70;
data skiers;
input treat \ cond \ count @@;
cards;
plac cold 31 plac ncold 109
asco cold 17 asco ncold 122
;
run;
The data are now in the SAS data set SKIERS.
The following commands, submitted from the SAS PROGRAM
EDITOR window, will, among other things,
- Create a table with
- Cell counts
- Overall, row and column percentages
- Cell chi-squares (the square of the Pearson
residuals) (cellchi2)
- Calculate the Pearson chi-square statistic and its
p-value (chisq)
proc freq data=skiers order=data;
weight count;
tables treat*cond / chisq cellchi2;
run;
The output is the following:
Analysis of data on French skiers
TABLE OF TREAT BY COND
TREAT COND
Frequency |
Cell Chi-Square|
Percent |
Row Pct |
Col Pct |cold |ncold | Total
---------------+--------+--------+
asco | 17 | 122 | 139
| 1.999 | 0.4154 |
| 6.09 | 43.73 | 49.82
| 12.23 | 87.77 |
| 35.42 | 52.81 |
---------------+--------+--------+
plac | 31 | 109 | 140
| 1.9847 | 0.4124 |
| 11.11 | 39.07 | 50.18
| 22.14 | 77.86 |
| 64.58 | 47.19 |
---------------+--------+--------+
Total 48 231 279
17.20 82.80 100.00
Analysis of data on French skiers
STATISTICS FOR TABLE OF TREAT BY COND
Statistic DF Value Prob
------------------------------------------------------
Chi-Square 1 4.811 0.028
Likelihood Ratio Chi-Square 1 4.872 0.027
Continuity Adj. Chi-Square 1 4.141 0.042
Mantel-Haenszel Chi-Square 1 4.794 0.029
Fisher's Exact Test (Left) 0.021
(Right) 0.991
(2-Tail) 0.038
Phi Coefficient -0.131
Contingency Coefficient 0.130
Cramer's V -0.131
Sample Size = 279
Doing Lab 7-1 with SAS
The instructions below are keyed to the instructions in the text.
- Access the macro LAB7_1. This macro will generate a
data bivariate set, and will display a plot
of the response versus regressor variable in the SAS graph window.
- A window called 'GUESS' will appear
asking you to give an intercept and slope for a line
you think best fits the data. Take a few minutes to formulate
an educated guess before answering. Following instructions in the
window, enter your guesses.
- The program will then plot your line superimposed on a
plot of the data. How did you do? From the plot you should see how
your fitted line can be improved. When you are done looking at the
plot, click on the lower part of the vertical scroll bar on the right
side of the graphics window. This causes a plot of the residuals from
your guessed line versus X. The program also displays the SSE for
the line you fit and gives you a residual plot. Mark the SSE down.
- Using the feedback from the data plots,
try to improve your fit. A message will appear in the guess window
asking you if you want to try again. Type 'G' to guess again or 'Q'
to quit. Make another guess, submitting numbers as you did before.
In terms of SSE and of the data plots how did you do? Be sure to look
at both plots (you won't be able to continue until you do).
Keep track of your best fit and keep trying until you think you've done
as well as you can.
- When you want to see how close your guess is to the
least-squares line type 'S' in the macro window when you are prompted.
This will get you the least squares fits of slope and intercept and
the minimum SSE. How does your best fit compare? If you run the
least squares slope and intercept through the macro, you can compare
the resulting plots with those from your best fit, and you can find
the SSE for the least squares fit.
Doing Lab 7-2 with SAS
The instructions below are keyed to the instructions in the text.
- 2
- Access the macro LAB7_2. In response to the prompts, input FUEL
(or SASDATA.FUEL) as
the data set, FADJ as the response and
E_RATIO as the regressor. Hitting return will generate a scatterplot
of the response versus the regressor with the least squares line
superimposed. Clicking on the scroll bar in the graph window will
show a plot of the Studentized residuals versus the regressor.
The SAS Output Window contains the regression parameters, SSE, fits,
residuals, etc. Note
the value of the coefficient of determination (R-square). Describe the
pattern of the residual plot.
- 3-4
- In the data entry window, you will be prompted
for the value p for the power transformation you want to apply. The
macro will regress FADJ^p on E_RATIO and output the same plots and
regression output as in 2. Try the two suggested values of p, and
then keep trying more values until you find a nearly linear
relationship.
Joe Petruccelli < jdp@wpi.edu>
Last modified: Mon Jan 11 14:44:21 EST 1999