The following tool will allow you to generate bivariate data sets and compute linear regressions.
To generate your data set: (or skip and just press "Generate!")
Choose an N (number of data points). Designed with values less than or equal to 42 in mind. Values larger than N=100 may cause noticeable delay.
Choose a method of generating the x coordinates:
"evenly spaced" will divide the specified interval up equally.
"uniform distribution" will generate random values in a range, with each value equally likely to be picked.
The default, "normal distribution", will tend to generate values close to the mean.
Choose coefficients for generating the y coordinates.
By default, the coefficient of x2 is 0: data is generated along a line and then displaced by a random factor.
Set the St.Dev. of the random displacement to 0 to generate data pefectly along a line or parabola.
Set the coefficients of x and x2 to 0 in order to have y be generated purely randomly (not based on the x value).
Press "Generate!"
Press "Generate!" again to rerandomize.
Was this helpful? Confusing? Didn't work correctly in your browser? Feedback to thomas.kern@oswego.edu is highly appreciated.
N:
Choose how you would like to generate your data:
Generate x:
Evenly Spaced
Random (Uniform Distribution)
Random (Normal Distribution)
Mean:
St.Dev.:
Generate y:
As a function of x:
x2 + x +
Plus a random displacement:
St.Dev.:
Hint: Leave entries as 0 to not include them.
μx
μy
σx
σy
ρ
ρ2
Linear:
Automatically fit data in window
Maintain aspect ratio
Draw axes
Draw regression line
Draw residual lines
Min x:
Max x:
Min y:
Max y:
Residuals:
Questions to think about:
Set the coefficient of x2 under "generate y as a function of x" to 0, and the standard deviation under "plus a random displacement" to 1.
How is the linear regression found related to the "generate y as a function of x" line? Try generating new data several times and adjusting the coefficients under "generate y as a function of x".
Set the coefficients of x2 and x under "generate y as a function of x" to 0, and the standard deviation under "plus a random displacement" to a nonzero value.
Press "Generate!" a bunch of times. How frequently does the tool detect a linear relationship despite x and y being generated completely independently?
Use the tool below to streamline the process:
Number of iterations:
Linear relations found:
These misclassifications are called "false positives" or "type I errors" and occur naturally and unavoidably in the course of studying statistics. We will talk about them later on in class. For now you can read more about them below.
What happens to the graph as you change the standard deviation under "plus a random displacement"?
Set the method for generating x to be "evenly spaced" and the standard devation under "generate y: plus a random displacement" to 0. Set the coefficient of x2 under "generate y as a function of x" to be 1. By playing around with the coefficient of x, the constant term, and upper and lower bounds for x can you get ρ to be close to 0? close to 1? close to -1?
Set the coefficient of x2 under "generate y as a function of x" to 0. Play around with the coefficient of x and the standard deviation of the random displacement to see if you can find patterns about when you are likely to get linear relationships.
Set the "generate y as a function of x" to 1 x2 + 20 x + 1, and the standard deviation of the "plus a random displacement" to 0. How good is the linear regression model? You may wish to uncheck "maintain aspect ratio" to get a better plot. Compare the linear regression model versus the quadratic model y = x2 + 20 x + 1 which would be perfect. What do the residuals look like?
The Spurious Correlations webpage looks through a large database of time-series data to find datasets that are correlated by simple chance. (Content warning: includes data on various causes of death.)
A fictional account of finding a false positive can be found on xkcd.