What on earth do people mean when they tell you they’re going to “fit a model”?
It looks like you could draw a nice straight line through this cloud of points. That means that a linear model might be a good choice for this data. We’ve just done the first step in the model-fitting process: we’ve decided to use a line – a simple linear model.
The process of picking the correct line for this model is called “fitting”. There are different ways to do this – least squares is possibly the most familiar one. You could also use the “wiggle a ruler around on paper” method, or the “draw lines in Powerpoint” method. We’ll skip the details of that step because the internet describes least-squares fairly well.
I’m going to use an equation here, but it’s simple and it’s the only one I’ll use in this post!
That fitted line can be described with the equation y=mx+b. When we fit the model what we’re really doing is choosing the values for m and b – the slope and the intercept. The point of fitting the model is to find this equation – to find the values of m and b such that y=mx+b describes a line that fits our observed data well. In the case of the best fit model above, m is close to 1, and b is just a bit larger than 0.
And why do we care about this? Well, that value of m can be really informative. If m is very large and positive, then a small change in the value of x predicts a tremendous positive change in the value of y. If m is small, then changes in x predict small changes in y. A large m implies that x may have a large effect on y, hence m is also sometimes called the effect size. It’s also sometimes called a coefficient.
The principals of this model-fitting can be applied to linear models created on data with many more parameters – many dimensions. But they are fit in similar ways. We may not be able to draw multidimensional datasets in neat graphs but we can still apply least-squares to them!
Testing for significance
Now that we’ve fit a model and found values for m and b, we’d like to know something: does m really matter?
Take these two sets of data:
You got me; the first example is just a copy of the one above. But the second is another dataset that we could imagine fits the *same* linear model – that is, the best-fit linear model could have the exact same values for m and b. But really, does the value of x predict the value of y well here? We can tell by looking that it doesn’t.
That’s why many model-fitting tools return not only a slope for each parameter, but a p-value. This p-value is an indicator of whether that predictor (x) is actually useful in informing you about the state of the response variable (y).
To assess whether a parameter is predictive, we remove the variable (x) and it’s coefficient (m) from the model. And then we see how good we are at predicting y with a model that doesn’t include them.
In this case a model with no x means that we guess: we create a model where the prediction for y is always the same: the mean value of all of the observed values of y.
We compare the predictions between these two models. If our model that includes x is much better at prediction, we assign a low p-value to that coefficient.
We do this kind of testing for significance in many statistical settings, including one of my favorites: testing for differential expression of genes in RNA-Seq experiments. If a linear model that includes the expression level of gene A is better at predicting which group a sample comes from than a model without A, we decide that gene A is significantly differentially expressed.