Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
245 views
in Technique[技术] by (71.8m points)

Linear regression with R: How do I get labels on data point in qq plot, scale location plot, Residuals vs Leverage etc

I have a small dataset with EU member states that contains values on their degree of negotiation success and the activity level the member states showed in the negotiations.

I am doing a linear regression with R.

In short the hypothesis is: The more activity a member state shows, the more success it will have in negotiations.

I played around a lot with the data, transformed it etc.

What I have done so far:

# Stored the dataset from a csv file in object linData
linData = read.csv(file.choose(), sep = ";", encoding = "de_DE.UTF-8")

# As I like to switch variables and test different models, I send the relevant ones to objects x and y.
# So it is easier for me to change it in the future.
x = linData$ALL_Non_Paper_Art.Ann.Recit.Nennung
y = linData$Success_high

# I put the label for each observation in a factor lab
lab = linData$MS_short

# After this I run the linear model
linModel = lm(y~x, data = linData)
summary(linModel)

# I create a simple scatterplot. Here the labels from the factor lab work fine
plot(x, y)
text(x, y, labels=lab, cex= 0.5, pos = 4)

So far so good. Now I want to check for model quality. For visual insepection I found out I can use the command

plot(linModel)

This produces 4 plots in a row:

As you can see in every picture R marks problematic observations by a number. It would be very convenient if R could just use the column "MS_short" from te dataset and add the label to the marked observations. I am sure this is possible... but how?

I work with R for 2 months now. I found some stuff here and via googe but nothing helped me to solve the problem. I have no one I can ask. This is my 1st post here an stackoverflow.

Thank you in advance Rainer


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

With the help of G. Grothendieck I solved the problem.

After entering the R-help of plot, more specific the help for plot and linear regression (plot.lm) with the command

?plot.lm

I read the box with the "arguments and usage" part and identified the labels.id argument AND the id.n argument.

id.n is "number of points to be labelled in each plot, starting with the most extreme."

I needed that. I was interested in the identification of this extreme points. R already marked the 3 most extreme points in all graphics (see initial post) but used the observations numbers and not any useful labels. Any other labelling would mess up the graphics. So, we remember: In my case I want the 3 most extreme values to be labelled.

Now let's add this to the command: I started the same as above, with a plot of my already computed linear model -> plot(linModel). After that I added "id.n =" and set the value to "3". That looked like that:

plot(linModel, id.n = 3, 

So far so good, now R knows what to label BUT still not what should be used as label. For this we have to add the labels.id to the command.

labels.id is the "vector of labels, from which the labels for extreme points will be chosen."

I assumed that one column in my dataset (NOT the linear model!) has the property of a vector and so I added a comma and then "labels.id =" to the command and typed in the name of my dataset and then the column, so in my case: "linData$MS_short" where linData is the dataset and MS_short the column with the 2 letter string for each member state. The final command looked like this:

plot(linModel, id.n = 3, labels.id = linData$MS_short)

And then it worked (see here). End of story.

Hope this can help some other newbies. Greetings.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...