--- title: "Introduction to Biostatistics
The 2nd Lecture" author: "岩田洋佳 hiroiwata@g.ecc.u-tokyo.ac.jp" date: "4/15/2022" output: html_document --- ## パッケージこのコードを実行するのに必要なパッケージ ```{r, message = F} require(here) require(plotly) ``` ## Simple regression analysis Changes in one variable may affect another, such as the relationship between breeding and cultivation conditions and the growth of animals and plants. One of the statistical methods to model the relationship between these variables is regression analysis. By statistically modeling the relationship between variables, it becomes possible to understand the causal relationship that exists between variables, and to predict one variable from another. Here, first, we will discuss simple regression analysis that models the relationship between two variables as a “linear relationship”. In this case, the mechanism of single regression analysis will be explained using the analysis of rice data (Zhao et al. 2011, Nature Communications 2: 467) as an example. First, read the rice data in the same way as before. Before entering the following command, change your R working directory to the directory (folder) where the two input files (RiceDiversityPheno.csv, RiceDiversityLine.csv) are located. ```{r} # this data set was analyzed in Zhao 2011 (Nature Communications 2:467) pheno <- read.csv("RiceDiversityPheno.csv") line <- read.csv("RiceDiversityLine.csv") line.pheno <- merge(line, pheno, by.x = "NSFTV.ID", by.y = "NSFTVID") head(line.pheno)[,1:12] ``` Prepare analysis data by extracting only the data used for simple regression analysis from the read data. Here we analyze the relationship between plant height (Plant.height) and flowering timing (Flowering.time.at.Arkansas). In addition, principal component scores (PC1 to PC4) representing the genetic background to be used later are also extracted. Also, remove samples with missing values in advance. ```{r} # extract variables for regression analysis data <- data.frame( height = line.pheno$Plant.height, flower = line.pheno$Flowering.time.at.Arkansas, PC1 = line.pheno$PC1, PC2 = line.pheno$PC2, PC3 = line.pheno$PC3, PC4 = line.pheno$PC4) data <- na.omit(data) head(data) ``` First, visualize the relationship between two variables. ```{r} # look at the relationship between plant height and flowering time plot(data$height ~ data$flower) ``` As shown in the above figure, the earlier the flowering time, the shorter the plant height, while the later the flowering time, the taller the plant height. Let’s create a simple regression model that explains the variation in plant height by the difference in flowering timing. ```{r} # perform single linear regression model <- lm(height ~ flower, data = data) ``` The result of regression analysis (estimated model) is assigned to “model”. Use the function “summary” to display the result of regression analysis. ```{r} # show the result summary(model) ``` I will explain the results displayed by executing the above command in order. >Call:
lm(formula = height ~ flower, data = data) This is a repeat of the command you entered earlier. If you get this output right after you type it, it does not seem to be useful information. However, if you make multiple regression models and compare them as described later, it may be useful because you can reconfirm the model employed in the analysis. Here, assuming that the plant height is $y_i$ and the flowering timing is $x_i$, the regression analysis is performed with the model $$ y_i = \mu + \beta x_i + \epsilon_i $$ As mentioned earlier, $x_i$ is called independent variable or explanatory variable, and $y_i$ is called dependent variable or response variable. $\mu$ and $\beta$ are called parameters of the regression model, and $\epsilon_i$ is called error. Also, $\mu$ is called population intercept and $\beta$ is called population regression coefficient. Since it is not possible to directly know the true values of the parameters $\mu$ and $\beta$ of the regression model, estimation is performed based on samples. The estimates of the parameters $\mu$ and $\beta$, which are estimated from the sample, are called sample intercept and sample regression coefficient, respectively. The values of $\mu$ and $\beta$ estimated from the samples are denoted by $m$ and $b$, respectively. Since $m$ and $b$ are values estimated from the samples, they are random variables that vary depending on the samples selected by chance. Therefore, it follows a probability distribution. Details will be described later. >Residuals:
Min 1Q Median 3Q Max
-43.846 -13.718 0.295 13.409 61.594 This output gives an overview of the distribution of residuals. You can use this information to check the regression model. For example, the model assumes that the expected value (average) of the error is 0. You can check whether the median is close to it. You can also check whether the distribution is symmetrical around 0, i.e., whether the maximum and minimum or the first and third quantiles have almost the same value. In this example, the maximum value is slightly larger than the minimum value, but otherwise no major issues are found. >Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 58.05464 6.92496 8.383 1.08e-15 ***
flower 0.67287 0.07797 8.630 < 2e-16 ***
—
Signif. codes: 0 ‘’ 0.001 ‘’ 0.01 ‘’ 0.05 ‘.’ 0.1 ‘ ’ 1 The estimates of parameters $\mu$ and $\beta$, i.e., $m$ and $b$, and their standard errors, $t$ values and $p$ values are shown. Asterisks at the endo of each line represent significance levels. One star represents 5%, two stars 1%, and three stars 0.1%. >Residual standard error: 19 on 371 degrees of freedom
Multiple R-squared: 0.1672, Adjusted R-squared: 0.1649
F-statistic: 74.48 on 1 and 371 DF, p-value: < 2.2e-16 The first line shows the standard deviation of the residuals. This is the value represented by $s$, where $s^2$ is the estimated value of the error variance $\sigma^2$. The second line is the determination coefficient $R^2$. This index and the adjusted $R^2$ represent how well the regression explain the variation of y. The third line is the result of the F test that represents the significance of the regression model. It is a test under the hypothesis (null hypothesis) that all regression coefficients are 0, and if this p value is very small, the null hypothesis is rejected and the alternative hypothesis (regression coefficient is not 0) is taken to be adopted. Let’s look at the results of regression analysis graphically. First, draw a scatter plot and draw a regression line. ```{r} # again, plot the two variables plot(data$height ~ data$flower) abline(model, col = "red") ``` ext, calculate and plot the value of y when the data is fitted to the regression model. ```{r} # calculate fitted values height.fit <- fitted(model) plot(data$height ~ data$flower) abline(model, col = "red") points(data$flower, height.fit, pch = 3, col = "green") ``` The values of $y$ calculated by fitting the model all lie on a straight line. An observed value $y$ is expressed as the sum of the variation explained by the regression model and the error which is not explained by the regression. Let’s visualize the error in the figure and check the relationship. ```{r} # plot residuals plot(data$height ~ data$flower) abline(model, col = "red") points(data$flower, height.fit, pch = 3, col = "green") segments(data$flower, height.fit, data$flower, height.fit + resid(model), col = "gray") ``` The value of $y$ is expressed as the sum of the values of $y$ calculated by fitting the model (green points) and the residuals of the model (gray line segments) Let’s use a regression model to predict $y$ for $x=(60,80,...,140)$, which are not actually observed. ```{r} # predict unknown data height.pred <- predict(model, data.frame(flower = seq(60, 140, 20))) plot(data$height ~ data$flower) abline(model, col = "red") points(data$flower, height.fit, pch = 3, col = "green") segments(data$flower, height.fit, data$flower, height.fit + resid(model), col = "gray") points(seq(60, 140, 20), height.pred, pch = 2, col = "blue") ``` All the predicted values will locate again on the line. ## Method for calculating the parameters of a regression model Here we will explain how to calculate a regression model. Also, let’s calculate the regression coefficients while actually using the R command. As mentioned earlier, the simple regression model is $$ y_i = m +b x_i + e_i $$ This equation means that the observed value $y_i$ consists of the part $m+bx_i$ explained by the regression equation and the error part $e_i$ not explained by the regression line. In the above equation, if $m$ or $b$ is moved, the error $e_i$ changes accordingly. So, how can we find the "optimal" parameters? There are various criteria for what is considered “optimal”, but here we consider minimizing the error $e_i$ across the data. Since errors can take both positive and negative values, errors cancels each other in their simple sum. So, we consider minimizing the sum of squared error (SSE). That is, consider μm and βb that minimize the following equation: $$ SSE=\sum_{i=1}^n e_i^2 =\sum_{i=1}^n (y_i-(m+b x_i))^2 $$ The following figure shows the change in SSE for various values of $m$ and $b$. The commands to draw the figure is a little complicated, but they are as follows: ```{r} #visualize the plane for optimization x <- data$flower y <- data$height m <- seq(0, 100, 1) b <- seq(0, 2, 0.02) sse <- matrix(NA, length(m), length(b)) for(i in 1:length(m)) { for(j in 1:length(b)) { sse[i, j] <- sum((y - m[i] - b[j] * x)^2) } } plotly::plot_ly(x = m, y = b, z = sse, type = "surface") ``` It should be noted that at the point where SSE becomes the minimum in the above figure, SSE should not change (the slope of the tangent is zero) even when $m$ or $b$ changes slightly. Therefore, the coordinates of the minimum point can be determined by partially differentiating the above equation with $m$ and $b$, and setting the value to zero. That is, $$ \frac{\partial SSE}{\partial m}=0 \\ \frac{\partial SSE}{\partial b}=0 $$ We should obtain the values of $m$ and $b$ to satisfy these. The method of calculating the parameters of a regression model through minimizing the sum of squares of errors in this way is called the least squares method. Note that $m$ minimizing SSE is $$ \frac {\partial SSE}{\partial m}=-2 \sum_{i=1}^n (y_i-m-b x_i) = 0 \\ \Leftrightarrow　\sum_{i=1}^n y_i -nm -b \sum_{i=1}^n x_i =0 \\ \Leftrightarrow　m = \frac {\sum_{i=1}^n y_i} n - b \frac {\sum_{i=1}^n x_i} n =\bar y-b\bar x $$ Also, $b$ minimizing SSE is $$ \frac {\partial SSE}{\partial b}=-2 \sum_{i=1}^n x_i(y_i-m-b x_i) = 0 \\ \Leftrightarrow \sum_{i=1}^n x_iy_i-m \sum_{i=1}^n x_i - b \sum_{i=1}^n x_i^2 =0\\ \Leftrightarrow \sum_{i=1}^n x_iy_i- n(\bar y- b \bar x)\bar x- b \sum_{i=1}^n x_i^2 =0\\ \Leftrightarrow \sum_{i=1}^n x_iy_i - n\bar x \bar y- b \left(\sum_{i=1}^n x_i^2 -n\bar x ^2\right)=0\\ \Leftrightarrow b = \frac {\sum_{i=1} n x_i y_i - n\bar x \bar y}{\sum_{i=1} n x_i^2 - n\bar x^2}= \frac {SSXY}{SSX} $$ Here, SSXY and SSX are sum of products of deviation in $x$ and $y$ and deviation the sum of squares of deviation in $x$, respectively. $$ SSXY = \sum_{i=1}^n (x_i - \bar x)(y_i - \bar y) \\ = \sum_{i=1}^n x_iy_i - \bar x \sum_{i=1}^n y_i - \bar y \sum_{i=1}^n x_i + n\bar x \bar y\\ = \sum_{i=1}^n x_iy_i - n\bar x \bar y - n\bar y\bar x + n\bar x\bar y\\ = \sum_{i=1}^n x_iy_i - n\bar x \bar y $$ $$ SSX = \sum_{i=1}^n (x_i - \bar x)^2 \\ = \sum_{i=1}^n x_i^2 - 2 \bar x \sum_{i=1}^n x_i + n\bar x \\ = \sum_{i=1}^n x_i^2 - 2n\bar x^2 + n\bar x^2 \\ = \sum_{i=1}^n x_i^2 - n\bar x^2 $$ The values of $m$ and $b$ minimizing SSE are the estimates of the parameters, and let the estimates be represented simply by $m$ and $b$. That is, $$ b=\frac {SSXY}{SSX} \\ m=\bar y- b \bar x $$ Now let’s calculate the regression coefficients based on the above equation. First, calculate the sum of products of deviation and the sum of squares of deviation. ```{r} # calculate sum of squares (ss) of x and ss of xy n <- length(x) ssx <- sum(x^2) - n * mean(x)^2 ssxy <- sum(x * y) - n * mean(x) * mean(y) ``` Then, we calculate the slope $b$. ```{r} # calculate b b <- ssxy / ssx b ``` Calculate the intercept $m$ ```{r} # calculate m m <- mean(y) - b * mean(x) m ``` Let’s draw a regression line based on the calculated estimates. ```{r} # draw scatter plot and regression line plot(y ~ x) abline(m, b) ``` Make sure that the same regression line was obtained as the function lm which we used earlier. Note that once the regression parameters $\mu$ and $\beta$ are estimated, it is possible to calculate $\hat y_i$, which is the value of $y$ corresponding to a given $x_i$. That is, $$ \hat y_i=m+b x_i $$ This makes it possible to calculate the value of $y$ when the model is fitted to the observed $x$, or to predict $y$ if only the value of $x$ is known. Here, let’s calculate the value of $y$ when the model is fitted to the observed $x$, and draw scatter points on the figure drawn earlier. ```{r} # calculate fitted values y.hat <- m + b * x lim <- range(c(y, y.hat)) plot(y, y.hat, xlab = "Observed", ylab = "Fitted", xlim = lim, ylim = lim) abline(0, 1) ``` Let’s calculate the correlation coefficient between the two to find the degree of agreement between the observed and fitted values. ```{r} # calculate correlation between observed and fitted values cor(y, y.hat) ``` In fact, the square of this correlation coefficient is the proportion of the variation of $y$ explained by the regression (coefficient of determination, $R^2$ value). Let’s compare these two statistics. ```{r} # compare the square of the correlation and R2 cor(y, y.hat)^2 summary(model) ``` ## Significance test of a regression model When the linear relationship between variables is strong, a regression line fit well to the observations, and the relationship between both variables can be well modeled by a regression line. However, when a linear relationship between variables is not clear, modeling with a regression line does not work well. Here, as a method to objectively confirm the goodness of fit of the estimated regression model, we will explain a test using analysis of variance. First, let’s go back to the simple regression. ```{r} model <- lm(height ~ flower, data = data) ``` The significance of the obtained regression model can be tested using the function anova. ```{r} # analysis of variance of regression anova(model) ``` As a result of the analysis of variance, the term of flowering time is highly significant ($p<0.001$), and the goodness of fit of the regression model that the flowering timing influences plant height is confirmed. Analysis of variance for regression models involves the following calculations: First of all, “Sum of squares explained by regression” can be calculated as follows. $$ SSR=\sum_{i=1}^n (\hat y_i-\bar y)^2 \\ =\sum_{i=1}^n \left[\mu+bx_i-(\mu+\bar bx)\right]^2 \\ =b^2 \sum_{i=1}^n (x_i-\bar x)^2 \\ =b^2\cdot SSX=b \cdot SSXY $$ Also, the sum of squares of deviation from the mean of the observed values $y$ is expressed as the sum of the sum of squares SSR explained by the regression and the residual sum of squares SSE. That is, $$ SSY=\sum_{i=1}^n (y_i-\bar y)^2 \\ =\sum_{i=1}^n (y_i-\hat y_i+\hat y_i-\bar y)^2 \\ =\sum_{i=1}^n (y_i-\hat y_i )^2 +\sum_{i=1}^n (\hat y_i-\bar y)^2 \\ =SSE+SSR $$ Let’s actually calculate it using the above equation. First, calculate SSR and SSE. ```{r} # calculate sum of squares of regression and error ssr <- b * ssxy ssr ## [1] 26881.49 ssy <- sum(y^2) - n * mean(y)^2 sse <- ssy - ssr sse ``` Next, calculate the mean squares, which is of the sum of squares divided by the degrees of freedom. ```{r} # calculate mean squares of regression and error msr <- ssr / 1 msr mse <- sse / (n - 2) mse ``` Finally, the mean square of the regression is divided by the mean square of the error to calculate the $F$ value. Furthermore, calculate the $p$ value corresponding to the calculated $F$ value. ```{r} (f.value <- msr / mse) 1 - pf(f.value, 1, n - 2) ``` The results obtained are in agreement with the results calculated earlier using the function anova. The results of regression analysis are included in the results of regression analysis displayed using the function “summary”. ```{r} # check the summary of the result of regression analysis summary(model) ``` “Residual standard error” is the square root of the mean square of the residual. ```{r} # square root of mse sqrt(mse) ``` “Multiple R-squared” ($R^2$) is a value called the coefficient of determination, which is the ratio of SSR to SSY. ```{r} # R squared ssr / ssy ``` “Adjusted R-squared” ($R_{adj}^2$) is a value called the adjusted coefficient of determination, which can be calculated as follows. ```{r} # adjusted R squared (ssy / (n - 1) - mse) / (ssy / (n - 1)) ``` Also, “F-statistic” matches the $F$ value and its $p$ value which are expressed as the effect of flowering time in the analysis of variance. In addition, the t value calculated for the regression coefficient of the flowering time term is squared to obtain the $F$ value (8.6302 = 74.477). $R^2$ and $R_{adj}^2$ can also be expressed using SSR, SSY, and SSE as follows. $$ R^2=\frac {SSR}{SSY}=1-\frac {SSE}{SSY} \\ R_{adj}^2=1-\frac {n-1}{n-p}\cdot\frac {SSE}{SSY} $$ Here, $p$ is the number of parameters included in the model, and $p=2$ for a simple regression model. It can be seen that $R_{adj}^2$ has a larger amount of adjustment (the residual sum of squares is underestimated) as the number of parameters included in the model increases. ## Distribution that estimated value of regression coefficient follows As mentioned earlier, the estimates $b$ and $m$ of the regression coefficients $\mu$ and $\beta$ are values estimated from samples and are random variables that depend on the samples chosen by chance. Thus, estimates $b$ and $m$ have probabilistic distributions. Here we consider the distributions that the estimates follow. The estimate b follows the normal distribution: $$ b \sim N\left(\beta,\frac {\sigma^2}{SSX}\right) $$ Here, $\sigma^2$ is the residual variance $\sigma^2=Var(y_i)=Var(e_i)$. The esimate $m$ follows the normal distribution: $$ m \sim N\left(\mu ,\sigma^2 \left[\frac 1 n + \frac {\bar x^2}{SSX}\right]\right) $$ Although the true value of the error variance $\sigma^2$ is unknown, it can be replaced by the residual variance $s^2$. That is, $$ s^2 = \frac {SSE}{n-2} $$ This value is the mean square of the residuals calculated during the analysis of variance. At this time, the statistic on $b$ $$ t = \frac {b-b_0}{s⁄\sqrt{SSX}} $$ follows the $t$ distribution with $n – 2$ degrees of freedom under the null hypothesis:$H_0: \beta = b_0$ The statistic on $n$ $$ t = \frac {m-m_0}{s\sqrt{\frac 1 n +\frac {\bar x^2} {SSX}}} $$ xollows the $t$ distribution with $n – 2$ degrees of freedom under the null hypothesis:$H_0: \mu=m_0$ First, test for null hypothesis $H_0:\beta=0$ for $b$. $b$について、帰無仮説$H_0:\beta=0$について検定してみます。このとき、$b_0=0$であることに注意して、 ```{r} t.value <- (b - 0) / (sqrt(mse / ssx)) t.value ``` This statistic follows the t distribution with 371 degrees of freedom under the null hypothesis. Draw a graph of the distribution ```{r} # visualize the t distribution under H0 s <- seq(-10, 10, 0.2) plot(s, dt(s, n - 2), type = "l") abline(v = t.value, col = "green") abline(v = - t.value, col = "gray") ``` From the distribution that follows under the null hypothesis, the value of $t$ that is being obtained appears to be large. Let’s perform a $t$ test. Note that it is a two-tailed test. ```{r} # perform t test 2 * (1 - pt(abs(t.value), n - 2)) # two-sided test ``` The result of this test was already displayed as regression analysis results. ```{r} # check the summary of the model summary(model) ``` The hypothesis test performed above can be performed for any $\beta_0$. For example, let’s test for null hypothesis $H_0:\beta=0.5$. ```{r} # test beta = 0.5 t.value <- (b - 0.5) / sqrt(mse/ssx) t.value 2 * (1 - pt(abs(t.value), n - 2)) ``` The result is significant at the 5% level. Now let us test for $m$. First, test the null hypothesis $H_0:\mu=0$. ```{r} # test mu = 0 t.value <- (m - 0) / sqrt(mse * (1/n + mean(x)^2 / ssx)) t.value 2 * (1 - pt(abs(t.value), n - 2)) ``` This result was also already calculated. ```{r} # check the summary of the model again summary(model) ``` Finally, let’s test for the null hypothesis $H_0:\mu =70$. ```{r} # test mu = 70 t.value <- (m - 70) / sqrt(mse * (1/n + mean(x)^2 / ssx)) t.value 2 * (1 - pt(abs(t.value), n - 2)) ``` The result was not significant even at the 5% level. Draw a graph of the distribution under the null hypothesis. ```{r} # visualize the t distribution under H0 s <- seq(-5, 5, 0.1) plot(s, dt(s, n - 2), type = "l") abline(v = t.value, col = "green") abline(v = -t.value, col = "gray") ``` ## Confidence intervals for regression coefficients and fitted values Function “predict” has various functions. First, let’s use the function with the estimated regression model. Then, the values of $y$ when the model fitted to observed data are calculated. The values $\hat y$ are exactly the same as calculated by the function “fitted”. ```{r} # fitted values pred <- predict(model) head(pred) head(fitted(model)) ``` By setting the options “interval” and “level”, you can calculate the confidence interval (the 95 confidence interval at the default setting) of $y$ at the specified significance level when fitting the model. ```{r} # calculate confidence interval pred <- predict(model, interval = "confidence", level = 0.95) head(pred) ``` Let’s vidualize the 95% confidence interval of $y$ using the function “predict”. ```{r} # draw confidence bands pred <- data.frame(flower = 50:160) pc <- predict(model, interval = "confidence", newdata = pred) plot(data$height ~ data$flower) matlines(pred$flower, pc, lty = c(1, 2, 2), col = "red") ``` Next, we will consider the case of predicting unknown data. Let us use the predict function to draw the 95% confidence interval of the predicted value, i.e., the value of $y$ that would be observed for the unknown data, i.e., the predicted value $\hat y$, is subject to additional variation due to error. ```{r} pc <- predict(model, interval= "prediction", newdata = pred) plot(data$height ~ data$flower) matlines(pred$flower, pc, lty = c(1, 2, 2), col = "green") ``` With the added error, the variability of the predicted $\tilde y$ is greater than the estimated $\hat y$. Note that for a particular $x$, the confidence intervals for the estimated $y$, i.e., $\hat y$, and the predicted $y$, i.e., $\tilde y$, can be found as follows. Here, we find the 99% confidence interval when $x=120$. ```{r} # estimate the confidence intervals for the estimate and prediction of y pred <- data.frame(flower = 120) predict(model, interval = "confidence", newdata = pred, level = 0.99) predict(model, interval = "prediction", newdata = pred, level = 0.99) ``` ## Polynomial regression model and multiple regression model So far, we have applied to the data a regression model that represents the relationship between the two variables with a straight line. Let’s extend the regression model a bit. First, let’s perform regression by a method called polynomial regression. In polynomial regression, $$ y_i=\mu+\beta_1 x_i+\beta_2 x_i^2+\cdots+\beta_p x_i^p+\epsilon_i $$ In this way, regression is performed using the second or higher order terms of $x$. First, let’s perform regression using the first and second terms of $x$. ```{r} # polynomial regression model.quad <- lm(height ~ flower + I(flower^2), data = data) summary(model.quad) ``` It can be seen that the proportion of variation of $y$ (coefficient of determination $R^2$) explained by the polynomial regression model is larger than that of the simple regression model. Although this will be mentioned later, you should not judge that the polynomial regression model is excellent only with this value. This is because a polynomial regression model has more parameters than a simple regression model, and you have more flexibility when fitting the model to data. It is easy to improve the fit of the model to the data by increasing the flexibility. In extreme cases, the model can be completely fitted to the data with as many parameters as the size of data (In that case, the coefficient pf determination $R^2$ completely matches 1). Therefore, careful selection of some statistical criteria is required when selecting the best model. This will be discussed later. Let's visualize the result of the polynormial regression with its confidence interval. ```{r} # plot(data$height ~ data$flower) pred <- data.frame(flower = 50:160) pc <- predict(model.quad, interval = "confidence", newdata = pred) plot(data$height ~ data$flower) matlines(pred$flower, pc, lty = c(1, 2, 2), col = "red") ``` When the timing of flowering is over 120 days after sowing, it can be seen that the reliability of the model is low. Now let’s draw the result of polynomial regression with confidence intervals. ```{r} # compare predicted and observed values lim <- range(c(data$height, fitted(model), fitted(model.quad))) plot(data$height, fitted(model), xlab = "Observed", ylab = "Expected", xlim = lim, ylim = lim) points(data$height, fitted(model.quad), col = "red") abline(0, 1) ``` The above figure represents relationship between fitted value and observed value of the simple regression model (black) and the second-order polynomial model (red). Let’s test if the improvement in the explanatory power of the second-order polynomial model is statistically significant. The significance is tested with F test whether the difference between the residual sum of squares of the two models is sufficiently large compared to the residual sum of squares of the model containing the other (here, Model 2 contains Model 1). ```{r} # compare error variance between two models anova(model, model.quad) ``` The results show that the difference in residual variance between the two models is highly significant ($p<0.001$). In other words, Model 2 has significantly more explanatory power than Model 1. Now let’s fit a third-order polynomial regression model and test if it is significantly more descriptive than a second-order model. ```{r} # extend polynormial regression model to a higher dimensional one... model.cube <- lm(height ~ flower + I(flower^2) + I(flower^3), data = data) summary(model.cube) # compare error variance between two models anova(model.quad, model.cube) ``` The 3rd-order model has a slightly better explanatory power than the 2nd-order model. However, the difference is not statistically significant. In other words, it turns out that extending a second-order model to a third-order model is not a good idea. Finally, let’s apply the multiple linear regression model: $$ y_i=\mu+\beta_1 x_{1i}+\beta_2 x_{2i}+\cdots+\beta_p x_{pi}+\epsilon_i $$ In this way, regression is performed using multiple explanatory variables $(x_{1i},x_{2i},\dots,x_{pi})$. In the first lecture, I confirmed in the graph that the height varies depending on the difference in genetic background. Here, we will create a multiple regression model that explains plant height using genetic backgrounds (PC1 to PC4) expressed as the scores of four principal components. ```{r} # multi-linear regression with genetic background model.wgb <- lm(height ~ PC1 + PC2 + PC3 + PC4, data = data) summary(model.wgb) anova(model.wgb) ``` You can see that the coefficient of determination of the regression model is higher than that of the polynomial regression model. The results of analysis of variance show that all principal components are significant and need to be included in the regression. Finally, let’s combine the polynomial regression model with the multiple regression model. ```{r} # multi-linear regression with all information model.all <- lm(height ~ flower + I(flower^2) + PC1 + PC2 + PC3 + PC4, data = data) summary(model.all) # compare error variance anova(model.all, model.wgb) ``` The effect of the genetic background on plant height is very large, but it can also be seen that the model’s explanatory power improves if the effect of flowering timing is also added. Lastly, let’s compare the first regression model and the last created multiple regression model by plotting the scatter of the observed value and the fitted value. ```{r} # compare between the simplest and final models lim <- range(data$height, fitted(model), fitted(model.all)) plot(data$height, fitted(model), xlab = "Observed", ylab = "Fitted", xlim = lim, ylim = lim) points(data$height, fitted(model.all), col = "red") abline(0,1) ``` As a result, we can see that the explanatory power of the model is significantly improved by considering the genetic background and the second order terms. However, on the other hand, it can also be seen that the two varieties and lines whose flowering timing is late (after 180 days) can not be sufficiently explained even by the finally obtained model. There may be room to improve the model, such as adding new factors as independent variables. ## Experimental design and analysis of variance When trying to draw conclusions based on experimental results, it is always the presence of errors in the observed values. Errors are inevitable no matter how precise the experiment is, especially in field experiments, errors are caused by small environmental variations in the field. Therefore, experimental design is a method devised to obtain objective conclusions without being affected by errors. First of all, what is most important in planning experiments is the Fisher’s three principles: 1. Replication: In order to be able to perform statistical tests on experimental results, we repeat the same process. For example, evaluate one variety multiple times. The experimental unit equivalent to one replication is called a plot. 2. Randomization: An operation that makes the effect of errors random is called randomization. For example, in the field test example, varieties are randomly assigned to plots in the field using dice or random numbers. 3. Local control: Local control means dividing the field into blocks and managing the environmental conditions in each block to be as homogeneous as possible. In the example of the field test, the grouped area of the field is divided into small units called blocks to make the cultivation environment in the block as homogeneous as possible. It is easier to homogenize each block rather than homogenizing the cultivation environment of the whole field. The experimental method of dividing the field into several blocks and making the cultivation environment as homogeneous as possible in the blocks is called the randomized block design. In the randomized block design, the field is divided into blocks, and varieties are randomly assigned within each block. The number of blocks is equal to the number of replications. Next, I will explain the method of statistical test in the randomized block design through a simple simulation. First, let’s set the “seed” of the random number before starting the simulation. A random seed is a source value for generating pseudorandom numbers. ```{r} # set a seed for random number generation set.seed(12) ``` Let’s start the simulation. Here, consider a field where 16 plots are arranged in $4 \times 4$. And think about the situation that there is a slope of the soil fertility in the field. ```{r} # The blocks have unequal fertility among them field.cond <- matrix(rep(c(4,2,-2,-4), each = 4), nrow = 4) field.cond ``` However, it is assumed that there is an effect of +4 where the soil fertility is high and -4 where it is low. Here, we arrange blocks according to Fisher’s three principles. The blocks are arranged to reflect the difference in the soil fertility well. ```{r} # set block to consider the heterogeneity of field condition block <- c("I", "II", "III", "IV") blomat <- matrix(rep(block, each = 4), nrow = 4) blomat ``` Next, randomly arrange varieties in each block according to Fisher’s three principles. Let’s prepare for that first. ```{r} # assume that there are four varieties variety <- c("A", "B", "C", "D") # sample the order of the four varieties randomly sample(variety) sample(variety) ``` Let’s allocate varieties randomly to each block. ```{r} # allocate the varieties randomly to each column of the field varmat <- matrix(c(sample(variety), sample(variety), sample(variety), sample(variety)), nrow = 4) varmat ``` Consider the differences in genetic values of the four varieties. Let the genetic values of the A to D varieties be +4, +2, -2, -4, respectively. ```{r} # simulate genetic ability of the varieties g.value <- matrix(NA, 4, 4) g.value[varmat == "A"] <- 4 g.value[varmat == "B"] <- 2 g.value[varmat == "C"] <- -2 g.value[varmat == "D"] <- -4 g.value ``` Environmental variations are generated as random numbers from a normal distribution with an average of 0 and a standard deviation of 2.5. ```{r} # simulate error variance (variation due to the heterogeneity of local environment) e.value <- matrix(rnorm(16, sd = 2.5), 4, 4) e.value ``` Although the above command generates random numbers, I think you will get the same value as the textbook. This is because the random numbers generated are pseudo random numbers and are generated according to certain rules. Note that if you change the value of the random seed, the same value as above will not be generated. Also, different random numbers are generated each time you run. Finally, the overall average, the gradient of soil fertility, the genetic values of varieties, and the variation due to the local environment are added together to generate a simulated observed value of the trait. ```{r} # simulate phenotypic values grand.mean <- 50 simyield <- grand.mean + field.cond + g.value + e.value simyield ``` Before performing analysis of variance, reshape data in the form of matrices into vectors and rebundle them. ```{r} # unfold a matrix to a vector as.vector(simyield) as.vector(varmat) as.vector(blomat) ``` Below, the data is bundled as a data frame. ```{r} # create a dataframe for the analysis of variance simdata <- data.frame(variety = as.vector(varmat), block = as.vector(blomat), yield = as.vector(simyield)) head(simdata, 10) ``` Let’s plot the created data using the function interaction.plot. ```{r} # draw interaction plot interaction.plot(simdata$block, simdata$variety, simdata$yield) ``` It can be seen that the difference between blocks is as large as the difference between varieties. Let’s perform an analysis of variance using the prepared data. ```{r} # perform the analysis of variance (ANOVA) with simulated data res <- aov(yield ~ block + variety, data = simdata) summary(res) ``` It can be seen that both the block and variety effects are highly significant. Note that the former is not the subject of verification, and is incorporated into the model in order to estimate the variety effect correctly. The analysis of variance described above can also be performed using the function “lm” for estimating regression models. ```{r} # perform ANOVA with a linear model res <- lm(yield ~ block + variety, data = simdata) anova(res) ``` ## Complete random block design and completely randomized design The local control, one of Fisher’s three principles, is very important for performing highly accurate experiments in fields under high heterogeneity between plots. Here, assuming the same environmental conditions as before, let’s consider performing an experiment without setting up a block. In the previous simulation experiment, we blocked each column and placed A, B, C, D randomly in that block. Here we will assign the plots with 4 varieties $\times$ 4 replicates completely randomly across the field. An experiment in which blocks are not arranged in the expliment and arranged completely randomly is called “completely randomized design.” ```{r} # completely randomized the plots of each variety in a field varmat.crd <- matrix(sample(varmat), nrow = 4) varmat.crd ``` This time, you should careful that the frequency of appearance of variety varies from row to row, since varieties are randomly assigned to the entire field. The genetic effect is assigned according to the order of varieties in a completely random arrangement. ```{r} # simulate genetic ability of the varieties g.value.crd <- matrix(NA, 4, 4) g.value.crd[varmat.crd == "A"] <- 4 g.value.crd[varmat.crd == "B"] <- 2 g.value.crd[varmat.crd == "C"] <- -2 g.value.crd[varmat.crd == "D"] <- -4 g.value.crd ``` As in the previous simulation experiment, the overall average, the gradient of soil fertility, the genetic effect of varieties, and the variation due to the local environment are summed up. ```{r} # simulate phenotypic values simyield.crd <- grand.mean + g.value.crd + field.cond + e.value simyield.crd ``` The data is bundled as a data frame. ```{r} # create a dataframe for the analysis of variance simdata.crd <- data.frame(variety = as.vector(varmat.crd), yield = as.vector(simyield.crd)) head(simdata.crd, 10) ``` Now let’s perform analysis of variance on the data generated in the simulation. Unlike the previous experiment, we do not set blocks. Thus, we perform regression analysis with the model that only includes the varietal effect and does not include the block effect. ```{r} # perform ANOVA res <- lm(yield ~ variety, data = simdata.crd) anova(res) ``` In the above example, the varietal effect is not significant. This is considered to be due to the fact that the spatial heterogeneity in the field causes the error to be large and the genetic difference between varieties cannot be estimated with sufficient accuracy. The above simulation experiment was repeated 100 times (shown on the next page). As a result, in the experiment using the random complete block design, the varietal effect was detected (the significance level was set to 5%) in 94 experiments out of 100, but it was detected only 66 times in the completely random arrangement. In addition, when the significance level was set to 1%, the number of the varietal effect detected was 70 and 30, respectively (in the case of completely random arrangement, the varietal effect was missed 70 times!). From this result, it can be seen that the adoption of the random complete block design is effective when there is among-replication heterogeneity such as the slope of soil fertility. In order to make a time-consuming and labor-intensive experiment as efficient as possible, it is important to design the experiment properly. ```{r} # perform multiple simulations n.rep <- 100 p.rbd <- rep(NA, n.rep) p.crd <- rep(NA, n.rep) for(i in 1:n.rep) { # experiment with randomized block design varmat <- matrix(c(sample(variety), sample(variety), sample(variety), sample(variety)), nrow = 4) g.value <- matrix(NA, 4, 4) g.value[varmat == "A"] <- 4 g.value[varmat == "B"] <- 2 g.value[varmat == "C"] <- -2 g.value[varmat == "D"] <- -4 e.value <- matrix(rnorm(16, sd = 2.5), 4, 4) simyield <- grand.mean + field.cond + g.value + e.value simdata <- data.frame(variety = as.vector(varmat), block = as.vector(blomat), yield = as.vector(simyield)) res <- lm(yield ~ block + variety, data = simdata) p.rbd[i] <- anova(res)$Pr[2] # experiment with completed randomized design varmat.crd <- matrix(sample(varmat), nrow = 4) g.value.crd <- matrix(NA, 4, 4) g.value.crd[varmat.crd == "A"] <- 4 g.value.crd[varmat.crd == "B"] <- 2 g.value.crd[varmat.crd == "C"] <- -2 g.value.crd[varmat.crd == "D"] <- -4 simyield.crd <- grand.mean + g.value.crd + field.cond + e.value simdata.crd <- data.frame(variety = as.vector(varmat.crd), yield = as.vector(simyield.crd)) res <- lm(yield ~ variety, data = simdata.crd) p.crd[i] <- anova(res)$Pr[1] } sum(p.rbd < 0.05) / n.rep sum(p.crd < 0.05) / n.rep sum(p.rbd < 0.01) / n.rep sum(p.crd < 0.01) / n.rep ``` ## Report assignment 1. Using the number of seeds per panicle (Seed.number.per.panicle) as the dependent variable $y$ and panicle length (Panicle.length) as the independent variable $x$, fit a simple regression model ($y_i=\mu+\beta x_i+\epsilon_i$) and find the sample intercept m and sample regression coefficient b, which are estimates of $\mu$ and $\beta$. 2. Test the null hypothesis $H_0:\beta=0.02$. 3. Test the null hypothesis $H_0:\mu=5$. 4. Answer the 95% confidence interval between the estimated $y$, i.e., $\hat y$ and the predicted $y$, i.e., $\tilde y$ for $x=27$. 5. Fit the polynomial regression model $y_i=\mu+\beta_1 x_i+\beta_2 x_i^2 + \epsilon_i$) using the first- and second-order terms of $x$ and answer the determination coefficient $R^2$ and the adjusted coefficient of determination $R_{adj}^2$. 6. Compare the regression model in 5 with the regression model in 1 in an analysis of variance and consider the validity of including a second-order term of $x$ in the regression model. 7. Fit the polynomial regression model $y_i=\mu+\beta_1 x_i+\beta_2 x_i^2+\beta_3 x_i^3+\epsilon_i$) using the first- to third-order terms of $x$ and answer the decision coefficient $R^2$ and the he adjusted coefficient of determination $R_{adj}^2$. 8. Compare the regression model in 7 with the regression model in 5 in an analysis of variance to examine the validity of extending the second-order polynomial regression model to a third-order polynomial regression model. Submission method: Create a report as an html file or a pdf file and submit it to ITC-LMS. When you cannot submit your report to ITC-LMS with some issues, send the report to report@iu.a.u-tokyo.ac.jp Make sure to write the affiliation, student number, and name at the beginning of the report. The deadline for submission is May 6th.