A department dealing with the effects of atmospheric pollutants in the vicinity of an industrial complex has established a data table of measurements of a Y on a scale of 0 (extremely bad ) to 1000 ( absolutely pure) and the dependence of this on X1, X2, , X6. The aim of the department is to establish which of the is contributing most to .

This report analyzed and discussed the association of the purity index Y with component pollutant variables and developed a model to forecast the purity index. The analysis suggested that the component pollutant variables X1, X2, X4, X5, and X6 are significantly related to purity index Y (p <.05). However, only two-component pollutant variables X1 and X5 are most likely to contribute significantly to atmospheric pollution (purity index Y). The equation for the best regression (chosen) model was given by Y = 0.185 + 1.111X1 + 7.598X5

Further, for the chosen model, all the underlying assumptions of the regression analysis (multicollinearity, non-normality, nonconstant variance, and autocorrelation) are valid.

A department dealing with the effects of atmospheric pollutants in the vicinity of an industrial complex has established a data table of measurements of a purity index Y. The purity index Y is measured on a scale of 0 to 1000, with 0 being extremely bad and 1000 being absolutely pure and the dependence of this on component pollutant variables X1, X2, , X6. The aim of the department is to establish which of the component variables is contributing most to local atmospheric pollution.

This report will analyze and discuss the association of the purity index Y with component pollutant variables X1, X2, , X6. Further, this report will develop a model for forecasting the purity index Y based on component pollutant variables X1, X2, , X6. For this, sample data for a period of 50 days is obtained. The test is a blind one in the sense that none of the pollutants has been identified by name, because of its association with the source and the possibility at this stage of unwanted litigation.

Correlation and Scatterplot Analysis

Figure 1 to 6 shows the scatterplots of purity index Y against component pollutant variables X1, X2 X6.

Y versus X1

Figure 1: Y versus X1

Y versus X2

Figure 2: Y versus X2

Y versus X3

Figure 3: Y versus X3

Y versus X4

Figure 4: Y versus X4

Y versus X5

Figure 5: Y versus X5

Y versus X6

Figure 6: Y versus X6

There appears a strong linear relationship between Y and X1, Y and X2, and Y and X5. In addition, there appears a moderately strong linear relationship between Y and X6. Furthermore, there appears weak or no linear relationship between Y and X3 and Y and X4. Table 2 shows the correlation matrix (using MegaStat, an Excel Add-in) for purity index Y and component pollutant variables X1, X2 X6.

Table 1: Correlation Matrix

X1 X2 X3 X4 X5 X6 Y

X1 1.000

X2 .738 1.000

X3 -.293 -.283 1.000

X4 .201 .287 -.130 1.000

X5 .605 .803 -.094 .307 1.000

X6 .491 .675 -.163 .109 .521 1.000

Y .881 .778 -.261 .290 .805 .533 1.000

50 sample size

.279 critical value.05 (two-tail)

.361 critical value.01 (two-tail)

As shown in table 1, the correlation of Y is significant for X1, X2, X4, X5, and X6. Therefore, excluding component pollutant variable X3 from first multiple regression analysis based on correlation and scatterplot analysis.

Multiple Regression Model

Model with Five Independent Variables (Excluding X3)

Table 2

SUMMARY OUTPUT

Regression Statistics

Multiple R 0.9477

R Square 0.8982

Adjusted R Square 0.8866

Standard Error 44.0675

Observations 50

ANOVA

df SS MS F Significance F

Regression 5 753910.3541 150782.0708 77.6449 0.0000

Residual 44 85445.5108 1941.9434

Total 49 839355.8649

Coefficients Standard Error t Stat P-value Lower 95% Upper 95%

Intercept -80.3818 67.5179 -1.1905 0.2402 -216.4552 55.6917

X1 1.1879 0.1278 9.2925 0.0000 0.9302 1.4455

X2 -1.4448 1.0805 -1.3372 0.1880 -3.6223 0.7327

X4 6.2999 7.2074 0.8741 0.3868 -8.2257 20.8255

X5 8.4910 1.4413 5.8911 0.0000 5.5862 11.3959

X6 2.4322 3.2784 0.7419 0.4621 -4.1750 9.0393

Table 2 shows the regression model with five component pollutant variables. Although, the regression model is significant (F = 77.64, p <.001), the p-value for coefficient of component pollutant variables X2, X4, and X6 are greater than 0.05. The p-value for coefficient of X6 (0.462) is higher as compared to coefficient of other component pollutant variables X2 (0.188) and X4 (0.3868), thus, excluding component pollutant variable X6 from further multiple regression analysis.

Model with Four Independent Variables (Excluding X3 and X6)

Table 3

SUMMARY OUTPUT

Regression Statistics

Multiple R 0.9471

R Square 0.8969

Adjusted R Square 0.8878

Standard Error 43.8468

Observations 50

ANOVA

df SS MS F Significance F

Regression 4 752841.5355 188210.3839 97.8967 0.0000

Residual 45 86514.3294 1922.5407

Total 49 839355.8649

Coefficients Standard Error t Stat P-value Lower 95% Upper 95%

Intercept -46.0844 48.9615 -0.9412 0.3516 -144.6979 52.5291

X1 1.1863 0.1272 9.3280 0.0000 0.9301 1.4424

X2 -1.0792 0.9567 -1.1280 0.2653 -3.0061 0.8477

X4 5.6856 7.1238 0.7981 0.4290 -8.6625 20.0338

X5 8.4570 1.4334 5.9000 0.0000 5.5700 11.3440

Table 3 shows the regression model with four component pollutant variables. Although, the regression model is significant (F = 97.90, p <.001), the p-value for coefficient of component pollutant variables X2 and X4 are greater than 0.05. The p-value for coefficient of X4 (0.443) is higher as compared to coefficient of component pollutant variable X2 (0.265), thus, excluding component pollutant variable X4 from further multiple regression analysis.

Model with Four Independent Variables (Excluding X3, X4 and X6)

Table 4

SUMMARY OUTPUT

Regression Statistics

Multiple R 0.9463

R Square 0.8955

Adjusted R Square 0.8887

Standard Error 43.6734

Observations 50

ANOVA

df SS MS F Significance F

Regression 3 751616.9110 250538.9703 131.3532 0.0000

Residual 46 87738.9539 1907.3686

Total 49 839355.8649

Coefficients Standard Error t Stat P-value Lower 95% Upper 95%

Intercept -12.7192 25.3857 -0.5010 0.6187 -63.8179 38.3795

X1 1.1842 0.1266 9.3505 0.0000 0.9293 1.4391

X2 -1.0248 0.9505 -1.0781 0.2866 -2.9381 0.8885

X5 8.6109 1.4147 6.0865 0.0000 5.7631 11.4586

Table 4 shows the regression model with three component pollutant variables. Although, the regression model is significant (F = 131.35, p <.001), the p-value for coefficient of component pollutant variable X2 (0.287) is greater than 0.05, thus, excluding component pollutant variable X2 from further multiple regression analysis.

Model with Two Independent Variables X1 and X5

Table 5

SUMMARY OUTPUT

Regression Statistics

Multiple R 0.9449

R Square 0.8928

Adjusted R Square 0.8883

Standard Error 43.7488

Observations 50

ANOVA

df SS MS F Significance F

Regression 2 749399.7991 374699.8996 195.7722 0.0000

Residual 47 89956.0658 1913.9588

Total 49 839355.8649

Coefficients Standard Error t Stat P-value Lower 95% Upper 95%

Intercept 0.1849 22.4256 0.0082 0.9935 -44.9296 45.2995

X1 1.1114 0.1074 10.3531 0.0000 0.8955 1.3274

X5 7.5978 1.0594 7.1717 0.0000 5.4665 9.7290

Table 5 shows the regression model with two component pollutant variables X1 and X5. The regression model is significant (F = 195.77, p <.001). The p-value for coefficient of component variables X1 and X5 is significant that indicates that both component pollutant variables X1 and X5 significantly predicts purity index Y in regression model.

Table 6 shows the stepwise regression (using MegaStat, an Excel Add-in) taking n number of variables (best for n). As shown in table 6, the best multiple regression model is given by component pollutant variables X1 and X5, as p-value for model is highest.

Table 6: Multiple regression model with different number of independent variables

p-values for the coefficients

Nvar X1 X2 X3 X4 X5 X6 Se Adj R R p-value

1 .0000 62.649 .771 .776 3.46E-17

2 .0000 .0000 43.749 .888 .893 1.61E-23

3 .0000 .2866 .0000 43.673 .889 .895 1.45E-22

4 .0000 .1933 .2597 .0000 43.530 .889 .898 9.57E-22

5 .0000 .1428 .2482 .0000 .4818 43.772 .888 .900 7.98E-21

6 .0000 .1282 .2802 .4404 .0000 .4349 43.970 .887 .901 5.57E-20

Adjusted R2 is a parameter for deciding number of independent variables in multiple regression model. Figure 7 show the Adjusted R2 versus Number of Independent Variables. As shown in figure 7, there is not much increase in Adjusted R2 after two independent variables X1, and X5. The Adjusted R2 value is approximately same (0.888) for more than 2 independent variables in multiple regression model. Therefore, the best regression model is given by only taking two independent variables X1, and X5.

Adjusted R2 versus Number of Independent Variables

Figure 7: Adjusted R2 versus Number of Independent Variables

Chosen Multiple Regression Model

The equation for the best regression (chosen) is given by Y = 0.185 + 1.111X1 + 7.598X5

Regression slope coefficient of 1.111 of X1 indicates that for each point increase in X1, purity index Y increase by about 1.111 on average given fixed component pollutant variable X5.

Not sure if you can write a paper on Atmospheric Pollution Constituents by yourself? We can help you

for only $16.05 $11/page

The regression slope coefficient of 7.598 of X2 indicates that for each point increase in X2, purity index Y increase by about 7.598 on average given fixed component pollutant variable X1.

Component pollutant variables X1 and X5 explain about 89.3% variation in purity index Y. The other 10.7% variation in purity index Y remains unexplained may be due to other factors.

T-tests on Individual Coefficients

The null and alternate hypotheses are:

FormulaFormula

The selected level of significance is 0.05 and the selected test is t-test for Zero Slope.

The decision rule will reject H0 if p-value 0.5. Otherwise, do not reject H0.

Component pollutant variable X1 significantly predicts purity index Y, t(47) = 10.35, p <.001.

Component pollutant variable X5 significantly predicts purity index Y, t(47) = 7.17, p <.001.

F test on All coefficients

The null and alternate hypotheses are:

FormulaFormula

The selected level of significance is 0.05 and the selected test is F-test.

The decision rule will reject H0 if p-value 0.5. Otherwise, do not reject H0.

The regression model is significant, R2 =.893, F(2, 47) = 195.77, p <.001.

Assumptions of Regression Model

Multicollinearity

Kleins Rule suggests that we should worry about the stability of the regression coefficient estimates only when a pairwise predictor correlation exceeds the multiple correlation coefficient R (i.e., the square root of R2). The value of the correlation coefficient between X1 and X5 is 0.605. The value of Multiple R for the final regression model with X1 and X5 is 0.945 and far exceeds 0.605, which suggests that the confidence intervals and t-tests may not be affected.

Another approach for checking multicollinearity is the Variance Inflation Factor (VIF). Figure 2 shows the interpretation of the Variance Inflation Factor (VIF). As a Rule of Thumb, we should not worry about multicollinearity, if VIF for the explanatory variable is less than 10.

Variance Inflation Factor (VIF) and Interpretation

Figure 8: Variance Inflation Factor (VIF) and Interpretation

Table 7: Variance Inflation Factor (VIF) using MegaStat

Regression Analysis

R 0.893

Adjusted R 0.888 n 50

R 0.945 k 2

Std. Error 43.749 Dep. Var. Y

ANOVA table

Source SS df MS F p-value

Regression 749,399.7991 2 374,699.8996 195.77 1.61E-23

Residual 89,956.0658 47 1,913.9588

Total 839,355.8649 49

Regression output confidence interval

variables coefficients std. error t (df=47) p-value 95% lower 95% upper std. coeff. VIF

Intercept 0.1849 22.4256 0.008 .9935 -44.9296 45.2995 0.000

X1 1.1114 0.1074 10.353 1.03E-13 0.8955 1.3274 0.621 1.576

X5 7.5978 1.0594 7.172 4.49E-09 5.4665 9.7290 0.430 1.576

As shown in table 7, the VIFs for both X1 and X5 is 1.576; thus, there is no need for concern.

Non-Normal Errors

Figure 9 shows the normal probability plot of residuals. As shown in figure 9, the residual plot is approximately linear, thus, the residuals seem to be consistent with the hypothesis of normality.

Normal Probability Plot of Residuals

Figure 9: Normal Probability Plot of Residuals

Nonconstant Variance (Heteroscedasticity)

Figure 10 and 11 show the plots of residuals by X1 and residuals by X5.

Residuals by X1

Figure 10: Residuals by X1

Residuals by X5

Figure 11: Residuals by X5

As shown in figure 10 and 11 the data points are scattered, and there is no pattern in the residuals as we move from left to right, thus, the residuals seem to be consistent with the hypothesis of homoscedasticity (constant variance).

Autocorrelation

Autocorrelation exists when the residuals are correlated with each other. With time-series data, one needs to be aware of the possibility of autocorrelation, a pattern of nonindependent errors that violate the regression assumption that each error is independent of its predecessor. The most common test for autocorrelation is the Durbin-Watson test. The DW statistic lies between 0 and 4. For no autocorrelation, the DW statistic will be near 2. In this case, DW = 2.33, which is near 2, thus errors are non-autocorrelated. However, for cross-sectional data, the DW statistic is usually ignored.

Figure 12 shows the residual by observation number. As shown in figure 12, the sign of a residual cannot be predicted from the sign of the preceding one this means that there is no autocorrelation.

Residuals by Observations

Figure 12: Residuals by Observations

Thus, for the chosen model, all the underlying assumptions of the regression analysis are valid.

Pollutant Variables (X) to Contribute Atmospheric Pollution (Purity Index Y)

As shown in table 1: correlation matrix, the component pollutant variables X1, X2, X4, X5, and X6 are significantly related to purity index Y (p <.05). Thus, they all are individually contributing significantly to atmospheric pollution. However, looking at the multiple regression model analysis, the only two-component pollutant variables X1 and X5 are most likely to contribute significantly to atmospheric pollution (purity index Y).