In this section, I delve into the application of linear regression to forecast and assess salaries within the realm of data science positions. Linear regression serves as the analytical foundation, enabling us to make informed projections and conduct in-depth salary evaluations for these roles.

**Introduction to linear regression model**

Linear regression is a statistical method used to model the relationship between a dependent variable and one or more independent variables. It assumes that the relationship between the variables is linear, meaning that the change in the dependent variable is proportional to the change in the independent variables.

The linear regression model can be expressed mathematically as:

Y =β0 +β1X1 +β2X2 +β3X3+ …+ βnXn +ε

where Y is the dependent variable, X1, X2, . . . , Xn are the independent variables, β0 is the intercept or the constant term, β1, β2, …, βn are the coefficients or the slopes of the independent variables, and ε is the error term or the residual.

The goal of linear regression is to estimate the values of the coefficients that best fit the data and can be used to make predictions about the dependent variable. This is typically done by minimizing the sum of squared errors between the observed values and the predicted values.

Linear regression can be applied to both simple and multiple regression problems, depending on the number of independent variables. It is a widely used method in many fields, such as economics, finance, engineering, and social sciences. Linear regression can also be extended to include more complex relationships, such as nonlinear and polynomial relationships, using techniques like polynomial regression and generalized linear models.

**Correlation matrix of between variables**

I use the **pairs.panel()** function to explore the correlation matrix, shedding light on the relationships between variables and gaining insights into the data’s interconnections.

`pairs.panels(data)`

A pair panel of a correlation matrix is a visual representation of the correlations between pairs of variables in a dataset. Each variable is compared with all other variables in the dataset, and the correlation coefficient between them is displayed in a matrix format. The pair panel allows for quick and easy identification of the strongest positive and negative correlations in the dataset. It can also be used to detect any outliers or unexpected correlations, which may indicate a problem with the data. Overall, the pair panel of a correlation matrix is a useful tool for exploring the relationships between variables in a dataset.

**Data Processing**

To remove the dependent variables, I drop the ‘salary’ and ‘salary currency’ columns.

`myvars <- names(data) %in% c("salary", "salary_currency")`

newdata <- data[!myvars]

**Linear regression model**

`model_1 <- lm(salary_in_usd~., data = newdata)`

summary(model_1)$r.sq

`## [1] 0.698794`

`f_statistic <- summary(model_1)$fstatistic`

f_value <- f_statistic[1]

p_value <- pf(f_value, f_statistic[2], f_statistic[3], lower.tail = FALSE)

p_value

`## value`

## 1.667176e-64

Multiple R-squared ~ 0.7 means that nearly 70% of the proportion of variation in the dependent variable is explained by the independent variables in the model.

p-values of F-statistic is smaller than 0.01. It is statistically significant that a set of variables collectively has a relationship with an outcome variable in a linear regression model.

**Evaluate model**

`par(mfrow= c(2,2))`

plot(model_1,pch = 19, col = rgb(1, 0, 0, 0.4))

From the Residuals vs fitted plot, we can see the pattern which indicates the non-constant error variance. Also, our QQ plot shows that distribution is not normal.

`res <- resid(model_1)`

par(mfrow=c(2,1))

hist(newdata$salary_in_usd,pch = 19, col = rgb(1, 0, 0, 0.4)) # histogram for salary_in_usd

plot(density(res),pch = 19, col = rgb(1, 0, 0, 0.4))

The histogram reveals clearly that the distribution curve for expenses is positively skewed. We need to apply the transformation on the Y-axis in order to distribute them normally.

Three commonly used data transformations are square-root transformation, cube-root transformation, and log-transformation.

- The square-root transformation is used to reduce the magnitude of large values while preserving the small ones. It is often used when dealing with count data or continuous data with a skewed distribution.
- The cube-root transformation is similar to the square-root transformation, but it is more effective in reducing the skewness of the data. It is commonly used to transform data with a positive skew.
- The log-transformation is used to reduce the variability in the data and convert it to a normal distribution. It is useful when dealing with data that contains extreme values or outliers.

`ex.sq <- sqrt(newdata$salary_in_usd) #sq root transformation`ex.cub <- (newdata$salary_in_usd)^(1/3) #cube root transformation

ex.ln <- log(newdata$salary_in_usd) #log transformation

par(mfrow=c(2,2))

hist(ex.sq,pch = 19, col = rgb(1, 0, 0, 0.4), main = "square-root transformation")

hist(ex.cub,pch = 19, col = rgb(1, 0, 0, 0.4), main = "cube-root transformation")

hist(ex.ln,pch = 19, col = rgb(1, 0, 0, 0.4), main = "log transformation")

I use the **qqnorm() **function to gauge the normality of transformed datasets. This approach allows us to assess which transformed dataset adheres most closely to the normal distribution, aiding in statistical analysis and model validity.

“qqnorm” is a statistical function used for creating quantile-quantile (Q-Q) plots. Q-Q plots are a graphical tool used to assess whether a dataset follows a particular theoretical distribution, typically the normal distribution. By comparing the quantiles of the observed data to the quantiles of the theoretical distribution, you can visually check for departures from normality. If the points in the Q-Q plot closely align with a straight line, it suggests that the data follows the expected distribution. Deviations from the line indicate departures from normality.

`par(mfrow=c(2,2))`

qqnorm(ex.sq,pch = 19, col = rgb(1, 0, 0, 0.4), main = "square-root transformation")

qqline(ex.sq)

qqnorm(ex.cub,pch = 19, col = rgb(1, 0, 0, 0.4), main = "cube-root transformation")

qqline(ex.cub)

qqnorm(ex.ln,pch = 19, col = rgb(1, 0, 0, 0.4), main = "log transformation")

qqline(ex.ln)

Cube-root transformation looks convencing, I will move forward with cube-root transformation and develop a new model.

I apply the linear regression to the new dataset and use **summary()** function to determine the** r.square**.

`trans_m1 <- lm((salary_in_usd)ˆ(1/3) ~ ., data= newdata)`

summary(trans_m1)$r.sq

`## [1] 0.8067955`

`f_statistic <- summary(trans_m1)$fstatistic`

f_value <- f_statistic[1]

p_value <- pf(f_value, f_statistic[2], f_statistic[3], lower.tail = FALSE)

p_value

`## value`

## 2.058514e-105

Multiple R-squared increases from 0.7 to 0.8. There is nearly 80% of the proportion of variation in the dependent variable is explained by the independent variables in the model. p-values of the F-statistic is still smaller than 0.01. It is statistically significant that a set of variables collectively has a relationship with an outcome variable in a linear regression model.

**Box-cox transformation**

Box-Cox transformation is a data transformation technique that is used to improve the normality and homogeneity of variance of a dataset. The technique involves finding a power transformation of the data that maximizes the log-likelihood function, which is a measure of how well the transformed data fits a normal distribution.

The Box-Cox transformation works by applying a mathematical function to the data, which adjusts the skewness and kurtosis of the distribution, and makes it more symmetrical. The transformation can be applied to both positive and negative data values, making it a versatile technique.

The optimal value of the transformation parameter lambda can be determined using a statistical test or by visual inspection of the transformed data. Lambda can take on any value, but the most commonly used values are 0, 0.5, and 1.

Box-Cox transformation is a powerful technique for data normalization and can be used in a variety of applications, including regression analysis, hypothesis testing, and data visualization. However, it should be noted that the transformation can be sensitive to outliers and may not always be appropriate for all types of data.

I employ the **boxcox()** function for the Cube-root transformation dataset.

`library(MASS)`

bc <- boxcox(salary_in_usd ~ ., data= newdata)

Next, I select the lamda to carry out transformation

`trans <- bc$x[which.max(bc$y)]`

trans

`## [1] 0.1818182`

Then, I run the linear regression model for the new dataset

`trans_m2<- lm(salary_in_usdˆtrans ~., data = newdata)`

summary(trans_m2)$r.sq

`## [1] 0.8238061`

`f_statistic <- summary(trans_m2)$fstatistic`

f_value <- f_statistic[1]

p_value <- pf(f_value, f_statistic[2], f_statistic[3], lower.tail = FALSE)

p_value

`## value`

## 3.632112e-114

Multiple R-squared increases is 0.82. There is nearly 82% of the proportion of variation in the dependent variable is explained by the independent variables in the model. p-values of the F-statistic is still smaller than 0.01. It is statistically significant that a set of variables collectively has a relationship with an outcome variable in a linear regression model.

**Compare 3 models**

- Model 1: Linear regression model
- Model 2: Cube-root transformation model
- Model 3: Box-cox transformation model

**Step 1: Apply data to the trained models**

In this section, I employ the

function to estimate salaries for data science positions utilizing three distinct models.**predict()**

**Model 1**

Predict linear regression model

`newdata$pred <- predict(model_1,data)`

**Model 2**

Predict the cube-root transformation model

` `

myvars <- names(newdata) %in% c("salary_in_usd")

testmodel <- newdata[!myvars]

newdata$pred1 <-predict(trans_m1,testmodel)ˆ(3)

**Model 3**

Predict box-cox transformation model

`myvars <- names(newdata) %in% c("salary_in_usd")`

testmodel <- newdata[!myvars]

newdata$pred2 <-predict(trans_m2,testmodel)ˆ(1/trans)

**Step 2: Compare the correlation of models**

In this section, I leverage the

function to calculate the correlation between variables, providing insights into the relationships within the dataset. This aids in understanding how different factors influence data science job salaries, further enhancing our predictive models’ accuracy and evaluation.**cor()**

`cor_model1 <- cor(newdata$pred,newdata$salary_in_usd)`

cor_model2 <- cor(newdata$pred1,newdata$salary_in_usd)

cor_model3 <- cor(newdata$pred2,newdata$salary_in_usd)

- Correlation of model 1: 0.835939
- Correlation of model 2: 0.8458336
- Correlation of model 3: 0.8441168

The three correlation coefficients indicate a strong positive relationship between the variables in all three models. Model 2 has the highest correlation coefficient at 0.8458336, followed closely by Model 3 with a correlation coefficient of 0.8441168. Model 1 has a slightly lower correlation coefficient at 0.835939. However, the difference between the correlation coefficients is relatively small, and it is unclear if it is statistically significant.

*Plots of correlation*

I generate plots illustrating the correlations between variables. These visual representations offer a comprehensive view of the relationships among different factors and their impact on data science job salaries. This visual analysis enhances our understanding of the dataset and informs the predictive models.

`par(mfrow=c(3,1))`

plot(newdata$pred,newdata$salary_in_usd,main="Model 1",pch = 20, col = rgb(0, 0, 0, 0.4))

abline(a=0, b=1,col = "red", lwd = 1, lty = 2)

plot(newdata$pred1,newdata$salary_in_usd,main="Model 2",pch = 20, col = rgb(0.5, 0, 0, 0.4

abline(a=0, b=1,col = "red", lwd = 1, lty = 2)

plot(newdata$pred2,newdata$salary_in_usd,main="Model 3",pch = 20, col = rgb(1, 0, 0, 0.4))

abline(a=0, b=1,col = "red", lwd = 1, lty = 2)

**Step 3: Compare R-squared values of models**

`R2_model1 <- summary(model_1)$r.sq`

R2_model2 <- summary(trans_m1)$r.sq

R2_model3 <- summary(trans_m2)$r.sq

- R-squared of model 1: 0.698794
- R-squared of model 2: 0.8067955
- R-squared of model 3: 0.8238061

R-squared improves significantly in transformed models. Box-cox transformed model has a higher R-squared value than the cube-root transformed model.

**Step 4: Compare Mean Absolute Error values of models**

**MAE **stands for Mean Absolute Error, which is a common metric used to evaluate the performance of regression models. It measures the average difference between the predicted values and the actual values. Specifically, it takes the absolute value of the differences between the predicted and actual values, and then averages those differences across all observations in the dataset. The advantage of using **MAE** as an evaluation metric is that it is easy to interpret and provides a straightforward measure of the magnitude of the errors in the predictions. Additionally, it is less sensitive to outliers than other metrics like **RMSE**, which means that extreme values in the dataset will have less influence on the final score.

`MAE <- function(actual, predicted){ mean(abs(actual - predicted))`

}

MAE1 <- MAE(newdata$pred, newdata$salary_in_usd)

MAE2 <- MAE(newdata$pred1, newdata$salary_in_usd)

MAE3 <- MAE(newdata$pred2, newdata$salary_in_usd)

- MAE of model 1: 2.599654 × 10^4
- MAE of model 2: 2.3964881 × 10^4
- MAE of model 3: 2.3926421 × 10^4

The three models have different Mean Absolute Error (MAE) values, indicating that they have different levels of accuracy in their predictions. Model 1 has the highest MAE of 2.599654 × 10^4, which means its average prediction error is larger than that of Model 2 and Model 3. Model 2 has a lower MAE of 2.3964881 × 10^4 than Model 1, indicating that it is more accurate on average. However, Model 3 has the lowest MAE of 2.3926421 × 10^4, making it the most accurate of the three models. Therefore, based on the MAE values, Model 3 is the best performer, followed by Model 2, and then Model 1.

**Step 5: Compare Root Mean Square Error values of models**

**RMSE **stands for Root Mean Square Error, which is a commonly used metric to evaluate the performance of regression models. It measures the average distance between the predicted values and the actual values, taking into account the squared differences between them. Specifically, it calculates the square root of the mean of the squared differences between the predicted and actual values.

The advantage of using **RMSE** as an evaluation metric is that it penalizes large errors more heavily than **MAE**, providing a more accurate measure of the model’s overall performance. Additionally, it is often used as a standard metric for comparing the performance of different models, since it provides a single number that summarizes the level of error in the predictions.

One potential disadvantage of using **RMSE** is that it is sensitive to outliers, which can have a disproportionate impact on the final score. This means that if the data contains a few extreme values, the **RMSE** may not accurately reflect the model’s performance on the majority of the data.

Overall, **RMSE** is a useful metric for evaluating the performance of regression models, particularly when the data does not contain outliers or when large errors are of particular concern. However, it should be used in conjunction with other metrics like **MAE** and **R-squared **to provide a more comprehensive evaluation of the model’s performance.

`rmse <- function(actual, predicted) { sqrt(mean((actual - predicted)ˆ2))`

}

rmse1 <- rmse(newdata$pred, newdata$salary_in_usd)

rmse2 <- rmse(newdata$pred1, newdata$salary_in_usd)

rmse3 <- rmse(newdata$pred2, newdata$salary_in_usd)

- RMSE of model 1: 3.891084 × 10^4
- RMSE of model 2: 3.8043792 × 10^4
- RMSE of model 3: 3.825698 × 10^4

The three **RMSE** values suggest that there may be slight differences in the performance of the models. The model with the lowest **RMSE **value is Model 2 with a score of 3.8043792 × 10^4, indicating that its predictions have the smallest deviation from the actual values. Model 3 has a slightly higher **RMSE** value of 3.825698×10^4, suggesting slightly less accuracy in its predictions. Meanwhile, model 1 has the highest **RMSE** value of 3.891084 × 10^4, implying that its predictions have the highest deviation from the actual values.

**Conclusion**

In order to choose the best model, we need to consider several factors, including the correlation coefficient, R-square, mean absolute error (MAE), and root mean square error (RMSE) of each model.

The correlation coefficients for all three models are relatively high, with Model 2 having the highest value at 0.8458336. This indicates that there is a strong positive linear relationship between the predicted and actual values in this model. However, the differences between the correlation coefficients are relatively small and may not be significant.

The R-squared values indicate the percentage of variation in the dependent variable that is explained by the independent variable. Model 3 has the highest R-squared value of 0.8238061, which means that it explains a larger proportion of the variance in the dependent variable than the other two models.

When it comes to MAE, Models 2 and 3 have similar scores, both around 23,900, while Model 1 has a higher score of 2.599654 × 10^4. This suggests that Models 2 and 3 are better at predicting the dependent variable than Model 1.

Similarly, when looking at RMSE, Models 2 and 3 have lower values than Model 1, suggesting that they have smaller errors in predicting the dependent variable.

Considering all of these factors together, **Model 2** appears to be the best choice. It has the highest correlation coefficient, second highest R-square, lowest MAE, and lowest RMSE. While Model 3 has a slightly higher R-square and lower MAE than Model 2, the difference is not substantial enough to outweigh the benefits of Model 2’s higher correlation coefficient and lower RMSE.