Implementing a New Scale Technique in the M-Estimation Method to Estimate Parameters of Multiple Linear Regression: Simulation Study

: The goal of this study is to develop a new technique for estimating the parameters of a multiple linear regression by using M-estimation based on scale estimator to handle the influence of outlier values. In order to get new estimators, the root mean square error (RMSE) criterion is used to check the efficiency between the new technique and the classical method. The research showed that the new technique (M-estimation based on scale estimator) yields more accurate parameter estimates than the traditional approach (OLS) in all simulated cases


Introduction
Multiple Regression analysis is an important statistical tool that is routinely applied in most sciences.Out of many possible regression techniques, the ordinary least squares (OLS) method has been generally adopted because of its tradition and ease of computation.However, there is presently widespread awareness of the dangers posed by the occurrence of outliers, which may be a result of keypunch errors, misplaced decimal points, recording or transmission errors, exceptional phenomena such as earthquakes or strikes, or members of a different population slipping into the sample.Outliers occur very frequently in real data, and they often go unnoticed because nowadays much data is processed by computers without careful inspection or screening.Not only the response variable can be outlying, but also the explanatory part, leading to so-called leverage points.Both types of outliers may totally spoil an ordinary LS analysis.Often, such influential points remain hidden to the user because they do not always show up in the usual OLS residual plots (Rousseeuw, and Leroy, 1987: 216).The basis of linear regression is the presumption that errors have a constant variance and are normally distributed.Outliers, which can significantly and unpredictably affect the errors, can cause this assumption to be falsified.
Outliers might result from inaccurate measurement, incorrect data entry, or unavoidable data variability.Outliers are intended to have less of an impact on the regression estimates by being given less weight or being completely disregarded in robust regression methods (Montgomery, 2012: 369).
Outliers in linear regression models can be handled with the use of robust regression techniques.Outliers are observations that drastically depart from the overall trend of the data and can skew estimations of the regression line's slope and intercept.This tutorial will teach you how to use a few popular robust regression techniques (Barnett and Lewis, 1994: 290) (Ali, T. H., & Salah, D. M., 2022: 920-939).

Methodology:
In this part represents the theoretical aspect of multiple regression analysis will be reviewed, (outlier values), robust estimation method compared the new technique M-estimation method with the classic method (OLS) for outlier problems for estimating multiple linear regression model using the statistical criterion root mean square error (RMSE) 2-1.Multiple linear regression model: In order to represent the relationship between a scalar response or dependent variable, denoted by the letter Y, and one or more explanatory or independent variables, denoted by the letter X, we use the method of linear regression.In linear regression, unknown model parameters are inferred from the data by applying linear predictor functions to model the data (Alma, 2011: 411) (Obed, Saleh & Jamil, 2023: 1304-1324).The following is an example of a p independent variables linear regression model: The dependent variable is Y. Explanatory variables are X i .Y-intercept is called β o .Slope coefficients are β p .The model's error is ε i .

2-2. Ordinary least squares:
The multiple linear regression model and its estimation using ordinary least squares (OLS) is doubtless the most widely used tool in econometrics.It allows estimating the relation between a dependent variable and a set of explanatory variables (Ali, 2011: 331-348).
The approach is based on the idea of reducing the sum of squared residuals between the actual and forecasted values.By reducing the sum of squared errors or residuals between the actual and predicted values, the OLS method can be used to identify the best-fit line for data.Additionally, the partial derivative of the cost function with respect to the coefficients of determination must be taken into account while minimizing the sum of squares residuals in calculus (Ali, T. H. & Salah, D. M, 2021: 3388-3409).
The partial derivatives must then be set to zero and the coefficients must then be solved for individually.We estimate the parameters for a regression model using the ordinary least squares method, which minimizes the residual sum of squares, or sum of squared variances between the fitted and observed response (Almetwally& Almongy, 2018: 55-63).An outlier is defined as an observation that does not conform to the pattern (model) suggested by the homogeneous majority of the observations in a data set.That does not conform to the linear regression line well.These observations have unusually high residual errors.Some data sets may come from homogenous groups, while others may come from heterogeneous groups with varying features with reference to a certain variable.Outliers might result from inaccurate measurements, including data entry mistakes, or they can come from a different population from the rest of the data.Therefore, it's crucial to spot outliers for two reasons: either they point to a problem with the data that has to be corrected or they are the first sign of a significant new trend (Rousseeuw and Leroy, 1987: 216).
Outliers are defined by (Hawkins, 1980:85) as observations that differ so significantly from other observations that it raises the possibility that they were produced by a separate mechanism.However, some definitions are considered general enough to deal with diverse types of data and methods.❖ Extreme outlier: It refers to observation that lies at the end of the tail, for the status data upgrade or downgrade, if it is greater than (3σ)or smaller than (-3σ)in the context of figures of data under shade of standard normal distribution (Rousseeuw and Leroy, 1987: 216).❖ High-Leverage points: High leverage points are observations that have outlying values in covariate space.In logistic regression model, the identification of high leverage points becomes essential due to their gross effects on the parameter estimates (Hawkins1980:85).
Let us consider a k-variable regression model, Y = X β + ∈ the OLS residual vector can be expressed in terms of the true disturbance vector as: Where the matrix H = X(X T X) −1 X T given in Equation ( 4) is generally known as weight matrix or leverage matrix.Observations corresponding to excessively large ∈ values are termed as outliers.The weight matrix H reflects joint effect of k regressors on the fitted responses.Usually the diagonal elements h ii of the weight matrix H are considered as leverage values, which measure influences in the X-space.The i th leverage value is defined as (Hawkins, 1980:85).
❖ Influential observation: Influential observations are those observations that, individually or collectively, excessively influence the fitted regression equation as compared to other observations in the data set (Hawkins, 1980: 85).
When presenting the following must be noted: 1. Outliers need not be influential observations.2. Influential observations need not be outliers.3.While observations with large residuals are undesirable, this is because least square fitting avoids large residuals.4. Observations that have small residual do not mean necessarily, typical observation, because the gravitation of high-leverage point has small residual and influence on the success of the sample.
The following instance illustrates the above remarks.Adding points A-B-C in an insulation pattern into the typical points, the following will be produced: In the concern point -A-: have small residual (because the value of (Y) is close to the straight through the other points), high-leverage point because it is phenomenal value of (X) and it does not have influence on the fitting of regression equation, thus the high leverage point is not influential.In the concern point -B-: it does not possess high-leverage point (because it is located into the center (X)); whereas outlier (has great residual) and the influential point (with the entering, it does not change the slope, but it changes the straight junction point with the axis (Y)).In the concern point -C-: it is the outlier (has great residual), high-leverage point (because of the extreme point in the space (X)) and the influential observation because it changes the fitting of regression equation (McCann, 2006: 109).With Adding two points E, D with the points of the model, we can observe that: Point -D-is: outlier but it is not influential and is not high-leverage point.
Point-E-is: influential observation because it changes the fitting of regression equation; but it does not phenomenal (it has a small residual) and it is not leverage point (Rousseeuw and Leroy, 1987:216).

2-4. New Scale Technique in M-Estimation method:
If the data originate from a normal distribution, the standard deviation, the most popular scale estimate, is the most effective scale estimate.However, the standard deviation lacks robustness in the sense that even a small change can have a significant impact on the estimated value of the standard deviation (low resistance) (CHEN, 2002: 08).Furthermore, it lacks resilience of efficiency for non-normal data.The most widely employed robust substitutes for the standard deviation are the ( n S )scale and median absolute deviation (MAD) under outliers' problem.Sigma is required to determine the robust method's parameters.In this study, we used a new technique that combines with M-Estimation.We obtain a new sigma ( ) ˆ, which we then apply to a robust technique to determine the parameters (Rousseeuw and Croux, 1993: 1273-1283).Huber created the M-estimation method, and it is currently the most popular robust regression methodology.The median of these n numbers is then the estimated value of ( n S ).It is found that ( n S ) becomes a trustworthy estimator when the constant (1/c) is used.The chosen value is 0.77519, which is necessary for ( n S ) to be a trustworthy estimator for normal data (Rousseau, P.J. and Croux, C., 1993: 1273-1283).

….. (8)
The (i-th) residual is indicated by the letter ( i e ).The following normal equations are obtained: The iteratively reweighted least squares (IRLS) strategy was used to solve the nonlinear normal equations for M-estimates.The iterative process that follows is (Ruckstuhl, 2014:12): 1. Calculate the weights, like i w .2. Utilizing Eq. ( 7), determine a revised estimate of  .3. Repeat steps two and three as necessary until the algorithm converges.Last but not least, the M-formula estimator's …. (10)

2-6. Evaluation Criteria:
The evaluation criterion used to compare the performance of classical and robust processes in multiple linear regression models is the root mean square error (Ali, Albarwary and Ramadhan, 2023: (11) Where: Yi: is the actual value for the i-th observation.Ŷi: is the predicted value for the i-th observation.n: is the number of observations.From the results discussed above, we can see that the root mean square error had the lowest value of (7.3) and the highest value of R2, both of which were calculated using the new M-estimation technique.This confirms the method's superiority over the classical method (OLS) in handling outliers and obtaining a multiple linear model with high efficiency.

A simulation experiment's description and analysis:
Table ( e = (Y − Xβ) T (Y − Xβ) … .(2) Minimization of (2) results into the least squares estimate of β which isβ ̂= (X T X) −1 X T Y.The fitted regression model corresponding to the level of the regressor variables is Y ̂= Xβ ̂ The corresponding residual or error vector is e = Y − Y ̂= Y − Xβ ̂= Y − X(X T X) −1 X T Y.The residual sum of squares is calculated from (2) as e T e = (Y − Xβ ̂)T (Y − Xβ ̂).The residual sum of squares has degrees of freedom associated with it, since(P + 1) parameters are estimated in the regression model.Thus, the mean and variance of the residual are e i ~N(0, σ ̂2).Yields in the least squares estimate of, which is equal to β ̂= (X T X) −1 X T Y ….. (3) 2-3.Outliers: Different scientific communities define outliers differently: estimator alternative to the MAD: Using IRLS, this system can be solved.The likelihood works for  & is as follows under this circumstance: ordinary least squares criterion with a robust criterion, M-estimator of  is (Hisham & Ehab, 2017: 55-63): of scale can be used as an alternative to the MAD.It shares with MAD the favored resilience traits of a restricted influence function and a 50% breakdown point.It does not depend on symmetry and also has a far greater normal efficiency (58%) than the previous method.…...(7) ( ) 1, 2,..., n for each i.
This part compares the new M-estimation technique to the classic Ordinary Least Squares regression (OLS) method in a real-world setting.After studying the most crucial technique for eliminating data outliers, the comparison was made by assessing relative efficiency, which represents the root mean square of error (RMSE).The simulation experiment's implementation made use of varying degrees of the following factors: number of samples n, sample sizes (50, 100, and 200) were used in this study.Without altering the explanatory variables, when (k) = (2, 4, and 8) and the (y) vector contain outliers (5%, 15%).A comparison of the approaches used in the estimation process represented by the new M-estimation technique with Ordinary Least Squares regression (OLS) was made for the frequency of 1000 replications.we achieve this by developing a specialized MATLAB (version 2020a) software for this work.The table below provides a summary of the algorithm used in simulation studies.

Figure ( 1 )
Figure (1): normal probability plot of the residuals In figure (1) shows the residuals from the robust fit are closer to the straight line, except for the obvious outliers.This figure finds by author by using MATLAB program

Table ( 3
1): shows how Y is distributed in simulation experiments.): the outcome of RMSE when Y is 5% contaminated and σ = 6.This table finds by author by using MATLAB program Table (4): the outcome of RMSE when Y is 15% contaminated and σ =2.This table finds by author by using MATLAB program Table (5): the outcome of RMSE when Y is 15% contaminated and σ =6.This table finds by author by using MATLAB program 4. Conclusion 1.Based on the results of the relevant case study, it is possible to draw the conclusion that the new method in the robust method (M-estimation based on n S scale) has successfully demonstrated its efficacy in estimating the regression model parameters with high accuracy under outlier values in the dependent variable.2. In all simulation cases, the RMSE appears to decrease with increasing sample size.