RMSE with and without standardizing the output variable

by Hiyam   Last Updated September 11, 2019 16:19 PM

I have a time series data that I would like to be able to forecast. I was trying to standardize the data as my columns are all of different ranges. I have standardized the input variables, but was reluctant of whether should I standardize the output variable or not. The following code snippet describeswhat I did.

class OutOfSampleForecasting:

    def __init__(self, train_df, test_df):
        train_df = train_df.dropna()
        test_df = test_df.dropna()

        #make the 'date' the index column 
        train_df = train_df.set_index('date')
        test_df = test_df.set_index('date')

        #change rows were demand = 0, make them demand = 1
        train_df.loc[train_df.demand == 0, 'demand'] = 1
        test_df.loc[test_df.demand == 0, 'demand'] = 1

        self.X_train = np.array(train_df.loc[:, train_df.columns != 'demand'])
        self.y_train = np.array(train_df.loc[:, 'demand'])

        self.X_test = np.array(test_df.loc[:, test_df.columns != 'demand'])
        self.y_test = np.array(test_df.loc[:, 'demand'])

        #standardizing only the training, applying parameters to testing 
        scaler = StandardScaler()
        self.X_train = scaler.fit_transform(self.X_train)
        self.X_test = scaler.transform(self.X_test)

        y_scaler = StandardScaler()
        self.y_train = y_scaler.fit_transform(self.y_train.reshape(-1, 1)).reshape(-1)
        self.y_test = y_scaler.transform(self.y_test.reshape(-1, 1)).reshape(-1)

        print('avg demand: %.3f', np.mean(self.y_test))

    def forecast(self, model, model_name, isCatBoost=False):
        print('*** Results for %s ***' % model_name)
        t1 = time.time()
        if isCatBoost:
            model.fit(self.X_train, self.y_train, verbose=False)
        else:
            model.fit(self.X_train, self.y_train)
        y_pred = model.predict(self.X_test)
        t2 = time.time()
        time_taken = float(t2 - t1) / 60
        print('time taken %.3f min' % time_taken)
        self.print_stats(self.y_test, y_pred)

    def print_stats(self, y_test, y_pred):
        r2_Score = r2_score(y_test, y_pred)
        rmse_score = np.sqrt(mean_squared_error(y_test, y_pred))
        mse_score = mean_squared_error(y_test, y_pred)
        mae_score = mean_absolute_error(y_test, y_pred)
        print('R^2: %.3f\nRMSE: %.3f\nMSE: %.3f\nMAE: %.3f\n' % (r2_Score, rmse_score, mse_score, mae_score))

        plt.plot(y_test, label='actual')
        plt.plot(y_pred, label='predicted')
        plt.legend()
        plt.show()

    def run_all(self):
        self.forecast(Ridge(), 'Ridge Regression')
        self.forecast(Lasso(), 'Lasso Regression')
        self.forecast(ElasticNet(), 'Elastic Net Regression')
        self.forecast(DecisionTreeRegressor(), 'Decision Tree')
        self.forecast(RandomForestRegressor(), 'Random Forest')
        self.forecast(AdaBoostRegressor(), 'Ada Boost')
        self.forecast(GradientBoostingRegressor(), 'Gradient Descent')
        self.forecast(XGBRegressor(), 'XGBoost')
        self.forecast(CatBoostRegressor(), 'Cat Boost', True)
        self.forecast(SVR(), 'Support Vector Regressor')

As you can see in this part, I am stnadardizing both the input and the output variable:

#standardizing only the training, applying parameters to testing 
            scaler = StandardScaler()
            self.X_train = scaler.fit_transform(self.X_train)
            self.X_test = scaler.transform(self.X_test)

            y_scaler = StandardScaler()
            self.y_train = y_scaler.fit_transform(self.y_train.reshape(-1, 1)).reshape(-1)
            self.y_test = y_scaler.transform(self.y_test.reshape(-1, 1)).reshape(-1)

However, what made me wonder is the RMSE results (and other metrics) with and without standardizing the output variable:

With standardizing output variable:

RMSE: 1.213
MSE: 1.472
MAE: 1.014

Without Standardizating output variable

RMSE: 48.784
MSE: 2379.876
MAE: 42.317

So basically, which results should I consider ?

I assume that what happened when standardizing the output variable is that ALL COLUMNS are now of the same scale, but is and RMSE of 1.2 good ? Or is an somehow a 'transformed' RMSE ? And what should I do in this case ?



Related Questions


Updated July 29, 2019 07:19 AM

Updated April 12, 2019 15:19 PM

Updated August 07, 2019 08:19 AM

Updated August 23, 2017 21:19 PM

Updated October 19, 2018 17:19 PM