Models Modules

The Models modules provide a comprehensive suite of statistical and machine learning models for climate prediction, including linear models, EOF-based models, canonical correlation analysis (CCA), analog methods, and multi-model ensemble (MME) techniques. These models are designed to handle both deterministic and probabilistic forecasts, with support for hyperparameter tuning. Models are evaluated using cross-validation schemes.

The models modules are organized into several classes, each implementing a specific type of model:

Machine Learning Models: This includes linear models such as multiple linear regression, logistic regression and regularized models like ridge, lasso, elastic-net. Additionally, more advanced models are available, including support vector regression, random forests, XGBoost, and neural networks.
EOF and PCR Models: For dimensionality reduction and regression using principal components.
CCA Models: For identifying relationships between two multivariate datasets.
Analog Methods: For finding historical analogs to current conditions.
Multi-Model Ensemble (MME) Techniques: For combining predictions from multiple models.

Machine Learning Models

The available models are:

WAS_LinearRegression_Model:

Standard Multiple Linear Regression. * WAS_Ridge_Model: Ridge regression with L2 regularization. * WAS_Lasso_Model: Lasso regression with L1 regularization. * WAS_LassoLars_Model: Lasso regression using the LARS algorithm. * WAS_ElasticNet_Model: Elastic net regression combining L1 and L2 regularization. * WAS_LogisticRegression_Model: Logistic regression for classification. * WAS_SVR: Support vector regression. * WAS_PolynomialRegression: Polynomial regression. * WAS_PoissonRegression: Poisson regression. * WAS_RandomForest_XGBoost_ML_Stacking: Random forest and XGBoost regression with stacking. * WAS_MLP: Multi-Layer Perceptron regression. * WAS_RandomForest_XGBoost_Stacking_MLP: Random forest, XGBoost, and MLP regression with stacking. * WAS_Stacking_Ridge: Random forest, XGBoost, MLP, and Ridge regression with stacking.

Except for WAS_LogisticRegression_Model, each model class includes methods for:

compute_model: Training the model and making predictions.
compute_prob: Computing tercile probabilities for the predictions.
forecast: Making forecasts for new data.

EOF and PCR Models

The was_eof.py and was_pcr.py modules provide classes for EOF analysis and Principal Component Regression (PCR), with support for multiple EOF zones:

WAS_EOF: Performs EOF analysis with options for cosine latitude weighting, standardization, and L2 normalization.
WAS_PCR: Combines EOF analysis with a regression model for prediction, supporting multiple EOF zones.

WAS_EOF

Initialization

n_modes: Number of EOF modes to retain.
use_coslat: Apply cosine latitude weighting (default: True).
standardize: Standardize the input data (default: False).
opti_explained_variance: Target cumulative explained variance to determine modes.
L2norm: Normalize components and scores to have L2 norm (default: True).

Methods

fit: Fits the EOF model to the data, supporting multiple zones by applying EOF analysis to the entire dataset.
transform: Projects new data onto the EOF modes.
inverse_transform: Reconstructs data from principal components (PCs).
plot_EOF: Plots the EOF spatial patterns with explained variance.

WAS_PCR

Initialization

regression_model: The regression model (e.g., WAS_Ridge_Model) to use with PCs.
n_modes: Number of EOF modes to retain.
use_coslat: Apply cosine latitude weighting (default: True).
standardize: Standardize the input data (default: False).
opti_explained_variance: Target cumulative explained variance.
L2norm: Normalize EOF components and scores (default: True).

Methods

compute_model: Fits the EOF model, transforms data to PCs, and applies the regression model.
compute_prob: Computes tercile probabilities using the regression model.
forecast: Makes forecasts using EOF-transformed data.

Example Usage: Seasonal Forecasting Based on Observational Data

from wass2s import *
## Define the directory to save the data
dir_to_save_reanalysis = "/path/to/save_reanalysis"
dir_to_save_agroindicators = "/path/to/save_agroindicators"

## Define the climatology  year range and the season
clim_year_start = 1991
clim_year_end = 2020
seas_reanalysis = ["01", "02", "03"]
seas_agroindicators = ["05", "06", "07"]

## Define the variables to download
variables = ["AGRO.PRCP"]

## Define the center and the predictor variables
center_variable = ["ERA5.SST"]:

## Define the extent for reanalysis
extent = [45, -180, -45, 180] # [North, West, South, East]

## Define the extent for Observation
extent_obs = [30, -25, 0, 30] # [North, West, South, East]

## Download the predictors and the predictand
downloader = WAS_Download()

## Download the predictors
downloader.WAS_Download_Reanalysis(
    dir_to_save=dir_to_save_reanalysis,
    center_variable=center_variable,
    year_start=1991,
    year_end=2025,
    area=extent,
    seas=seas_reanalysis,
    force_download=False
)

## Download the predictand
downloader.WAS_Download_AgroIndicators(
    dir_to_save=dir_to_save_agroindicators,
    variables=["AGRO.PRCP"],
    year_start=1991,
    year_end=2024,
    area=extent_obs,
    seas=seas_agroindicators,
    force_download=False
)

Case 1: Used SST index as a predictor

# Prepare predictand and predictors
predictand = prepare_predictand(dir_to_save_agroindicators, variables, year_start, year_end, seas_agroindicators, ds=False, daily=False)

# Prepare predictors
## Print available SST indices
print(list(sst_indices_name.keys()))

## Choose yours
sst_index_name = ['NINO34','TNA', 'TSA', 'DMI']

## Plot the SST index zone
plot_map([extent[1],extent[3],extent[2],extent[0]], sst_indices = sst_index_name, title="Index Zone",fig_size=(7,4))

## Compute the SST indices
predictors = compute_sst_indices(dir_to_save_reanalysis, sst_index_name, center_variable[0], year_start, year_end, seas_reanalysis)

## Compute variance inflation factor to see multicolinearity between predictors

vif_data = pd.DataFrame()
vif_data["feature"] = predictors.to_dataframe().columns
vif_data["VIF"] = [VIF(predictors.to_dataframe(), i) for i in range(predictors.to_dataframe().shape[1])]
## Print VIF values
print(vif_data)

## Set a threshold for VIF
vif_threshold = 5
# Remove features with VIF greater than the threshold
low_vif_predictors = vif_data[vif_data["VIF"] < vif_threshold]["feature"].tolist()
filtered_predictors = predictors[low_vif_predictors].to_array()
filtered_predictors = filtered_predictors.rename({"variable": "features"}).transpose('T', 'features')

# Initialize the model class
model = WAS_LinearRegression_Model(nb_cores=2, dist_method="lognormal")
# Assuming predictand follows a lognormal distribution. otherwise, normal, student-t or gamma are available. used dist_method="normal" or dist_method="t" or dist_method="gamma".

# Perform cross-validation
was_cv = WAS_Cross_Validator(n_splits=len(predictand.get_index("T")), nb_omit=2)
hindcast_det, hindcast_prob = was_cv.cross_validate(model, predictand, filtered_predictorsisel(T=slice(None,-1)), clim_year_start, clim_year_end)
# clim_year_start and clim_year_end are the years used to compute the climatology.

# Initialize the model class
model = WAS_Ridge_Model(n_clusters=6, alpha_range=np.logspace(-4, 0.1, 20), nb_cores = 2)

# Compute alpha parameters
alpha, clusters = model.compute_hyperparameters(predictand, filtered_predictors)

# Perform cross-validation
was_cv = WAS_Cross_Validator(n_splits=len(predictand.get_index("T")), nb_omit=2)
hindcast_det_Ridge, hindcast_prob_Ridge = was_cv.cross_validate(model, predictand, filtered_predictors.isel(T=slice(None,-1)), clim_year_start, clim_year_end, alpha=alpha)

# Make a forecast
forecast_det_Ridge, forecast_prob_Ridge = model.forecast(predictand, clim_year_start, clim_year_end, filtered_predictors.isel(T=slice(None,-1)), hindcast_det_Ridge, filtered_predictors.isel(T=[-1]), alpha=alpha, l1_ratio=l1_ratio)

Case 2: Used PCRs as a predictor

# Set your own zones ( zones not available in built-in)
# define zone as dict : {'zone_name_key': ('Explicit_Zone_name', lon_min, lon_max, lat_min, lat_max)}
zones_for_PCR = {'A': ('A', -150, 150, -45, 45)}

# Set number of modes
n_modes = 6

# ElasticNet hyperparameters range
alpha_range = np.logspace(-4, 0.1, 20)
l1_ratio_range = [0.5, 0.9999]

# Initialize the model class
model = WAS_PCR_Model(n_clusters=6, alpha_range=np.logspace(-4, 0.1, 20), nb_cores = 2)
plot_map([extent[1],extent[3],extent[2],extent[0]], sst_indices = zones_for_PCR, title="Predictors Area",fig_size=(8,6))

# Retrieve predictor data for the defined zone
predictor = retrieve_single_zone_for_PCR(dir_to_save_Reanalysis, zones_for_PCR, variables_reanalysis[0], year_start, year_end, season, clim_year_start, clim_year_end)

# Load WAS_EOF Class
eof_model = WAS_EOF(n_modes=n_modes, use_coslat=True, standardize=True)

# Load predictor, compute EOFs and retrieve component, scores and explained variances
s_eofs, s_pcs, s_expvar, _ = eof_model.fit(predictor, dim="T",  clim_year_start=clim_year_start, clim_year_end=clim_year_end)

# Plot EOFs and explained variances
eof_model.plot_EOF(s_eofs, s_expvar)

# Perform Cross-validation with elastic-net

## Load class for model
regression_model = WAS_ElasticNet_Model(alpha_range = alpha_range, l1_ratio_range = l1_ratio_range, nb_cores = 2, dist_method="lognormal")
pcr_model = WAS_PCR(regression_model=regression_model, n_modes=n_modes, standardize=False)

## Compute alpha parameters
alpha, l1_ratio, clusters = regression_model.compute_hyperparameters(predictand, s_pcs.isel(T=slice(None,-1)).rename({"mode": "features"}).transpose('T', 'features'))
## Perform cross-validation
was_cv = WAS_Cross_Validator(n_splits=len(predictand.get_index("T")), nb_omit=2)
hindcast_det, hindcast_prob = was_cv.cross_validate(pcr_model, predictand, s_pcs.isel(T=slice(None,-1)).rename({"mode": "features"}).transpose('T', 'features'), clim_year_start, clim_year_end, alpha=alpha, l1_ratio=l1_ratio)

CCA Models

The was_cca.py module provides classes for Canonical Correlation Analysis (CCA):

WAS_CCA: Performs CCA to identify relationships between two multivariate datasets.

Initialization

n_modes: Number of CCA modes to retain.
n_pca_modes: Number of PCA modes to use for dimensionality reduction.
dist_method: distribution method for probability computations.

Methods

compute_model: Fits the CCA model and makes predictions.
compute_prob: Computes tercile probabilities for the predictions.

Example Usage: Recalibrating Seasonal Forecast Outputs from Global Climate Models (GCMs)

Analog Forecasting Methods

The was_analog.py module provides the WAS_Analog class for analog-based forecasting using various techniques to identify historical analogs to current conditions for prediction, particularly for seasonal rainfall forecasts using sea surface temperature (SST) data.

Initialization Parameters

dir_to_save (str): Directory path to save downloaded and processed data files.
year_start (int): Starting year for historical data.
year_forecast (int): Target forecast year.
reanalysis_name (str): Reanalysis dataset name (e.g., “ERA5.SST” or “NOAA.SST”).
model_name (str): Forecast model name (e.g., “ECMWF_51.SST”).
method_analog (str, default=”som”): Analog method to use (“som”, “cor_based”, “pca_based”).
best_prcp_models (list, optional): List of best precipitation models. Default is None.
month_of_initialization (int, optional): Forecast initialization month. Default is None (uses current month).
lead_time (list, optional): Lead times in months. Default is None (uses [1, 2, 3, 4, 5]).
ensemble_mean (str, default=”mean”): Ensemble mean method (“mean” or “median”).
clim_year_start (int, optional): Start year for climatology period.
clim_year_end (int, optional): End year for climatology period.
define_extent (tuple, optional): Bounding box as (lon_min, lon_max, lat_min, lat_max) for regional analysis.
index_compute (list, optional): Climate indices to compute (e.g., [“NINO34”, “DMI”]).
some_grid_size (tuple, default=(None, None)): SOM grid dimensions (rows, cols); None uses automatic sizing.
some_learning_rate (float, default=0.5): Learning rate for SOM training.
some_neighborhood_function (str, default=”gaussian”): Neighborhood function for SOM (“gaussian”, etc.).
some_sigma (float, default=1.0): Initial neighborhood radius for SOM.
dist_method (str, default=”gamma”): Probability method (“gamma”, “t”, “normal”, “lognormal”, “nonparam”).

Key Methods

download_sst_reanalysis(): Downloads and processes SST reanalysis data from the specified center for the given years and area.
download_models(): Downloads seasonal forecast model data for the specified model, initialization month, and lead times.
standardize_timeseries(): Standardizes time series data over a specified climatology period.
calc_index(): Computes specified climate indices (e.g., NINO34, DMI) from SST data.
compute_model(): Identifies historical analogs using the specified method and computes deterministic forecasts.
compute_prob(): Calculates tercile probabilities (Below Normal, Near Normal, Above Normal) using the specified distribution method.
forecast(): Generates deterministic and probabilistic forecasts for the target year, returning processed SST data, similar years, deterministic forecast, and probabilistic forecast.
composite_plot(): Creates composite plots of forecast results, optionally including the predictor (SST) visualization.

Example Usage

Basic analog forecast setup:

from wass2s.was_analog import WAS_Analog

# Initialize analog model
analog_model = WAS_Analog(
    dir_to_save="./s2s_data/analog",
    year_start=1990,
    year_forecast=2025,
    reanalysis_name="NOAA.SST",
    model_name="ECMWF_51.SST",
    method_analog="som",
    month_of_initialization=3,
    clim_year_start=1991,
    clim_year_end=2020,
    define_extent=(-180, 180, -45, 45),
    index_compute=["NINO34", "DMI"],
    dist_method="gamma"
)

# Download and process data
sst_hist, sst_for = analog_model.download_and_process()

# Generate forecast
ddd, similar_years, forecast_det, forecast_prob = analog_model.forecast(
    predictant=rainfall_data,
    clim_year_start=1991,
    clim_year_end=2020,
    hindcast_det=hindcast_data,
    forecast_year=2025
)

# Create composite plot
similar_years = analog_model.composite_plot(
    predictant=rainfall_data,
    clim_year_start=1991,
    clim_year_end=2020,
    hindcast_det=hindcast_data,
    plot_predictor=True
)

Cross-Validation Example

from wass2s.was_analog import WAS_Cross_Validator

# Perform cross-validation
was_analog_cv = WAS_Cross_Validator(n_splits=len(rainfall.get_index("T")), nb_omit=2)
hindcast_analog_det, hindcast_analog_prob = was_analog_cv.cross_validate(
    analog_model,
    rainfall,
    clim_year_start=1991,
    clim_year_end=2020
)

# Generate forecast using cross-validated hindcast
ddd, similar_years, forecast_det, forecast_prob = analog_model.forecast(
    predictant=rainfall,
    clim_year_start=1991,
    clim_year_end=2020,
    hindcast_det=hindcast_analog_det,
    forecast_year=2025
)

Note

Ensure WAS_Cross_Validator is correctly imported from the wass2s.was_analog module and that the rainfall variable is an xarray DataArray with appropriate dimensions (T, Y, X).