**Introduction**

The data used in this analysis is sourced from the Australian Bureau of Meteorology (BOM), accessed through Python’s FTP library. It comprises daily weather observations in New South Wales (NSW) spanning twelve months from February 2023 to January 2024. This dataset encapsulates observations from 79 weather stations across the region. You can follow the steps outlined in this article to capture the weather data.

**Setting the Stage: Data Preparation**

The initial step in automating weather prediction involves gathering and preprocessing the data. The following Python script snippet demonstrates this process:

`import os`

import pandas as pd

import numpy as np

from sklearn.preprocessing import MinMaxScaler

from sklearn.model_selection import train_test_split

from sklearn.linear_model import LogisticRegression

from sklearn.metrics import confusion_matrix, classification_report

import seaborn as sns

import matplotlib.pyplot as plt# Read the file

filename = 'NSW_weather_data_12mth.csv'

observe = pd.read_csv(filename, sep =',',index_col = 0)

# Adjust column types

int_col = ['ev_transpiration','rain','pan_ev','max_temp','min_temp','max_humid','min_humid','wind','solar']

for col in int_col:

observe[col] = pd.to_numeric(observe[col], errors='coerce')

observe[col] = observe[col].astype(float)

observe['date'] = pd.to_datetime(observe['date'], format='%d/%m/%Y')

# Sort values

observe = observe.sort_values(['station_name','date'])

## fill nan value with mean

for col in int_col:

observe[col] = observe[col].fillna(observe[col].mean())

This script snippet demonstrates the initial steps of data gathering, cleaning, and feature engineering, which are essential for building accurate predictive models.

**Data Preprocessing**

Once the missing values are handled, we proceed to augment the dataset with additional features and perform normalization:

`# Add rain today, rain yesterday column`

observe['rain_today'] = np.where(observe['rain'] > 0, 1, 0)

observe['rain_yesterday'] = (observe['rain_today'].shift(1) == 1).astype(int).fillna(0)

observe['rain_tomorrow'] = (observe['rain_today'].shift(-1) == 1).astype(int).fillna(0)# Add difference temp, humid

observe['diff_temp'] = observe['max_temp'] - observe['min_temp']

observe['diff_humid'] = observe['max_humid'] - observe['min_humid']

# Get month number from datetime

observe['month_number'] = observe['date'].dt.month

# Normalize columns

norm_col = ['ev_transpiration','rain','pan_ev','max_temp','min_temp','max_humid','min_humid','wind','solar', 'diff_temp', 'diff_humid']

scaler = MinMaxScaler()

data_normalize = observe[norm_col]

observer_normalize = scaler.fit_transform(data_normalize)

observe_normalize = pd.DataFrame(observer_normalize, columns=norm_col)

observe[norm_col] = observe_normalize

These steps ensure that the dataset is appropriately prepared and features are standardized for further analysis.

**Exploratory Data Analysis (EDA)**

Understanding the relationships between weather variables is crucial. We use simple visualizations, such as correlation heatmaps, to explore these relationships:

`# Correlation heatmap`

corr_col = ['rain_tomorrow','rain_yesterday', 'rain_today', 'rain', 'ev_transpiration','pan_ev','max_temp','min_temp','max_humid','min_humid','wind','solar', 'diff_temp', 'diff_humid']

sns.heatmap(observe[corr_col].corr(), annot=True, cmap="coolwarm", annot_kws={"size": 6})

plt.figure(figsize=(14, 14))

plt.show()

This visualization helps identify which weather features have the most significant influence on rain prediction.

**Model Development: Logistic Regression**

Our focus is on building a logistic regression model, a simple yet powerful classification technique. The following script snippet demonstrates training the model:

`## prepare data set, then split train , test set`

x_col = ['ev_transpiration','max_temp','min_temp','max_humid','min_humid','wind','solar','month_number']

X = observe[x_col]

y = observe['rain_tomorrow']X_train, X_val, y_train, y_val = train_test_split(X, y , test_size = 0.2, random_state = 10)

print(X_train.shape, X_val.shape, y_train.shape, y_val.shape)

## Logistic Regression

## model

log_reg_model = LogisticRegression()

log_reg_model.fit(X_train, y_train)

## predict

# log_reg_model.predict(X_val)

## model score, model accuracy

log_reg_model_score = log_reg_model.score(X_val, y_val)

log_reg_model_accuracy = round(log_reg_model_score * 100, 2)

print("The classification accuracy of Logistic Regression model is " + str(log_reg_model_accuracy) + "%")

## confusion matrix

y_pred = log_reg_model.predict(X_val)

cm = confusion_matrix(y_val, y_pred)

## plot heatmap

plt.figure(figsize=(4, 3))

plt.imshow(cm, cmap='Reds')

# Add text annotations

for i in range(len(cm)):

for j in range(len(cm[i])):

plt.text(j, i, str(cm[i][j]), ha='center', va='center', color='black', fontsize=12)

# Set labels and title

plt.xlabel('Predicted', fontsize=10)

plt.ylabel('Actual', fontsize=10)

plt.title('Confusion Matrix', fontsize=10, pad=10)

# Set class labels

class_labels = ['Not Rain', 'Rain']

plt.xticks(range(len(class_labels)), class_labels, rotation=45)

plt.yticks(range(len(class_labels)), class_labels, rotation=0)

plt.colorbar()

plt.show()

## classification report

print('Logistic Regression Classification Report')

print('=========================================')

print(classification_report(y_val, y_pred, target_names = class_labels))

This basic model serves as a foundation for predicting whether it will rain tomorrow based on current weather conditions.

**Conclusion**

By automating weather prediction using Python and machine learning techniques, we can enhance the accuracy and reliability of forecasts, enabling better decision-making and planning in various domains. In the upcoming articles, we will explore more advanced machine learning models for weather prediction, building upon the simple logistic regression discussed here. Stay tuned for deeper insights into enhancing weather forecasting capabilities.