In the previous blog, we saw how we can use a parametrized algorithm to learn the environment in such as manner that we can generalize as well as discriminate the states and their values.

In this blog, we will see how we can create features to better represent the states in the environment.

Let’s say we have 100 states, and we need some numerical way to represent each state. This might sound familiar. We used to do one hot encoding for categorical features. Similarly, we can create a vector of 100 dimensions. If the agent is in the state (Si) the value of ith element in the vector is 1 and 0 otherwise.

However, this process is not scalable. In the real world, there could be many states available in the environment (E.g., Chess board game).

There are various methods to represent a state without doing simple binary conversions.

· State aggregation involves grouping similar states together into a smaller set of aggregated states.

· This reduces the number of unique states and allows the agent to generalize its learning across similar states.

· Aggregating states is typically done based on some predefined criteria, such as spatial or temporal similarity.

· By aggregating states, the learning algorithm can focus on a smaller set of representative states, which simplifies the learning process.

Example:

Suppose we have 1000 states, state N=0 will give reward = -1, state N= 1000 will give reward = 1 and other states give zero rewards

We can group (aggregate) 100 adjacent states and can form 10 new states. The reward of the 100 states will be updated together.

We can expect the value function to follow a trend just like a step function where each step is representing the value of the aggregated states.

· Coarse coding is a technique that uses a set of overlapping binary features to represent states.

· Each feature encodes a specific property of the state.

· Multiple features can be active simultaneously with allows us to represent more nuances of states.

· Coarse coding helps in generalizing the learning across similar states by activating features that are common to multiple states.

· By using overlapping features, it provides a smoother transition between similar states and facilitates better generalization.

Here each circle represents a binary flag in a feature vector. To represent the state S, all the elements of the feature vector corresponding to the circles which contain the state S will be 1 (0 otherwise)

Both state aggregation and coarse coding aim to reduce the dimensionality of the state space and capture relevant information in a more compact and manageable representation. These techniques help in dealing with the curse of dimensionality, where the learning becomes impractical or computationally expensive due to many states. They facilitate better generalization, faster convergence, and improved performance in RL tasks.

It is a form of coarse coding that provides a compact and efficient representation of continuous states.

The basic idea behind tile coding is to divide the continuous state space into multiple tiles or grids, where each tile covers a specific region of the state. These tiles overlap with each other, allowing for better representation and generalization across neighboring states.

Here we have selected 4 Tiles, to represent the point in state space, and different tiles got activated.

We have seen various ways in which we can create features to represent the states. We can play around with the shape, size, and number of tiles to get the best representation of a state.

**Is there a way to learn these parameters automatically from data?**

Neural networks can be used to replace tile coding as a function approximator for representing and generalizing the state space. Instead of manually designing and defining tiles, neural networks learn to automatically capture the underlying patterns and representations from the raw input states.

Here’s how neural networks can replace tile coding:

· Design a neural network architecture suitable for the specific reinforcement learning task.

· The architecture typically consists of input layers, hidden layers, and output layers.

· The input layer takes the raw state as input, and the output layer represents the estimated values or policy.

· Train the data on observed state, action, and rewards

· The network iteratively updates its parameters to minimize the discrepancy between predicted and target values.

· The network can approximate the value or policy function for any given state, even those not encountered during training.

`import numpy as np`

import torch

import torch.nn as nn

import torch.optim as optim# Define the Neural Network architecture

class QNetwork(nn.Module):

def __init__(self, state_dim, action_dim):

super(QNetwork, self).__init__()

self.fc1 = nn.Linear(state_dim, 64)

self.fc2 = nn.Linear(64, 64)

self.fc3 = nn.Linear(64, action_dim)

def forward(self, state):

x = torch.relu(self.fc1(state))

x = torch.relu(self.fc2(x))

return self.fc3(x)

# Create the QNetwork instance

model = QNetwork(state_dim, action_dim)

# Define the optimizer and loss function

optimizer = optim.Adam(model.parameters(), lr=0.001)

loss_fn = nn.MSELoss()

# Function to compute the action given a state using epsilon-greedy policy

def get_action(state, epsilon):

if np.random.rand() < epsilon:

# Explore: choose a random action

return np.random.choice(action_dim)

else:

# Exploit: choose the action with the highest Q-value

Q_values = model(torch.FloatTensor(state)).detach().numpy()

return np.argmax(Q_values)

# Function to update the neural network using TD learning

def update_network(state, action, reward, next_state, done):

Q_values = model(torch.FloatTensor(state))

next_Q_values = model(torch.FloatTensor(next_state))

if done:

target = reward

else:

target = reward + discount_factor * torch.max(next_Q_values)

Q_values[action] = target

optimizer.zero_grad()

predicted = model(torch.FloatTensor(state))

loss = loss_fn(predicted, Q_values)

loss.backward()

optimizer.step()

# Training loop

for episode in range(num_episodes):

state = env.reset()

epsilon = max(0.1, 1.0 - episode / 200) # Decrease epsilon over time

for t in range(max_steps):

action = get_action(state, epsilon)

next_state, reward, done, _ = env.step(action)

update_network(state, action, reward, next_state, done)

state = next_state

if done:

break