My project on Github:

We using logistic regression for the Titanic dataset is a common choice because logistic regression is well-suited for binary classification tasks, such as predicting survival (1 = yes/0 = no) of passengers. There are several reasons why logistic regression is often applied to this dataset:

1. Easy Interpretation: Logistic regression provides easily interpretable results in terms of probabilities. It allows us to understand the likelihood of a passenger surviving based on the given feature values.

2. Efficient for Large Datasets: Logistic regression tends to perform well even on large datasets, as it involves relatively light computations.

3. Robustness to Outliers: Logistic regression is not highly sensitive to outliers in the data, which reduces the need for complex data preprocessing.

4. Feature Selection Capabilities: Logistic regression can be used for feature selection by examining the coefficients associated with each feature.

5. Simple and Effective: Despite its simplicity, logistic regression often yields good results, especially when the relationship between features and the target is relatively linear or not too complex.

6. Ease of Implementation and Interpretation: Logistic regression is straightforward to implement and interpret, making it suitable for beginners in data analysis or machine learning.

Therefore, applying logistic regression to the Titanic dataset helps predict the likelihood of passenger survival based on various attributes present in the dataset.

The Titanic dataset consists of several features used as independent variables (X) and one binary target variable (Y), which represents whether a passenger survived the shipwreck or not. Here’s an explanation of each feature in the Titanic dataset in the context of logistic regression:

1. Id: This variable represents a unique identification of each data point in the dataset. In the context of logistic regression, this variable is usually not used as a predictive feature because it does not provide relevant information for predicting a person’s survival.

2. Pclass: This variable represents the ticket class of the passenger, with values 1, 2, and 3 indicating first, second, and third class, respectively. In logistic regression, this variable can be used as a predictive feature to estimate the likelihood of survival based on the passenger’s ticket class.

3. Sex: This variable identifies the gender of the passenger, with value 0 indicating female and value 1 indicating male. In logistic regression, gender can be an important factor in predicting survival, as evacuation policies typically prioritize women and children.

4. Age: This variable represents the age of the passenger in years. In logistic regression, age can be an important factor as evacuation policies may also consider age-based priorities, such as giving priority to children and the elderly.

5. SibSp: This variable indicates the number of siblings or spouses accompanying the passenger on the journey. In logistic regression, this information can provide insight into whether a person is traveling with family or alone, which may affect the likelihood of survival.

6. Parch: This variable indicates the number of parents or children accompanying the passenger on the journey. In logistic regression, this information can also provide insight into whether a person is traveling with family or alone, which may affect survival likelihood.

7. Fare: This variable represents the ticket fare of the passenger. In logistic regression, the ticket fare can be an indicator of the passenger’s socioeconomic status, which may affect the likelihood of survival.

8. Embarked: This variable indicates the port of embarkation of the passenger, with values 0, 1, and 2 indicating Cherbourg, Queenstown, and Southampton, respectively. In logistic regression, the port of embarkation may also affect the likelihood of survival due to potential differences in evacuation policies between ports.

Dropping the “Id” column from the Titanic dataset is a common practice in data analysis and machine learning tasks. The “Id” column typically represents a unique identifier for each row in the dataset. Since this identifier does not provide any meaningful information related to the prediction task, it is usually removed before building predictive models.

In the context of logistic regression, these features are used to predict the probability of a passenger surviving the shipwreck based on their characteristics and conditions. By studying the relationship between these features and the target variable using a logistic regression model, we can make predictions about the likelihood of survival for a new passenger with similar features.

## The Sigmoid Function

The sigmoid function, also known as the logistic function, is a mathematical function that maps any real-valued number to a value between 0 and 1. It is defined as:

where e is the base of the natural logarithm (approximately equal to 2.71828).

Here’s an explanation of the components of the sigmoid function:

– x : Input to the sigmoid function. It can be any real number.

– e^-x: Exponential function with base ( e ) raised to the power of negative x.

– 1 + e^-x: Adding 1 to the exponential term ensures that the denominator is always positive.

– 1/(1 + e^-x): Taking the reciprocal of the denominator scales the output to the range [0, 1].

The sigmoid function has an S-shaped curve, which asymptotically approaches 0 as x approaches negative infinity and approaches 1 as x approaches positive infinity. This property makes it useful for mapping arbitrary input values to a probability score between 0 and 1.

`def sigmoid(x):`

return 1/(1 + np.exp(-x))

The visualize the sigmoid function using Python and matplotlib:

This code will plot the sigmoid function, showing its characteristic S-shaped curve. As the input x increases, the output of the sigmoid function approaches 1, and as the input decreases, the output approaches 0.

## Cost Function

This script implements logistic regression using gradient descent optimization to minimize the cost function. Here’s an explanation of the components:

- Model Parameters:

– W : Weight vector of shape (n, 1), where n is the number of features.

– B : Bias scalar. - Initialization:

— W is initialized as a zero vector.

— B is initialized as zero.

— cost_list is initialized to keep track of the cost function value at each iteration. - Iterations: The loop runs for the specified number of iterations.
- Forward Propagation:

— Compute the linear transformation Z = W^T.X + B, where X is the feature matrix.

— Apply the sigmoid activation function to obtain the predicted probabilities

- Cost Function:

— Compute the logistic loss or cross-entropy cost function:

— This cost function measures the difference between the predicted probabilities A and the actual labels Y.

- Backward Propagation (Gradient Descent):

— Compute the gradients of the cost function with respect to the parameters:

— Update the parameters using the gradient descent update rule:

- Cost Tracking:

— Append the current cost to the cost list.

— Print the cost at intervals to monitor the training progress.

This function returns the learned parameters W and B, along with the list of costs over iterations. These parameters can then be used to make predictions on new data.

This code snippet represents the training process of a logistic regression model using gradient descent optimization. Let’s break it down:

– `iterations = 100000`: This variable specifies the number of iterations (or epochs) for which the training process will run. Each iteration corresponds to one complete pass through the training data.

– `learning_rate = 0.001`: This variable represents the learning rate, which determines the size of the steps taken in the direction of the gradient during gradient descent. It controls the speed at which the model learns and should be carefully chosen to balance convergence speed and stability.

– `W, B, cost_list = model(X_train, Y_train, learning_rate=learning_rate, iterations=iterations)`: This line of code calls the `model` function to train the logistic regression model. It passes the training data `X_train` and corresponding labels `Y_train`, along with the specified `learning_rate` and `iterations`. The function returns the learned parameters `W` (weights) and `B` (bias), as well as a list containing the cost values computed during training.

– The subsequent lines show the cost (or loss) after certain iterations during the training process. The cost represents the difference between the predicted outputs of the model and the actual labels. In logistic regression, the cost is typically calculated using the cross-entropy loss function. As the training progresses, the cost ideally decreases, indicating that the model is improving in its predictions and moving towards convergence.

In summary, this code snippet demonstrates the training process of a logistic regression model by iteratively updating its parameters (`W` and `B`) using gradient descent to minimize the cost function. The learning rate and the number of iterations are essential hyperparameters that need to be carefully tuned to achieve optimal performance.

This function calculates the accuracy of a logistic regression model given the input features `X`, the true labels `Y`, and the learned weights `W` and bias `B`. Here’s a breakdown of each step:

1. Calculate the logits: First, it computes the logits `Z` by taking the dot product of the transpose of weights `W` and the input features `X`, and then adds the bias `B`. This is essentially the linear transformation step.

2. Apply the sigmoid function: The logits `Z` are passed through the sigmoid function to obtain the predicted probabilities `A`, which represent the likelihood of each sample belonging to the positive class.

3. Thresholding: Next, it applies a threshold of 0.5 to the predicted probabilities `A` to convert them into binary predictions. If the probability is greater than 0.5, it assigns a label of 1 (positive class); otherwise, it assigns a label of 0 (negative class).

4. Convert to integer: The binary predictions `A` are then converted to integers using `np.array(A, dtype=’int64′)`. This step ensures that the predictions are integers, which can be directly compared to the true labels.

5. Compute accuracy: Finally, it calculates the accuracy of the model by comparing the predicted labels `A` with the true labels `Y`. It computes the absolute difference between `A` and `Y`, sums up the differences, divides by the total number of samples (`Y.shape[1]`), and subtracts this value from 1 to get the accuracy. Multiplying by 100 converts the result to a percentage.

In this case, the accuracy of the model is calculated to be 92.11%. This means that the model correctly predicts the class of approximately 92.11% of the samples in the dataset.

In this project, we implemented logistic regression from scratch to predict survival on the Titanic dataset. Logistic regression proved to be a suitable model for binary classification tasks like this one. By preprocessing the data, defining the logistic regression model, and optimizing its parameters using gradient descent, we were able to achieve a reasonably good accuracy on both the training and test sets. Additionally, we visualized the training process by plotting the cost function over iterations, which helped us understand how the model learns and improves over time.