Skip to content

Commit abb1c8a

Browse files
committed
Started Deep Learning...
1 parent 947e892 commit abb1c8a

File tree

6 files changed

+667
-0
lines changed

6 files changed

+667
-0
lines changed
Lines changed: 84 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,84 @@
1+
---
2+
title: Activation Functions
3+
sidebar_label: Activation Functions
4+
description: "Why we need non-linearity and a deep dive into Sigmoid, Tanh, ReLU, and Softmax."
5+
tags: [deep-learning, neural-networks, activation-functions, relu, sigmoid]
6+
---
7+
8+
An **Activation Function** is a mathematical formula applied to the output of a neuron. Its primary job is to introduce **non-linearity** into the network. Without them, no matter how many layers you add, your neural network would behave like a simple linear regression model.
9+
10+
## 1. Why do we need Non-Linearity?
11+
12+
Real-world data is rarely a straight line. If we only used linear transformations ($z = wx + b$), the composition of multiple layers would just be another linear transformation.
13+
14+
Non-linear activation functions allow the network to "bend" the decision boundary to fit complex patterns like images, sound, and human language.
15+
16+
## 2. Common Activation Functions
17+
18+
### A. Sigmoid
19+
The Sigmoid function squashes any input value into a range between **0 and 1**.
20+
* **Formula:** $\sigma(z) = \frac{1}{1 + e^{-z}}$
21+
* **Best For:** The output layer of binary classification models.
22+
* **Downside:** It suffers from the **Vanishing Gradient** problem; for very high or low inputs, the gradient is almost zero, which kills learning.
23+
24+
### B. ReLU (Rectified Linear Unit)
25+
26+
ReLU is the default choice for hidden layers in modern deep learning.
27+
* **Formula:** $f(z) = \max(0, z)$
28+
* **Pros:** It is computationally very efficient and helps prevent vanishing gradients.
29+
* **Cons:** "Dying ReLU" — if a neuron's input is always negative, it stays at 0 and never updates its weights again.
30+
31+
### C. Tanh (Hyperbolic Tangent)
32+
33+
Similar to Sigmoid, but it squashes values between **-1 and 1**.
34+
* **Formula:** $\tanh(z) = \frac{e^z - e^{-z}}{e^z + e^{-z}}$
35+
* **Pros:** It is "zero-centered," meaning the average output is closer to 0, which often makes training faster than Sigmoid.
36+
37+
38+
39+
## 3. Comparison Table
40+
41+
| Function | Range | Common Use Case | Main Issue |
42+
| :--- | :--- | :--- | :--- |
43+
| **Sigmoid** | (0, 1) | Binary Classification Output | Vanishing Gradient |
44+
| **Tanh** | (-1, 1) | Hidden Layers (legacy) | Vanishing Gradient |
45+
| **ReLU** | [0, $\infty$) | Hidden Layers (Standard) | Dying Neurons |
46+
| **Softmax** | (0, 1) | Multi-class Output | Only used in Output layer |
47+
48+
## 4. The Softmax Function (Multi-class)
49+
50+
When you have more than two categories (e.g., classifying an image as a Cat, Dog, or Bird), we use **Softmax** in the final layer. It turns the raw outputs (logits) into a probability distribution that sums up to **1.0**.
51+
52+
$$
53+
\sigma(\mathbf{z})_i = \frac{e^{z_i}}{\sum_{j=1}^K e^{z_j}}
54+
$$
55+
56+
Where:
57+
58+
* $\mathbf{z}$ = vector of raw class scores (logits)
59+
* $K$ = total number of classes
60+
* $\sigma(\mathbf{z})_i$ = probability of class $i$
61+
62+
## 5. Implementation with Keras
63+
64+
```python
65+
from tensorflow.keras.layers import Dense
66+
67+
# Using ReLU for hidden layers and Sigmoid for output
68+
model.add(Dense(64, activation='relu'))
69+
model.add(Dense(1, activation='sigmoid'))
70+
71+
# Alternatively, using Softmax for multi-class (3 classes)
72+
model.add(Dense(3, activation='softmax'))
73+
74+
```
75+
76+
---
77+
78+
## References
79+
80+
* **CS231n:** [Linear Classifiers and Activations](https://cs231n.github.io/neural-networks-1/)
81+
82+
---
83+
84+
**Now that you know how neurons fire, how do we measure how "wrong" their firing pattern is compared to the ground truth?**
Lines changed: 116 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,116 @@
1+
---
2+
title: "Backpropagation: How Networks Learn"
3+
sidebar_label: Backpropagation
4+
description: "Demystifying the heart of neural network training: The Chain Rule, Gradients, and Error Attribution."
5+
tags: [deep-learning, neural-networks, backpropagation, calculus, gradient-descent]
6+
---
7+
8+
**Backpropagation** (short for "backward propagation of errors") is the central algorithm that allows neural networks to learn. If [Forward Propagation](./forward-propagation) is how the network makes a guess, Backpropagation is how it realizes how wrong it was and adjusts its internal weights to do better next time.
9+
10+
## 1. The High-Level Concept
11+
12+
Imagine you are a manager of a large factory (the network). At the end of the day, the final product is defective (a high **Loss**). To fix the problem, you don't just blame the person at the exit door; you trace the mistake backward through every department to find out who contributed most to the error and tell them to adjust their process.
13+
14+
## 2. The Four Steps of Training
15+
16+
Backpropagation is the third step in the general training loop:
17+
18+
1. **Forward Pass:** Calculate the prediction ($y_{pred}$).
19+
2. **Loss Calculation:** Calculate the error using a [Loss Function](./loss-functions) (e.g., $L = (y_{actual} - y_{pred})^2$).
20+
3. **Backward Pass (Backpropagation):** Calculate the **Gradient** of the loss with respect to every weight and bias in the network.
21+
4. **Weight Update:** Adjust the weights slightly in the opposite direction of the gradient.
22+
23+
## 3. The Secret Sauce: The Chain Rule
24+
25+
Mathematically, we want to find out how much the Loss ($L$) changes when we change a specific weight ($w$). This is the derivative $\frac{\partial L}{\partial w}$.
26+
27+
Because the weight is buried deep inside the network, we use the **Chain Rule** from calculus to "unpeel" the layers:
28+
29+
$$
30+
\frac{\partial L}{\partial w} = \frac{\partial L}{\partial \text{out}} \cdot \frac{\partial \text{out}}{\partial \text{net}} \cdot \frac{\partial \text{net}}{\partial w}
31+
$$
32+
33+
Where:
34+
35+
- $\text{out}$ = output of the neuron
36+
- $\text{net}$ = weighted sum input to the neuron
37+
38+
By applying the chain rule repeatedly, we can propagate the error gradient backward through the network.
39+
40+
This allows us to calculate the error contribution of a neuron in the 10th layer, and then use that result to calculate the error of a neuron in the 9th layer, and so on, all the way back to the input.
41+
42+
## 4. Visualizing the Gradient Flow
43+
44+
Information flows backward through the same paths it took during the forward pass.
45+
46+
```mermaid
47+
graph RL
48+
%% Output Layer
49+
Y["$$\hat{y}$$"] -->|"$$\frac{\partial L}{\partial z^{[2]}}$$"| H1
50+
Y -->|"$$\frac{\partial L}{\partial z^{[2]}}$$"| H2
51+
52+
%% Hidden Layer
53+
H1["$$a_1^{[1]}$$"] -->|"$$\frac{\partial L}{\partial z_1^{[1]}}$$"| X1
54+
H1 -->|"$$\frac{\partial L}{\partial z_1^{[1]}}$$"| X2
55+
H1 -->|"$$\frac{\partial L}{\partial z_1^{[1]}}$$"| X3
56+
57+
H2["$$a_2^{[1]}$$"] -->|"$$\frac{\partial L}{\partial z_2^{[1]}}$$"| X1
58+
H2 -->|"$$\frac{\partial L}{\partial z_2^{[1]}}$$"| X2
59+
H2 -->|"$$\frac{\partial L}{\partial z_2^{[1]}}$$"| X3
60+
61+
%% Input Layer
62+
X1["$$x_1$$"]
63+
X2["$$x_2$$"]
64+
X3["$$x_3$$"]
65+
66+
%% Loss
67+
L["$$L(\hat{y}, y)$$"] --> Y
68+
69+
```
70+
71+
In this diagram, the arrows represent the flow of gradients backward through the network. Each neuron receives gradients from the neurons it feeds into, allowing it to compute how much it contributed to the final loss.
72+
73+
**Quick overview of the steps during backpropagation:**
74+
75+
1. Start at the output layer and compute the gradient of the loss with respect to the output.
76+
2. Use the chain rule to propagate this gradient backward through each layer.
77+
3. At each neuron, compute the gradient with respect to its weights and biases.
78+
79+
## 5. The Vanishing Gradient Problem
80+
81+
In very deep networks, as we multiply many small derivatives together using the chain rule, the gradient can become extremely small by the time it reaches the first layers.
82+
83+
* **Result:** The early layers stop learning because their weights are barely changing.
84+
* **The Solution:** This is why we use activation functions like **ReLU** instead of Sigmoid, as ReLU doesn't "squash" gradients as severely.
85+
86+
## 6. Simple Implementation Logic
87+
88+
In modern libraries like PyTorch or TensorFlow, you don't have to write the calculus yourself—they use **Autograd** (Automatic Differentiation).
89+
90+
```python
91+
# A conceptual example using PyTorch logic
92+
import torch
93+
94+
# 1. Initialize weights with 'requires_grad'
95+
w = torch.tensor([2.0], requires_grad=True)
96+
x = torch.tensor([5.0])
97+
y_actual = torch.tensor([12.0])
98+
99+
# 2. Forward Pass
100+
y_pred = w * x
101+
102+
# 3. Calculate Loss
103+
loss = (y_actual - y_pred)**2
104+
105+
# 4. BACKPROPAGATION (The Magic Step)
106+
loss.backward()
107+
108+
# 5. Check the Gradient
109+
print(f"Gradient of loss w.r.t w: {w.grad}")
110+
# This tells us how to change 'w' to reduce 'loss'
111+
112+
```
113+
114+
---
115+
116+
**Now that we have the "Gradients" (the direction of change), how do we actually move the weights to reach the minimum error?**
Lines changed: 141 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,141 @@
1+
---
2+
title: Forward Propagation
3+
sidebar_label: Forward Propagation
4+
description: "Understanding how data flows from the input layer to the output layer to generate a prediction."
5+
tags: [deep-learning, neural-networks, forward-propagation, math]
6+
---
7+
8+
**Forward Propagation** is the process by which a neural network transforms input data into an output prediction. It is the "inference" stage where data flows through the network layers, undergoing linear transformations and non-linear activations until it reaches the final layer.
9+
10+
## 1. The Step-by-Step Flow
11+
12+
In a dense (fully connected) network, the signal moves from left to right. For every neuron in a hidden or output layer, two distinct steps occur:
13+
14+
### Step A: The Linear Transformation (Z)
15+
The neuron takes all inputs from the previous layer, multiplies them by their respective weights, and adds a bias term. This is essentially a multi-dimensional linear equation.
16+
17+
$$
18+
z = \sum_{i=1}^{n} (w_i \cdot x_i) + b
19+
$$
20+
21+
Where:
22+
23+
- $x_i$ = input features from the previous layer
24+
- $w_i$ = weights associated with each input
25+
- $b$ = bias term
26+
27+
### Step B: The Non-Linear Activation (A)
28+
The result $z$ is passed through an **Activation Function** (like ReLU or Sigmoid). This step is crucial because it allows the network to learn complex, non-linear patterns.
29+
30+
$$
31+
a = \sigma(z)
32+
$$
33+
34+
## 2. Forward Propagation in Matrix Form
35+
36+
In practice, we don't calculate one neuron at a time. We use **Linear Algebra** to calculate entire layers simultaneously. This is why GPUs (which are great at matrix math) are so important for Deep Learning.
37+
38+
If $W^{[1]}$ is the weight matrix for the first layer and $X$ is our input vector:
39+
40+
$$
41+
Z^{[1]} = W^{[1]} \cdot X + b^{[1]}
42+
$$
43+
44+
Then, we apply the activation function:
45+
46+
$$
47+
A^{[1]} = \sigma(Z^{[1]})
48+
$$
49+
50+
This output $A^{[1]}$ then becomes the "input" for the next layer.
51+
52+
## 3. A Visual Example
53+
54+
Imagine a simple network with 1 Hidden Layer:
55+
56+
```mermaid
57+
graph LR
58+
%% Input Layer
59+
X1["$$x_1$$"] -->|"$$w_{11}^{[1]}$$"| H1
60+
X2["$$x_2$$"] -->|"$$w_{12}^{[1]}$$"| H1
61+
X3["$$x_3$$"] -->|"$$w_{13}^{[1]}$$"| H1
62+
63+
X1 -->|"$$w_{21}^{[1]}$$"| H2
64+
X2 -->|"$$w_{22}^{[1]}$$"| H2
65+
X3 -->|"$$w_{23}^{[1]}$$"| H2
66+
67+
%% Hidden Layer
68+
H1["$$z_1^{[1]} \\ a_1^{[1]} = \sigma(z_1^{[1]})$$"]
69+
H2["$$z_2^{[1]} \\ a_2^{[1]} = \sigma(z_2^{[1]})$$"]
70+
71+
%% Output Layer
72+
H1 -->|"$$w_1^{[2]}$$"| Y
73+
H2 -->|"$$w_2^{[2]}$$"| Y
74+
75+
Y["$$z^{[2]} \\ \hat{y} = \sigma(z^{[2]})$$"]
76+
77+
%% Bias annotations
78+
B1["$$b^{[1]}$$"] -.-> H1
79+
B1 -.-> H2
80+
B2["$$b^{[2]}$$"] -.-> Y
81+
82+
```
83+
84+
1. **Input:** Your features (e.g., pixel values of an image).
85+
2. **Hidden Layer:** Extracts abstract features (e.g., edges or shapes).
86+
3. **Output Layer:** Provides the final guess (e.g., "This is a dog with 92% probability").
87+
88+
## 4. Why "Propagate"?
89+
90+
The term "propagate" is used because the output of one layer is the input of the next. The information "spreads" through the network. Each layer acts as a filter, refining the raw data into more meaningful representations until a decision can be made at the end.
91+
92+
## 5. Implementation in Pure Python (NumPy)
93+
94+
This snippet demonstrates the math behind a single forward pass for a network with one hidden layer.
95+
96+
```python
97+
import numpy as np
98+
99+
def sigmoid(x):
100+
return 1 / (1 + np.exp(-x))
101+
102+
# 1. Inputs (3 features)
103+
X = np.array([0.5, 0.1, -0.2])
104+
105+
# 2. Weights and Biases (Hidden Layer with 2 neurons)
106+
W1 = np.random.randn(2, 3)
107+
b1 = np.random.randn(2)
108+
109+
# 3. Weights and Biases (Output Layer with 1 neuron)
110+
W2 = np.random.randn(1, 2)
111+
b2 = np.random.randn(1)
112+
113+
# --- FORWARD PASS ---
114+
115+
# Layer 1 (Hidden)
116+
z1 = np.dot(W1, X) + b1
117+
a1 = sigmoid(z1)
118+
119+
# Layer 2 (Output)
120+
z2 = np.dot(W2, a1) + b2
121+
prediction = sigmoid(z2)
122+
123+
print(f"Model Prediction: {prediction}")
124+
125+
```
126+
127+
## 6. What happens next?
128+
129+
Forward propagation gives us a prediction. However, at the start, the weights are random, so the prediction will be wrong. To make the model "learn," we must:
130+
131+
1. Compare the prediction to the truth using a **Loss Function**.
132+
2. Send the error backward through the network using **Backpropagation**.
133+
134+
## References
135+
136+
* **DeepLearning.AI:** [Neural Networks and Deep Learning (Week 2)](https://www.coursera.org/learn/neural-networks-deep-learning)
137+
* **Khan Academy:** [Matrix Multiplication Foundations](https://www.khanacademy.org/math/precalculus/x9e81a4f98389efdf:matrices)
138+
139+
---
140+
141+
**We have the prediction. Now, how do we tell the network it made a mistake?** Head over to the [Backpropagation](./backpropagation.mdx) guide to learn how neural networks learn from their errors!

0 commit comments

Comments
 (0)