How Neural Networks Learn: Neurons, Weights and Activation Functions — AI Beginners Guide Episode 5.1 (Updated June 2026)
Here's the thing — ChatGPT, Tesla Autopilot, the spam filter in your Gmail and the face recognition on your phone all run on the same fundamental technology: neural networks. NASSCOM and Deloitte project India will need 1.25 million AI professionals by 2027, and what most people don't realize is that you cannot truly work with these systems unless you understand how they work from the inside. Episode 5.1 of our AI series strips away the mysticism and explains neural networks the way I explain them to engineering students at ABC Trainings — starting from the single artificial neuron and building up to a complete forward pass through a multi-layer network.
- An artificial neuron takes multiple inputs, multiplies each by a weight, sums them with a bias, then passes the result through an activation function.
- Weights determine how strongly each input influences the output — they start random and are updated during training.
- Activation functions add non-linearity — without them, a 100-layer network is mathematically equivalent to a single linear layer.
- ReLU is the most common activation for hidden layers; Sigmoid for binary classification; Softmax for multi-class classification.
- Forward propagation is the process of passing input data through all network layers to produce a prediction.
The Artificial Neuron — The Basic Unit of Every Neural Network
An artificial neuron is a mathematical function loosely inspired by a biological neuron in the brain. It does three things: receives multiple inputs (numbers — could be pixel values, sensor readings or any numerical feature), multiplies each input by its corresponding weight (a number that says how important this input is), sums all the weighted inputs together and adds a bias term (a constant that shifts the output), then passes this sum through an activation function to produce an output. Written as an equation: output = activation(w1 times x1 plus w2 times x2 plus ... plus wn times xn plus b). A single neuron can already make simple decisions — is this email spam (1) or not spam (0)? But real problems need thousands of neurons working together in layers. ChatGPT underlying model, GPT-4, has hundreds of billions of parameters (weights and biases) — each one learned from data through the training process.

Weights and Biases — What the Network Actually Learns
Weights are the learned parameters of a neural network. Each connection between a neuron in one layer and a neuron in the next layer has its own weight. A high positive weight means that input strongly pushes the output higher; a high negative weight pushes it lower; a weight near zero means that input is nearly ignored. Biases are additional parameters added to each neuron — they allow the network to shift the activation function left or right, giving the model an extra degree of freedom. Both weights and biases start at small random values (this initialization is important — all-zero initialization breaks symmetry and prevents learning). During training, an algorithm called backpropagation calculates how each weight and bias should change to reduce the network prediction error, and the optimizer (e.g., Adam) actually updates them. The entire learning of a neural network is the process of finding the right values for all weights and biases.
| Activation Function | Output Range | Best Used In | Keras Code |
|---|---|---|---|
| ReLU | 0 to +infinity | Hidden layers (default choice) | activation='relu' |
| Sigmoid | 0 to 1 | Binary classification output | activation='sigmoid' |
| Softmax | 0 to 1 (sums to 1) | Multi-class output layer | activation='softmax' |
| Tanh | -1 to 1 | RNNs and older architectures | activation='tanh' |
| Linear | -infinity to +infinity | Regression output layer | activation='linear' |
Activation Functions: ReLU, Sigmoid and Softmax — When to Use Each
Activation functions introduce non-linearity into the network — and this is absolutely critical. Without a non-linear activation function, stacking multiple layers is mathematically equivalent to a single linear transformation, no matter how many layers you add. You would lose all the representational power that makes deep learning work. ReLU (Rectified Linear Unit) is the most widely used activation for hidden layers: it outputs 0 for negative inputs and the input itself for positive inputs. It is computationally cheap and does not suffer from vanishing gradients for positive inputs. Sigmoid squashes any real number to the range 0-1. It is used in the output layer for binary classification, where the output represents a probability. Softmax extends sigmoid to multiple classes: it converts a vector of raw scores into a probability distribution that sums to 1. Use Softmax in the output layer for multi-class classification — for example, classifying a handwritten digit as one of 10 classes (0 through 9).

Network Layers — Input, Hidden and Output Explained
A neural network organizes neurons into layers. The Input Layer receives the raw data — if you are classifying 28x28 pixel images (like MNIST handwritten digits), the input layer has 784 neurons, one per pixel. Hidden Layers are all the layers between input and output — these are where the actual feature learning happens. Each hidden layer learns progressively more abstract representations: the first layer might detect edges, the second layer curves, deeper layers entire shapes or objects. The number of hidden layers and neurons per layer are hyperparameters you choose — there is no one right answer, and practitioners experiment. The Output Layer produces the final prediction. For binary classification (spam or not spam), it has 1 neuron with Sigmoid activation. For 10-class classification (digit recognition), it has 10 neurons with Softmax activation. The depth (number of layers) is what makes a network "deep" — hence the term Deep Learning.
Forward Propagation — How a Prediction Is Made Step by Step
Forward propagation is the process of taking an input, passing it through each layer of the network in sequence, and producing an output prediction. Start with the input data — say, the pixel values of an image. Multiply each input by its weight connecting it to the first hidden layer, sum them, add the bias, apply the activation function — these are the outputs of layer 1. These outputs become the inputs to layer 2. Repeat through all hidden layers. The final layer produces the raw scores (called logits) which are passed through the output activation (Sigmoid or Softmax) to get probabilities. In Python with NumPy, a single forward pass through a layer is: z = np.dot(weights, inputs) plus bias; output = relu(z). In TensorFlow Keras, the entire forward pass is handled automatically when you call model.predict(x) or model(x). Understanding the mathematical sequence of forward propagation helps you debug unusual model outputs.
Loss Functions — How the Network Measures Its Own Mistakes
A loss function measures the difference between the network predictions and the true labels — it is the signal that drives learning. The goal of training is to minimize the loss function. For binary classification (spam detection, fraud detection), Binary Cross-Entropy is the standard: it penalizes confident wrong predictions very heavily. For multi-class classification (image recognition across 10+ categories), Categorical Cross-Entropy (or Sparse Categorical Cross-Entropy when labels are integers) is used. For regression problems where you are predicting a continuous number, Mean Squared Error (MSE) or Mean Absolute Error (MAE) are appropriate. A decreasing loss during training means the model is learning. If the training loss stops decreasing (plateaus), try reducing the learning rate or checking for data quality issues. If training loss decreases but validation loss increases, you have overfitting — the topic we cover in Episode 5.2. Contact ABC Trainings at 7039169629 to enroll in our AI Powered Application Development course in Pune or Sambhajinagar.
Get the Artificial Intelligence Brochure + Fees + Batch Dates on WhatsApp
Free 1:1 counselling. Placement track record. CMYKPY/PMKVY eligibility check.
💬 Get Brochure on WhatsApp📞 Call 7039169629About the author: Rahul Patil. 12 yrs experience training engineers across Maharashtra.
Visit Our Centers
- Wagholi (Pune): 1st Floor, Laxmi Datta Arcade, Pune-Ahilyanagar Highway. Call 7039169629
- Hadapsar (Pune HQ): 1st Floor, Shree Tower, opp. Vaibhav Theater, Magarpatta. Call 7039169629
- Cidco (Chh. Sambhajinagar): Kalpana Plaza, opp. Eiffel Tower, N-1 Cidco. Call 7039169629
- Osmanpura (Chh. Sambhajinagar): S.S.C Board to Peer Bazar Road, near Jama Masjid. Call 7039169629
- Sangli: Shubham Emphoria, 1st Floor, Above US Polo Assn., Sangli-Miraj Rd, Vishrambag. Weekend batches available. Call 7039169629
FAQs
What is an artificial neuron and how is it different from a biological neuron?
An artificial neuron is a mathematical function that takes multiple numerical inputs, multiplies each by a learned weight, sums them with a bias, then applies an activation function. It is inspired by — but much simpler than — a biological neuron. A biological neuron fires electrochemical signals through synapses; an artificial neuron performs weighted summation and a mathematical transformation. A neural network has thousands to billions of artificial neurons organized in layers, all working together to transform input data into predictions.
Why do neural networks need activation functions?
Without activation functions, stacking multiple layers is mathematically equivalent to a single linear transformation regardless of how many layers you add. Non-linear activation functions allow the network to learn complex, non-linear relationships in data — which is essential for recognizing images, understanding language or predicting continuous values. A network without non-linear activations can only learn linear decision boundaries, severely limiting its capability.
What is the difference between ReLU, Sigmoid and Softmax activation functions?
ReLU outputs 0 for negative inputs and the input itself for positive inputs. It is the default choice for hidden layers because it is computationally cheap and avoids the vanishing gradient problem for positive values. Sigmoid squashes inputs to 0-1 and is used in binary classification output layers where you need a probability. Softmax converts a vector of raw scores to a probability distribution summing to 1 — used in multi-class classification output layers where each output represents the probability of a specific class.
Does ABC Trainings offer neural network and deep learning courses in Pune?
Yes. ABC Trainings covers artificial neurons, activation functions, forward propagation, backpropagation, CNNs and hands-on TensorFlow Keras projects in our AI Powered Application Development workshop at our Wagholi and Hadapsar centres in Pune, and at Cidco and Osmanpura in Sambhajinagar. Students build working neural network classifiers as part of the course. Contact us at 7039169629 or WhatsApp 7774002496 for batch schedules and fees.


