Imbalanced Data Sampling Strategies: Mathematical Treatment of Synthetic Oversampling and Undersampling in Skewed Classification Tasks

Imbalanced Data Sampling Strategies: Mathematical Treatment of Synthetic Oversampling and Undersampling in Skewed Classification Tasks

Imbalanced classification is one of the most common reasons a model looks “accurate” in reports but fails in production. When one class heavily outweighs the other, a naïve classifier can predict the majority class most of the time and still achieve high accuracy. In practice, this is useless for tasks like fraud detection, defect identification, churn prediction, or rare disease screening, where the minority class is the one you care about. A solid response begins with sampling strategies that reshape the training distribution while preserving the real-world meaning of evaluation metrics.

For learners in a data scientist course in Ahmedabad, understanding the mathematics behind oversampling and undersampling is valuable because it helps you choose the right technique rather than applying SMOTE by default. In this article, we will break down the logic, the calculations, and the trade-offs behind common sampling methods.

Why Imbalance Breaks Standard Learning

Let the dataset contain NNN samples with N+N_{+}N+​ minority (positive) and N−N_{-}N−​ majority (negative) samples, where N+≪N−N_{+} \ll N_{-}N+​≪N−​. The imbalance ratio is:

IR=N−N+IR = \frac{N_{-}}{N_{+}}IR=N+​N−​​

Many training algorithms implicitly minimise empirical risk by weighting errors proportionally to class frequency. If misclassifying a minority case has the same cost as misclassifying a majority case, the optimiser will naturally favour the majority because it contributes more to the loss.

This is why accuracy becomes misleading. A better evaluation is typically based on precision, recall, F1-score, ROC-AUC, and especially PR-AUC when positives are rare. However, even with correct metrics, the training data distribution still needs attention.

Undersampling: Reducing the Majority Class

Undersampling aims to reduce N−N_{-}N−​ so the learner receives a more balanced view of both classes. The simplest approach is random undersampling, where you keep all minority samples and randomly select a subset of majority samples.

If you want a target ratio rrr such that:

N−′N+=r⇒N−′=rN+\frac{N’_{-}}{N_{+}} = r \Rightarrow N’_{-} = rN_{+}N+​N−′​​=r⇒N−′​=rN+​

then you sample N−′N’_{-}N−′​ from the original N−N_{-}N−​. The benefit is speed and simplicity: fewer samples mean faster training and lower memory use. The risk is information loss. If the majority class contains multiple sub-patterns, random reduction may discard critical regions of the feature space, increasing variance and reducing generalisation.

More informed undersampling uses clustering or near-miss logic. For example, you can retain majority samples that lie close to minority samples in feature space, preserving decision boundary information. In practical terms, a data scientist course in Ahmedabad often teaches this as “keep the hard negatives,” because those are the cases that shape the classifier boundary most.

Oversampling: Replicating Minority Samples

Oversampling increases N+N_{+}N+​ to reduce imbalance. In random oversampling, minority samples are duplicated until a desired ratio is achieved. If the goal ratio is rrr again:

N−N+′=r⇒N+′=N−r\frac{N_{-}}{N’_{+}} = r \Rightarrow N’_{+} = \frac{N_{-}}{r}N+′​N−​​=r⇒N+′​=rN−​​

The number of synthetic additions is N+′−N+N’_{+} – N_{+}N+′​−N+​. Random oversampling keeps information but can cause overfitting because the same minority points are repeated. Many models, especially tree-based learners, may create splits that memorise these duplicates.

This is where synthetic oversampling becomes important.

Synthetic Oversampling: SMOTE and Its Mathematical Intuition

SMOTE (Synthetic Minority Over-sampling Technique) creates new points by interpolating between existing minority samples. For a minority sample xix_ixi​, choose one of its kkk-nearest minority neighbours xnnx_{nn}xnn​. A synthetic point is generated as:

xnew=xi+λ(xnn−xi),λ∼U(0,1)x_{new} = x_i + \lambda (x_{nn} – x_i), \quad \lambda \sim U(0,1)xnew​=xi​+λ(xnn​−xi​),λ∼U(0,1)

This equation is simple but powerful. It creates samples along line segments joining minority examples, effectively thickening the minority region in feature space. Instead of duplicating xix_ixi​, you create plausible intermediate points. The mathematical advantage is reduced overfitting risk compared to replication, while improving the learner’s exposure to the minority manifold.

However, SMOTE has limits. If the minority class overlaps heavily with the majority, interpolation can create ambiguous samples that actually increase misclassification. Variants like Borderline-SMOTE focus generation near the decision boundary, while SMOTE-NC adapts for mixed numerical and categorical features.

Combining Strategies and Preserving Evaluation Integrity

In real projects, hybrid strategies often work best. A common pipeline is SMOTE followed by undersampling (or vice versa) to reach a balanced ratio without exploding dataset size.

A crucial rule is: apply sampling only on the training split, never on the full dataset before train-test split. Otherwise, synthetic points leak information into validation, inflating performance. The correct workflow is:

  1. Split the dataset into train and test.
  2. Apply oversampling/undersampling only on training folds (inside cross-validation).
  3. Train the model on resampled training data.
  4. Evaluate on untouched validation/test data.

This workflow is typically emphasised in a data scientist course in Ahmedabad because it prevents a very common mistake: “perfect metrics” that vanish the moment the model goes live.

Conclusion

Imbalanced data requires more than a quick resampling function call. Undersampling reduces majority dominance but may drop valuable information. Oversampling improves minority representation but can overfit if done through duplication. Synthetic oversampling methods like SMOTE use interpolation to create realistic minority samples, yet must be applied carefully to avoid boundary noise and leakage.

A practical approach is to define the target ratio, resample only within training folds, and judge success using recall, precision, PR-AUC, and error analysis rather than accuracy alone. With these principles, you can build classifiers that perform well where it truly matters: on the rare, high-impact cases.