Few challenges in Machine Learning

VARSHITHA GUDIMALLA
3 min readMay 14, 2021
image by Le Wagon on unsplash
  1. Insufficient Quantity of Training Data
  2. Poor-Quality Data
  3. Irrelevant Features
  4. Overfitting the Training Data
  5. Underfitting the Training Data

Insufficient Quantity of Training Data

Baby can learn thing if it is said once or repeatedly. For example, for a baby to learn what is a ball, all it takes is for us to point to ball and say “ball”(once or repeatedly). Then the baby will be able to recognize ball.
Whereas machine learning is not there yet; it takes a lot of data for most ML algorithms to work properly. Even for a simple problem we need thousands of examples, and for complex problems we may need still more number of examples.

Poor-Quality Data

If our training data is full of errors, outliers, and noise, it will make it harder for the system to detect the underlying patterns, so model is less likely to perform well. It is often well worth the effort to spend time cleaning up training data, most data scientists spend a significant part of their time doing just that.

For example,

• If some instances are outliers, we can simply discard them or we can try to
fix them manually.

• If some instances are missing a few features, we can decide whether we want to ignore this attribute altogether, ignore these instances, fill in the missing values, or train one model with the feature and one model without it, and so on.

Irrelevant Features

As the saying goes: garbage in, garbage out. Our model will only be capable to learn if the training data contains enough relevant features and not too many irrelevant features. A critical part of a Machine Learning model is selecting a good set of features to train on. This process is called feature engineering.

Feature engineering involves:

Feature Selection: selecting the most relevant features to train out model among existing features.

Feature extraction: combining existing features to get a more useful one.

Creating new features by gathering new data.

Overfitting the Training Data

Say someone has visited a country and the taxi driver rips her off. She might be tempted to say that all taxi drivers in that country are thieves. Overgeneralizing is something that we humans do all too often, and unfortunately machines can fall into the same trap if we are not careful. In Machine Learning this is called overfitting: it means that the model performs well on the training data, but it does not work well on testing data or simply it didn’t generalize well.

Underfitting the Training Data

underfitting is quite opposite of overfitting. A ML algorithm is said to have underfitting if the model cannot capture the underlying trend of the data. Underfitting destroys the accuracy of a machine learning model. It usually occurs when we have less data to train a model and also when we try to build a linear model with a non-linear data. Underfitted model does not work well on both train and test data.

--

--

VARSHITHA GUDIMALLA

Computer Science Engineering Graduate || Machine Learning enthusiast.