How is a Machine learning Project Done??
6 Steps
- Looking at big images
- Get the data
- Discovering and visualizing the data to gain the insights
- Prepare the data for Machine Learning algorithms.
- Select a model and train it.
- Fine-tune your model.
Looking at Big Images
We are now going to build a machine learning model of housing prices in California using the California census data. This data has features such as the population, median income, median housing price, and so on for each block in California.
Our model should learn from the data and be able to predict the median housing price in any block, when given all the other features.
Frame the Problem
The first question to ask ourselves is: what exactly is the business objective. what does the company expect us to do and benefit from this model? This is important because it will determine how we frame the problem, what algorithms we will select, what performance measure we will use to evaluate our model, and how much effort we should spend tweaking it. We found that our model’s output will be fed to another Machine Learning system, along with many other signals. This downstream system will determine whether it is worth investing in a given area or not. Getting this right is critical, as it directly affects revenue.
The next question to ask is what the current solution looks like (if any). It will often give us a reference performance, as well as insights on how to solve the problem. We found that the district housing prices are currently estimated manually by experts, We know that it is costly and time-consuming, and their estimations are not great. This is why the company thinks that it would be useful to train a model to predict a district’s median housing price given other data about that district.
Okay, with all this information we are now ready to start designing our system. First, we need to frame the problem: is it supervised, unsupervised, or Reinforcement Learning? Is it a classification task, a regression task, or something else? Should you use batch learning or online learning techniques?
Let’s see: it is clearly a typical supervised learning task since we are given labeled training examples. Moreover, it is also a typical regression task, since we are asked to predict a value. Finally, there is no continuous flow of data coming in the system, there is no particular need to adjust to changing data rapidly, and the data is small enough to fit in memory, so plain batch learning should do just fine.
Select a Performance Measure
A typical performance measure for regression problems is the Root Mean Square Error (RMSE). It gives an idea of how much error the system typically makes in its predictions, with a higher weight for large errors.
The mathematical formula to compute the RMSE:
Get the Data
You can download the dataset from the following link:
Loading the data
import pandas as pd
path = “datasets/housing.csv”
housing = pd.read_csv(path)
housing.head()#head is used to get the top five rows of a dataframe
The info() method is useful to get a quick description of the data, in particular the total number of rows, and each attribute’s type and number of non-null values.
housing.info()
There are 20,640 instances in the dataset. Notice that the total_bed rooms attribute has only 20,433 non-null values, meaning that 207 districts are missing this feature. We will need to take care of this later.
All attributes are numerical, except the ocean_proximity field. Its type is object, so it could hold any kind of Python object, but since we loaded this data from a CSV file we know that it must be a text attribute. When we looked at the top five rows, we probably noticed that the values in the ocean_proximity column were repetitive, which means that it is probably a categorical attribute. We can find out what categories exist and how many districts belong to each category by using the value_counts() method:
housing["ocean_proximity"].value_counts()
Let’s look at the other fields. The describe() method shows a summary of the numerical attributes.
housing.describe()
The count, mean, min, and max rows are self-explanatory. Note that the null values are ignored. The std row shows the standard deviation. The 25%, 50%, and 75% rows show the corresponding percentiles. These are often called the 1st quartile, the median, and the 3rd quartile respectively.
Another way to get a feel of data we are dealing with is to plot a histogram for each numerical attribute. A histogram shows the number of instances that have a given value range. We can either plot this one attribute at a time, or we can call the hist() method on the whole dataset, and it will plot a histogram for each numerical attribute.
%matplotlib inline
import matplotlib.pyplot as plt
housing.hist(bins=50, figsize=(20,15))
plt.show()
Create a Test Set
Creating a test set is simple, just pick some instances randomly, typically 20% of the dataset, and set them aside.
from sklearn.model_selection import train_test_split
train_set, test_set = train_test_split(housing, test_size=0.2, random_state=42)
Discover and Visualize the Data to Gain Insights
housing = train_set.copy() #making the copy of the training dataset
Visualizing Geographical Data
housing.plot(kind="scatter", x="longitude", y="latitude", alpha=0.1)
Now let’s look at the housing prices. The radius of each circle represents the district’s population (option s), and the color represents the price (option c). We will use a predefined color map (option cmap) called jet, which ranges from blue (low values) to red (high prices).
housing.plot(kind="scatter", x="longitude", y="latitude", alpha=0.4,
s=housing["population"]/100, label="population", figsize=(10,7),
c="median_house_value", cmap=plt.get_cmap("jet"), colorbar=True, )
plt.legend()
This image tells us that the housing prices are very much related to the location and to the population density.
Looking for Correlations
Since the dataset is not too large, we can easily compute the standard correlation coefficient between every pair of attributes using the corr() method:
corr_matrix = housing.corr()
corr_matrix["median_house_value"].sort_values(ascending=False)
The correlation coefficient ranges from –1 to 1. When it is close to 1, it means that there is a strong positive correlation. When the coefficient is close to –1, it means that there is a strong negative correlation. When, coefficients close to zero mean that there is no linear correlation.
Another way to check for correlation between attributes is to use Pandas scatter_matrix function, which plots every numerical attribute against every other numerical attribute. Since there are now 11 numerical attributes, we would get 121 plots, which would not fit, so let’s just focus on a few promising attributes that seem most correlated with the median housing value.
from pandas.plotting import scatter_matrix
attributes = [“median_house_value”, “median_income”, “total_rooms”,
“housing_median_age”]
scatter_matrix(housing[attributes], figsize=(12, 8))
Experimenting with Attribute Combinations
One last thing we may want to do before preparing the data for Machine Learning algorithms is to try out various attribute combinations. For example, the total number of rooms in a district is not very useful if we don’t know how many households there are. What we really want is the number of rooms per household. Let’s create some new attributes similarly:
housing["rooms_per_household"] = housing["total_rooms"]/housing["households"]
housing["bedrooms_per_room"] = housing["total_bedrooms"]/housing["total_rooms"]
housing["population_per_household"]=housing["population"]/housing["households"]
And now let’s look at the correlation matrix again:
corr_matrix = housing.corr()
corr_matrix[“median_house_value”].sort_values(ascending=False)
The new bedrooms_per_room attribute is much more correlated with the median house value than the total number of rooms or bedrooms. Apparently houses with a lower bedroom/room ratio tend to be more expensive. The number of rooms per household is also more informative than the total number of rooms in a district — obviously the larger the houses, the more expensive they are.
Prepare the Data for Machine Learning Algorithms
It’s time to prepare the data for your Machine Learning algorithms.
let’s separate the predictors and the labels.
housing_features = housing.drop('median_house_value', axis=1) #Generally taken as X
housing_labels = housing['median_house_value'].copy() #Generally taken as Y
housing_labels.to_frame()
housing_labels.head()
Data Cleaning
Most Machine Learning algorithms cannot work with missing features, so let’s take care of them. We noticed earlier that the total_bedrooms attribute has some missing values, so let’s fix this. We have three options:
- Get rid of the corresponding districts.
- Get rid of the whole attribute.
- Set the values to some value (zero, the mean, the median, etc.)
We can do these easily using DataFrame’s dropna(), drop(), and fillna() methods, respectively.
We will do it using option 3, So we should compute the median value on the training set, and use it to fill the missing values in the training set, and also we need to save the median value that we have computed. we will need it later to replace missing values in the test set when we want to evaluate our model, and also once the model goes live to replace missing values in new data.
median = housing_features[“total_bedrooms”].median() # option 3
housing_features[“total_bedrooms”].fillna(median, inplace=True)
housing_features.head()
Handling Text and Categorical Attributes
housing_cat = housing_features[[“ocean_proximity”]]
housing_cat.head()
We can see that the ‘ocean_proximity’ column contains the catogerical attributes.
Most Machine Learning algorithms prefer to work with numbers anyway, so let’s convert these categories from text to numbers. For this, we can use Scikit-Learn’s OrdinalEncoder class.
from sklearn.preprocessing import OrdinalEncoder
ordinal_encoder = OrdinalEncoder()
housing_cat_encoded = ordinal_encoder.fit_transform(housing_cat)
housing_cat_encoded[:10]
housing_features[‘ocean_proximity’] = housing_cat_encoded
housing_features.head()
Feature Scaling
One of the most important transformations we need to apply to our data is feature scaling. With few exceptions, Machine Learning algorithms don’t perform well when the input numerical attributes have very different scales. This is true for the housing data: the total number of rooms ranges from about 6 to 39,320, while the medianincomes only range from 0 to 15. Note that scaling the target values is generally notrequired.
There are two common ways to get all attributes to have the same scale: min-max scaling and standardization
We will be using standardization for now:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()housing_features = sc.fit_transform(housing_features)
housing_features = pd.DataFrame(housing_features)
housing_features.head()
Select and Train a Model
At last! We framed the problem, we got the data and explored it, we sampled a training set and a test set, and we cleaned up and prepare our data for Machine Learning algorithms automatically. We are now ready to select and train a Machine Learning model.
Training and Evaluating on the Training Set
Let’s first train a Linear Regression model
from sklearn.linear_model import LinearRegression
lin_reg = LinearRegression()
lin_reg.fit(housing_features, housing_labels)
Let’s measure this regression model’s RMSE on the whole training set using Scikit-Learn’s mean_squared_error function:
from sklearn.metrics import mean_squared_error
import numpy as np
housing_predictions = lin_reg.predict(housing_features)
lin_mse = mean_squared_error(housing_labels, housing_predictions)
lin_rmse = np.sqrt(lin_mse)
lin_rmse
Okay, this is clearly not a great score. This is an example of a model underfitting the training data. When this happens it can mean that the features do not provide enough information to make good predictions, or that the model is not powerful enough. As we know, the main ways to fix underfitting are to select a more powerful model, to feed the training algorithm with better features, or to reduce the constraints on the model. This model is not regularized, so this rules out the last option. we could try to add more features, but first let’s try a more complex model to see how it does.
Let’s train a DecisionTreeRegressor.
from sklearn.tree import DecisionTreeRegressor
tree_reg = DecisionTreeRegressor()
tree_reg.fit(housing_features, housing_labels)housing_predictions = tree_reg.predict(housing_features)
tree_mse = mean_squared_error(housing_labels, housing_predictions)
tree_rmse = np.sqrt(tree_mse)
tree_rmse
Wait !!!! Could this model be absolutely perfect? The model have badly overfit the data. How can we be sure? As we saw earlier, we don’t want to touch the test set until we are ready to launch a model we are confident about, so we need to use part of the training set for training, and part for model validation.
Better Evaluation Using Cross-Validation
We use Scikit-Learn’s K-fold cross-validation feature. The following validation randomly splits the training set into 10 distinct subsets called folds, then it trains and evaluates the Decision Tree model 10 times, picking a different fold for evaluation every time and training on the other 9 folds. The result is an array containing the 10 evaluation scores:
from sklearn.model_selection import cross_val_score
scores = cross_val_score(tree_reg, housing_features, housing_labels, scoring=”neg_mean_squared_error”, cv=10)
tree_rmse_scores = np.sqrt(-scores)
print("Scores:", tree_rmse_scores)
print("Mean:", tree_rmse_scores.mean())
print("Standard deviation:", tree_rmse_scores.std())
Now the Decision Tree doesn’t look as good as it did earlier. In fact, it seems to perform worse than the Linear Regression model.
Let’s compute the same scores for the Linear Regression model:
lin_scores = cross_val_score(lin_reg, housing_features, housing_labels, scoring="neg_mean_squared_error", cv=10)
lin_rmse_scores = np.sqrt(-lin_scores)
print("Scores:", lin_rmse_scores)
print("Mean:", lin_rmse_scores.mean())
print("Standard deviation:", lin_rmse_scores.std())
That’s right: the Decision Tree model is overfitting so badly that it performs worse than the Linear Regression model.
Let’s try one last model now: the RandomForestRegressor.
from sklearn.ensemble import RandomForestRegressor
forest_reg = RandomForestRegressor()
forest_reg.fit(housing_features, housing_labels)
housing_predictions = forest_reg.predict(housing_features)
forest_mse = mean_squared_error(housing_labels, housing_predictions)
forest_rmse = np.sqrt(forest_mse)
forest_rmse
from sklearn.model_selection import cross_val_score
scores = cross_val_score(forest_reg, housing_features, housing_labels, scoring=”neg_mean_squared_error”, cv=10)
forest_rmse_scores = np.sqrt(-scores)print(“Scores:”, forest_rmse_scores)
print(“Mean:”, forest_rmse_scores.mean())
print(“Standard deviation:”, forest_rmse_scores.std())
Wow, this is much better: Random Forests look very promising. However, note that the score on the training set is still much lower than on the validation sets, meaning that the model is still overfitting the training set.
Fine-Tune Your Model
Let’s assume that we now have a shortlist of promising models. You now need to fine-tune them. Let’s look at a few ways we can do that.
Grid Search
One way to do that would be to fiddle with the hyperparameters manually, until we find a great combination of hyperparameter values, which is pratically impossible. Instead we should get Scikit-Learn’s GridSearchCV to search for us.
from sklearn.model_selection import GridSearchCV
param_grid = [
{'n_estimators': [3, 10, 30], 'max_features': [2, 4, 6, 8]},
{'bootstrap': [False], 'n_estimators': [3, 10], 'max_features': [2, 3, 4]},
]
forest_reg = RandomForestRegressor()
grid_search = GridSearchCV(forest_reg, param_grid, cv=5, scoring='neg_mean_squared_error', return_train_score=True)
grid_search.fit(housing_features, housing_labels)
we can get the best combination of parameters like this:
grid_search.best_params_
grid_search.best_estimator_
Evaluate Our System on the Test Set
After tweaking our models for a while, we eventually have a system that performs sufficiently well. Now is the time to evaluate the final model on the test set.
final_model = grid_search.best_estimator_
test_set["rooms_per_household"] = test_set["total_rooms"]/test_set["households"]
test_set["bedrooms_per_room"] = test_set["total_bedrooms"]/test_set["total_rooms"]
test_set["population_per_household"]=test_set["population"]/test_set["households"]
X_test = test_set.drop("median_house_value", axis=1)
y_test = test_set["median_house_value"].copy()
y_test.to_frame()
y_test.head()
X_test.fillna(median, inplace=True)
X_test_cat = X_test[["ocean_proximity"]]
X_test_catt_encoded = ordinal_encoder.fit_transform(X_test_cat)
X_test['ocean_proximity'] = X_test_catt_encoded
X_test = sc.fit_transform(X_test)
X_test = pd.DataFrame(X_test)
final_predictions = final_model.predict(X_test)
final_mse = mean_squared_error(y_test, final_predictions)
final_rmse = np.sqrt(final_mse)
final_rmse
From this blog we just aim to know how rea la machine learning project is done.