Objective
To implement simple linear regression and predict house prices based on area using Python and scikit-learn. Linear regression is one of the most fundamental machine learning algorithms and serves as a foundation for understanding more complex models.
This experiment introduces supervised learning, regression problems, model training, evaluation metrics, and data visualization. Students will learn to build, train, and evaluate a machine learning model from scratch, understanding the complete ML workflow.
Theory
Linear regression models the relationship between a dependent variable (target) and one or more independent variables (features) using a linear equation. In simple linear regression, we have one feature and one target variable.
The linear regression model assumes: y = mx + c, where:
- y is the target variable (house price)
- x is the feature variable (area)
- m is the slope (coefficient)
- c is the y-intercept
The algorithm finds the best values of m and c that minimize the sum of squared errors between predicted and actual values. This is done using the least squares method or gradient descent optimization.
Dataset
Use a CSV file with two columns: area (in square feet) and price
(in lakhs). The dataset should have sufficient data points (at least 20-30) for meaningful
training and testing. You can create synthetic data or use real estate datasets from
sources like Kaggle.
Example dataset structure:
| Area (sq ft) | Price (lakhs) |
|---|---|
| 1200 | 45 |
| 1500 | 55 |
| 1800 | 65 |
Steps (Detailed Algorithm)
- Load the CSV dataset – Use
pandas.read_csv()to load the data. Inspect the data usinghead(),info(), anddescribe()to understand its structure and check for missing values. - Split into feature and target – Separate the dataset into feature matrix
X (area) and target vector y (price). Use
df[['area']]for features anddf['price']for target. Reshape if necessary using.values.reshape(-1, 1). - Split into train and test sets – Use
train_test_split()from scikit-learn with a typical split of 80% training and 20% testing. Setrandom_statefor reproducibility. This ensures the model is evaluated on unseen data. - Fit LinearRegression model – Create a
LinearRegression()object and callfit(X_train, y_train)to train the model. The model learns the coefficients (slope and intercept) that best fit the training data. - Predict on test data – Use
predict(X_test)to generate predictions for the test set. Compare predicted values with actual values to evaluate model performance. - Compute evaluation metrics – Calculate Mean Absolute Error (MAE), Mean
Squared Error (MSE), Root Mean Squared Error (RMSE), and R²-score using functions
from
sklearn.metrics. These metrics help understand how well the model performs. - Visualize results – Plot the training data points, test data points, and the regression line using matplotlib. This visual representation helps understand how well the model fits the data and identifies any patterns or outliers.
Complete Python Code
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
# Step 1: Load dataset
df = pd.read_csv('house_prices.csv')
print("Dataset shape:", df.shape)
print(df.head())
print(df.describe())
# Step 2: Prepare features and target
X = df[['area']].values # Feature: area
y = df['price'].values # Target: price
# Step 3: Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# Step 4: Create and train the model
model = LinearRegression()
model.fit(X_train, y_train)
# Get model parameters
print(f"Slope (m): {model.coef_[0]:.2f}")
print(f"Intercept (c): {model.intercept_:.2f}")
print(f"Equation: price = {model.coef_[0]:.2f} * area + {model.intercept_:.2f}")
# Step 5: Make predictions
y_train_pred = model.predict(X_train)
y_test_pred = model.predict(X_test)
# Step 6: Evaluate model
mae = mean_absolute_error(y_test, y_test_pred)
mse = mean_squared_error(y_test, y_test_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, y_test_pred)
print("\nEvaluation Metrics:")
print(f"MAE: {mae:.2f}")
print(f"MSE: {mse:.2f}")
print(f"RMSE: {rmse:.2f}")
print(f"R² Score: {r2:.4f}")
# Step 7: Visualize results
plt.figure(figsize=(10, 6))
plt.scatter(X_train, y_train, color='blue', alpha=0.5, label='Training Data')
plt.scatter(X_test, y_test, color='green', alpha=0.5, label='Test Data')
plt.plot(X_train, y_train_pred, color='red', linewidth=2, label='Regression Line')
plt.xlabel('Area (sq ft)')
plt.ylabel('Price (lakhs)')
plt.title('House Price Prediction using Linear Regression')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()
Expected Output
- Regression equation:
price = m * area + cwith specific numerical values for m and c. The equation shows the learned relationship between area and price. - Error metrics: MAE, MSE, RMSE, and R²-score values. Lower MAE, MSE, and RMSE indicate better performance. R²-score close to 1.0 indicates a good fit.
- Visualization: A scatter plot showing data points and the best-fit regression line. The line should pass through the center of the data points, indicating a good model fit.
Understanding Evaluation Metrics
| Metric | Formula | Interpretation |
|---|---|---|
| MAE | Mean of |actual - predicted| | Average absolute error, easy to interpret |
| MSE | Mean of (actual - predicted)² | Penalizes large errors more, in squared units |
| RMSE | √MSE | Same units as target, most commonly used |
| R² Score | 1 - (SS_res / SS_tot) | Proportion of variance explained, 0-1 scale |
Frequently Asked Questions
Q1: What if the relationship between area and price is not linear?
If the relationship is non-linear, simple linear regression won't work well. You can try polynomial regression, transform the features (log, square root), or use non-linear models. Check the scatter plot first to see if a linear relationship exists.
Q2: How do I interpret the R² score?
R² score ranges from 0 to 1 (can be negative for very poor models). R² = 1 means perfect predictions, R² = 0 means the model is no better than predicting the mean. Generally, R² > 0.7 is considered good, but it depends on the problem domain.
Q3: Why do we split data into train and test sets?
Training on all data and testing on the same data would give overly optimistic results (overfitting). Testing on unseen data gives a realistic estimate of how the model will perform on new data, which is the real goal of machine learning.