ML Lab – Linear Regression for House Price Prediction

Objective

To implement simple linear regression and predict house prices based on area using Python and scikit-learn. Linear regression is one of the most fundamental machine learning algorithms and serves as a foundation for understanding more complex models.

This experiment introduces supervised learning, regression problems, model training, evaluation metrics, and data visualization. Students will learn to build, train, and evaluate a machine learning model from scratch, understanding the complete ML workflow.

Theory

Linear regression models the relationship between a dependent variable (target) and one or more independent variables (features) using a linear equation. In simple linear regression, we have one feature and one target variable.

The linear regression model assumes: y = mx + c, where:

y is the target variable (house price)
x is the feature variable (area)
m is the slope (coefficient)
c is the y-intercept

The algorithm finds the best values of m and c that minimize the sum of squared errors between predicted and actual values. This is done using the least squares method or gradient descent optimization.

Dataset

Use a CSV file with two columns: area (in square feet) and price (in lakhs). The dataset should have sufficient data points (at least 20-30) for meaningful training and testing. You can create synthetic data or use real estate datasets from sources like Kaggle.

Example dataset structure:

Area (sq ft)	Price (lakhs)
1200	45
1500	55
1800	65

Steps (Detailed Algorithm)

Load the CSV dataset – Use pandas.read_csv() to load the data. Inspect the data using head(), info(), and describe() to understand its structure and check for missing values.
Split into feature and target – Separate the dataset into feature matrix X (area) and target vector y (price). Use df[['area']] for features and df['price'] for target. Reshape if necessary using .values.reshape(-1, 1).
Split into train and test sets – Use train_test_split() from scikit-learn with a typical split of 80% training and 20% testing. Set random_state for reproducibility. This ensures the model is evaluated on unseen data.
Fit LinearRegression model – Create a LinearRegression() object and call fit(X_train, y_train) to train the model. The model learns the coefficients (slope and intercept) that best fit the training data.
Predict on test data – Use predict(X_test) to generate predictions for the test set. Compare predicted values with actual values to evaluate model performance.
Compute evaluation metrics – Calculate Mean Absolute Error (MAE), Mean Squared Error (MSE), Root Mean Squared Error (RMSE), and R²-score using functions from sklearn.metrics. These metrics help understand how well the model performs.
Visualize results – Plot the training data points, test data points, and the regression line using matplotlib. This visual representation helps understand how well the model fits the data and identifies any patterns or outliers.

Complete Python Code


import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

# Step 1: Load dataset
df = pd.read_csv('house_prices.csv')
print("Dataset shape:", df.shape)
print(df.head())
print(df.describe())

# Step 2: Prepare features and target
X = df[['area']].values  # Feature: area
y = df['price'].values   # Target: price

# Step 3: Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Step 4: Create and train the model
model = LinearRegression()
model.fit(X_train, y_train)

# Get model parameters
print(f"Slope (m): {model.coef_[0]:.2f}")
print(f"Intercept (c): {model.intercept_:.2f}")
print(f"Equation: price = {model.coef_[0]:.2f} * area + {model.intercept_:.2f}")

# Step 5: Make predictions
y_train_pred = model.predict(X_train)
y_test_pred = model.predict(X_test)

# Step 6: Evaluate model
mae = mean_absolute_error(y_test, y_test_pred)
mse = mean_squared_error(y_test, y_test_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, y_test_pred)

print("\nEvaluation Metrics:")
print(f"MAE: {mae:.2f}")
print(f"MSE: {mse:.2f}")
print(f"RMSE: {rmse:.2f}")
print(f"R² Score: {r2:.4f}")

# Step 7: Visualize results
plt.figure(figsize=(10, 6))
plt.scatter(X_train, y_train, color='blue', alpha=0.5, label='Training Data')
plt.scatter(X_test, y_test, color='green', alpha=0.5, label='Test Data')
plt.plot(X_train, y_train_pred, color='red', linewidth=2, label='Regression Line')
plt.xlabel('Area (sq ft)')
plt.ylabel('Price (lakhs)')
plt.title('House Price Prediction using Linear Regression')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

Expected Output

Regression equation: price = m * area + c with specific numerical values for m and c. The equation shows the learned relationship between area and price.
Error metrics: MAE, MSE, RMSE, and R²-score values. Lower MAE, MSE, and RMSE indicate better performance. R²-score close to 1.0 indicates a good fit.
Visualization: A scatter plot showing data points and the best-fit regression line. The line should pass through the center of the data points, indicating a good model fit.

Understanding Evaluation Metrics

Metric	Formula	Interpretation
MAE	Mean of \|actual - predicted\|	Average absolute error, easy to interpret
MSE	Mean of (actual - predicted)²	Penalizes large errors more, in squared units
RMSE	√MSE	Same units as target, most commonly used
R² Score	1 - (SS_res / SS_tot)	Proportion of variance explained, 0-1 scale

Frequently Asked Questions

Q1: What if the relationship between area and price is not linear?

If the relationship is non-linear, simple linear regression won't work well. You can try polynomial regression, transform the features (log, square root), or use non-linear models. Check the scatter plot first to see if a linear relationship exists.

Q2: How do I interpret the R² score?

R² score ranges from 0 to 1 (can be negative for very poor models). R² = 1 means perfect predictions, R² = 0 means the model is no better than predicting the mean. Generally, R² > 0.7 is considered good, but it depends on the problem domain.

Q3: Why do we split data into train and test sets?

Training on all data and testing on the same data would give overly optimistic results (overfitting). Testing on unseen data gives a realistic estimate of how the model will perform on new data, which is the real goal of machine learning.