Laptop Price Prediction


This project aims to predict laptop prices based on various features extracted from a dataset of laptop specifications. The project involves data preprocessing, feature engineering, and building a predictive model using machine learning techniques.

Data Processing notebook : here
Model Training notebook : here

Table of Contents

Introduction

Accurately predicting laptop prices can assist consumers in making informed purchasing decisions and help retailers optimize their pricing strategies. This project involves preprocessing a dataset containing laptop specifications and building a machine learning model to predict laptop prices.

Dataset

The dataset contains various specifications of laptops, including:

Note: Due to confidentiality, the dataset is not included in this repository. Ensure you have access to the dataset file named laptopData.csv.

Data Preprocessing

The raw data was cleaned and transformed to ensure it was suitable for machine learning. Key steps included:

1. Handling Missing Values

Feature Percentage of missing values per column
Company 2.302379
TypeName 2.302379
Inches 2.302379
ScreenResolution 2.302379
Cpu 2.302379
Ram 2.302379
Memory 2.302379
Gpu 2.302379
OpSys 2.302379
Weight 2.302379
Price 2.302379

Interestingly, all the columns have the same number of missing value, that suggests that perhaps the missing values are occuring in related places and are not completely randomly distributed. A little exploration of the df has revealed that indeed the missing values are actually missing rows, and in that case, the cleaning procedure I would adopt is to simply drop those rows. There is not risk of losing additional information.

2. Cleaning Columns

  Company TypeName Inches ScreenResolution Cpu Ram Memory Gpu OpSys Weight Price
0 Apple Ultrabook 13.3 IPS Panel Retina Display 2560x1600 Intel Core i5 2.3GHz 8GB 128GB SSD Intel Iris Plus Graphics 640 macOS 1.37kg 71378.6832
1 Apple Ultrabook 13.3 1440x900 Intel Core i5 1.8GHz 8GB 128GB Flash Storage Intel HD Graphics 6000 macOS 1.34kg 47895.5232
2 HP Notebook 15.6 Full HD 1920x1080 Intel Core i5 7200U 2.5GHz 8GB 256GB SSD Intel HD Graphics 620 No OS 1.86kg 30636.0000

3. Processing ‘ScreenResolution’ Column

The “Screen Resolution” column contains noisy and inconsistent data. To organize it effectively, we can extract and categorize the key pieces of information:

Panel Type: Examples include IPS Panel and Touchscreen.

Resolution: Common formats include 1920x1080 and 2560x1600.

Additional Features: Such as Retina Display and 4K Ultra HD.

4. Processing ‘Cpu’ Column, ‘Gpu’ Column and ‘Memory’ Column

I give here the exemple for the processing of ‘Cpu’ Column.

5. Final Dataset

After handling missing values and cleaning the data, we finalize the dataset.

dataset.dropna(subset=['CPU Model Number', 'Gpu Type', 'Main Storage Type', 'Main Storage Size'], inplace=True)

Modeling

For this project, I used python libraries including:

1. One-Hot Encoding

One-hot encoding is a technique used to transform categorical data into a numerical format that machine learning models can understand. It creates binary (0 or 1) columns for each unique category in a categorical variable. This ensures that the model treats these categories as distinct and unrelated.

However, we need to be careful while one-hot encoding the CPU Brand and GPU Brand, because in certain instances they might share the same company. This would needlessly result in two identical columns, which might potentially confuse the machine learning model. To avoid this issue, we add prefixes as shown below.

Other Categorical Features are encoded too like the Screen Panel Type, the Gpu and Cpu series, the main storage type, etc.

2. Feature Selection

We have too many features, many of which might not be contribute very much to predicting the price of a laptop. We will identify the most relevant variables from the dataset to use in the model. The key steps for feature selection are:

Correlations between the features and the target variable (Price) are calculated. Correlation measures the strength of the relationship between two variables, with values ranging from -1 to 1. Features with high correlation values (positive or negative) to the target variable are likely to be more useful for prediction.

  correlations = df.corr()['Price'].abs().sort_values()

This calculates the absolute value of the correlation between each feature and the target (Price), sorts the features, and identifies which are most strongly associated with Price.

Only features with correlations greater than 0.15 are retained for modeling.

The dataset is reduced to include only the selected features, simplifying the model and potentially improving performance by reducing noise.

3. Data Visualization

A heatmap is useful for visualizing the correlation between features in a dataset:

Heatmap of the selected features

4. Model Training

For this project, I choose to work with Random Forest.

Random Forest is an ensemble learning method based on decision trees. It works by creating multiple decision trees during training and combining their predictions (averaging for regression or majority voting for classification).

Why Random Forest?

  from sklearn.ensemble import RandomForestRegressor

The dataset is split into two parts:

Features often have different ranges (e.g., RAM might range from 4 to 64 GB, while Weight is between 1 and 3 kg). Scaling ensures that no single feature dominates due to its larger range.

  scaler = StandardScaler()
  X_train_scaled = scaler.fit_transform(X_train)
  X_test_scaled = scaler.transform(X_test)

fit_transform: Computes scaling parameters (mean and standard deviation) and applies them to the training set. transform: Applies the same scaling parameters to the test set to avoid data leakage.

5. Model Evaluation

We evaluate the model to validate the model’s accuracy and ensures it generalizes well beyond the training data.

The R² Score measures how much of the variance in the target variable (Price) is explained by the model. Values closer to 1 indicate a better fit.

  score = model.score(X_test_scaled, y_test)
  print(f'Model R^2 Score: {score}')

Results: Model R² Score: 0.74

This plot shows the model’s accuracy:

Predicted vs. Actual Prices

Conclusion

In this project, we successfully built a machine learning model to predict laptop prices based on various specifications. The data preprocessing stage involved cleaning and transforming the dataset, handling missing values, and extracting meaningful features from complex strings. In the modeling stage, we utilized a Random Forest Regressor, which achieved an R² score of approximately 0.74. The model can be further improved by experimenting with different algorithms, hyperparameter tuning, and feature engineering.

How to Run

  1. Clone the Repository:

    git clone https://github.com/vishrut-b/ML-Project-Laptop-Price-Prediction.git
    
  2. Navigate to the Project Directory:

    cd your_repository
    
  3. Install Required Libraries:

    Ensure you have Python 3.x installed. Install the necessary libraries:

    pip install pandas numpy matplotlib seaborn scikit-learn
    
  4. Prepare the Dataset:

    • Place the laptopData.csv file in the project directory.
    • Run the data preprocessing script to generate edited_dataframe.csv.

      python data_processing.py
      
  5. Run the Modeling Script:

    Execute the script to train the model and evaluate its performance.

    python learning.py
    

    Replace data_preprocessing.py and learning.py with the names of your scripts containing the above code.


Note: This README covers the data processing and modeling stages of the project. Further improvements, such as hyperparameter tuning and deployment, can be added in future updates.