Laptop Price Prediction
This project aims to predict laptop prices based on various features extracted from a dataset of laptop specifications. The project involves data preprocessing, feature engineering, and building a predictive model using machine learning techniques.
Data Processing notebook : here
Model Training notebook : here
Table of Contents
Introduction
Accurately predicting laptop prices can assist consumers in making informed purchasing decisions and help retailers optimize their pricing strategies. This project involves preprocessing a dataset containing laptop specifications and building a machine learning model to predict laptop prices.
Dataset
The dataset contains various specifications of laptops, including:
- Company
- Type
- Screen size and resolution
- CPU and GPU details
- RAM and storage specifications
- Operating system
- Weight
- Price
Note: Due to confidentiality, the dataset is not included in this repository. Ensure you have access to the dataset file named laptopData.csv.
Data Preprocessing
The raw data was cleaned and transformed to ensure it was suitable for machine learning. Key steps included:
1. Handling Missing Values
-
Checking for Missing Values: We identify missing values in each column.
missing_values = dataset.isnull().sum() missing_percentage = dataset.isnull().mean() * 100
| Feature | Percentage of missing values per column |
|---|---|
| Company | 2.302379 |
| TypeName | 2.302379 |
| Inches | 2.302379 |
| ScreenResolution | 2.302379 |
| Cpu | 2.302379 |
| Ram | 2.302379 |
| Memory | 2.302379 |
| Gpu | 2.302379 |
| OpSys | 2.302379 |
| Weight | 2.302379 |
| Price | 2.302379 |
Interestingly, all the columns have the same number of missing value, that suggests that perhaps the missing values are occuring in related places and are not completely randomly distributed. A little exploration of the df has revealed that indeed the missing values are actually missing rows, and in that case, the cleaning procedure I would adopt is to simply drop those rows. There is not risk of losing additional information.
-
Dropping Rows with Missing Values: Since missing values occur in entire rows, we drop those rows.
dataset.dropna(axis=0, inplace=True)
2. Cleaning Columns
| Company | TypeName | Inches | ScreenResolution | Cpu | Ram | Memory | Gpu | OpSys | Weight | Price | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Apple | Ultrabook | 13.3 | IPS Panel Retina Display 2560x1600 | Intel Core i5 2.3GHz | 8GB | 128GB SSD | Intel Iris Plus Graphics 640 | macOS | 1.37kg | 71378.6832 |
| 1 | Apple | Ultrabook | 13.3 | 1440x900 | Intel Core i5 1.8GHz | 8GB | 128GB Flash Storage | Intel HD Graphics 6000 | macOS | 1.34kg | 47895.5232 |
| 2 | HP | Notebook | 15.6 | Full HD 1920x1080 | Intel Core i5 7200U 2.5GHz | 8GB | 256GB SSD | Intel HD Graphics 620 | No OS | 1.86kg | 30636.0000 |
-
Removing Units from ‘Ram’ and ‘Weight’ Columns: The occurrance of units in columns like weight does not add any additional meaning to the df. Another similar example is ‘GB’. We remove units to convert these columns into numerical data.
dataset['Ram'] = dataset['Ram'].str.replace("GB", "") dataset['Weight'] = dataset['Weight'].str.replace("kg", "")
3. Processing ‘ScreenResolution’ Column
The “Screen Resolution” column contains noisy and inconsistent data. To organize it effectively, we can extract and categorize the key pieces of information:
Panel Type: Examples include IPS Panel and Touchscreen.
Resolution: Common formats include 1920x1080 and 2560x1600.
Additional Features: Such as Retina Display and 4K Ultra HD.
-
Extracting Panel Type, Resolution, and Additional Features:
def simplify_resolution(res): panel = re.search(r'(IPS Panel|Touchscreen)', res) panel = panel.group(0) if panel else 'Standard' resolution = re.search(r'\d{3,4}x\d{3,4}', res) resolution = resolution.group(0) if resolution else 'Unknown' feature = re.search(r'(Retina Display|4K Ultra HD|Full HD|Quad HD\+)', res) feature = feature.group(0) if feature else 'Standard' return f'{panel}, {feature}, {resolution}' dataset['SimplifiedResolution'] = dataset['ScreenResolution'].apply(simplify_resolution) -
Splitting into Separate Columns:
dataset[['Screen Panel Type', 'Additional Screen Features', 'Screen Resolution']] = dataset['SimplifiedResolution'].str.split(', ', expand=True) dataset.drop(columns=['ScreenResolution', 'SimplifiedResolution'], inplace=True)
4. Processing ‘Cpu’ Column, ‘Gpu’ Column and ‘Memory’ Column
I give here the exemple for the processing of ‘Cpu’ Column.
-
Extracting CPU Features:
dataset['CPU Brand'] = dataset['Cpu'].apply(lambda x: x.split()[0]) dataset['CPU Series'] = dataset['Cpu'].apply(lambda x: x.split()[1] if len(x.split()) > 1 else None) dataset['CPU Core Type'] = dataset['Cpu'].str.extract(r'(\b(?:Quad|Dual|Octa)?\b Core)', expand=False) dataset['CPU Model Number'] = dataset['Cpu'].str.extract(r'(\b[A-Za-z0-9\-]+[0-9]+\b)', expand=False) dataset['CPU Clock Speed'] = dataset['Cpu'].str.extract(r'(\d+\.\d+GHz)', expand=False) -
Dropping the Original ‘Cpu’ Column:
dataset.drop(columns=['Cpu'], inplace=True)
5. Final Dataset
After handling missing values and cleaning the data, we finalize the dataset.
dataset.dropna(subset=['CPU Model Number', 'Gpu Type', 'Main Storage Type', 'Main Storage Size'], inplace=True)
-
Converting Data Types:
dataset['Price'] = dataset['Price'].astype(int) dataset['Main Storage Size'] = dataset['Main Storage Size'].apply(convert).astype(int) dataset['Additional Storage Size'] = dataset['Additional Storage Size'].fillna('0GB') dataset['Additional Storage Size'] = dataset['Additional Storage Size'].apply(convert).astype(int) dataset['CPU Clock Speed'] = dataset['CPU Clock Speed'].str.replace('GHz', '').astype(float) -
Saving the Cleaned Data:
dataset.to_csv('edited_dataframe.csv', index=False)
Modeling
For this project, I used python libraries including:
- numpy
- pandas
- matplotlib
- torch
- sklearn
- seaborn
1. One-Hot Encoding
One-hot encoding is a technique used to transform categorical data into a numerical format that machine learning models can understand. It creates binary (0 or 1) columns for each unique category in a categorical variable. This ensures that the model treats these categories as distinct and unrelated.
However, we need to be careful while one-hot encoding the CPU Brand and GPU Brand, because in certain instances they might share the same company. This would needlessly result in two identical columns, which might potentially confuse the machine learning model. To avoid this issue, we add prefixes as shown below.
-
Encoding Categorical Variables:
df = df.join(pd.get_dummies(df['Company'])) df.drop('Company', axis=1, inplace=True) df = df.join(pd.get_dummies(df['TypeName'])) df.drop('TypeName', axis=1, inplace=True) df = df.join(pd.get_dummies(df['OpSys'])) df.drop('OpSys', axis=1, inplace=True) # Encoding CPU and GPU Brands with prefixes to avoid confusion cpu_brands = pd.get_dummies(df['CPU Brand'], prefix='CPU_Brand') df = df.join(cpu_brands) df.drop('CPU Brand', axis=1, inplace=True) gpu_brands = pd.get_dummies(df['Gpu Brand'], prefix='GPU_Brand') df = df.join(gpu_brands) df.drop('Gpu Brand', axis=1, inplace=True)
Other Categorical Features are encoded too like the Screen Panel Type, the Gpu and Cpu series, the main storage type, etc.
2. Feature Selection
We have too many features, many of which might not be contribute very much to predicting the price of a laptop. We will identify the most relevant variables from the dataset to use in the model. The key steps for feature selection are:
- Calculating Correlations:
Correlations between the features and the target variable (Price) are calculated. Correlation measures the strength of the relationship between two variables, with values ranging from -1 to 1. Features with high correlation values (positive or negative) to the target variable are likely to be more useful for prediction.
correlations = df.corr()['Price'].abs().sort_values()
This calculates the absolute value of the correlation between each feature and the target (Price), sorts the features, and identifies which are most strongly associated with Price.
- Selecting Features with Correlation Above Threshold:
Only features with correlations greater than 0.15 are retained for modeling.
The dataset is reduced to include only the selected features, simplifying the model and potentially improving performance by reducing noise.
3. Data Visualization
- Heatmap of the selected features:
A heatmap is useful for visualizing the correlation between features in a dataset:
- Diagonal Values: These represent self-correlation, which is always 1.
- High Correlation (Close to 1): Indicates a strong positive relationship.
- Low Correlation (Close to -1): Indicates a strong negative relationship.
- Near Zero: Suggests little to no relationship.

4. Model Training
For this project, I choose to work with Random Forest.
Random Forest is an ensemble learning method based on decision trees. It works by creating multiple decision trees during training and combining their predictions (averaging for regression or majority voting for classification).
Why Random Forest?
- Robustness: it reduces overfitting because individual trees are trained on different subsets of data and features.
- Handles Non-Linear Relationships: Random Forest can model complex, non-linear relationships between features and the target.
- Feature Importance: it provides a measure of feature importance, helping you understand which features contribute the most to predictions.
from sklearn.ensemble import RandomForestRegressor
-
Defining Features and Target Variable:
X = selected_df.drop('Price', axis=1) y = selected_df['Price'] -
Splitting the Data:
The dataset is split into two parts:
- Training Set: Used to train the model (learn patterns from the data).
-
Testing Set: Used to evaluate how well the model performs on unseen data. This ensures that the model generalizes to new, unseen data rather than simply memorizing the training data.
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.2, random_state=42) - Scaling the Data:
Features often have different ranges (e.g., RAM might range from 4 to 64 GB, while Weight is between 1 and 3 kg). Scaling ensures that no single feature dominates due to its larger range.
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
fit_transform: Computes scaling parameters (mean and standard deviation) and applies them to the training set.
transform: Applies the same scaling parameters to the test set to avoid data leakage.
-
Training the Random Forest Model:
model = RandomForestRegressor() model.fit(X_train_scaled, y_train)
5. Model Evaluation
We evaluate the model to validate the model’s accuracy and ensures it generalizes well beyond the training data.
- Evaluating the Model:
The R² Score measures how much of the variance in the target variable (Price) is explained by the model. Values closer to 1 indicate a better fit.
score = model.score(X_test_scaled, y_test)
print(f'Model R^2 Score: {score}')
Results: Model R² Score: 0.74
- Plotting Predicted vs. Actual Prices:
This plot shows the model’s accuracy:

Conclusion
In this project, we successfully built a machine learning model to predict laptop prices based on various specifications. The data preprocessing stage involved cleaning and transforming the dataset, handling missing values, and extracting meaningful features from complex strings. In the modeling stage, we utilized a Random Forest Regressor, which achieved an R² score of approximately 0.74. The model can be further improved by experimenting with different algorithms, hyperparameter tuning, and feature engineering.
How to Run
-
Clone the Repository:
git clone https://github.com/vishrut-b/ML-Project-Laptop-Price-Prediction.git -
Navigate to the Project Directory:
cd your_repository -
Install Required Libraries:
Ensure you have Python 3.x installed. Install the necessary libraries:
pip install pandas numpy matplotlib seaborn scikit-learn -
Prepare the Dataset:
- Place the
laptopData.csvfile in the project directory. -
Run the data preprocessing script to generate
edited_dataframe.csv.python data_processing.py
- Place the
-
Run the Modeling Script:
Execute the script to train the model and evaluate its performance.
python learning.pyReplace
data_preprocessing.pyandlearning.pywith the names of your scripts containing the above code.
Note: This README covers the data processing and modeling stages of the project. Further improvements, such as hyperparameter tuning and deployment, can be added in future updates.