Laptop Price Prediction

This project aims to predict laptop prices based on various features extracted from a dataset of laptop specifications. The project involves data preprocessing, feature engineering, and building a predictive model using machine learning techniques.

Data Processing notebook : here
Model Training notebook : here

Introduction
Dataset
Data Preprocessing
Modeling
Conclusion
How to Run

Introduction

Accurately predicting laptop prices can assist consumers in making informed purchasing decisions and help retailers optimize their pricing strategies. This project involves preprocessing a dataset containing laptop specifications and building a machine learning model to predict laptop prices.

Dataset

The dataset contains various specifications of laptops, including:

Company
Type
Screen size and resolution
CPU and GPU details
RAM and storage specifications
Operating system
Weight
Price

Note: Due to confidentiality, the dataset is not included in this repository. Ensure you have access to the dataset file named laptopData.csv.

Data Preprocessing

The raw data was cleaned and transformed to ensure it was suitable for machine learning. Key steps included:

1. Handling Missing Values

Checking for Missing Values: We identify missing values in each column.

missing_values = dataset.isnull().sum()
missing_percentage = dataset.isnull().mean() * 100

Feature	Percentage of missing values per column
Company	2.302379
TypeName	2.302379
Inches	2.302379
ScreenResolution	2.302379
Cpu	2.302379
Ram	2.302379
Memory	2.302379
Gpu	2.302379
OpSys	2.302379
Weight	2.302379
Price	2.302379

Interestingly, all the columns have the same number of missing value, that suggests that perhaps the missing values are occuring in related places and are not completely randomly distributed. A little exploration of the df has revealed that indeed the missing values are actually missing rows, and in that case, the cleaning procedure I would adopt is to simply drop those rows. There is not risk of losing additional information.

Dropping Rows with Missing Values: Since missing values occur in entire rows, we drop those rows.
```
dataset.dropna(axis=0, inplace=True)
```

2. Cleaning Columns

	Company	TypeName	Inches	ScreenResolution	Cpu	Ram	Memory	Gpu	OpSys	Weight	Price
0	Apple	Ultrabook	13.3	IPS Panel Retina Display 2560x1600	Intel Core i5 2.3GHz	8GB	128GB SSD	Intel Iris Plus Graphics 640	macOS	1.37kg	71378.6832
1	Apple	Ultrabook	13.3	1440x900	Intel Core i5 1.8GHz	8GB	128GB Flash Storage	Intel HD Graphics 6000	macOS	1.34kg	47895.5232
2	HP	Notebook	15.6	Full HD 1920x1080	Intel Core i5 7200U 2.5GHz	8GB	256GB SSD	Intel HD Graphics 620	No OS	1.86kg	30636.0000

Removing Units from ‘Ram’ and ‘Weight’ Columns: The occurrance of units in columns like weight does not add any additional meaning to the df. Another similar example is ‘GB’. We remove units to convert these columns into numerical data.
```
dataset['Ram'] = dataset['Ram'].str.replace("GB", "")
dataset['Weight'] = dataset['Weight'].str.replace("kg", "")
```

3. Processing ‘ScreenResolution’ Column

The “Screen Resolution” column contains noisy and inconsistent data. To organize it effectively, we can extract and categorize the key pieces of information:

Panel Type: Examples include IPS Panel and Touchscreen.

Resolution: Common formats include 1920x1080 and 2560x1600.

Additional Features: Such as Retina Display and 4K Ultra HD.

Extracting Panel Type, Resolution, and Additional Features:

def simplify_resolution(res):
    panel = re.search(r'(IPS Panel|Touchscreen)', res)
    panel = panel.group(0) if panel else 'Standard'

    resolution = re.search(r'\d{3,4}x\d{3,4}', res)
    resolution = resolution.group(0) if resolution else 'Unknown'

    feature = re.search(r'(Retina Display|4K Ultra HD|Full HD|Quad HD\+)', res)
    feature = feature.group(0) if feature else 'Standard'

    return f'{panel}, {feature}, {resolution}'

dataset['SimplifiedResolution'] = dataset['ScreenResolution'].apply(simplify_resolution)

Splitting into Separate Columns:

dataset[['Screen Panel Type', 'Additional Screen Features', 'Screen Resolution']] = dataset['SimplifiedResolution'].str.split(', ', expand=True)
dataset.drop(columns=['ScreenResolution', 'SimplifiedResolution'], inplace=True)

4. Processing ‘Cpu’ Column, ‘Gpu’ Column and ‘Memory’ Column

I give here the exemple for the processing of ‘Cpu’ Column.

Extracting CPU Features:

dataset['CPU Brand'] = dataset['Cpu'].apply(lambda x: x.split()[0])
dataset['CPU Series'] = dataset['Cpu'].apply(lambda x: x.split()[1] if len(x.split()) > 1 else None)
dataset['CPU Core Type'] = dataset['Cpu'].str.extract(r'(\b(?:Quad|Dual|Octa)?\b Core)', expand=False)
dataset['CPU Model Number'] = dataset['Cpu'].str.extract(r'(\b[A-Za-z0-9\-]+[0-9]+\b)', expand=False)
dataset['CPU Clock Speed'] = dataset['Cpu'].str.extract(r'(\d+\.\d+GHz)', expand=False)

Dropping the Original ‘Cpu’ Column:

dataset.drop(columns=['Cpu'], inplace=True)

5. Final Dataset

After handling missing values and cleaning the data, we finalize the dataset.

dataset.dropna(subset=['CPU Model Number', 'Gpu Type', 'Main Storage Type', 'Main Storage Size'], inplace=True)

Converting Data Types:

dataset['Price'] = dataset['Price'].astype(int)
dataset['Main Storage Size'] = dataset['Main Storage Size'].apply(convert).astype(int)
dataset['Additional Storage Size'] = dataset['Additional Storage Size'].fillna('0GB')
dataset['Additional Storage Size'] = dataset['Additional Storage Size'].apply(convert).astype(int)
dataset['CPU Clock Speed'] = dataset['CPU Clock Speed'].str.replace('GHz', '').astype(float)

Saving the Cleaned Data:

dataset.to_csv('edited_dataframe.csv', index=False)

Modeling

For this project, I used python libraries including:

numpy
pandas
matplotlib
torch
sklearn
seaborn

1. One-Hot Encoding

One-hot encoding is a technique used to transform categorical data into a numerical format that machine learning models can understand. It creates binary (0 or 1) columns for each unique category in a categorical variable. This ensures that the model treats these categories as distinct and unrelated.

However, we need to be careful while one-hot encoding the CPU Brand and GPU Brand, because in certain instances they might share the same company. This would needlessly result in two identical columns, which might potentially confuse the machine learning model. To avoid this issue, we add prefixes as shown below.

Encoding Categorical Variables:

df = df.join(pd.get_dummies(df['Company']))
df.drop('Company', axis=1, inplace=True)

df = df.join(pd.get_dummies(df['TypeName']))
df.drop('TypeName', axis=1, inplace=True)

df = df.join(pd.get_dummies(df['OpSys']))
df.drop('OpSys', axis=1, inplace=True)

# Encoding CPU and GPU Brands with prefixes to avoid confusion
cpu_brands = pd.get_dummies(df['CPU Brand'], prefix='CPU_Brand')
df = df.join(cpu_brands)
df.drop('CPU Brand', axis=1, inplace=True)

gpu_brands = pd.get_dummies(df['Gpu Brand'], prefix='GPU_Brand')
df = df.join(gpu_brands)
df.drop('Gpu Brand', axis=1, inplace=True)

Other Categorical Features are encoded too like the Screen Panel Type, the Gpu and Cpu series, the main storage type, etc.

2. Feature Selection

We have too many features, many of which might not be contribute very much to predicting the price of a laptop. We will identify the most relevant variables from the dataset to use in the model. The key steps for feature selection are:

Calculating Correlations:

Correlations between the features and the target variable (Price) are calculated. Correlation measures the strength of the relationship between two variables, with values ranging from -1 to 1. Features with high correlation values (positive or negative) to the target variable are likely to be more useful for prediction.

  correlations = df.corr()['Price'].abs().sort_values()

This calculates the absolute value of the correlation between each feature and the target (Price), sorts the features, and identifies which are most strongly associated with Price.

Selecting Features with Correlation Above Threshold:

Only features with correlations greater than 0.15 are retained for modeling.

The dataset is reduced to include only the selected features, simplifying the model and potentially improving performance by reducing noise.

3. Data Visualization

Heatmap of the selected features:

A heatmap is useful for visualizing the correlation between features in a dataset:

Diagonal Values: These represent self-correlation, which is always 1.
High Correlation (Close to 1): Indicates a strong positive relationship.
Low Correlation (Close to -1): Indicates a strong negative relationship.
Near Zero: Suggests little to no relationship.

Heatmap of the selected features

4. Model Training

For this project, I choose to work with Random Forest.

Random Forest is an ensemble learning method based on decision trees. It works by creating multiple decision trees during training and combining their predictions (averaging for regression or majority voting for classification).

Why Random Forest?

Robustness: it reduces overfitting because individual trees are trained on different subsets of data and features.
Handles Non-Linear Relationships: Random Forest can model complex, non-linear relationships between features and the target.
Feature Importance: it provides a measure of feature importance, helping you understand which features contribute the most to predictions.

  from sklearn.ensemble import RandomForestRegressor

Defining Features and Target Variable:

X = selected_df.drop('Price', axis=1)
y = selected_df['Price']

Splitting the Data:

The dataset is split into two parts:

Training Set: Used to train the model (learn patterns from the data).
Testing Set: Used to evaluate how well the model performs on unseen data. This ensures that the model generalizes to new, unseen data rather than simply memorizing the training data.
```
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.2, random_state=42)
```
Scaling the Data:

Features often have different ranges (e.g., RAM might range from 4 to 64 GB, while Weight is between 1 and 3 kg). Scaling ensures that no single feature dominates due to its larger range.

  scaler = StandardScaler()
  X_train_scaled = scaler.fit_transform(X_train)
  X_test_scaled = scaler.transform(X_test)

fit_transform: Computes scaling parameters (mean and standard deviation) and applies them to the training set. transform: Applies the same scaling parameters to the test set to avoid data leakage.

Training the Random Forest Model:

model = RandomForestRegressor()
model.fit(X_train_scaled, y_train)

5. Model Evaluation

We evaluate the model to validate the model’s accuracy and ensures it generalizes well beyond the training data.

Evaluating the Model:

The R² Score measures how much of the variance in the target variable (Price) is explained by the model. Values closer to 1 indicate a better fit.

  score = model.score(X_test_scaled, y_test)
  print(f'Model R^2 Score: {score}')

Results: Model R² Score: 0.74

Plotting Predicted vs. Actual Prices:

This plot shows the model’s accuracy:

Predicted vs. Actual Prices

Conclusion

In this project, we successfully built a machine learning model to predict laptop prices based on various specifications. The data preprocessing stage involved cleaning and transforming the dataset, handling missing values, and extracting meaningful features from complex strings. In the modeling stage, we utilized a Random Forest Regressor, which achieved an R² score of approximately 0.74. The model can be further improved by experimenting with different algorithms, hyperparameter tuning, and feature engineering.

How to Run

Clone the Repository:

git clone https://github.com/vishrut-b/ML-Project-Laptop-Price-Prediction.git

Navigate to the Project Directory:
```
cd your_repository
```
Install Required Libraries:

Ensure you have Python 3.x installed. Install the necessary libraries:
```
pip install pandas numpy matplotlib seaborn scikit-learn
```
Prepare the Dataset:
- Place the laptopData.csv file in the project directory.
- Run the data preprocessing script to generate edited_dataframe.csv.
```
python data_processing.py
```
Run the Modeling Script:

Execute the script to train the model and evaluate its performance.
```
python learning.py
```
Replace data_preprocessing.py and learning.py with the names of your scripts containing the above code.

Note: This README covers the data processing and modeling stages of the project. Further improvements, such as hyperparameter tuning and deployment, can be added in future updates.