Housing Prices Data Cleaning and Preprocessing

Overview

This project involves performing data cleaning and preprocessing on a housing prices dataset to prepare it for further analysis or modeling. The dataset used for this analysis includes various features related to housing properties and their sale prices.

Dataset

The dataset used in this project is the Housing Prices dataset, which can be downloaded from Kaggle.

Features

Id: Property ID
MSSubClass: The building class
MSZoning: The general zoning classification
LotFrontage: Linear feet of street connected to the property
LotArea: Lot size in square feet
Street: Type of road access
Alley: Type of alley access
LotShape: General shape of property
LandContour: Flatness of the property
Utilities: Type of utilities available
... (and many more features)

Project Structure

The repository contains the following files and directories:

data/: Directory containing the dataset
- train.csv: The dataset file
Housing_Prices_Data_Cleaning.ipynb: Jupyter Notebook with the complete data cleaning and preprocessing process
README.md: This file

Steps and Analysis

1. Data Loading and Exploration

Load the dataset and display the first few rows
Summarize the dataset to understand the basic statistics
Check for missing values in the dataset

2. Handling Missing Values

Fill missing numerical values with the median
Fill missing categorical values with the mode

3. Handling Outliers

Plot boxplots to detect outliers in the SalePrice
Remove outliers based on the IQR method

4. Feature Engineering

Create new features from existing data (e.g., TotalSF)
Convert categorical variables into dummy/indicator variables

5. Scaling and Normalization

Standardize the dataset using StandardScaler

How to Run the Notebook

Clone the Repository:

git clone https://github.com/yourusername/Housing_Prices_Data_Cleaning.git
cd Housing_Prices_Data_Cleaning

Install Required Libraries: Ensure you have the required Python libraries installed. You can install them using pip:
```
pip install pandas numpy matplotlib seaborn scikit-learn
```
Open the Jupyter Notebook: Start Jupyter Notebook and open Housing_Prices_Data_Cleaning.ipynb to run the analysis step-by-step.

Contributing

Contributions are welcome! If you have any improvements or suggestions, please create a pull request or open an issue to discuss.

License

This project is licensed under the MIT License. See the LICENSE file for more details.

Acknowledgements

The dataset used in this project is available on Kaggle.

Author

Ivy Qwinn

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
Housing_prices_data_cleaning.ipynb		Housing_prices_data_cleaning.ipynb
README.md		README.md
train.csv		train.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Housing Prices Data Cleaning and Preprocessing

Overview

Dataset

Features

Project Structure

Steps and Analysis

1. Data Loading and Exploration

2. Handling Missing Values

3. Handling Outliers

4. Feature Engineering

5. Scaling and Normalization

How to Run the Notebook

Contributing

License

Acknowledgements

Author

About

Releases

Packages

Languages

IvyQwinn/Housing_Prices_Data_Cleaning

Folders and files

Latest commit

History

Repository files navigation

Housing Prices Data Cleaning and Preprocessing

Overview

Dataset

Features

Project Structure

Steps and Analysis

1. Data Loading and Exploration

2. Handling Missing Values

3. Handling Outliers

4. Feature Engineering

5. Scaling and Normalization

How to Run the Notebook

Contributing

License

Acknowledgements

Author

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages