This project involves performing data cleaning and preprocessing on a housing prices dataset to prepare it for further analysis or modeling. The dataset used for this analysis includes various features related to housing properties and their sale prices.
The dataset used in this project is the Housing Prices dataset, which can be downloaded from Kaggle.
- Id: Property ID
- MSSubClass: The building class
- MSZoning: The general zoning classification
- LotFrontage: Linear feet of street connected to the property
- LotArea: Lot size in square feet
- Street: Type of road access
- Alley: Type of alley access
- LotShape: General shape of property
- LandContour: Flatness of the property
- Utilities: Type of utilities available
- ... (and many more features)
The repository contains the following files and directories:
data/
: Directory containing the datasettrain.csv
: The dataset file
Housing_Prices_Data_Cleaning.ipynb
: Jupyter Notebook with the complete data cleaning and preprocessing processREADME.md
: This file
- Load the dataset and display the first few rows
- Summarize the dataset to understand the basic statistics
- Check for missing values in the dataset
- Fill missing numerical values with the median
- Fill missing categorical values with the mode
- Plot boxplots to detect outliers in the SalePrice
- Remove outliers based on the IQR method
- Create new features from existing data (e.g., TotalSF)
- Convert categorical variables into dummy/indicator variables
- Standardize the dataset using
StandardScaler
-
Clone the Repository:
git clone https://github.com/yourusername/Housing_Prices_Data_Cleaning.git cd Housing_Prices_Data_Cleaning
-
Install Required Libraries: Ensure you have the required Python libraries installed. You can install them using pip:
pip install pandas numpy matplotlib seaborn scikit-learn
-
Open the Jupyter Notebook: Start Jupyter Notebook and open
Housing_Prices_Data_Cleaning.ipynb
to run the analysis step-by-step.
Contributions are welcome! If you have any improvements or suggestions, please create a pull request or open an issue to discuss.
This project is licensed under the MIT License. See the LICENSE file for more details.
- The dataset used in this project is available on Kaggle.
- Ivy Qwinn