This project involves cleaning and transforming data from the Nashville Housing dataset using SQL queries. The primary tasks include exploring the data, changing date formats, populating missing property addresses, breaking down address columns, standardizing 'SoldAsVacant' values, removing duplicates, and deleting unused columns.
- Retrieve all records from the
NashvilleHousing
table.
SELECT *
FROM NashvilleHousing;
ุฏ
Alter the table to add a new column (SaleDateConverted) with a standardized date format. Update the new column with converted dates.
ALTER TABLE NashvilleHousing
ADD SaleDateConverted DATE;
UPDATE NashvilleHousing
SET SaleDateConverted = CONVERT(date, SaleDate);
Populate missing property addresses by updating records based on ParcelID.
-- Populating Property Address Data
-- (code snippet for SELECT statement not shown here)
-- Update records with missing property addresses
UPDATE N1
SET PropertyAddress = ISNULL(N1.PropertyAddress, N2.PropertyAddress)
FROM NashvilleHousing N1
JOIN NashvilleHousing N2
ON N1.ParcelID = N2.ParcelID
AND N1.UniqueID <> N2.UniqueID
WHERE N1.PropertyAddress IS NULL;
Break down the PropertyAddress column into two individual columns: Address and City.
-- Breaking Property Address Column
-- (code snippet for SELECT statement not shown here)
-- Add new columns
ALTER TABLE NashvilleHousing
ADD Address NVARCHAR(255);
-- Update the new columns
UPDATE NashvilleHousing
SET Address = SUBSTRING( PropertyAddress, 1, CHARINDEX(',', PropertyAddress)-1);
ALTER TABLE NashvilleHousing
ADD City NVARCHAR(255);
UPDATE NashvilleHousing
SET City = SUBSTRING(PropertyAddress, CHARINDEX(',', PropertyAddress)+1, LEN(PropertyAddress));
Standardize values in the SoldAsVacant column.
-- Changing 'Y' and 'N' to 'Yes' and 'No' in SoldAsVacant Column
-- (code snippet for SELECT statement not shown here)
-- Update the column
UPDATE NashvilleHousing
SET SoldAsVacant = CASE WHEN SoldAsVacant = 'Y' THEN 'Yes'
WHEN SoldAsVacant = 'N' THEN 'No'
ELSE SoldAsVacant
END;
Remove duplicate records based on specific columns.
-- Removing Duplicates
-- (code snippet for WITH statement not shown here)
-- Delete duplicates
WITH RowNumCte AS (
-- ...
)
DELETE
FROM RowNumCte
WHERE row_num > 1;
Remove columns that are no longer needed.
-- Deleting Unused Columns
ALTER TABLE NashvilleHousing
DROP COLUMN OwnerAddress, TaxDistrict, PropertyAddress;
### Checking the Result
Verify that the cleaning process was successful.