Data Cleaning in MS Excel and SQL Server
A comparison between MS Excel and SQL for data cleaning
Objective
The main objective of this article is to walk through a few ways to obtain clean data using MS Excel and SQL
Introduction
Data cleaning is the process of ensuring data is correct, consistent and usable. Data cleaning is really important because it improves data quality and there by increases the overall productivity. Clean data helps in accurate decision making, boost results and revenue, save money , time and minimize compliance risks. This articles presents few ways which helps in cleaning the data using MS Excel and SQL.
Null
Objective : Replace Null values with 0
EXCEL :
SQL
SELECT ISNULL(column_name, 0 ) FROM table1
OR
SELECT COALESCE(column_name, 0 ) FROM table1
Spelling
Objective : Replace Y to “Yes”, “N” to “No” and correct other spelling mistakes
In Excel, this can be achieved in the same way using Find and Replace method . Find Y and replace all with Yes in the required column
SQL :
SELECT CASE WHEN IsAvailable='Y' THEN 'Yes'
WHEN IsAvailble='N' THEN 'No'
ELSE IsAvailable END
FROM IceCream
Fill
Objective : Find the first name and last name from Email
To extract the first name and last name from email ID in Excel, we use a shortcut for Autofill Ctrl+E . Select each column and do ctrl +E to autofill first name and last name . This can be also achieved using the FlashFill option in home tab from Fill which helps in filling these fields quickly. Excel is intelligent to understand the delimiter which helps in extracting first name and last name easily.
Another alternate way is to select the cells to split. Select Data->Text to Columns-> select delimiter and apply.
Sort and Filter
Objective : Sort the table based on first name in ascending order
To filter the employees who have salary greater than average of all employees is done by right click on salary filter and select number filters. You can see many options and choose the required option. We can also add custom filters based on user requirements.
SQL:
#ORDER BY clause is used for sorting in ascending or descending orderSELECT * FROM TABLE
ORDER BY First Name ASC;
Format
Objective :
- Remove the trailing and leading spaces in product_name
- Lower case the product_name
- Change the sale_date to ‘year-month’ format
Excel:
Lowercase formula: =lower(cellrange)
Trim formula: = Trim(cellrange)
Date and Month =LEFT(number of characters) or format the cells SQL:
SELECT lower(TRIM(product_name))product_name
,LEFT(sale_date,7) sale_date
FROM Sales
GROUP BY product_name,sale_date
Duplicates
Objective : Find the duplicate Email and remove from the table
EXCEL:
Select the full data, go to Data and click on Remove duplicates.
To visualize the duplicate values, we can use conditional formatting and highlight the duplicate cells or rowSQL:
#List of Duplicate Emails
SELECT email
,count(*) cnt_email
FROM PERSON
GROUP BY email
HAVING cnt_email > 1#To Delete Duplicate Emails
Delete FROM TABLE
WHERE Id NOT IN
(select min(Id)
from Person
group by Email)
Transpose
Objective : Change rows to columns
If you need to shifts rows to become columns or vice-versa, select the cells you need to transpose, and copy it. Right click and click on paste special and select transpose. An alternate way to transpose using the transpose function which is using the format {=TRANSPOSE(Range)}
Summary
We have seen how to clean a raw data by removing duplicates, spelling mistakes, formatting, replacing null values, sorting and filling in MS Excel and SQL.