top of page

Preparing and cleaning data

Updated: Feb 2


data cleaning


The goal is to prepare and clean the data for an e-commerce bike store.



Step 1. Review the raw data to locate any data that could skew the data analysis.


We see that there are blank cells, data that needs to be broken out into columns, and zeroes for Order Quantity.



data cleaning using excel

Step 2. Data Cleaning


The data set provided may contain duplicate entries. One of the processes of cleaning data is to find and remove these. In the Bike Sales dataset, the only column that cannot have the same value more than once is column A, Sales_Order #.


  1. Select column A to check for duplicate data here.

  2. With column A selected, click the Conditional Formatting button in the Home toolbar and select Highlight Cell Rules > Duplicate Values, then click Done.

  3. Conditional formatting found two pairs of duplicates. Cells A2 and A3 both have the same sales order number of 261695. Also, cells A8 and A9 have the same sales order number of 261701. These types of duplicate entries can easily occur during manual data entry or when copying and pasting data into a worksheet.

Review the duplicated entries.


In the case of cells A2 and A3, it appears that the Sales Order# of 261695 was incorrectly entered into cell A3. As a data analyst, I would need to go to the source of the data and verify the Sales Order number.


For cells A8 and A9, a close review shows that both rows are exactly the same. Most likely in this case a sales entry was entered twice.




data cleaning using excel

Step 2: Fixing and Removing Duplicates


When duplicate data entries are identified, they need to be reviewed carefully before they are removed, so that relevant data is not accidentally deleted.

  1. To correct the duplicate entry in cell A3, change the Sales Order# to 261696. The conditional formatting should be automatically removed by Excel.

  2. To correct duplicate rows 8 and 9, one of the rows needs to be removed. 1) One way to remove a duplicate entry is to select a row and delete. 2) If there are many duplicates that need to be removed in a large dataset, the Remove Duplicates tool can be used.

  3. Click the Remove Duplicates tool in the Data tool bar.

  4. In the Remove Duplicates dialog box, select the Sales_Order# column and ensure that the checkbox My data has headers is selected. Click OK to continue.

Excel will remove the second instance of each set of duplicate rows.


Step 3: Finding Empty Cells


There are many reasons why a cell might be blank. It could be human error from manual data entry or it could be a result of copying data from other sources. Context is key when determining what to do with empty cells. Sometimes, a data analyst will need to fill every blank cell in the data with the same constant value. Other times, there may be clues as to what should be in an empty cell from the surrounding data. Analysts may also have to go back to the source of the data to discover what the missing values should be.

To find empty cells, the Conditional Formatting tool can be used.

  1. Select the entire sheet by clicking the arrow in the top left corner of the worksheet to the left of column A.

  2. Click on the Conditional Formatting tool in the Styles tool bar and select Highlight Cell Rules > Text That Contains.

  3. In the Conditional Formatting window, under Rule Type, select Blanks. Change Format to Green fill with dark green text and click Done. Any blank cells in the worksheet should now be filled in green. There should be four highlighted cells: C12, G17, M23, and N24. A data analyst would likely review the blank cells to see if the missing data can be obtained. If it cannot, the only option may be to delete the rows with the missing data.


data cleaning using excel


data cleaning using excel


data cleaning using excel

Step 4: Data Parsing from Text to Column

You will note that some of the cells have multiple data elements separated by a data delimiter like a comma. For example, consider the column Product_Description. You can parse the data in this column so that each part of the product description is displayed in its own column. You will use the Text to Columns function to achieve this.

You will parse the data in the column Product_Description to move the bike size and color into separate columns.

  1. Start by adding a new blank column to the right of column M, Product_Description. This new column becomes column N.


data cleaning using excel

  1. Highlight column M, Product_Description.

  2. In the Data toolbar, click the Text to Columns button in Data Tools ribbon.

  3. In the Text to Columns window, select Comma as the only delimiter and click Apply.

  4. All of the bike colors should now be moved into column N.

This moves the color of the bikes to the new column so that all that is left in column M now are the bike models.


data cleaning using excel


data cleaning using excel

Step 5: Removing Extra Spaces

When data is pasted from external sources into an Excel worksheet, there is a good chance that cells may contain extra spaces that will need to be removed so that searches and queries will generate accurate results. The TRIM function is used to eliminate excess spaces and tab spaces in Excel worksheet cells. In this step, you will use the TRIM functions to clean up the data.

  1. Insert two new blank columns to the right of column I, Country. These will become columns J and K.

  2. Name the two new columns Length and TRIM, respectively.


data cleaning using excel


  1. In the Length column, in cell J2, enter the function =LEN(I2) to see how many characters are in cell I2. The result should be 13, which is the number of characters in “United States” if counting the space in the middle.


data cleaning using excel

  1. Now copy this LEN function from cell J2 down through cell J8. Notice that the other cells with United States are showing a length of 14 characters and not 13. This is because each of these cells contains an extra space. In cell I4 the extra space is in front. In cell I6 the extra space is between the words United and States. In cell I8 the extra space is not readily noticeable, but it is at the end.


data cleaning using excel

To remove the extra spaces in these cells, use the TRIM function in column K.

  1. In cell K4, enter =TRIM(I4). The function removes the leading space from cell I4.

  2. Copy the TRIM function to cells K6 and K8. The function removes the spaces from these cells.



data cleaning using excel


  1. To ensure that any extra spaces in the Country column are removed, copy and paste the TRIM function to all the cells in column K. All extra spaces in column M are now removed. The values in the TRIM column now need to be pasted to the Country column.

  2. Select cells K2 through K89 and copy them.

  3. Select cells I2 through I89 and click the down arrow under the Paste tool and then select Paste Values.


Step 6: Highlight Possible Errors

In this step, you will highlight all the Unit Costs and Unit Prices that are zero. This type of data is false and will skew the dataset, so it is important to find these errors and fix them.

  1. Select the Unit_Cost column and the Unit_Price column.

  2. Click the Conditional Formatting tool > Highlight Cell Rules > Equal to.

  3. In the Conditional Formatting window, enter 0 under the Equal to box and click Done to highlight all the bikes with a unit cost or unit prices of zero. Cell R6 under Unit Cost and Cell S10 under Unit_Price should be highlighted in red.


data cleaning using excel

A data analyst would need to determine what values to input into these cells or to delete these two rows of data. In this example, we know the values that should be in these cells, because they are in other rows of the sheet.


Correct the issue by entering $1252 in cell R6 and $769 in Cell S10.



data cleaning using excel


Step 7: Spell Check

The feature of checking spelling is available in the Review tab. You can check the spelling by cell, column, row, or sheet. Spell check will ensure that spelling errors don’t cause results of searches or queries to be incorrect.

  1. Select all columns with text values.

  2. In the Review toolbar, click the Spelling tool. Words not found in the dictionary are displayed in the Spelling dialog box that appears on the right side of the worksheet.

  3. Click Ignore for any of the column names that are identified as not in the dictionary.

  4. Select Change All for any words found that are misspelled. (Change All is found under the drop arrow at the right end of the Suggestions box.)



8 views0 comments

Yorumlar


bottom of page