How to deal with a bad data from data analytic perspective

I have a couple of columns (showing only two for reference) as shown below,

Date        Region          
Feb 2021    North america  
Jan 2021    South america
Kinsley     Norway

here, you can see that the date column has a weird value instead of date and i am trying to find out what is the best way to deal with such data. You might suggest me to delete the whole row or only that value but i am not sure if it is the right way as i might lose some important information about that specific row.

Please suggest what is the best idea.

I am using Alteryx to read this data from an Excel file.


3 Replies

PK Poovarasan Kandasamy Syncfusion Team June 1, 2023 03:37 AM UTC

From data analytic perspective, Domain knowledge plays a crucial role in data wrangling. Sometimes, there are no missing values in the dataset but there are a lot of invalid values which we need to manually identify and remove those invalid values.



In Bold BI, if any invalid values are present in the date or integer columns, the entire column will be treated as a string column to avoid data loss. For the above sample file, all the values will be extracted into Bold BI and there will be no loss of data.

 

However, having invalid data will give errors upon conversion of this string column to a date column. So, these values should be manually cleaned up before we load it into Bold BI.




EW Eden Wheeler June 21, 2023 11:40 AM UTC

When dealing with bad data from a data analytics perspective, it's important to handle it appropriately to maintain data integrity and ensure accurate analysis. Here are some suggestions for dealing with the specific issue you mentioned:

  1. Validate and clean the data: Identify and isolate the rows with invalid or incorrect data. In this case, you can filter out rows where the Date column does not contain a valid date value.
  2. Assess the impact: Consider the importance of the affected rows and the potential impact on your analysis. If the rows with bad data contain critical information, deleting them entirely may not be the best solution.
  3. Impute or replace the bad data: Instead of deleting the entire row, you can replace the incorrect data in the Date column with a missing or placeholder value, such as "N/A" or "Unknown." This way, you can retain the other valuable information in the row while clearly indicating the issue with the Date column.
  4. Investigate the source of the bad data: Determine why the data quality issue occurred in the first place. It could be due to data entry errors, system glitches, or other factors. Understanding the root cause can help you prevent similar issues in the future.
  5. Document your data cleaning process: Keep a record of the steps you took to handle the bad data. This documentation will provide transparency and ensure that others working with the data are aware of the changes made.

By validating, cleaning, and appropriately handling bad data, you can maintain the integrity of your analysis while minimizing the risk of losing important information.



JG Jimmy Goheen replied to Poovarasan Kandasamy June 24, 2023 09:38 AM UTC

Correct, domain knowledge is indeed crucial in data wrangling and data analytics.


Loader.
Up arrow icon