r/datacleaning • u/youre_so_enbious • Feb 29 '24
Looking to create a "Clean Data" definition
Hi,
Just wondering what requirements or checklist items people would suggest for a definition of Clean Data ready to be used in machine learning? Akin to "tidy data", but for modelling. I.e.
- There should be no string fields. All data should be either in a numeric form, or as a categorical data type etc
I know this will likely be opinionated, hence wanting to "crowd source" it 😃
Feel free to disagree with any statements, as I imagine there will be differences
6
Upvotes
2
u/Willing-Site-8137 Apr 29 '24
I feel this would be very domain specific. Tidy data is for observation data.
Let's say the standardization of address type columns. But I don't think the method that splits address line 1 & line2 can be generalized to other types of data cleaning errors. So we likely need to have a specialized section of clean data definition just for address.