Clean

Settings for auto-cleaning your data

Data cleanliness is a huge issue for companies with years of historical data, representing hours of work to sift through. Using the prebuilt data cleaning options, you can quickly perform basic data cleanliness steps before doing specific work on your datasets. The options are shown and described below.

Default Settings

Standardize Data Columns - Convert all date columns to ISO 8601 standard format. (YYYY-MM-DD HH: MM) Remove Unexpected Nulls - Remove rows with null values for columns at least 99% filled in. Replace Excess Categories with "Other" - Replace values in categorical columns, not in the top 32 most common values, with "Other." Remove Constant Columns - Remove columns with the same value for every row.

Remove Mostly Unreadable Numerical Columns - Remove numerical columns with at least 99% unreadable values. Remove Mostly Unreadable Data Columns - Remove date columns with at least 99% unreadable values. Remove Mostly Blank Columns - Remove columns that are at least 99% blank values.

Other Cleaning Options

Flag Outliers - For each numerical column, add a column that flags whether or not numerical values in that row are more than three standard deviations from the mean, higher than the 99th percentile, or lower than the 1st percentile. Flag Inliers - For each numerical column, add a column that flags whether or not numerical values in that row are prevalent in the dataset. Clamp Outliers - Replace values in numerical columns that are more than three standard deviations from the mean, higher than the 99th percentile, or lower than the 1st percentile with the nearest value in the range.

Execute Cleaning

Once you have selected the options you want, click Preview to see the changes to your data and apply when happy.

Last updated