Overview of the Prep interface
The Prep screen houses all the tools needed to transform and clean your data before either interacting with it in Explore or building your model. Those features are detailed here.
Chat Data Prep, built on GPT4, builds on the theme of democratizing access to data by taking even more code out of the AI equation. By leveraging our ML engine we are able to parse plain text requests to transform incoming data before you even train your model.
Watch this video walkthrough.
Simply click on the Chat Data Prep Icon in the upper left of your Table view to begin.
Chat Data Prep Entry Screen
As with any ML based chat solution your options for entry are vast. If you are familiar with code based solutions for data manipulation you can lean on those input types to get very specific outcomes but plain English will work as well. In the example above we could want to compare only the clean data of university and high school and not other less defined options. Simply entering 'remove all data except university and high school from education' will make it a comparison of those two values which can then be fed through the flow.
Some other things you could try:
- Remove rows with any empty columns
- Remove rows with typos in any column
- Remove any rows less than 18 in the age column
- Generate a text summary of age job and education
- Combine month and year columns
Feel free to test the limits! Once your data is cleaned up in this way you will have an even easier time training your model how you want it to perform.
Let's go through an example of how Chat Data Prep functions. First we will select data to transform, for this example we will use the customer churn demo. Simply go to the home page of the Akkio app and select it from the list of options to follow along. Not there? Some users are setup in a way that does not auto populate the example flows. Not to worry! It can be downloaded from the help center in the upper right of the screen. Download it and start a new flow using a table.
From the table screen you can see the Chat Data Prep Icon. Lets start with a transform we know. Lets say we want to focus on more recent accounts, ones that we are loosing after a few years. Most internet companies have introductory offers that expire so this could be useful to focus on.
With that in mind we ask Chat Data Prep 'Remove anyone with tenure over 36'. This is measured in months so will only give us accounts that are three years old or less.
As you can see the AI understood completely and we are safe to execute the command. After applying, it will note the transform that is in effect in the upper left in gray and will number which transform it is.
Note where it says 'Remove Outliers'
Expanding that feature by clicking on it gives us all the relevant information about that particular transform.
Next we can try a transform that may have less success. Lets say we want to anonymize the data a bit and remove gender as a factor. We go and type in 'remove men and women'. The following error is generated.
The AI has interpreted our request as a removal of rows with those characteristics which would in this case be all the data. As such we need to reword, instead we can ask it to remove the gender column. For fun though first let's test our friendly neighborhood LLM. Instead of asking to remove the gender column we ask it to remove demographic data. As you can see it understands what that is and is going to remove all identifying information that is independent of our product. Not what we want this time but a great use for the tool.
Instead we finish off with 'remove gender' and the change is what we want. Apply the transform and we now have two active transforms on the data.
Finally if we wanted to just take this data out from here for other uses or to back up the original set elsewhere there is the option to download the transformed data to CSV.
Data cleanliness is a huge issue for companies with years of historical data, representing hours of work to sift through. Using the prebuilt data cleaning options you can quickly perform basic data cleanliness steps before doing any specific work on your datasets. The options are shown and described below.
Remove Outliers - Remove values in numerical columns that are more than 3 standard deviations from the mean, higher than the 99th percentile or lower than the 1st percentile. These values are then replaced with estimates (Imputed). Remove Inliers - Remove values in numerical columns that are anomalously frequent. These values are then replaced with estimates (Imputed). Remove Unexpected Nulls - Remove rows with null values for columns which are at least 99% filled in. Replace Excess Categories with "Other" - Replace values in categorical columns that are not in the top 32 most common values with "Other".
Remove Constant Columns - Remove columns that have the same value for every row. Remove Mostly Unreadable Numerical Columns - Remove numerical columns that are at least 99% unreadable values. Remove Mostly Unreadable Data Columns - Remove date columns that are at least 99% unreadable values. Remove Mostly Blank Columns - Remove columns that are at least 99% blank values.