Using ChatGPT for Machine Learning (ML): a simple example

Picture of Diogo Camacho

Diogo Camacho

I am a formally trained biochemist with a Ph.D. in Computational Systems Biology, focused on machine learning applications for the analyses of large scale omics data.

Table of Contents

Machine learning. The words of the day. The promise of AI, how it will change everything that we will do, from self driving cars to robots. Every organization wants it, demand for data scientists is through the roof, and with it comes the need to build machine learning models at an unprecedented pace. What if AI could help us with this last step, or at least cut some developing time here?

In a conversation with Justin Belair, we thought it would be fun to see how we can use one of these large language models that are all the rage, to build a machine learning model and test how is could help ML research in a variety of applications. I will not give anything away , so that you can walk with me through this analyses, and you get to experience how we would tackle a machine learning problem. Let’s dive in! (GitHub here)

guy on a chair

A good prompt is all you need

By now you have played around with ChatGPT and were able to rewrite lyrics for pop songs, get the latest recipe for sourdough bread, or get a summary of a book that you always wanted to read. Yeah, it’s fun. And in these explorations you probably realized that the more accurate your prompt is, the more targeted/insightful the response will be. This is not a post about LLM’s (we still need to cover it somewhere else), but let’s focus on the last point: “accurate prompt”.

I went into Bing and the Microsoft CoPilot, I chose the “More Precise” option in my conversation with the bot and input the following prompt:

prompt 1

It should be clear to the model what I want. Now, ChatGPT has a lot of information that comes for all the corners of the internet, so this should be a simple task: between Stack Overflow, the R/tidyverse community, GitHub repositories, and bookdown examples, all of the information needed for me to get a simple analyses and a model up and running should be easy to get. This is the code the bot returned:

code

The code looks good and for those of you that are fluent in R/tidyverse you should be able to sketch out in your brain what’s going on. What’s more interesting is what the bot added after generating the code:

code

It gives a detailed explanation of what it is attempting to do and why. It also highlights that, for this to work, you may need to install specific libraries, otherwise things will fall apart. After all, the model doesn’t know anything about your personal/cloud computing environment. I will not be making any changes to the code, so that we can check how well this works.

Step by step on generated code

I’ll skip the part on loading the libraries, since I have all of them, and the iris data set comes standard with R, so all that is taken care of. Let’s start on row 10, the exploratory data analyses bit.

Exploratory data analyses

For this bit ChatGPT just followed what you would do as an undergraduate student in your freshman stats course: look at the data.

data exploration

We run those commands. First, let’s get some summary statistics on the data:

iris_summary

Cool. The bot also chose to do some plotting of the data. It starts by plotting every variable against every other variable (that’s the “Pair plot” bit):

pairs plot

With colors, for visualization best practices. Not too shabby, ChatGPT. After this, we do a box plot of the different variables:

box plots

So far so good. We got some summary statistics (maybe we could expect more, but, again, it all comes down to the prompt), and some plots to support our exploratory data analyses. Let’s now see how well the bot did the ML part.

Data processing

I asked the bot to do a few things before the classification model. I asked it to process the data and do all necessary best practices in ML, using 2 different packages: caret and tidymodels. The bot opted for tidymodels:

data_preprocessing

Let’s deconstruct this, to see what we’re doing:

  • We start by splitting the data. The bot does a 75% training – 25% testing split, which is pretty common. I like to do 80/20, but let’s see where this goes.
  • After the split, we get the data into different data frames. Check.
  • The bot then uses the tidymodels to normalize all the variables with the “step_normalize(all_predictors())” step. Good.

Model specifications

I asked the bot to build me an XGBoost model for the classification of the data, and it followed suit by writing this piece of code:

model specification

None of this is wrong, but there’s one small aspect that, in hindsight, I forgot to ask, which was to include a step for optimization of the model. This would involve optimizing the number of trees and the tree depth of the model, which would impact the performance. More on this in a bit.

After this, it’s a matter of putting things together and running the analyses:

workflow

And there you have it. We run the processing of the data and then feed it to the model (that’s the workflow piece) and then we fit the data. Simple enough. Let’s see how the model performs.

Evaluating the model

For the last steps, the bot makes predictions on the held out (test) data set, and then uses the tidymodels framework to write everything out:

workflow

First, let’s look at the predictions:

predictions

Ok. It’s a multiclass classifier (we are trying to fit the data to 3 classes – the different species of iris). On the first column we have the predicted class, on the last column the actual class (we knew this already). There are no misclassifications on these results. This is good and bad. On the one hand, we got everything right. On the other hand, the shadow of overfitting is looming large (meaning: our model is so closely tied to these data that it cannot predict anything else with accuracy).

Edits and improvements

First the good:

  • Giving a relatively detailed prompt to ChatGPT allowed me to get the code for a machine learning model in record speed
  • There are no gaps in the analyses: it is very clear on why certain steps are done
  • The code is decently commented (even if just scarcely commented)

Where we can improve:

  • I didn’t specify to the bot that I wanted some optimization of the model. This will help with model generalizabity and avoiding pitfalls like overfitting
  • I only asked for an XGBoost model. Ideally, we could try a variety of models for the task at hand, and choose the best model architecture out of that

Prompt with improvements

Based on where we could improve in our first pass, I was curious to see what the answer would be if I made the prompt a little bit more precise. Here’s the prompt:

prompt_improved

The answer I got was, at a glance, satisfactory:

satisfactory_answer

While the EDA (exploratory data analyses piece) is underwhelming, the variety of models chosen is interesting. The bot also chose to do a 10-fold cross validation (good) but the optimization of the models themselves is left to the defaults of caret, which is fine for this exercise but is something to be mindful of in your real world applications.

Conclusions

With tools like ChatGPT at our fingerprints, thinking how you can use them to build machine learning models for the applications we work on can really increase your productivity. That being said, a couple of considerations:

  • Using ChatGPT is no replacement for using your brain. Look over the assumptions made by the bot, think where there is room for improvement, take care in understanding the steps, and make sure you adapt it to the problem you are trying to solve.
  • Be mindful that, even with all the constant updates to the existing LLMs and the emergence of newer LLMs, these models are only as good as the data that these are trained on. Yes, it’s billions of data points and parameters, more than any human brain can hold, but your ingenuity and creativity is something that the model can’t predict. Writing functions from tidymodels that are, in essence, a copy paste from examples that you can find online, is not being better than you, it’s just typing faster than you.

Overall, this was a fun exercise and over the next few days I will be playing around with this for applications into drug discovery and see where we can go.

BONUS: When you have a hammer

… everything looks like a nail. I wanted to do the same exercise but asking the bot to do it using a deep neural network and the keras package (keras3, to be more exact.) So, I tweaked the prompt a bit:

tweak_prompt

The task remains the same. The bot gave some pretty usable code off the bat:

usable_code

This looks ok but there’s an error in defining the input shape when building the model, reinforcing the idea that you can take these answers at face value but always check what the bot outputs. There’s no magic here.

After fixing the error and adding a couple of things to the fit portion (like early stopping and changing the batch sizes), you get to an accuracy of 97% with the model. Nothing to snark at, but our previous examples gave us accuracy in the 94%+ ballpark. The point here is that sometimes you need to look at your data and understand the application before jumping into complex models that will be marginally better or on par with your “not-so-sexy” machine learning approaches like random forests or support vector machines.

Scroll to Top