AIs and statistical coding?

Let’s have some fun and start discussing something that has been on my mind a lot lately. First, let me be frank and say I am not a big ChatGPT or CoPilot user – I will attest that I use them to create images but beyond that – I just haven’t used it.

Now I heard from fellow researchers over the past year, that students are starting to use AI to create SAS and R code for their statistical analyses. I was shocked and thought No WAY! But then, I met a student who was doing exactly this – I didn’t realize it until well into our meeting, and they admitted that YES they had used ChatGPT to give them a start with their SAS coding. I was then informed that using ChatGPT did a better job with R code since R is open source and there are more examples out there.

So… much to my chagrin, this is a THING! Rather than deny it – I thought I’d try it out myself and see how it compares to what we teach. I’m going to write a few blog posts now on what I’ve discovered and highlight where students need to watch out – let’s see if we can use these tools effectively but NOT rely on them.

Setting the stage for the example – SAS

A common experimental design that we use in the agricultural field is the Randomized Complete Block Design or RCBD. I will describe this design in a separate post in the coming week for folks that are not familiar with it. But this is a great opportunity to test out ChatGPT – so let’s ask the following question:

Create SAS code for an RCBD experiment

ChatGPT said:

Here is a basic example of SAS code for analyzing data from a Randomized Complete Block Design (RCBD) experiment using PROC GLM. This code assumes you have a response variable (e.g., yield), a treatment factor, and blocks (e.g., replicates or locations).


๐Ÿ”ฌ Example RCBD Scenario

  • Treatments: 4 different fertilizer types (Trt)
  • Blocks: 3 different fields (Block)
  • Response Variable: Yield (Yield)

๐Ÿ“„ Sample Dataset Structure

BlockTrtYield
1A45.2
1B48.7
1C50.1
1D46.3
2A44.9
2B49.5

โœ… SAS Code for RCBD

* Step 1: Input data;

data rcbd;
    input Block $ Trt $ Yield;
    datalines;
1 A 45.2
1 B 48.7
1 C 50.1
1 D 46.3
2 A 44.9
2 B 49.5
2 C 51.0
2 D 47.2
3 A 46.0
3 B 50.0
3 C 52.3
3 D 48.1
;

run;

* Step 2: Run the RCBD analysis using PROC GLM;
proc glm data=rcbd;
    class Block Trt;
    model Yield = Block Trt;
    means Trt / lsd;
    lsmeans Trt / pdiff adjust=tukey;
    title ‘RCBD Analysis – Treatment Effects Adjusted for Block’;
run;

* Optional: Diagnostics;
proc univariate data=rcbd normal;
    var Yield;
    histogram Yield / normal;
    title ‘Normality Check of Residuals’;
run;


๐Ÿง  Explanation

  • class Block Trt; โ€” Declares the categorical variables.
  • model Yield = Block Trt; โ€” The main model: Treatment effects adjusted for blocks.
  • means and lsmeans โ€” Produce comparisons (LSD, Tukey) to assess treatment differences.
  • proc univariate โ€” Checks normality of residuals (optional but good practice).

My Thoughts and Interpretations of ChatGPT results

WHAT?????

This is me being polite! Proc GLM???? and LSD???? Oish – this is taking us back decades in the world of statistical analyses. In my classes and workshops I teach the progression and history of SAS procedures and WHY we use Proc GLIMMIX today. I am VERY sad to see ChatGPT answer with Proc GLM and nothing about random block effects!

Let’s review why I disagree with these results:

  • There is NO mention of the random effect of BLOCK – this is why we use an RCBD to incorporate the random effect of the BLOCK!
  • Proc GLM was released by SAS in 1976 and uses Type I and Type III Sums of Squares results – let’s be honest – some of us learned how to calculate these in our statistics courses – but we just don’t ask our students to do this anymore (I wish we did though – so students have a better understanding of the F-statistic)
  • a RANDOM statement was added to Proc GLM later BUT the users need to calculate the RANDOM effects using the model results provided – and NOBODY does this and MOST people are not aware of this!!
  • The optional: diagnostics section is WRONG! It states that the analysis is checking normality of the residuals – yet the SAS code is checking normality of the output variable Yield – there are NO residuals being saved from the model!

SIDENOTE: For a review of the history of SAS PROCs used for analysis of variance check out this post on the AgStats.ca blog.

Challenges for our students and researchers

For whatever reason, there’s a notion out there that AI will give us the “right” answer – heck there’s a lot of information out there and WHY shouldn’t a new stats user trust it?? We also know – from personal experience – that students find stats difficult – so if there’s a tool that they can use that will help them – they will use it and trust it! This is a HUGE challenge for students, instructors, and researchers. We all want to ensure that high quality research is created and one major piece of that puzzle is statistical analysis.

SO…. What do we do?

  1. Make statistics more accessible to students and researchers
  2. Be aware of how AI is being used to create SAS and R code so we can teach our students and researchers how to build on what they are provided
  3. As an instructor, keep playing with AI to be more aware of the results

Upcoming Blog post

Closer look at the SAS code and R code produced by ChatGPT for an RCBD with an example.

Tags

One response

  1. Ugur SERBESTER Avatar
    Ugur SERBESTER

    Thank you for sharing

    Like

Leave a reply to Ugur SERBESTER Cancel reply

Michelle Edwards is a self-proclaimed data geek who teaches statistics and experimental design to anyone willing to listen. She has been told on many occasions that she can translate “statistics-ese” to english, making statistics FUN!