Introduction
Chances are you use ChatGPT. OpenAI’s chatbot had about a year head start on competing large language models like Google’s Gemini and Microsoft’s Copilot. The latter two offer integration with office productivity suites and man this paragraph is getting boring! Don’t worry, I’ll narrow the focus: in this post I pit these AI chatbots against one another in carrying out identical tasks: an essay and a picture. (Next week I’ll have them write a poem.) These tasks are probably not what you use chatbots for, but I think they’re a good measure of the AIs’ so-called intelligence, which—in the face of all this uncertainty of where AI is going and what it means for humanity—is probably more interesting than noting how well they answer basic questions or perform routine tasks like writing emails or reports.
(Wondering about the picture? I’ll get to that.)
Now, if you’re an astute reader (which you are or you wouldn’t be here, so congratulations), you’ll be wondering, why even bother evaluating the current capabilities of a technology that is evolving so fast? Wouldn’t this post have a very short shelf life? Those are good questions and here’s my (kind of) short answer: it’s because it’s fun to capture a moment in time and look back on it later, to see how far we’ve come. It’s like watching really old Hollywood movies and appreciating a) how much better the dialogue is in modern film, and b) how much less sexist Americans are now. (Yes, we’ve still got a long way to go, but looking back can help us feel grateful for the progress that’s been made.)
Let me give you an example of how primitive early AI was. As it’s theoretically possible for you to have noticed, I’ve been tracking its progress since 2012, when I tried out Cleverbot. Over the next few years I evaluated the AI used in smartphones. In 2020 I did a test drive of the very first version of OpenAI’s GPT. As described here, all it could do was finish your sentences; you’d type the first half of a sentence and hit tab, and it would finish the sentence for you (as many modern text editors now do). Here’s how the original GPT “helped” me write a short essay about learning to type. I’ve formatted its output in bold italics so you can see what it contributed:
“Pack my box with five dozen liquor jugs” is a cool way to pass the time. It is particularly useful for budding gay men to read the words if they are trying to learn how to type on a computer. … Okay, what’s with this guess that my original text had anything to do with ‘gay? that was definitely a pretty random statement to make but it fit, and … no, it didn’t fit. For A.I. to be useful, it must stick to the ‘gay side of the word.’ No. It must stick to the point. I was writing about a sexual deviant. No, I was not! I was writing about the simple act of learning to use the ‘gay keyboard. Also, A.I., you’ve twice screwed up on putting a space between my legs. Okay, fine. I give up. This GPT technology obviously has a lot of potential.What a gas, right? Of course AI will keep getting better, to the point that what’s considered amazing today will one day seem laughably primitive. Who knows, perhaps you’ve found this post years after I wrote it, and are looking to it to help you remember what it was like to interact with AI through a cumbersome keyboard, rather than having it read your mind automatically via WiFi 12 or 8G cellular technology.
Okay, down to brass tacks. In this post I will evaluate the latest versions of three leading AI chatbots: OpenAI’s ChatGPT (version GPT-5); Google’s Gemini (version 2.5 Flash and Pro); and Microsoft’s Copilot (version Smart GPT-5, based on Microsoft’s collaboration with OpenAI, which Copilot tells me “[goes] far beyond what you’d get from GPT-5 alone”).
Why a scholastic essay? Because that kind of writing is a lot harder than a lot of what AI does, which is just being a really good natural language search engine. Analyzing a large text and writing about it clearly requires something closer to real thought than just fielding a fairly specific question, harvesting the best existing resources on the topic, and mashing them into a concise and nicely formatted answer. For more on the fundamental difference between writing “thoughtfully” and merely researching, see my last post.
Activity #1: academic essay
Much of the hype around AI is its ability to do college kids’ work for them. In a shocking New Yorker article I read recently, a college professor interviewed several students at top universities about their shameless use of A.I. to write their papers, and how well they’re getting away with it. Here’s a crazy example:
A sophomore at Columbia studying computer science told me about a class where she was required to compose a short lecture on a topic of her choosing. “I just put it in [to an A.I. platform] and I maybe edited like two words and submitted it,” she said. Her professor identified her essay as exemplary work, and she was asked to read from it to a class of two hundred students. “I was a little nervous,” she said. But then she realized, “If they don’t like it, it wasn’t me who wrote it, you know?”
So, is the current AI really good enough for the Ivy Leagues? Or at least good enough for your kid’s high school English class? Two and a half years ago I asked ChatGPT (version GPT-3.5) to write an essay describing the use of dramatic irony in a children’s novel, Clarice Bean, Don’t Look Now by Lauren Child. I chose that book because it’s a less cryptic work than, say, Ulysses, which probably nobody, not even James Joyce himself, understood. But even without being written for grown-ups, the Clarice Bean book does feature dramatic irony, as my seven-year-old daughter was able to appreciate. So how did GPT-3.5 do? It crashed and burned, big time, as described here. To see how much the AI has improved, I’ll use the same prompt this time, with the same book.
ChatGPT’s essay
Here is the most salient excerpt of what OpenAI’s GPT-5 came up with:
At the heart of the novel is Clarice’s attempt to navigate friendship, change, and uncertainty. She idolizes her favorite fictional character, Ruby Redfort—a cool, unflappable girl detective who always knows what to do. Clarice, by contrast, often feels helpless when life becomes complicated. Much of the book’s dramatic irony arises from this gap between how Clarice perceives herself and how readers can see her situation. For instance, when Clarice tries to act as confident and worldly as Ruby, her bravado is clearly exaggerated or misplaced. Readers can see that her attempts to emulate Ruby’s poise only highlight her vulnerability. This irony makes Clarice endearing rather than foolish, because we recognize that her misunderstanding stems from innocence and idealism.
This really isn’t bad; I mean, it reads well, the flow is logical, and the essay has the semblance of depth. (It’s a lot better than GPT-3.5, whose entire essay was a hallucination.) Still, this analysis is rather flawed. For one thing, it completely misrepresents Clarice’s character. She actually exhibits no bravado, and puts on no airs. Her behavior is quite subdued; in fact, she withdraws from friends and classmates entirely. Meanwhile, the AI’s essay lacks the specific detail it would need to get a passing grade in a legitimate lit class.
Now, to be fair, much of the power of these chatbots comes from the user’s ability to iterate, guiding the AI with very precise prompts and helping it along when it stumbles. So I gave GPT-5 another chance, the way a TA would with an undergraduate, asking it, “Are there any scenes in particular that really show the dramatic irony, vs. generalizations? What is the best example of this literary device in the novel?” Here’s what it came back with:
The most striking use of dramatic irony in the novel comes when Clarice’s best friend, Betty Moody, suddenly goes missing. Clarice interprets Betty’s disappearance through the lens of her idol, Ruby Redfort … Convinced that something thrilling is afoot, Clarice imagines Betty has vanished into an adventure or secret mission, perhaps with spies involved.
The reader, however, quickly realizes something else: Betty hasn’t disappeared into a mystery, but into a family crisis. Her parents have separated, and Betty has been taken away unexpectedly. Clarice, too young to pick up on the signs of adult conflict, mistakes emotional upheaval for suspense. This is classic dramatic irony — the audience understands the real reason for Betty’s absence long before Clarice does.
This really would be a classic case of dramatic irony, but there’s one problem: GPT-5 is hallucinating! Betty doesn’t suddenly go missing; she announces to Clarice that her family is moving away, and then they do. There is no mystery for Clarice to solve, no imagined spies or secret mission, and Clarice misunderstands nothing. The chatbot states that the reader knows that Betty’s parents are separating but Clarice doesn’t; this is actually backwards: Clarice only fears Betty’s parents are divorcing, and they’re not. So as nicely written as GPT-5’s essay is, it gets the basic facts wrong. (The only real dramatic irony in the essay is that we know something ChatGPT doesn’t!)
When I challenged GPT-5 on this, the AI casually acknowledged that it may have gotten stuff wrong because it actually hasn’t read the book. It proceeded to fact-check itself using three dubious secondary sources: “TV Tropes,” “Books for Kids Blog,” and “Books for Keeps.” Then it offered, “If you like, I can check for direct quotes from the text to confirm exactly what the author states about Betty’s situation and whether divorce is explicitly ruled out.” WTF?! GPT-5 blithely—and confidently—trotted out this nice essay despite having no access to the source material! What actual student would be shameless enough to try that?
I’m going to give ChatGPT a D on this. This is a big step up from last time, but it still has a long way to go. Obviously an AI would do better with older literature, like the classics that are past their copyright and in the public domain, but I suspect most of its training data will always be other papers and run-of-the-mill critical criticism, not the work of the authors themselves.
Copilot’s essay
On to the next contestant, Microsoft’s Copilot. It has different versions based on what you’re trying to do (e.g., simple search vs. deeper analysis). I engaged Smart Mode for the essay. Here are the highlights:
A central example of dramatic irony emerges in Clarice’s interpretation of Clem Hansson, the new girl at school. Clarice believes Clem is aligning herself with Justin Broach, a classmate Clarice views as a “sociopath.” This assumption fuels Clarice’s isolation and resentment, reinforcing her belief that Clem is shallow and disloyal. However, readers can infer — through subtle cues and Clarice’s unreliable narration — that Clem’s behavior is not malicious, and that Clarice’s judgment is clouded by her own insecurities and grief over Betty’s departure. This disconnect between Clarice’s perspective and the likely reality creates dramatic irony: readers recognize that Clarice is misreading the situation, which deepens our empathy for her while also injecting humor into her exaggerated reactions.
Another layer of dramatic irony stems from Clarice’s obsession with the Ruby Redfort Survival Handbook, a fictional spy guide she consults for advice on navigating life. Clarice treats the handbook as gospel, applying its cryptic wisdom to everyday dilemmas. For instance, she fixates on the idea that “your worst worry is the one you haven’t even thought to worry about,” which paradoxically increases her anxiety. Readers, however, can see that the handbook’s advice is often absurd or irrelevant, and that Clarice’s reliance on it is a coping mechanism rather than a source of genuine guidance. This ironic gap between Clarice’s faith in Ruby Redfort and the reader’s recognition of its impracticality adds both humor and poignancy to the narrative.
Right off the bat, Copilot doesn’t commit any major blunders, in terms of hallucination. I would say the reader only infers that Clem’s friendly gestures toward Clarice are sincere, as we don’t have anything concrete to go on, and Clarice’s hunches aren’t obviously inferior to ours. I also dislike the quotation marks around “sociopath” because they imply that Clarice has Justin figured all wrong, which would be another example of dramatic irony, except that Justin most certainly is a sociopath. (He assaults classmates, extorts them for money, and kidnaps Clem’s pet rabbit.) In my book, Clarice’s mystification at Clem’s interaction with this jerk is understandable and certainly not exaggerated. Moving on to the Survival Handbook, it’s advice is arguably absurd and, on the face of things, irrelevant (as it’s written for spies, not kids), but Clarice does grasp (and point out) its limitations. In a moment of epiphany, she asks herself, “What is the point of reading about all this survival information about lions and tigers and sharks? … And I have read her Spy Guide: How to Know Things Without Knowing Things—and all I have discovered is that I know nothing at all.”
But all this is boring, I know, and the main news is that this essay (or more to the point a version that’s been polished up by a student who at least read the actual book) would probably get a C at UC Santa Barbara, perhaps a C- at UC Berkeley, and who knows, maybe an A at Columbia. But does this mean Copilot is now ready to steal our jobs? No, not yet. If it offered to ghost-write my next albertnet post, I’d politely decline.
Gemini’s essay
And how did Gemini, our final contestant, do? Here are some highlights from its first effort:
The irony primarily revolves around Clarice’s attempts to manage her mounting anxieties—namely her fear of a looming geography test and her worry about her best friend Betty Moody potentially moving away—while remaining oblivious to the humorous and poignant truths the reader recognizes.
The most potent source of dramatic irony is Clarice’s belief that she can control her own anxieties and the world around her by actively not confronting them. The geography test, which Clarice refers to ominously as “the Big Test,” is a persistent undercurrent of dread. She employs elaborate avoidance strategies, from organizing a “Worries Box” to distract herself with minor, manageable tasks. The irony lies in the reader’s understanding that avoidance only amplifies anxiety; the test is drawing closer with every page turned, yet Clarice invests her energy in delaying the inevitable.
The technique is equally effective in building the emotional tension surrounding Betty Moody’s potential move. Clarice views every sign of Betty’s packing or preoccupation through the singular lens of impending abandonment. The reader, however, is frequently given clues—through overheard adult conversations or context within the illustrations—that the situation is more mundane or less catastrophic than Clarice imagines. For instance, the boxes are not necessarily for a long-distance move, but perhaps a temporary one, or simply a major reorganization.
Again, the AI has written an essay that kind of sounds good and might fool somebody unfamiliar with the book, but it ultimately falls down because it has the basic facts wrong. Like ChatGPT, Gemini has no direct access to the full text and is just guessing (well, generating) based on the flotsam and jetsam that critics, reviewers, et al have strewn around the Internet. Gemini’s mistakes are not subtle. For one thing, there is no geography test mentioned anywhere in the book. Meanwhile, Clarice has a worries notebook, not a box, and she doesn’t spend undue time writing in it or employing avoidance strategies. Moreover, there is no scene in the book in which Betty packs boxes for her move, and the situation with Betty is not “more mundane” or “less catastrophic” than Clarice imagines: her best friend literally moves from London to San Francisco.
In the spirit of fairness, I gave Gemini another prompt to try to hone its response and focus on a specific scene of dramatic irony, and it came back with more of the same about the geography test and Betty’s move, adding, “The peak of this irony comes when the news of Betty’s situation is finally revealed, specifically that Betty is not moving away forever, but just temporarily to the apartment upstairs due to a family situation (often related to house renovations).” Completely false, and completely lame.
I pointed out all the errors and asked Gemini if that was really the best it could do. It replied, “The detailed essay you requested on Clarice Bean, Don’t Look Now was written using the Gemini 2.5 Flash model. If you were looking for even deeper, more complex analysis, the Gemini 2.5 Pro model (available with Gemini Advanced) would be the choice optimized for that specific level of complex, multi-step reasoning.” Oh, so if I want a better essay I have to pay for it? What cheek! I almost decided to just give Gemini an F and move on. That would have made this post shorter. But, doggone it, if I’m going to do this, I’d better do it right.
Gemini’s second essay
I switched to version 2.5 Pro (which turns is offered on a limited basis for free) and here’s the gist of its new essay:
The central irony is this: the very “spy” techniques Clarice uses to gain control and uncover the truth are precisely what blind her to it, generating both the novel’s humor and its profound sense of childhood helplessness.
This irony is sharpened in Clarice’s “investigation” of her parents. Overwhelmed by stress (which the reader understands is likely related to their friends’ move, finances, or just the chaos of family life), Clarice’s parents have tense, private conversations. Clarice, filtering these events through her Ruby Redfort lens, interprets them as “clues” to a secret, impending divorce. She misreads their mundane exhaustion as a sinister conspiracy. The dramatic irony is that Clarice’s hyper-vigilance, her constant search for meaning, makes her less perceptive, not more…
Ultimately, the book’s title, Don’t Look Now, serves as the most direct summary of its central irony. Clarice believes her “looking”—her spying and investigating—is the solution. But the reader knows she is refusing to look at the one thing that matters: the deep, painful, and very normal emotion of sadness. She avoids the emotional core of her problem by obsessing over its mechanics.
I have to say, this is much better. Version 2.5 Pro didn’t hallucinate, and its argument is more cohesive. It’s not perfect; for example, it misses Clarice’s epiphany about the limits of the Ruby Redfort book and thus overstates her lack of perception. But this essay so much better than what 2.5 Flash “wrote.”
So is that it, I give Gemini a C+ and move on? Not quite: remember, this chatbot benefited not only from my invoking its 2.5 Pro version, but from all the coaching I gave it in the chat. This distinction is crucial: AI LLMs do much better when you feed them high quality prompts and lots of feedback to supplement their training data. It’s equally important to understand that your input is not itself training data that the model can use going forward. The benefit you provide dies with your session. Thus, AI doesn’t learn and get smarter the way a human would; its progress is much more gradual. Which brings me to:
Gemini’s third essay
To see how 2.5 Pro would do without all the coaching, I opened a fresh session on my work laptop (i.e., totally different login, no history of my chats). (Don’t worry, I did this on the weekend.) (If you’re my boss reading this, congratulations on finding my blog, and please consider that my working knowledge of AI is surely valuable in the workplace and you should give me a raise.)
I guess I wasn’t surprised that 2.5 Pro didn’t do so well this time, but what did surprise me is just how badly it crashed and burned. Here’s an excerpt:
The plot is set in motion by a catalyst of deliberate misinterpretation. A cryptic, unsigned letter containing the vague warning, “something terrible is going to happen,” is received not as a piece of misdelivered junk mail but as a profound, personal omen… The humor is generated directly from this disparity; the audience … understands that the “terrible” event will be domestic, not devastating. The characters’ frantic preparations—installing locks, suspecting neighbors—are thus rendered as escalating absurdities, a performance for an audience that already knows the final act.
OMG, it’s the worst essay yet: total hallucination. There is no cryptic letter in this novel, no locks installed, no suspicion of the neighbors. I called this out, the chatbot apologized profusely for having accidentally based its essay on a different book entirely, and then it tried again:
The gap between perception and reality generates the novel’s central tension. While Clarice is hunting for evidence of international espionage, the audience is processing signs of a painful family separation. The “mysterious man” Karl meets is not a sinister agent, but, as the reader strongly suspects, his father.
Again, pure hallucination! There is simply no “mysterious man” in the entire book. I challenged the chatbot, asking how it gets its source material, both when a work is under copyright and when it’s in the public domain. Gemini explained that for public domain works its training data contains the full texts and also “the centuries of critical, scholarly, and secondary sources,” and for copyrighted works “is built from secondary sources … book reviews, detailed plot summaries, fan wikis, essays, and educational matters about the book.” So basically it’s amateur hour: the AI can’t really differentiate between, say, an esteemed college professor and a (gasp!) lowly blogger. As you can see this doesn’t always work so well. I’m going to give Gemini 2.5 Pro a D+.
As an aside that perhaps ought to be my thesis, I’d like to point out that the better AI gets at writing student papers, the worse off students—and the whole institution of higher education—will be. After all, the point isn’t for students to edify their instructors through their observations; the point is for the students to think and write for themselves. Yes, this is hard, but the right kind of hard, and through this struggle they ideally learn how to think and write, and can one day contribute in the realms of actual, non-student writing such as books, articles, or—worst case scenario—blogs.
Activity #2: original art
I’ve tinkered a lot with AI-generated art, usually to generate pictures to run at the top of my blog posts. It’s been pretty hit-or-miss; a picture which doesn’t stray into uncanny valley territory, or have major gaff like the wrong number of fingers on a hand, is all I’ve realistically hoped for. Today’s exercise is simple: I pitted the platforms against one another in the task of creating a picture for this post, featuring Clarice Bean. You can see the winner at the top, though you might cry foul: the art I ended up using is from Whisk, Google’s latest “experimental” imagine generator. I resorted to this new tool because I just wasn’t happy with the runners-up, as you shall see.
ChatGPT’s art
I asked ChatGPT, “Can you make a drawing for me of Clarice Bean reading albertnet on her tablet?” Not surprisingly, it mentioned the copyright and said, “I can’t generate or reproduce images of her or derivative works featuring her likeness” but offered to “generate an image of a cartoonish, freckled, red-haired girl reading a tablet, in the style of a children’s book illustration, but not resembling or referencing Clarice Bean specifically.” I agreed and here’s what it came up with:
I think you’ll agree that’s just about the most boring picture ever. It also has the classic issue of the subject holding the tablet backwards. This is just not that hard a prompt … what gives?
I said, “Make it a more realistic picture, please, and she should look a bit older, and have her in an armchair in her attic bedroom with a desk lamp, and reading the Ruby Redfort Survival Guide.” Maddeningly, the chatbot came up with a picture that was almost perfect, except that made her look a bit too old (about 15) and gave her Instagram-worthy boobs, which seemed inappropriate and unseemly. The picture didn’t show a lot of skin, but still … totally unusable (and I don’t even want to post it here because it’s in such poor taste). I replied, “Please make her a bit younger and flat-chested.” The chatbot chided me: “I can’t modify or generate an image based on physical or anatomical details like that.” Like it was basically calling me pervy! It even offered to “create a child-appropriate illustration,” as though I’d asked for something that wasn’t. Sheesh.
Copilot’s art
I gave Copilot the same initial prompt I’d given ChatGPT, and here’s what it came up with:
Other similarities of this art include the hair being the same length, the art having the same level of detail (barely more than a cartoon), and a complete absence of any details in the background. In delivering the picture, Copilot said, “Here you go - a stylized, collage-like illustration of a child reading a tablet, inspired by the playful textures you mentioned.” I don’t know what it means by collage-like, and I didn’t mention any “playful textures.” Whatever, chatbot.
Gemini’s art
I gave exactly the same art to Gemini, and it produced the corniest, least aesthetically pleasing picture yet:
Obviously this is a matter of taste, but would you agree there is no charm here? And what’s with the red-and-white-striped shirt appearing here, too? What are these AIs keying off of?
In Gemini’s defense, at least the little thought bubbles bear a slight resemblance to some of the art in the actual book. But again the tablet is backward and “albertnet” is spelled “alphabert” (weird misspellings being a common screw-up with AI art).
Frustrated by not having any good art yet, I tried ImageFx, another Google AI tool, and it gave me a photo-style picture with lavish detail, featuring both Ruby and her brother rocking red-and-white-striped shirts. I think it’s some kind of global AI conspiracy. What a relief when Whisk broke the cycle and generated the worthy picture you saw at the top of this post. I particularly like how Clarice is kind of staring off into space instead of at the book, clearly either pondering what she’s just read or distracted from her book by all the difficulties she’s working through.
Well, at long last that’s it for today. Tune in next week because I plan to pitch these chatbots against one another again, this time writing poems in dactylic trimeter based on the best prompt an AI was even given.
Other albertnet posts on A.I.
- I, Chatbot
- Almost Intelligent – Part I
- Almost Intelligent – Part II
- Smartphones & Artificial Stupidity
- A.I. Smackdown – Moto vs. Cortana vs. Siri
- Could Artificial Intelligence Replace Writers? – Part 1
- Could Artificial Intelligence Replace Writers? – Part 2
- Could Artificial Intelligence Replace Writers? – Part 3
- A.I. Smackdown – English Major vs. ChatGPT – Part 1
- A.I. Smackdown – English Major vs. ChatGPT – Part 2
- Schooling ChatGPT
- Will A.I. Steal Our Jobs?
- Tech Check-In - How Good is the Latest A.I.? - Part I
- Tech Check-In - How Good is the Latest A.I.? - Part II
- What is ChatGPT Great At (and Not)?
- Tech Reflection – Two Sides of AI
Email me here. For a complete index of albertnet posts, click here.





