Thursday, August 30, 2012

Almost Intelligent - Part I

NOTE:  This post is rated R for mild strong language.

“Almost intelligent” might be a good name for somebody’s biography (or autobiography) but here I’m talking about artificial intelligence.  My last post described my experience chatting with an application called Cleverbot that tried to simulate human dialog convincingly.  Here, I’ll tackle the subject of AI language more generally, looking at speech recognition, natural language, and translation. 

Do we care?

If you really don’t care about AI at all, go read something else—or, better yet, read on to see why maybe you should care.

On the one hand, AI is very exciting.  As computers have become “smarter,” and easier to use, they’ve gotten so useful it’s hard to imagine how we ever did without them.  I’m thinking about Google, GPS and other mapping applications, package tracking, e-mail spam filters … the list goes on and on.

On the other hand, AI is a bit scary, and as a human I prefer to believe I could never be replaced by a computer.  I shudder at the thought that human behavior could be so unvarying and predictable that one day we’ll barely be better than a really good computer program.  I want my computer applications to get smart, but not too smart.

Voice recognition and natural language

There’s a button on the side of my smartphone that, when pressed, startles me by causing the speakerphone to say, “Say a command!”  I’m vaguely aware that my phone will respond to voice commands but have no interest in issuing them.  Most of the cool features of smartphones involve the silent, non-speech stuff you can do—e-mail, Internet browsing, etc.—as you’ll notice on the subway when half the people are silently tapping away.  (The popularity of texting—a way to privately communicate without being eavesdropped on by the person you’re ostensibly talking to face-to-face—is a classic example of how phones are becoming increasingly mute.)

That said, the iPhone’s voice-recognition application, Siri, seems to be making a bit of a splash.  (Nobody I know uses Siri yet, but I’m sure some will.)  This demo shows how Siri is pretty good at understanding speech and figuring out what you want it to do.  (I played with a Droid phone recently and it was also very good at typing for me as I spoke.)  The reviewer asks Siri, “Where can I have lunch?”  Siri replies, “I found fourteen restaurants whose reviews mention lunch.  Twelve of them are close to you.”  This seems easier than typing into Google on a little phone.  But the natural language feature isn’t perfect; the reviewer says, “How about downtown?” and Siri replies, “I don’t know what you mean by ‘how about downtown.’”

Perhaps Siri’s communication isn’t “connection-oriented”—that is, it doesn’t consider “how about downtown?” in the context of “Where can I have lunch?” but takes the two queries as totally discrete and unrelated.  If so, this is a major shortcoming. 

The reviewer tries again:  “I want to have lunch downtown.”  Siri replies, “I found 3 restaurants matching ‘downtown.’”  Useless!  Siri knows where the user is, geographically, but does not realize that “downtown” in this context pertains to location, not a restaurant’s name.  Here, Siri starts to look like a mere forwarder of requests, always passing the buck to Google instead of applying intelligence to the request.

Simple conversion of speech to text looks pretty good on Siri.  The reviewer dictated a message to it, and almost everything came out.  The notable exception was how Siri transcribed the reviewer’s spoken comment “I need to make some videos about the iPhone 4S.”  Siri typed, “I need to make some videos about the iPhone 4 ass.”  The reviewer doesn’t notice this gaff, telling the YouTube viewer, “There it is.  It figured out exactly what I wanted to say.”  Dangerous, don’t you think?  What if the reviewer meant to e-mail the text “S as in Sam” but actually e-mailed “ass as in Sam,” to his boss, Sam?

Not that Siri doesn’t try hard.  When the reviewer says, “Set a timer for 3 minutes,” Siri replies, “OK, I started a three-minute timer.  Don’t overcook that egg.”  Not bad.  Actually, it is bad.  For one thing, “that egg,” when spoken by Siri, comes out “ditek.”  Without the text on the screen you’d never understand what it said.  Meanwhile, it’s obvious that Siri is trying to be funny, and completely failing.  There’s nothing witty about Siri making a lame guess as to what the timer is for.  What’s worse, Siri could create the impression that three minutes is actually how long you should cook an egg.  In fact that’s not nearly enough time, and everybody knows an undercooked egg presents a salmonella risk.

A fundamental problem

Of course I’m nitpicking with the egg timer example, and (to a lesser extent) with the “ass” example, but they bring up an important point:  language, as one of the primary interfaces between humans, requires far more than just understanding what is heard and forming sentences in response.  Having a sanity-check reflex that keeps you from using words like “ass” in mixed company, and knowing whether your joke is actually funny, are complicated processes.  Verbal communication can be a minefield, especially for a computer application that stabs around in the dark.

Consider, for example, the old joke about the Texan who gets into Harvard.  While touring the campus, he asks a student, “Excuuuse me, can you tell me where the library’s at?”  The student replies haughtily, “Here at Haaarvard, we never end a sentence with a preposition.”  The Texan replies, “Okay, can ya tell me where the library’s at, asshole?”

Upon inspection, this exchange, though brief, is quite complex.  The Harvard student’s response to the Texan’s query shows a decision that might not occur to an AI application—that is, to a) not answer the question, and b) use the opportunity to deliver a scornful message about class and intellect.  The Texan’s comeback makes a statement about a) his refusal to be cowed, b) the difference between cultivation and innate intelligence.  Meanwhile, the joke as a whole counts on the listener enjoying an opportunity to feel superior to both Harvard students and Texans, while exulting in the surprise and wit of the punch line.  Worlds away from “Enjoy ditek.”

Maybe you think I’m overreaching here, that such nuance will never be expected of AI.  Maybe AI is just a tool to make machines more useful to humans, and little gaffs don’t matter much.  When a woman asks her husband, “Do these pants make my butt look fat?” he is instantly plunged into a terribly complicated interaction, because of his relationship to the woman.  So much hinges on his response.  If he says “yes” he’s obviously dead.  If he says “no” too vociferously, he seems patronizing.  He could try the reverse-psychology approach and say, “No, your butt makes your butt look fat,” but she better have a sense of humor and thick skin.  Or, he could ignore the question, or say, “Look, krill!”  Or he could say “yeahhh” lecherously (note that imparting this single syllable with the sense of “I want some of that!” is far beyond the current state of the art in AI voice synthesis).  But when a human asks Siri “Am I fat?” and gets back, “Here’s your a.m. alarm” and “I found 8 fitness centers fairly close to you,” he or she can more easily blow it off.

This idea—that computers don’t have to play nice when “talking” to humans—is strongly supported by a scene in “The Terminator” when the evil cyborg, confronted by his landlord—“Hey buddy, you got a dead cat in there, or what?”—scans through a menu of possible responses—“YES/NO; OR WHAT; GO AWAY; PLEASE COME BACK LATER; FUCK YOU, ASSHOLE; FUCK YOU”—and chooses the penultimate one.  Of course when you’re the size of Arnold Schwarzenegger you don’t have to have a friendly user interface.

That said, I would argue that, to the extent humans are to embrace AI when using electronic devices, precision and nuance do matter.  We have to trust these devices not to turn “S” into “ass,” not to waste our time with lists of restaurants we’d never eat at, and not to infuriate us with messages like “cannot undo.”  Even if you’ve never found yourself yelling profanities at your computer, I’m sure you’ve seen others do it.

Consider this cautionary tale.  My dad bought one of the first consumer-oriented computers in history, the Hewlett-Packard Model 85.  This was 1980, a year before the IBM PC.  the HP-85 was about as far from Siri (or at least the design intent of Siri) as you can get.  There was no software for it; you had to program it yourself.  Meanwhile, its version of BASIC was proprietary, diverging from the industry standard (e.g., you used the command “DISP” instead of “PRINT”).  I had my brother try out one of my first programs.  It prompted him to type his name.  With great hesitation—he was greatly fearful of doing something wrong and damaging our dad’s expensive machine—he typed “Max.”  Then he sat there waiting for something to happen.  Nothing did, because my program didn’t say anything about hitting the Enter key when done.  Max looked a bit nervous.  “It’s not working!  It’s not doing anything!” he cried.  I told him to hit Enter.  When he did, the computer promptly displayed the message “Max is a jerk” (the whole point of my program).  Max got really angry and flustered and to this day does not use a computer.  This probably isn’t just because of my program; the HP-85 was less than user-friendly and doubtless gave Max the wrong impression of where home computing was going.


Here is where the AI picture is, to me, much rosier.  Early attempts at translation, like Alta Vista’s Babelfish, were a joke.  You pasted the foreign-language text into a window, gave it the language to translate it into, and then were presented with a salad of translated words (with un-translated ones sprinkled like croutons) that made no sense at all.  The only real use for this tool was translating things into Tristan.

What’s Tristan?  Well, I used to have a colleague, a computer programmer, whose native-tongue language skills were so poor it was impossible to understand a thing he wrote.  His e-mails always gave my colleagues and me a laugh, and in his honor we invented a language and named it after him.  (It wasn’t really called Tristan, because his last name wasn’t really Tristan; I’ve changed it to protect him from possible embarrassment.)  To translate something into Tristan, you’d type normal text, translate it into French using Babelfish, and then translate it back to English.  The results were pure comedy, with not a shred of sense left intact.

I think people are naturally forgiving of poor translation, because we’ve studied grammar and foreign languages in school and can really appreciate how difficult a task this is.  Plus, the results are so often funny, they put us in a good mood.  Consider the urban legend that “Coca-Cola,” when first translated into Chinese, came out meaning “bite the wax tadpole.”  (To this day I’ll complain about something by saying it bites the wax tadpole.)  Brian Hayes, writing in “American Scientist,” makes an interesting comment about AI efforts to parse grammatical constructions when translating text:  “The failure of this approach is sometimes dramatized with the tale of the English→ Russian→ English translation that began with ‘The spirit is willing but the flesh is weak’ and ended with ‘The vodka is strong but the meat is rotten.’”

More recently, online translation engines such as Google Translate have gotten much, much better.  As Hayes describes, “The idea is to ignore the entire hierarchy of syntactic and semantic structures—the nouns and verbs, the subjects and predicates, even the definitions of words—and simply tabulate correlations between words in a large collection of bilingual texts.”  At first, this strikes me as a “brute force” approach that is further from artificial intelligence than earlier efforts, however hapless, to actually parse a sentence grammatically.  But as Hayes points out, the modern technique is actually lot closer to how humans learn to talk.  (It’s also more similar to how we would learn a foreign language if we had the good fortune to go live in another country, versus making our way with a textbook and classes.)

I first tried Google Translate when I was trying to track a package that was being shipped to me from a web merchant in France.  I have studied French for years, but understanding statements about logistics and customs offices would be difficult in any language.  I was presented with this:  “Votre colis est sorti du bureau d'échange.  Il est en cours d'acheminement dans le pays de destination.”  This would have totally tripped up the original Babelfish, but Google served up an entirely comprehensible translation:  “Your package is out of the office of exchange. It is in transit in the country of destination.”  (Not only was I delighted with how clear this was, I was relieved my package wasn’t stuck in customs.)  Translating this English back into French, and then back into English, I get “Your package is out of the office of exchange. It is in transit to the destination country.”  Very little of the “Tristan effect.”  (There’s some fuzziness around “to” vs. “in” with regard to the destination country, but I can live with that.)

To reassure myself then the Man of Letters wouldn’t be replaced by a machine anytime soon, I tried some poetry: 
But the Raven still beguiling all my sad soul into smiling,
Straight I wheeled a cushioned seat in front of bird and bust and door;
Then, upon the velvet sinking, I betook myself to linking
Fancy unto fancy, thinking what this ominous bird of yore—
What this grim, ungainly, ghastly, gaunt, and ominous bird of yore
Meant in croaking “Nevermore.”
When I fed this into the new version of Babelfish (which works similarly to Google’s), and translated it into French and back, the response was this: 
But the Raven seductive yet all my sad soul into smiling,
Straight I wheeled a seat padded before the bird and bust and door;
Then, on the Velvet sinking, I hauled myself to tie Fancy: fancy,
Think what this bird threatening of antan - the sad bird, awkward, frightening,
Ghent and disturbing past Meant in croaking “Nevermore.”
Aha!  Gibberish!  I was about to feel all smug about the superiority of humans over AI, but then tried Google Translation with the same English à French à English task: 
But the raven still beguiling all my sad soul into smiling,
 I wheeled a cushioned seat in front of bird and bust and door;
 Then, upon the velvet sinking me, I betook myself to linking
 Fancy unto fancy, thinking what this ominous bird of yore -
 What this grim, ungainly, ghastly bird, gaunt, and ominous of yesteryear
 Meant in croaking “Nevermore.”
Wow.  That’s so good it’s creepy.  But before you despair and decide the computers will ultimately render the human race unnecessary, be sure to check out my next albertnet post, wherein I examine how well AI does playing games—another classic measure of its progress.

No comments:

Post a Comment