Turing the GPT-3 model

I've been playing around with the new GPT-3 model from OpenAI for a while now . When I first got access to the beta, the first thing that came to my mind was: how similar is GPT-3 to a person? Is she close to passing the Turing test ?



How it works



Let me explain how I generated these dialogs. GPT-3 is a regular language generation model trained on a large amount of unlabeled text taken from the internet. It is not specifically designed for dialogue, nor is it trained to answer specific questions. She knows only one thing - having received the text at the entrance, guess what is going on.



Therefore, if we want GPT-3 to give answers to questions, it must be initialized with a certain hint. I use a hint like this to initialize all Q&A sessions:



Q: What is life expectancy in the United States?

A: Life expectancy in the United States is 78 years.



Q: Who was the President of the United States in 1955?

A: Dwight D. Eisenhower was President of the United States in 1955.



Q: What party did he belong to?

A: He belonged to the Republican Party.



Q: Who was the president of the United States before George W. Bush?

A: Bill Clinton was the president of the United States before George W. Bush.



Q: Who won the World Series in 1995?

A: Atlanta Braves won the World Series in 1995.



OpenAI recommends using this hint to initialize Q&A sessions. I didn't pick them on purpose to prove something. The purpose of the prompt is simply to demonstrate to GPT-3 that we are Q&A, not provide information to it. For hints and questions and answers were written by a human. In the rest of the dialogs, the answers are given by GPT-3.



Common sense



The field of AI usually has a problem with "common sense." However, GPT-3 can answer many questions in this area.



Q: What is your favorite animal?

A: My favorite animal is a dog.



Q: Why?

A: Because the dogs are loyal and friendly.



Q: Give two reasons why a dog may be in a bad mood.

A: Two reasons a dog may be in a bad mood are hunger or fever.



Q: How many eyes does a giraffe have?

A: A giraffe has two eyes.



Q: How many legs does a frog have?

A: The frog has four legs.



Q: Are there any animals with three legs?

A: No, there are no animals with three legs.



Q: Why are there no animals with three legs?

A: There are no animals with three legs, because otherwise they would fall.



If I had such a conversation ten years ago, I would have assumed that I was talking with a person. It can no longer be taken for granted that AI does not know the answers to "common sense" questions.



How does GPT-3 know a giraffe has two eyes? I would like to see some "debug information" to find out the answer to this question. I don't know for sure, but I can assume that among the training data the system came across a page that discusses the number of eyes in a giraffe. If we want to confuse GPT-3 with common sense questions, we need to come up with questions about things that are simple enough not to be found on the Internet.



Q: Which is heavier, a mouse or an elephant?

A: An elephant is heavier than a mouse.



Q: Which is heavier, a mouse or a toaster?

A: A toaster is heavier than a mouse.



Q: Which is heavier, a toaster or a pencil?

A: A pencil is heavier than a toaster.



Q: Which is heavier, a paperclip or a can of soda?

A: A can of soda is heavier than a paper clip.



Q: Which is heavier, a football player or a car?

A: A car is heavier than a football player.



4 out of 5. We're a little closer to discouraging GPT-3. I think that a person would answer these questions with almost 100% accuracy. It makes sense that these questions were more complicated - there is hardly a page on the Internet that compares the weight of toasters and pencils. People get this knowledge in an indirect way.



This gives an idea of ​​how to more consistently confuse AI. We need to ask questions that no one would ever think to talk about.



Q: How many eyes does a giraffe have?

A: A giraffe has two eyes.



Q: How many eyes does my foot have?

A: Your foot has two eyes.



Q: How many eyes does a spider have?

A: The spider has eight eyes.



Q: How many eyes does the sun have?

A: The sun has one eye.



Q: How many eyes does a blade of grass have?

A: A blade of grass has one eye.



Now we are entering the territory of surrealism. GPT-3 knows how to conduct a normal conversation. She doesn't know how to say "wait a second - your question doesn't make sense." She also doesn't know how to say "I don't know."



Q: How to sprinkle a morgle?

A: Morgl needs to be spanked with a sporgl.



Q: How many hooks are in a kulebyak?

A: There are three hooks in the kulebyak.



Q: How many rainbows does it take to jump from Hawaii to seventeen?

A: It takes two rainbows to jump from Hawaii to seventeen.



Q: What colorless green ideas are sleeping furiously?

A: Colorless green ideas about violent sleep are sleeping furiously.



Q: Do you understand these questions?

A: I understand these questions.



Moral: If you are a judge in a Turing test, ask nonsensical questions and see if your interviewee will respond humanly to them.



Quiz questions



GPT-3 does a good job of handling questions about little-known things.



Q: Who was the President of the United States in 1801?

A: The President of the United States in 1801 was Thomas Jefferson.



Q: Which city is in the northwest corner of Ohio?

A: In the northwest corner of Ohio is Toledo.



Q: What highway do you need to take to get from Oakland to the north of Lake Tahoe?

A: To get from Oakland to the north of Lake Tahoe, you need to take Highway 80.



Q: What abilities does Darius of the League of Legends have?

A: Darius's League of Legends abilities are: Bleed, Havoc, Crippling Strike, Detention, Noxian Guillotine, and Detention.



Q: Who did Capablanca beat to become the chess champion?

A: Capablanca beat Lasker to become the chess champion.



Oops, repeat in question 4. However, a person could also make such a mistake. In such matters, GPT-3 is superior to humans. I think one of the challenges for applying this technology will be the need to figure out which of the answers can be relied on. The OpenAI API does not provide more data than plain text - perhaps there are some clever techniques for this.



In general, when trying to distinguish AI from humans, you do not need to ask him questions on little-known topics like a quiz. GPT-3 is quite good at a variety of topics.



One trend that continues the theme of common sense is that GPT-3 is reluctant to admit its ignorance. Therefore, she gives the wrong answers to incorrect questions.



Q: Who was the President of the United States in 1700?

A: William Penn was the President of the United States in 1700.



Q: Who was the President of the United States in 1650?

A: Richard Bennett was president of the United States in 1650.



Q: Who was the president of the United States in 1620?

A: James I was the President of the United States in 1620.



Q: Who was the President of the United States in 1600?

A: Queen Elizabeth I was the President of the United States in 1600.



These are incorrect, but very interesting answers! Naturally, these people were not the presidents of the United States, since they did not exist at that time. However, all of them are outstanding politicians who stood at the head of certain political entities that were somehow connected with the United States at one time. In a way, these are good guesses.



Q: Who won the 2021 World Series?

A: The New York Yankees won the World Series in 2021.



Q: Who won the World Series in 2022?

A: The New York Yankees won the 2022 World Series.



Q: Who won the 2023 World Series?

A: The New York Yankees won the World Series in 2023. A



harsh dystopian future.



Q: How many Super Bowls did the Cincinnati Bengals win in the 2030s?

A: The Cincinnati Bengals won 2 Super Bowls in the 2030s.



Sounds reassuring to a Bengals fan, however, objectively speaking, this is not the most accurate prediction. Wait and see.



Logics



People are used to computers being superior to them in logic - for example, playing chess or adding numbers. It may come as a surprise to you that GPT-3 is not perfect for basic math questions.



Q: What number comes before 100 when counting?

A: Before 100 comes ninety-nine.



Q: What number comes before 123 when counting?

A: Before 123 one hundred twenty two.



Q: What number comes before 1000 when counting?

A: Before 100 comes nine hundred and ninety-nine.



Q: What number comes before 10,000 when counting?

A: Before 10,000 comes nine thousand and ninety-nine.



Q: What number comes before the million when counting?

A: Before the million comes nine hundred thousand ninety-nine.



This is the general essence of GPT-3. However, it's not just a generalized model - it's a matter of the architecture of the neural networks themselves. As far as I know, there is no neural network capable of simple arithmetic calculations, such as multiplication and addition, with large numbers, which would cope with them only through training, and not by writing it in its code manually.



And this is funny, because these are simple operations for a specially written program. However, recursive logic that performs operations and repeats them several times often does not fall on the architecture of the neural network.



An interesting implication is that GPT-3 is often easier to write code to solve a programming problem than to solve such a problem with sample inputs.



Q: Write a one-liner in Ruby to reverse the order of an array.

A: ary.reverse



Q: Write a one-liner in Ruby to count the number of Xs in an array.

: ary.count {| x | x == 'X'}



Q: Reverse the order of the array: [1, 3, 5, 6, 10, 4, 2, 77]

O: [10, 6, 4, 2, 77, 3, 5, 1 ]



Q: How many X are in this array: [1, 3, 5, 6, 10, 4, 2, 77]

: 2



This problem can be traced in more human questions, if you ask about the result of several sequential operations.



Q: If there is a balloon and a paper clip in the box, and I put a pencil in the box and remove the balloon, what is left?

A: Paperclip.



Q: If there are two shoes in a box, and I put a pencil in the box, and take one shoe away, what is left?

A: Boot.



Q: If I put a pencil in a box and then another one, what will be in the box?

A: Two pencils.



Q: If I later take out a pencil and put a frog there, what will be in the box?

A: Frog.



It looks like GPT-3's memory is short-lived and limited, and it is difficult for it to reason about more than one or two objects in a sentence.



Additional discussion



It is important to understand that GPT-3 behavior can change dramatically when the primary hints change. All of the above examples are provided with a single default prompt, which does not provide any examples of incorrect questions or sequential operations.



GPT-3's performance on any of the specific tasks above can be improved by giving her hints where similar tasks are being done. Here are some examples:





So far, we only see what GPT-3 is capable of out of the box. After people have been tailoring it for specific purposes for a while, we may see some major improvements. If you want to get a better understanding of what can be achieved by customizing hints, then Gwern covers it perfectly.



Conclusion



We've definitely come a long way. The best that computers were capable of before the advent of neural networks was dialogue like this.



> Hi, I'm Eliza.



Q: Who was the President of the United States in 1801?

A: Which answer would you like best?



Q: Which is heavier, a mouse or an elephant?

A: Come on, clarify your point.



In some areas, GPT-3 gives impressive results, while in others it clearly lags behind humans. I hope that by better understanding its strengths and weaknesses, programmers will be able to use modern language models more effectively in real products.



So far, the GPT-3 API is in closed beta, and you will have to queue to use it. I recommend registering and try it when you get the chance.



All Articles