Using ChatGPT to program the social robot Pepper

The Pepper robot is designed to interact with humans in a social environment. However, programming the robot to perform various tasks such as recognizing emotions, following commands, and reacting appropriately can be a complex and time-consuming process. The emergence of ChatGPT suggests a new method for programming robots like Pepper to become easier. In this post, I attempt to use ChatGPT in programming the Pepper robot.

Programming social robots

Programming for social robots is a promising but challenging field. It requires knowledge of computer science, artificial intelligence, and psychology. In this field, Python and C++ are popular programming languages, with libraries and frameworks that facilitate the development of social robots. Specifically, programming Pepper requires using the NAOqi APIs. However, programming social robots is a difficult and time-consuming process.

Using NAOqi framework to program NAO and Pepper. Image: Aldebaran.

Potential solution using ChatGPT

Recently, Microsoft researchers have explored the possibility of using chatGPT to control mechanical arms and drones. ChatGPT simplifies the process of programming software commands for robots and allows for high-level feedback in natural language to generate code, mostly in Python, for controlling the robot accordingly. Microsoft's work on ChatGPT for robotics focuses on designing prompts, using APIs, and creating new models. Overall, ChatGPT holds the potential to make programming robots more accessible to a wider range of people.

Robotics with ChatGPT: design principles. Image: Microsoft.

Attempting to use ChatGPT to program Pepper

In this attempt, I use ChatGPT to generate Python code for programming Pepper. The goal is to simplify the programming process to generate code using natural language.

Hello World!

I used a basic natural language prompt without including any NAOqi API functions. As of now, ChatGPT doesn't have internet access and is limited to knowledge acquired until 2021. However, I assumed that its knowledge would be sufficient to recognize NAOqi API functions before 2021, which still work today. ChatGPT provided me with two frameworks: naoqi and qi to program Pepper. Both worked for both physical robots and simulations.

Prompt 1

Write a Python code to let Pepper say “Hello, world and chatGPT!” using NAOqi API

Method 1: Using naoqi framework

from naoqi import ALProxy

# Set up connection with Pepper
ip = "192.168.1.100" # replace with Pepper's IP address
port = 9559
tts = ALProxy("ALTextToSpeech", ip, port)

# Make Pepper say "Hello, world and chatGPT!"
text = "Hello, world and chatGPT!"
tts.say(text)

Method 2: Using qi framework

import qi

# Connect to Pepper
ip = "192.168.1.100" # Replace with Pepper's IP address
port = 9559
session = qi.Session()
session.connect("tcp://" + ip + ":" + str(port))

# Get the ALTextToSpeech service
tts = session.service("ALTextToSpeech")

# Say "Hello, world and chatGPT!"
tts.say("Hello, world and chatGPT!")

# Disconnect from Pepper
session.disconnect()

Both methods were successful in simulation. When regenerating responses, ChatGPT provided slightly different codes, but they all functioned correctly. Some of them included additional information on connecting the robot to the network and obtaining its IP address. However, ChatGPT kept switching between naoqi and qi frameworks.

Prompt 2: Ask ChatGPT to generate animated speech (speech and gesture)

Rewrite the code with animated speech

ChatGPT could provide a code that worked using ALAnimatedSpeech of the NAOqi API.

import qi

# Connect to the robot
ip = "your_robot_IP_address"
port = 9559
session = qi.Session()
session.connect("tcp://" + ip + ":" + str(port))

# Get the animated speech service
animated_speech_service = session.service("ALAnimatedSpeech")

# Set the text to be spoken
text_to_say = "Hello, world and chatGPT!"

# Set the parameters for the animation
body_language = "{\"bodyLanguageMode\":\"contextual\",\"disableBodyTalk\":false}"
animation = {"bodyLanguage": body_language}

# Make Pepper say the text with animated speech
animated_speech_service.say(text_to_say, animation)

Prompt 3: Generate a long speech

Write a Python code with animated speech to give a speech about how social robots help our society.

Here ChatGPT really showed its advantage as a LLM.

from naoqi import ALProxy

# Connect to the robot
ip = "127.0.0.1"
port = 39119
tts = ALProxy("ALTextToSpeech", ip, port)
animated_speech = ALProxy("ALAnimatedSpeech", ip, port)

# Set the text to be spoken
text = "Hello everyone! Today, I want to talk about how social robots help our society."

# Set the animation configuration
config = {"bodyLanguageMode": "contextual"}

# Speak the text with animated speech
animated_speech.say(text, config)

# Continue with the speech
tts.say("Social robots are designed to interact with humans in social environments. They can help us in many ways, such as providing companionship, assistance with daily tasks, and even medical care.")

# Set a new animation configuration for the next part of the speech
config = {"bodyLanguageMode": "random"}

# Speak the next part of the speech with animated speech
animated_speech.say("In healthcare, social robots can assist with physical therapy, monitor vital signs, and provide emotional support to patients. They can also help children with autism by providing social interaction and learning opportunities.")

# Set a new animation configuration for the last part of the speech
config = {"bodyLanguageMode": "random"}

# Finish the speech with animated speech
animated_speech.say("In summary, social robots have the potential to improve our lives in many ways. They can help us with tasks, provide companionship, and even improve our health. Thank you for listening!")

Motion

I wanted to fix the framework to naoqi instead of qi. I also told ChatGPT the IP address and the corresponding port of Pepper. Starting from a single joint to complex motion.

Prompt 1: Rotate the head

Write a Python code for Pepper to turn its head 30 degrees to the left. The IP address is 127.0.0.1 and the port is 39119. Use 'from naoqi import ALProxy' instead of 'import pi'.

ChatGPT could generate a code that works perfectly.

from naoqi import ALProxy

IP_ADDRESS = "127.0.0.1"
PORT = 39119

motion = ALProxy("ALMotion", IP_ADDRESS, PORT)
motion.setAngles("HeadYaw", 30.0 * 3.14 / 180.0, 0.2)

The explanation also makes sense.

In this code, we first import the ALProxy module from the naoqi package. Then we define the IP address and port number for the connection to Pepper. Next, we create an instance of the ALMotion module using the ALProxy method and specify the IP address and port number. We use the setAngles method of the ALMotion module to turn Pepper's head to the left by 30 degrees. The "HeadYaw" joint controls the horizontal movement of Pepper's head. The angle is specified in radians, so we convert 30 degrees to radians using the formula 30 * pi / 180. The second argument specifies the speed of the movement. In this case, we set it to 0.2, which means that Pepper will move at a slow speed.

I regenerated the code multiple times and found that ChatGPT generated mostly similar codes.

Prompt 2: Move forward 1 meter

Write a Python code for Pepper to move forward 1 meter and stop.

Despite generating several codes, none of the codes generated by ChatGPT worked. In some cases, the methods used in the generated codes did not exist. The closest code that ChatGPT produced was for Pepper to move forward nonstop at 100% speed. It used ‘moveToward’ function while the correct one should be ‘moveTo’.

Note: I doubted that my prompt of “move forward” misled ChatGPT to moveToward function or sometimes moveForward function, which doesn’t exist.

I tried to ask why Pepper did not stop. ChatGPT tried to give reasons related to the environment or hardware failure. Here is an example:

It's possible that Pepper did not stop after moving forward 1m due to various reasons such as incorrect distance measurement, slippery surface, or mechanical issues with the robot's movement system. You can try to troubleshoot the issue by checking the robot's sensors, reviewing the code, and testing it in different environments to see if the issue persists. It may also be helpful to consult the manufacturer's documentation or seek assistance from a technical expert.

So, I asked ChatGPT to change and the code worked.

Use moveTo instead of moveToward

from naoqi import ALProxy

ip = "127.0.0.1"
port = 39119

motion_proxy = ALProxy("ALMotion", ip, port)
motion_proxy.moveTo(1.0, 0.0, 0.0)  # move forward 1m

Finally I copied all functions in ALMotion and told ChatGPT to learn. It generated better results and most of them worked.

from naoqi import ALProxy

# Set the IP and port of Pepper
IP = "127.0.0.1"
PORT = 39119
# Create a proxy to ALMotion
motion_proxy = ALProxy("ALMotion", IP, PORT)

# Define the distance to move along the X axis
distance_x = 1.0

# Call the moveTo function to make Pepper move to the specified distance
motion_proxy.moveTo(distance_x, 0.0, 0.0)

Prompt 3: Raise both arms higher

I told Pepper the names of the joints, the D-H parameters, and an example code for controlling the joints. Everything was copy-paste from tables online.

These are the names of the joints. Body is ... Head + LArm + LLeg + RLeg + RArm The chain ... Head LArm LLeg RLeg RArm involves the joints ... HeadYaw LShoulderPitch LHipYawPitch1 RHipYawPitch1 RShoulderPitch HeadPitch LShoulderRoll LHipRoll RHipRoll RShoulderRoll LElbowYaw LHipPitch RHipPitch RElbowYaw LElbowRoll LKneePitch RKneePitch RElbowRoll LWristYaw2 LAnklePitch RAnklePitch RWristYaw2 LHand2 RAnkleRoll LAnkleRoll RHand2

Write a Python code for Pepper to raise both arms above the shoulder.

After several messages to correct the moving direction of the joints, ChatGPT could successfully generate the code to do what I want.

import qi
import argparse
import sys
import time
import almath

def main(session):
    """
    Example showing how to use setAngles method.
    """
    # Get the services ALMotion, ALRobotPosture and ALRobotModel.
    motion_service = session.service("ALMotion")
    posture_service = session.service("ALRobotPosture")
    robot_model = session.service("ALRobotModel")

    # Wake up robot
    motion_service.wakeUp()

    # Send robot to Stand Init
    posture_service.goToPosture("StandInit", 0.5)

    # Set the fraction of max speed for the arm movement.
    # 0.5 = half the max speed
    fractionMaxSpeed = 0.5

    # Raise both arms higher above the shoulder
    names = ["LShoulderPitch", "RShoulderPitch"]
    angles = [-80.0*almath.TO_RAD, -80.0*almath.TO_RAD]
    motion_service.setAngles(names, angles, fractionMaxSpeed)

    # Wait for the movement to finish
    time.sleep(2.0)

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--ip", type=str, default="127.0.0.1",
                        help="Robot IP address. On robot or Local Naoqi: use '127.0.0.1'.")
    parser.add_argument("--port", type=int, default=37227,
                        help="Naoqi port number")

    args = parser.parse_args()
    session = qi.Session()
    try:
        session.connect("tcp://" + args.ip + ":" + str(args.port))
    except RuntimeError:
        print ("Can't connect to Naoqi at ip \"" + args.ip + "\" on port " + str(args.port) +".\n"
               "Please check your script arguments. Run with -h option for help.")
        sys.exit(1)
    main(session)

I also requested more abstract motions such as sad and happy poses from ChatGPT. According to ChatGPT, sad and happy poses can be achieved by lowering or raising the head, shoulders, and elbows. Although the results were not ideal, I think that with more specific instructions, ChatGPT could generate more meaningful poses. Unfortunately, I do not have enough time to investigate further at the moment.

Combining speech and locomotion

Write a python code of greeting an old person using animated speech and then guided them to a table at a position of (5,2) using moveTo. When arriving the table, ask them to sit down and rest.

ChatGPT generated the code that worked perfectly with the right selection of words.

from naoqi import ALProxy

# Set up proxies for animated speech and motion
tts = ALProxy("ALTextToSpeech", "127.0.0.1", 39119)
motion = ALProxy("ALMotion", "127.0.0.1", 39119)

# Greet the old person with animated speech
tts.say("Hello, how are you feeling today?")

# Guide the person to the table using the ALMotion proxy
motion.moveTo(5.0, 2.0, 0.0) # move to the table position of (5,2)

# Ask the person to sit down and rest
tts.say("Please have a seat and take a rest. Let me know if you need anything.")

Another response:

from naoqi import ALProxy

# Set up the proxies for animated speech and motion
tts = ALProxy("ALTextToSpeech", "127.0.0.1", 39119)
motion = ALProxy("ALMotion", "127.0.0.1", 39119)

# Greet the old person using animated speech
tts.say("Hello, how are you today? My name is Pepper, and I am here to help you. Would you like me to guide you to a table?")

# Move Pepper to the table
x = 5.0
y = 2.0
theta = 0.0
motion.moveTo(x, y, theta)

# Ask the person to sit down and rest
tts.say("Here we are. Please take a seat and rest. Is there anything else I can help you with?")

Pepper as a conversation agent with ChatGPT

In my opinion, the implementation process seems relatively straightforward. It would involve first obtaining the voice input and converting it to text using whisper speech-to-text engine. Next, the text could be fed into ChatGPT for generating a response by using Bing-GPT-Voice-Assistant with a suitable length (not too long). Finally, ALAnimatedSpeech and ALMotion could be used to read the generated response out loud while incorporating suitable gestures and motion (including locomation) to make the response more engaging.

Update: I have tested the approach recommended earlier, but found that it involved a complex setup process, including the installation of different libraries, which proved challenging for ChatGPT to implement. As a result, I took on the task of writing the code myself. While the approach worked, there were some issues with the response time that depended on OpenAI's server, making the conversation feel somewhat unnatural.

Because naoqi is designed to work with Python 2, we need to create two separate files that communicate with each other using a text file. While there are cleaner ways to achieve this, the concept remains the same.

Python3: voicegpt.py

import openai
import asyncio
import re
import whisper
import os
import speech_recognition as sr

# Initialize the OpenAI API
openai.api_key = "YOUR OPENAI API KEY"

# Create a recognizer object and wake word variables
recognizer = sr.Recognizer()
GPT_WAKE_WORD = "hello"

def get_wake_word(phrase):
    if GPT_WAKE_WORD in phrase.lower():
        return GPT_WAKE_WORD
    else:
        return None
    
def synthesize_speech(text):
    # opening the file in write only mode
    f = open("speech.txt", "w")
    # f is the File Handler
    f.write(text)

async def main():
    while True:

        with sr.Microphone() as source:
            recognizer.adjust_for_ambient_noise(source)
            print(f"Waiting for wake words 'hello'...")
            while True:
                audio = recognizer.listen(source)
                print(f"Listenning...")
                try:
                    with open("audio.wav", "wb") as f:
                        f.write(audio.get_wav_data())
                    # Use the preloaded tiny_model
                    model = whisper.load_model("tiny")
                    result = model.transcribe("audio.wav")
                    phrase = result["text"]
                    print(f"You said: {phrase}")

                    wake_word = get_wake_word(phrase)
                    if wake_word is not None:
                        break
                    else:
                        print("Not a wake word. Try again.")
                except Exception as e:
                    print("Error transcribing audio: {0}".format(e))
                    continue

            print("Speak a prompt...")
            synthesize_speech('What can I help you with?')
            # Should be waiting until Pepper finishes talking to present the voice of Pepper going back to TTS
            # Can be checked with naoqi but I didn't have the time
            time.sleep(3)

            audio = recognizer.listen(source)

            try:
                with open("audio_prompt.wav", "wb") as f:
                    f.write(audio.get_wav_data())
                model = whisper.load_model("base")
                result = model.transcribe("audio_prompt.wav")
                user_input = result["text"]
                print(f"You said: {user_input}")
            except Exception as e:
                print("Error transcribing audio: {0}".format(e))
                continue

            if wake_word == GPT_WAKE_WORD:
                # Send prompt to GPT-3.5-turbo API
                response = openai.ChatCompletion.create(
                    model="gpt-3.5-turbo",
                    messages=[
                        {"role": "system", "content":
                        "Imagine you are Pepper, a humanoid robot. You are a fun interactive partner. Give response in the role of Pepper."},
                        {"role": "user", "content": user_input},
                    ],
                    temperature=0.5,
                    max_tokens=150,
                    top_p=1,
                    frequency_penalty=0,
                    presence_penalty=0,
                    n=1,
                    stop=["\nUser:"],
                )

                bot_response = response["choices"][0]["message"]["content"]
                
        print("Bot's response:", bot_response)
        synthesize_speech(bot_response)
        # Should be waiting until Pepper finishes talking to present the voice of Pepper going back to TTS
        # Can be checked with naoqi but I didn't have the time
        time.sleep(10)
        print(f"End")

if __name__ == "__main__":
    asyncio.run(main())

Python2: tts.py

import sys
import time
from naoqi import ALProxy

# Set up connection with Pepper
ip = "127.0.0.1" # replace with Pepper's IP address
port = [your port]
tts = ALProxy("ALAnimatedSpeech", ip, port)
text_old = ""

while True:
  f = open('speech.txt','r')
  text = f.read().replace('\n',' ')
  f.close()
  if text != "":
    if text !=text_old:
      tts.say(text)
      print(text)
      text_old=text
  time.sleep(1)

Conclusion

Providing clear instructions and examples is crucial for achieving better results. While ChatGPT has vast knowledge on various topics, relying solely on its knowledge might not be sufficient to deliver the desired outcome. Therefore, providing instructions and examples can help to clarify and guide the process, making it easier for ChatGPT to understand the user's intentions and provide more accurate and relevant responses. Clear instructions can also help the user to communicate their needs more effectively and ensure that they receive the desired information or solution.

ChatGPT surprised me with some of its responses, but it took a significant amount of time to provide guidance and refine the generated outputs to my desired level of quality. As a result, I believe it would be helpful to set up a Wizard-of-Oz setup. The only autonomous task that comes to mind for ChatGPT is being a conversational partner.

I do not plan to use it for my work for now but in the future who knows what opportunities might arise.

⚠️ And yes, as you might suspect, some parts of the text may have been written or elaborated upon by ChatGPT based on my brief inputs.