Author: Omar M Tarek (Alexandria STEM School)
Abstract
The prevalence of mental health challenges, such as depression, highlights the increasing need for innovative therapeutic tools. In response, a therapeutic robot, inspired by BMO from Adventure Time, has been designed to engage users through voice interaction, providing emotional support. The development process involved two main components: a software system featuring near-instant voice transcription using Whisper with CUDA acceleration, and a hardware design leveraging an old laptop and Arduino. Text generation experiments with local language models, including Llama2 and Claude, were conducted but proved insufficient in speed and accuracy. The integration of GPT-3.5 Turbo significantly improved response time and coherence. Text-to-speech (TTS) capabilities were enhanced by Eleven Labs, which provided a voice nearly identical to BMO. Additionally, the robot's face and body were crafted using acrylic materials to replicate BMO’s visual design. This combination of efficient voice transcription, precise language generation, and realistic TTS produced a therapeutic robot capable of delivering rapid, context-aware responses. The success of this project can be attributed to optimized software design with efficient speech and text processing, accurate character replication, and a cost-effective hardware solution. The robot demonstrates how AI and character-driven design can offer novel solutions in mental health care.
Introduction
Mental health care has emerged as a global priority, with increasing numbers of individuals experiencing conditions such as depression and anxiety. These disorders often go untreated due to barriers in accessing timely and personalized therapeutic support. The integration of robotics and artificial intelligence (AI) offers a promising avenue for addressing these gaps. This research focuses on the development of a therapeutic robot inspired by the character BMO from Adventure Time, designed to engage users in emotionally supportive interactions. By combining AI-driven language models with a familiar and comforting character, the robot aims to provide an accessible and engaging form of mental health support.
The development process encompassed both software and hardware challenges. From a software perspective, one of the primary objectives was to enable natural, real-time interaction without the reliance on constant internet connectivity. Initial attempts at using local language models such as Llama2 and Claude revealed limitations in both response speed and coherence, particularly when tested on consumer-grade hardware. These limitations were overcome by implementing OpenAI’s GPT-3.5 Turbo, which provided significant improvements in both the quality and speed of responses. Voice transcription, another critical component of the system, was optimized using Whisper with CUDA acceleration, resulting in near-instantaneous transcription of user input. The text-to-speech (TTS) system was equally important for maintaining the robot’s character-driven design, and after experimentation with Tortoise TTS and Google TTS, Eleven Labs was chosen for its high fidelity and efficiency in replicating BMO’s voice.
Hardware design was driven by cost considerations, leading to the decision to repurpose an old laptop rather than investing in a new Raspberry Pi system. The laptop, combined with an Arduino for external button control, was housed within a custom-built acrylic casing designed to visually replicate BMO’s form. The integration of a Bluetooth microphone further enhanced the robot’s ability to receive user input clearly, even in noisy environments.
The primary contribution of this research is the development of an affordable and effective therapeutic robot that utilizes cutting-edge AI technologies to provide emotional support to those who require it. By addressing key challenges in voice transcription, language processing, and cost-effective hardware design, this project demonstrates the potential of AI-driven robots in mental health care. In the sections that follow, we detail the software and hardware development processes, present the results of user testing, and discuss the implications of this work for the broader field of therapeutic robotics.
Software Development
Voice Transcription
Voice transcription is a key component of this project, enabling the robot to understand and respond to the user’s voice. Initially, AssemblyAI was used due to its simplicity, but was later replaced by Whisper for being offline without the need for internet usage. Faster-whisper with NVIDIA’s CUDA integration achieved near-instant voice transcription, with audio recorded using PyAudio and Wave.
The transcription process was optimized using Whisper with CUDA acceleration, which improved speed by 357% as shown in the above figure, a method similar to those explored in human-robot interactions (Srivastava, 2024), all this while still being localized with no need for an online connection. This is possible because of the usage of Tensor cores - a new, specialized unit introduced in NVIDIA that performs one matrix-multiply-and-accumulate on 4x4 matrices per clock cycle, thus speeding up the processing of big piles of data like audio files, which results in faster transcription.
Text Generation
In the development of the robot’s text generation capabilities, the initial objective was to implement a solution that could function without reliance on internet connectivity. To achieve this, several local language models were experimented on, including Llama2, Llama3, and Claude. These models were chosen for their potential to provide high-quality responses offline, allowing for a fully self-contained system. However, a significant issue emerged during testing on a laptop equipped with an NVIDIA RTX 4060 GPU, an Intel i7-13700HX CPU, and 16GB of DDR5 RAM.
The issue was the lack of coherence and relevance in the responses. The models frequently generated excessively verbose and unrelated content. For instance, a simple greeting such as "hi" would trigger an unnecessarily long 100-word reply, often with little regard for the context or intent of the user's input. Additionally, asking the model for actual long responses resulted in a generation error where the model would respond after minutes of the test starting.
After five iterations of testing and modifying different local models, none were able to resolve this core issue of relevance. Because of that, the implementation of the OpenAI GPT-3.5 Turbo API was initiated in the sixth attempt.
After the experiment’s completion, the integration of GPT-3.5 Turbo addressed coherence issues seen in earlier language models, aligning with findings in studies exploring large language models for mental health applications (Guo et al., 2024; Hua et al., 2024)
Text-To-Speech (TTS)
The TTS system was inspired by a YouTube series about TortoiseTTS and RVC. RVC or Retrieval Voice Conversion is an AI Technology that is used to clone/change voices into other voices. In our case, to train an RVC model to mimic the voice of BMO, some steps were followed. Firstly, the vocal data of BMO was collected into a folder, then the RVC training process was activated. Since RVC needs a voice to change to, a Text-To-Speech library needed to be added. Locally, there was pyttsx3, which was at best okay with the voice generation. Moving on to online APIs, there were gTTS, Bark and ElevenLabs, with ElevenLabs having built-in voice cloning. A speed test was done on all four of these with and without RVC, except for ElevenLabs and their built-in RVC. The experiment was controlled by using the same text prompt “Hi, I’m BMO, nice to meet you.”
The RVC training process is outlined above. Below is a comparison of the different TTS systems in audio generation time.
After analyzing the data, ElevenLabs was chosen for its high-fidelity voice reproduction, a key factor in TTS applications in therapeutic robotics (Bartal et al., 2024), as it’s 250% faster than pyttsx3, and after getting 10 people to choose which of the libraries sounded better, ElevenLabs also won in this category, making it the best candidate TTS library for the project. The figure below shows a qualitative analysis on the preferable TTS library for this project.
Facial Display
The next part left in BMO’s software is the cute face he is known for. To create this part, a graphics library was used. Two of the many options that were tested are pygame and Tkinter. Although pygame was tested at first, one of pygame’s biggest problems is its threading problem, making the window lag and freeze multiple times. Because of this, Tkinter was deployed, and the graphics library worked as expected. Now one of the main challenges was making BMO’s facial phases. In order to make the display fun and interesting, different facial phases were implemented to differentiate the states the AI is currently in. Four phases were decided, as shown below: idle – listening – processing – speaking.
To switch between facial expressions, a system using GIFs was deployed. By making transitional GIFs and Tkinter’s GIF object activate when changing between faces and then switching back to normal image display, we were able to minimize the performance tax while still keeping the visuals, without the need to use SVG or vector libraries.
Software Manager
To make all these parts work together in unison, a software manager was deployed. It runs both the display thread and the AI thread separately to avoid freezes. The AI thread will handle the listening, transcribing, text generation and TTS, and it will have a state value which the display thread is going to read and which will then affect the facial expressions. This multi-thread system reduces the possibility of freezing.
Unity Deployment And Generating Executable
Currently, only the runtime-part of the software has been dealt with, but because this program needs to be compatible with other hardware right away, the usage of python and its libraries needs excessive pre-work, installing things like CUDA, RVC, etc.
Because of the usage of APIs for most of the software, the deployment to the Unity Editor wasn’t that difficult, with the same APIs being used and the change being only in the graphics library. BMO’s AI software is now an actual executable that can be run on any PC at any time.
This change has led to an increase of 410% in loading speed, giving this program the ability to run on lesser hardware and potentially being turned into a simulated-OS, basically making the program load instantaneously when starting the robot.
Hardware Design
Using Old Hardware
To minimize costs, an old laptop was repurposed instead of purchasing a Raspberry Pi. A monitor was refurbished to serve as the robot's face. To replicate BMO’s buttons, an Arduino was connected to the laptop and buttons were embedded.
Wireless Microphone System
To fix the problem where the AI records everything and doesn’t stop because of background noise, a wireless mic was bought and embedded into a necklace, the USB connecter was attached to the laptop. This modification has led to heavily deafened background noise.
Custom Casing
Using BMO’s diagram as shown in below, a casing made from acrylic was made and colored to match the body’s colors, then a bunch of shapes were 3D-Printed and added to the buttons, indents and areas were made to hold the laptop and the monitor.
Results
The robot successfully performs its intended therapeutic functions. The wireless microphone necklace initiates the conversation and keeps it from background noise; the use of OpenAI’s GPT-3.5 Turbo provides coherent and timely responses; while the ElevenLabs TTS system ensures natural sounding speech. The hardware design has proven durable and visually faithful to the character of BMO, which enhances user engagement and emotional connection. However, challenges remain in optimizing the local processing of language models for offline functionality.
Recommendation
Currently, BMO has some flaws that need to be fixed. Firstly, it cannot differentiate between a speaker and another. To fix this, a form of Speaker Diarization can be implemented. Another improvement could be adding computer vision to it, making it able to detect faces and objects. And lastly, motors could be added to make BMO an actual moving being.
References
[1] Al Ghadani, A. K. A., Mateen, W., & Ramaswamy, R. G. (2020, May). Tensor-based CUDA optimization for ANN inferencing using parallel acceleration on embedded GPU. In IFIP International Conference on Artificial Intelligence Applications and Innovations (pp. 291-302). Springer International Publishing.
[2] Bartal, A., Jagodnik, K. M., Chan, S. J., & Dekel, S. (2024). OpenAI's narrative embeddings can be used for detecting post traumatic stress following childbirth via birth stories. Research Square.
[3] Carlbring, P., Hadjistavropoulos, H., Kleiboer, A., & Andersson, G. (2023). A new era in internet interventions: The advent of ChatGPT and AI-assisted therapist guidance. Internet Interventions, 32, 100621.
[4] D’Alfonso, S. (2020). AI in mental health. Current Opinion in Psychology, 36, 112-117.
[5] Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A. et al. (2024). The LLaMA 3 herd of models. arXiv Preprint, arXiv:2407.21783.
[6] Durak, H., & Uysal, G. (2021). The effect of cartoon watching and distraction cards on physiological parameters and fear levels during inhalation therapy in children: A randomized controlled study. Journal of Tropical Pediatrics, 67, fmab018.
[7] Dyshel, K. (2014). The development of primary school pupils’ affection by using animated cartoon as the therapy. Science and Education: A New Dimension. Pedagogy and Psychology, 30, 92-94.
[8] Fasi, M., Higham, N. J., Mikaitis, M., & Pranesh, S. (2021). Numerical behavior of NVIDIA tensor cores. PeerJ Computer Science, 7, e330.
[9] Guo, Z., Lai, A., Thygesen, J. H., Farrington, J., Keen, T., & Li, K. (2024). Large language models for mental health applications: A systematic review. JMIR Mental Health, 11, e57400.
[10] Hua, Y., Na, H., Li, Z., Liu, F., Fang, X., Clifton, D., & Torous, J. (2024). Applying and evaluating large language models in mental health care: A scoping review of human-assessed generative tasks. arXiv Preprint, arXiv:2408.11288.
[11] Kaur, N., & Singh, P. (2023). Conventional and contemporary approaches used in text-to-speech synthesis: A review. Artificial Intelligence Review, 56, 5837-5880.
[12] Ord, P. (2009). The therapeutic use of a cartoon as a way to gain influence over a problem. International Journal of Narrative Therapy & Community Work, 2009(1), 14-17.
[13] Park, T. J., Kanda, N., Dimitriadis, D., Han, K. J., Watanabe, S., & Narayanan, S. (2022). A review of speaker diarization: Recent advances with deep learning. Computer Speech & Language, 72, 101317.
[14] Pham, K. T., Nabizadeh, A., & Selek, S. (2022). Artificial intelligence and chatbots in psychiatry. Psychiatric Quarterly, 93, 249-253.
[15] Srivastava, A. (2024). Improving human-robot spoken interactions through Whisper. University of Ghent.
WOW I Really liked this Research paper and the system of BMO a lot and I think that it is very helpful
Fantastic ⚡️