“Put me on the phone with a human!”
is something I used to hear my parents say frequently when calling customer support. As calm as they usually were, they were so… irritated by the communicative limitations of a machine. Everyone at some point wishes that “I could talk to a human instead.”
Luckily for everyone, advancements in machine learning are improving the ability of machines and chatbots to communicate with people.
AI has the potential to carryout services that, otherwise, only a human can do well. Chatbot therapists are an example of an application of AI to do one of these services. The development of chatbot therapy is a reasonable, if not an obvious, step to take to improve the quality of life of people across the globe.
I’ve heard and still hear people who want help complain about the systems in place to provide mental health support. Challenges connecting with a professional, and finding affordable and available support are among some of the complaints — the stigma around different illnesses can shy people away from real professionals. 1 in 8 people across the world struggle with a mental disorder, according to the World Health Organization. When you realize most of these people don’t have access to quality support, then you realize the scale of this problem.
Providers of chatbot therapy advertise 24/7 access, reasonably priced (or free) services, and results that rival in-person treatment. The services offered depend on the provider and are expanding as the chatbot models develop. Generally, they are designed to support well-being and reduce the likelihood of a crisis, where the AI agent provides emergency resources and a real person.
The agents can be nearly indistinguishable from a human. Elomia Health or talk2us.ai are examples. The latter example’s agent will provide relationship, career, and overall life guidance. As incredible as this is, serious skepticism and criticism about these agents’ capabilities exists.
The hype behind chatbot therapy today started to pickup when users of ChatGPT told agent to act as if it was a professional therapist and then consulting with it.
A user, JustinCord had this to say on reddit “As someone who has consumed a lot of mental health services in his life, I can say that I found [Chat GPT] to be incredibly helpful, much more than many of the humans I have interacted with…”
Tebra showcased the American public feelings towards AI health consultation with in a blog post with survey results asking healthcare professionals and users about their experience with chatbot services, ChatGPT, Bard, and BingAI. Some of the stats from this survey are in the graphic below.
Besides the obvious potential differences in setting, the actual dialogue can be very similar. The agent will behave depending on the therapy style it’s trained on. CBT (cognitive behavioural therapy) is one of the most common and effective modalities of therapy.
The first stage of CBT is typically an evaluation stage to get a rough understanding of the patient. An AI agent that can ‘remember’ previous interactions in separate sessions can act as if it evaluated the patient as a human therapist would. An agent that cannot remember previous sessions will adapt by asking questions in the current session to understand the client for that session only.
The exact conversations vary, but generally speaking, what happens in a chatbot session is similar to with a person trained to do CBT (I am not an expert, this is from what I’ve gathered in reading and conversation). The agent or practitioner (human therapist) will:
-set an agenda for the session
-carry out a mood check
-emphasize with the client
-ask for feedback on its/their understanding of what the client is saying
-initiate a calming activity if appropriate
-identify problems and help the patient set goals
-discuss the cognitive model with the patient
-give optimism, reassure the patient their problems are real, and that it is okay to struggle
This a brief overview of what happens in a dialogue with both AI and human therapists. AI therapists on the market now are focused for short to medium-term aid. The complexity of treating and resolving many disorders is very challenging, and humans still struggle to do so .
To better understand some of the risks of chatbots, we need to know how they work. When I talk about an AI therapist or agent, I’m talking about a chatbot that applies NLP (natural language processing).
The chatbots that make customer service calls painful are rule-based; these bots are preprogrammed with select responses for specific inputs. NLP is a subfield of computer science and linguistics. It can be applied in many different ways depending on the purpose. Generally, its application aims to translate human language into machine-readable data and vice-versa.
NLP applied in chatbots has several components:
-Tokenization or lexical analysis is splitting sentences into ‘tokens’ based on their meaning or relationship to the sentence.
-Normalization or syntactic analysis checks the words for typos and changes them to the standard form they represent.
-Entity recognition is looking for keywords to identify the topic of conversation.
-Semantic analysis is understanding the meaning of the sentence by looking at the meanings of the words and their relation to the sentence structure.
NLU (natural language understanding) is a subfield of NLP specifically applied to understand human language by recognizing patterns in unstructured input. NLU in chatbots uses:
-A dictionary to determine the meaning of a word.
-A parser to check if the input syntax is appropriate for the language.
-A set of grammar rules to identify the parts of the sentence or statement.
Another subfield of NLP, NLG (natural language generation), produces human-readable text from structured machine-produced data. After understanding the purpose of intent and meaning of the input, NLG in chatbots:
– Existing data in the knowledge base is filtered.
– Patterns and available answers are interpreted.
– Response structure is planned narratively.
– Expressions and words for each sentence are compiled.
– Grammar rules are applied for accuracy.
– Data is input into language templates for natural representation of response.
The diagram below shows the evolution of NLP models over time. In 1990, RNNs (Recurrent neural networks) were first used in NLP by Elman Network; in 2017, Ashish Vaswani et al. proposed the transformer model in their paper “Attention Is All You Need.” The transformer architecture does not need CNNs (convolutional neural networks) or RNNs. Instead, it uses self-attention mechanisms; it was hypothesized that the transformer-based models could be trained faster and perform better.
In 2018, Google introduced a family of models called BERT (Bidirectional Encoder Representations from Transformers) that outperformed RNN-based models (GPT was introduced before BERT that year). The transformer architecture has massively improved machine-human communication and is used in the viral ChatGPT. Transformers are the standard in NLP today.
To understand how transformers are used for many NLP goals (like creating a therapy chatbot), it helps to initially imagine we are using this architecture to translate one language into another. The original transformer architecture was proposed with both an encoder and a decoder.
For language translation, the encoder is intended to encode the input’s meaning in machine language. The decoder reconstructs language from the encoded meaning in the appropriate language customs.
Not all transformer-based models have both types -coders and transformers can do more than translate between languages. For example, GPT is a type of decoder-only model, BERT is an encoder model, and GPT-4 can generate text and images.
The -coder in a transformer is a ‘stack’ of ‘blocks’; the stack size can vary for the desired purpose. The mechanisms in each -coder block can differ as well. Language comprehension and generation can be achieved through different alterations to either -coder; different transformer architectures perform better at different things. All transformers use MHAs(multi-head attention mechanisms) in these blocks.
The big thing that makes transformers uniquely powerful is how they use self-attention mechanisms.
RNN-based models look at relationships sequentially, one word or token at a time, through a sentence and assign a vector value for the sentence’s meaning. RNNs struggled with what are called long-range dependencies. For the language translation example, RNNs struggled to capture ideas separated by many words. Long-short-term memory (LSTM) was used to combat issues with long-range dependencies. A memory cell selectively stored data as the sequence was processed, effectively ‘remembering’ that info for later. Transformer models outperform RNNs that use LSTM.
Transformers look at the big picture of an input. To better capture the meaning of a token embedding, an attention head’s output from the current token embedding depends partly on all the other embeddings’ relevance to its meaning. This allows relationships that are distant in an input to be calculated. Transformers use multiple attention heads that simultaneously look at different relevance metrics of an embedding. Transformers ‘stack’ the -coder blocks to capture increasingly complex patterns in the input. This is why it is so great for NLP.
How does the encoder work?
This is a diagram of the original transformer model, with the encoder block highlighted. There are 6 of these blocks in the original transformer proposed.
Before entering the encoder block, input tokens like “apple,” “meaning,” and “-ful” are embedded, which means they are given a numerical value (multi-dimensional vector value) so they can be processed. It is positionally encoded before it enters to remember the order of the words or tokens from the input.
The block contains an MHA as its first sublayer. The second sublayer is the FFN (feed-forward neural network). A normalization function follows both sublayers.
How does the decoder work?
The original decoder stack comprises six identical copies of the highlighted block below. It has 3 primary sublayers.
The first is its masked multi-head attention mechanism that uses masking to make it unidirectional and attend only to preceding words. It implements several masked, single-attention functions.
The masked MHA sublayer is unique to the decoder. The second and third sublayers are mostly identical; much of the difference comes from the training of an applied model.
Some of what happens in MHAs (What I think is wildly smart🧠).
I briefly explained it earlier in my writing, but I want to build some intuition and talk about the crafty way NLP is applied here.
When inputs are first embedded, they are just words in vector form. In self-attention mechanisms like in transformers, we calculate a dot product between vectors and normalize them using a softmax function.
Think about what is happening here. Dot products are used to calculate the similarity of vectors. Imagine the words “walked,” “dog,” and “friend” in the sentence:
“I walked my dog,” and “I walked with my friend.”
The word walked, and either friend or dog is somewhat similar in vector space.
If you could capture that level of similarity and proportionally ‘nudge’ the vector closer to the other word in the sentence, then you would ‘contextualize’ them. This ‘nudge’ is done by multiplying the normalized dot products by the original input vector.
When I say an attention head captures the relationship between all the words/tokens in the input, it is because of this ‘contextualizing’ I’m talking about. A vector value representing a word in a statement is compared to all other vectors in the input, and it’s outputted as the sum of the multiplications of its similarities to all the vector values in the input. This is done for all words/tokens in the input. Note that matrix multiplication is used so that the same dot products are not repeatedly calculated.
I won’t get technical here, but to know that self-attention mechanisms can be trained, you need to know that the system has trainable weights. Learned matrices are applied to the vector being processed, all the other vectors in the input before they are used to find the dot product, and all the other vectors when weighed by their relevance to the word being processed.
Knowing about these matrix weights explains why different heads can be trained to pick up different patterns. These heads process the same vectors simultaneously and output their own vector representation. They are concatenated and then weighed by a matrix specific to the training in a linear transformation, making the values suitable for the following transformer processes.
After the multi-head attention mechanism processes the input sequence, the next step in a transformer model involves layer normalization. This technique normalizes the values of each vector/token/embedding, whatever you want to call it, within the sequence.
The primary purpose of layer normalization is to improve the model’s training and convergence. By standardizing the distribution of values, it assists in stabilizing gradients during the training process.
This step contributes to the model’s overall stability and efficiency. Consider using three terms from a Taylor series to model a sin function on a graph with an input of interest of x = 20. The incomplete Taylor series doesn’t behave usefully outside a specific range of inputs, and this is the same for many other parts of training in a transformer
Following the layer normalization stage, a residual connection is introduced into the transformer architecture. This mechanism is designed to maintain valuable information from the original input embeddings.
Specifically, it involves adding the initial input representations to the transformed output generated by the multi-head attention mechanism and subsequent normalization. This addition allows information from the original input tokens to flow directly into the successive layers, ensuring that crucial details are retained throughout the model’s processing.
Feedforward Neural Network (FFN):
The next pivotal component in the transformer pipeline is the feedforward neural network (FFN). This network consists of one or more fully connected layers, often incorporating non-linear activation functions like Rectified Linear Units (ReLU). Its role is to capture intricate patterns and relationships within the data.
The FFN enables the model to learn complex features and dependencies in the sequence data by applying transformations and non-linearities to the input representations.
Another Layer Normalization and Residual Connection:
Similar to the previous stage, another round of layer normalization is applied, followed by the incorporation of a residual connection. These components serve to maintain training stability and information flow.
Layer normalization ensures that the values of the embeddings remain in a consistent and manageable range, while the residual connection allows for the preservation of essential information from earlier stages. These measures collectively contribute to the robustness and effectiveness of the transformer model as it continues to process and refine the input data.
A note on training:
The transformer models mentioned, GPT and BERT were trained as language models. This means they were trained on a lot of raw text using self-supervised learning, which doesn’t require humans to label the data. While this type of model understands the language statistically, it’s not very useful for specific practical tasks. So, the pre-trained model undergoes transfer learning, fine-tuned in a supervised way using human-annotated labels for a given task. One of these tasks is predicting the next word in a sentence based on previous words, called causal language modelling.
Overall, transformer models can better understand a prompt’s semantics, context, meaning, and intention; they can be fine-tuned to communicate effectively for various purposes.
Off-the-rails answers, dangerous information, and false information from chatbots can happen, though safeguards do exist. In a healthcare setting, this can be very harmful. Poor training data and insufficient specific data available for training increase the risk of this.
Chatbots capable of simulating all the complexities of humans aren’t here yet, limiting their effectiveness. AI models that perform highly like a professional have not yet been developed. Right now, these chatbots are like ‘self-help buddies,’ not therapists.
Serife Tekin, a professor and mental health researcher at the University of Texas San Antonio, had said, “The hype and promise is way ahead of the research that shows its effectiveness” and “My worry is [teens] will turn away from other mental health interventions, saying, ‘Oh well, I already tried this and it didn’t work.’”
Your cultural experiences may affect the effectiveness of chatbot therapy as well. The data and the chatbots themselves need to be tested and safe. Aniket Bera, an associate professor of computer science at Purdue, says chatbot models are more likely to misunderstand him and miss clues because the available training data is biased toward the experiences of white males, and he grew up in India.
Ultimately, chatbot therapy is a step towards precision healthcare, which takes personal data. That raises another concern: the information shared between an actual therapist is practically harder to compromise and is protected by legislation in most places. Policy lags behind new technology, so the same protections people are accustomed to from IRL therapy may still need to be put in place. Data breaches can pose catastrophic to users who share sensitive data.
Scaling up AI for therapy means more emissions. I don’t think AI emissions will be a huge problem. It definitely could be, but there are three trends that suggest to me the opposite.
The first trend is the push for climate-friendly production. GPT-3’s training in 2020 could have produced 4,000 kg of CO2 emissions if it was trained in Canada on Azure servers powered by hydroelectricity, while if trained on Azure South Africa West servers, it would produce over 200,000kg. The amount of emissions of the computations for the AI depends on the amount of power used and where the power is sourced. In time, the amount of carbon produced per calculation in AI training will go down as renewable energy sources power more servers.
The second trend is the reduction in the training speed of ML models. Wright’s law predicts that AI-relative processing units will continue to drop in price 57% annually. This saved money can be directed to efforts to lower carbon output. Training can be made more efficient through many means, like transfer learning or distribution of pre-trained models. More efficient software will be pursued, further lowering power usage and carbon output.
The third trend I see is predictive and interpretive analysis with AI to lower carbon emissions. AI is being used to monitor, and predict emissions, analyze supply chain data and more. Many companies like Boston Consulting Group are using AI to create actionable insights for companies that are both profitable and climate-friendly. AI therapists will cause some emissions, but AI will cut down so much emissions in GHG alone. I don’t think the magnitude compares.
The chatbot therapists often provide positive or neutral outcomes for the users. People report reduced stress and anxiety after using a chatbot for a session; they report greater feelings of well-being after weeks of use. Some people who use chatbots say they would’ve chosen not to use a therapy service if they were talking to a human. Some people go on to seek more help after a positive experience with a chatbot. Right now, chatbot therapy offers a slight-large improvement in people’s lives without the barriers of time restriction, stigma, or the price of a professional.
The future is bright for AI in mental healthcare. We will see chatbot therapy improve significantly as NLP applications improve and as studies continue. Chatbot therapy will rival the standard therapist in the next 10 years in effectiveness while staying cheap and available. AI in mental healthcare won’t be limited to chatbots; it will be a critical tool for delivering precision care to the common person(at least in developed areas).