Last monday, several ChatGPT users were surprised to see their chat histories showing other people's chat queries. OpenAI later disclosed that it had to temporarily take ChatGPT offline as an unpatched bug (or vulnerability? more on that below) in an open source component caused the data leak of some of its subscribers’ payment-related info, along with users’ chat queries. The library in question is called Redis.
Both the root cause and the impact of the incident have been analyzed in OpenAI’s postmortem.
Race condition vulnerability in Redis
The bug tracked as sonatype-2023-1621 (later assigned CVE-2023-28858, CVE-2023-28859) is a Race Condition in Redis, an open source component available in the PyPI repository. Because the ‘bug’ poses a security risk impacting a system’s confidentiality and resource availability, potentially opening doors for exploitation, it effectively becomes a vulnerability (an explainer on bug vs. vulnerability).
The vulnerability itself is quite straightforward, but concerns a scenario that would typically occur extremely rarely. On Monday, OpenAI inadvertently introduced a change on its servers that caused a spike in Redis request cancellations – thereby bumping up the probability of this Race Condition triggering a whole lot more. That is why multiple users, as opposed to an odd person here and there, had their chat queries leak into other users’ chat history. For about 1.2% of ChatGPT Plus subscribers though, their name, email address, payment address, and partial credit card data (last four digits, expiration date) were also leaked.
Redis is a popular choice of in-memory data structure store and is often used for distributed caching and large scale noSQL databases.
Specifically, OpenAI states that it uses Redis to cache user information across its servers so it doesn’t need to query its database for every request. OpenAI further uses Redis Cluster to fairly distribute load across multiple Redis instances.
“We use the redis-py library to interface with Redis from our Python server, which runs with asyncio,” states OpenAI. “The library maintains a shared pool of connections between the server and the cluster, and recycles a connection to be used for another request once done.”
The Redis PyPI library uses ‘asyncio’ to implement its cluster and client classes. But due to insufficient error handling in Redis for extremely rare conditions that can occur in large scale context-dependent applications like ChatGPT, unintended consequences may happen:
“When using asyncio, requests and responses with redis-py behave as two queues: the caller pushes a request onto the incoming queue, and will pop a response from the outgoing queue, and then return the connection to the pool,” explains the postmortem.
“If a request is canceled after the request is pushed onto the incoming queue, but before the response popped from the outgoing queue, we see our bug: the connection thus becomes corrupted and the next response that’s dequeued for an unrelated request can receive data left behind in the connection.”
Although in most cases, this case would trigger a server error, urging users to retry the request, in some cases – like ChatGPT’s – the corrupted data would get returned from the cache, resulting in an unintended information disclosure.
In other words, if an async Redis command is canceled once it is sent by a node, but before it is received and parsed by another, the connection is left in an "unsafe" state for future commands. As such, responses from previous, canceled commands may be read by a node, out of sequence, potentially compromising the confidentiality of data and impacting the integrity and availability of resources.
Is the vulnerability fixed?
Despite Redis releasing a fix in version 4.5.3 and some backports, some sharp-witted testers were able to reproduce the flaw, deeming it unfixed. As such, a second identifier, CVE-2023-28859 has been assigned to track the flaw in insufficiently fixed versions (e.g. 4.5.3, 5.0.0b1, etc.)
The Sonatype security research team continues to monitor the development. As soon as ChatGPT disclosed its postmortem of the incident on Friday, we immediately flagged the vulnerable versions of Redis, and began conducting an expedited Deep Dive research on the vulnerability. By Monday, our research for sonatype-2023-1621 was updated to account for both CVEs in our security data. We continue to monitor for any upcoming Redis releases that would hopefully completely remediate the vulnerability. Customers should refer to their Nexus IQ and Nexus Lifecycle instances for up-to-date information.
What is a race condition?
In information systems, a race condition is an inadvertent scenario which occurs when a program or system is attempting to perform multiple operations that should occur in sequence, but due to a bug, fall out of sequence in their execution. One task executing before the other, or both of them executing concurrently, for example, can corrupt the operation and data.
While they may be rare and not always directly exploitable, the presence of race conditions can become problematic to debug as reproducing them in a test environment (vs. a live, large scale production system) may not always work.
A writeup on “Hacking Banks with Race Conditions,” from 2020 by infosec researcher Vickie Li explains this in a less abstract manner by explaining how the problem could occur in banking systems responsible for transactions.
In 2015, security researcher Egor Homakov exploited a race condition in Starbucks systems to “steal” free money for his gift card and drink “unlimited” coffee. Understandably, Starbucks did not seem too pleased about it.