Semantic LLM caching¶

Save on tokens and latency with a LLM response cache based on semantic similarity (as opposed to exact match), powered by Vector Search.

NOTE: this uses Cassandra's "Vector Search" capability. Make sure you are connecting to a vector-enabled database for this demo.

The Cassandra-backed "semantic cache" for prompt responses is imported like this:

In [1]:

Copied!

from langchain.cache import CassandraSemanticCache
from langchain.cache import CassandraSemanticCache

A database connection is needed. (If on a Colab, the only supported option is the cloud service Astra DB.)

In [2]:

Copied!





# Ensure loading of database credentials into environment variables:
import os
from dotenv import load_dotenv
load_dotenv("../../../.env")

import cassio
# Ensure loading of database credentials into environment variables:
import os
from dotenv import load_dotenv
load_dotenv("../../../.env")

import cassio

Select your choice of database by editing this cell, if needed:

In [3]:

Copied!

database_mode = "cassandra"  # "cassandra" / "astra_db"
database_mode = "cassandra"  # "cassandra" / "astra_db"

In [4]:

Copied!





if database_mode == "astra_db":
    cassio.init(
        database_id=os.environ["ASTRA_DB_ID"],
        token=os.environ["ASTRA_DB_APPLICATION_TOKEN"],
        keyspace=os.environ.get("ASTRA_DB_KEYSPACE"),  # this is optional
    )
if database_mode == "astra_db":
    cassio.init(
        database_id=os.environ["ASTRA_DB_ID"],
        token=os.environ["ASTRA_DB_APPLICATION_TOKEN"],
        keyspace=os.environ.get("ASTRA_DB_KEYSPACE"),  # this is optional
    )

In [5]:

Copied!





if database_mode == "cassandra":
    from cqlsession import getCassandraCQLSession, getCassandraCQLKeyspace
    cassio.init(
        session=getCassandraCQLSession(),
        keyspace=getCassandraCQLKeyspace(),
    )
if database_mode == "cassandra":
    from cqlsession import getCassandraCQLSession, getCassandraCQLKeyspace
    cassio.init(
        session=getCassandraCQLSession(),
        keyspace=getCassandraCQLKeyspace(),
    )

An embedding function and an LLM are needed.

Below is the logic to instantiate the LLM and embeddings of choice. We chose to leave it in the notebooks for clarity.

In [6]:

Copied!





import os
from llm_choice import suggestLLMProvider

llmProvider = suggestLLMProvider()
# (Alternatively set llmProvider to 'GCP_VertexAI', 'OpenAI', 'Azure_OpenAI' ... manually if you have credentials)

if llmProvider == 'GCP_VertexAI':
    from langchain.llms import VertexAI
    from langchain.embeddings import VertexAIEmbeddings
    llm = VertexAI()
    myEmbedding = VertexAIEmbeddings()
    print('LLM+embeddings from Vertex AI')
elif llmProvider == 'OpenAI':
    os.environ['OPENAI_API_TYPE'] = 'open_ai'
    from langchain.llms import OpenAI
    from langchain.embeddings import OpenAIEmbeddings
    llm = OpenAI(temperature=0)
    myEmbedding = OpenAIEmbeddings()
    print('LLM+embeddings from OpenAI')
elif llmProvider == 'Azure_OpenAI':
    os.environ['OPENAI_API_TYPE'] = 'azure'
    os.environ['OPENAI_API_VERSION'] = os.environ['AZURE_OPENAI_API_VERSION']
    os.environ['OPENAI_API_BASE'] = os.environ['AZURE_OPENAI_API_BASE']
    os.environ['OPENAI_API_KEY'] = os.environ['AZURE_OPENAI_API_KEY']
    from langchain.llms import AzureOpenAI
    from langchain.embeddings import OpenAIEmbeddings
    llm = AzureOpenAI(temperature=0, model_name=os.environ['AZURE_OPENAI_LLM_MODEL'],
                      engine=os.environ['AZURE_OPENAI_LLM_DEPLOYMENT'])
    myEmbedding = OpenAIEmbeddings(model=os.environ['AZURE_OPENAI_EMBEDDINGS_MODEL'],
                                   deployment=os.environ['AZURE_OPENAI_EMBEDDINGS_DEPLOYMENT'])
    print('LLM+embeddings from Azure OpenAI')
else:
    raise ValueError('Unknown LLM provider.')
import os
from llm_choice import suggestLLMProvider

llmProvider = suggestLLMProvider()
# (Alternatively set llmProvider to 'GCP_VertexAI', 'OpenAI', 'Azure_OpenAI' ... manually if you have credentials)

if llmProvider == 'GCP_VertexAI':
    from langchain.llms import VertexAI
    from langchain.embeddings import VertexAIEmbeddings
    llm = VertexAI()
    myEmbedding = VertexAIEmbeddings()
    print('LLM+embeddings from Vertex AI')
elif llmProvider == 'OpenAI':
    os.environ['OPENAI_API_TYPE'] = 'open_ai'
    from langchain.llms import OpenAI
    from langchain.embeddings import OpenAIEmbeddings
    llm = OpenAI(temperature=0)
    myEmbedding = OpenAIEmbeddings()
    print('LLM+embeddings from OpenAI')
elif llmProvider == 'Azure_OpenAI':
    os.environ['OPENAI_API_TYPE'] = 'azure'
    os.environ['OPENAI_API_VERSION'] = os.environ['AZURE_OPENAI_API_VERSION']
    os.environ['OPENAI_API_BASE'] = os.environ['AZURE_OPENAI_API_BASE']
    os.environ['OPENAI_API_KEY'] = os.environ['AZURE_OPENAI_API_KEY']
    from langchain.llms import AzureOpenAI
    from langchain.embeddings import OpenAIEmbeddings
    llm = AzureOpenAI(temperature=0, model_name=os.environ['AZURE_OPENAI_LLM_MODEL'],
                      engine=os.environ['AZURE_OPENAI_LLM_DEPLOYMENT'])
    myEmbedding = OpenAIEmbeddings(model=os.environ['AZURE_OPENAI_EMBEDDINGS_MODEL'],
                                   deployment=os.environ['AZURE_OPENAI_EMBEDDINGS_DEPLOYMENT'])
    print('LLM+embeddings from Azure OpenAI')
else:
    raise ValueError('Unknown LLM provider.')

LLM+embeddings from OpenAI

Create the cache¶

At this point you can instantiate the semantic cache.

Note: in the following it is made clear, through the way the table parameter is constructed, that different embeddings will require separate tables. This is done here to avoid mismatches when running this demo over and over with varying embedding functions: in most applications, where a single choice of embedding is made, there's no need to be this finicky and you cal usually leave the table name to its default value.

In [7]:

Copied!





cassSemanticCache = CassandraSemanticCache(
    session=None,
    keyspace=None,
    embedding=myEmbedding,
    table_name=f'semantic_cache_{llmProvider}',
)
cassSemanticCache = CassandraSemanticCache(
    session=None,
    keyspace=None,
    embedding=myEmbedding,
    table_name=f'semantic_cache_{llmProvider}',
)

Make sure the cache starts empty with:

In [8]:

Copied!

cassSemanticCache.clear()
cassSemanticCache.clear()

Configure the cache at a LangChain global level:

In [9]:

Copied!

import langchain
langchain.llm_cache = cassSemanticCache
import langchain
langchain.llm_cache = cassSemanticCache

Use the cache¶

Now try submitting a few prompts to the LLM and pay attention to the response times.

If the LLM is actually run, they should be the order of a few seconds; but in case of a cache hit, it will be way less than a second.

Notice that you get a cache hit even after rephrasing the question.

In [10]:

Copied!





%%time
SPIDER_QUESTION_FORM_1 = "How many eyes do spiders have?"
# A new question should take long
llm(SPIDER_QUESTION_FORM_1)
%%time
SPIDER_QUESTION_FORM_1 = "How many eyes do spiders have?"
# A new question should take long
llm(SPIDER_QUESTION_FORM_1)

CPU times: user 23.9 ms, sys: 0 ns, total: 23.9 ms
Wall time: 1 s

Out[10]:

'\n\nMost spiders have eight eyes, although some have fewer or more.'

In [11]:

Copied!

%%time
# Second time, very same question, this should be quick
llm(SPIDER_QUESTION_FORM_1)
%%time
# Second time, very same question, this should be quick
llm(SPIDER_QUESTION_FORM_1)

CPU times: user 3.44 ms, sys: 0 ns, total: 3.44 ms
Wall time: 4.97 ms

Out[11]:

'\n\nMost spiders have eight eyes, although some have fewer or more.'

In [12]:

Copied!





%%time
SPIDER_QUESTION_FORM_2 = "How many eyes does a spider generally have?"
# Just a rephrasing: but it's the same question,
# so it will just take the time to evaluate embeddings
llm(SPIDER_QUESTION_FORM_2)
%%time
SPIDER_QUESTION_FORM_2 = "How many eyes does a spider generally have?"
# Just a rephrasing: but it's the same question,
# so it will just take the time to evaluate embeddings
llm(SPIDER_QUESTION_FORM_2)

CPU times: user 6.86 ms, sys: 3.62 ms, total: 10.5 ms
Wall time: 259 ms

Out[12]:

'\n\nMost spiders have eight eyes, although some have fewer or more.'

Time for a really new question:

In [13]:

Copied!





%%time
LOGIC_QUESTION_FORM_1 = "Is absence of proof the same as proof of absence?"
# A totally new question
llm(LOGIC_QUESTION_FORM_1)
%%time
LOGIC_QUESTION_FORM_1 = "Is absence of proof the same as proof of absence?"
# A totally new question
llm(LOGIC_QUESTION_FORM_1)

CPU times: user 28.1 ms, sys: 1.39 ms, total: 29.5 ms
Wall time: 5.56 s

Out[13]:

'\n\nNo, absence of proof is not the same as proof of absence. Absence of proof means that there is no evidence to support a claim, while proof of absence means that there is evidence to support the claim that something does not exist.'

Going back to the same question as earlier (not literally, though):

In [14]:

Copied!





%%time
SPIDER_QUESTION_FORM_3 = "How many are the eyes on a spider usually?"
# Trying to catch the cache off-guard :)
llm(SPIDER_QUESTION_FORM_3)
%%time
SPIDER_QUESTION_FORM_3 = "How many are the eyes on a spider usually?"
# Trying to catch the cache off-guard :)
llm(SPIDER_QUESTION_FORM_3)

CPU times: user 15.1 ms, sys: 1.01 ms, total: 16.1 ms
Wall time: 246 ms

Out[14]:

'\n\nMost spiders have eight eyes, although some have fewer or more.'

And again to the logic riddle:

In [15]:

Copied!





%%time
LOGIC_QUESTION_FORM_2 = "Is it true that the absence of a proof equates the proof of an absence?"
# Switching to the other question again
llm(LOGIC_QUESTION_FORM_2)
%%time
LOGIC_QUESTION_FORM_2 = "Is it true that the absence of a proof equates the proof of an absence?"
# Switching to the other question again
llm(LOGIC_QUESTION_FORM_2)

CPU times: user 15.7 ms, sys: 2.66 ms, total: 18.4 ms
Wall time: 257 ms

Out[15]:

'\n\nNo, absence of proof is not the same as proof of absence. Absence of proof means that there is no evidence to support a claim, while proof of absence means that there is evidence to support the claim that something does not exist.'

Additional options¶

When creating the semantic cache, you can specify a few other options such as the metric used to calculate the similarity (and, accordingly, a corresponding threshold for accepting a "cache hit").

Here is an example which uses the L2 (Euclidean) metric:

In [16]:

Copied!





anotherCassSemanticCache = CassandraSemanticCache(
    session=None,
    keyspace=None,
    embedding=myEmbedding,
    table_name=f'semantic_cache_{llmProvider}',
    distance_metric='l2',
    score_threshold=0.4,
)
anotherCassSemanticCache = CassandraSemanticCache(
    session=None,
    keyspace=None,
    embedding=myEmbedding,
    table_name=f'semantic_cache_{llmProvider}',
    distance_metric='l2',
    score_threshold=0.4,
)

This cache builds on the same database table as the previous one, as can be seen e.g. with:

In [17]:

Copied!





lookup = anotherCassSemanticCache.lookup_with_id_through_llm(
    LOGIC_QUESTION_FORM_2,
    llm,
)
if lookup:
    cache_entry_id, response = lookup
    print(f"cache_entry_id = {cache_entry_id}")
    # `response` is a List[langchain.schema.output.Generation], so:
    print(f"\n{response[0].text.strip()}")
else:
    print('No match.')
lookup = anotherCassSemanticCache.lookup_with_id_through_llm(
    LOGIC_QUESTION_FORM_2,
    llm,
)
if lookup:
    cache_entry_id, response = lookup
    print(f"cache_entry_id = {cache_entry_id}")
    # `response` is a List[langchain.schema.output.Generation], so:
    print(f"\n{response[0].text.strip()}")
else:
    print('No match.')

cache_entry_id = 77add13036bcaa23c74ebf2ab2c56441-0e4d63bf605cd5f4329128fcbe38762d

No, absence of proof is not the same as proof of absence. Absence of proof means that there is no evidence to support a claim, while proof of absence means that there is evidence to support the claim that something does not exist.

Caching and Chat Models¶

The CassandraCache supports caching within chat-oriented LangChain abstractions such as ChatOpenAI as well:

(warning: the following is demonstrated with OpenAI only for the time being)

In [18]:

Copied!

from langchain.chat_models import ChatOpenAI

chat_llm = ChatOpenAI(model_name="gpt-3.5-turbo-16k", temperature=0)
from langchain.chat_models import ChatOpenAI

chat_llm = ChatOpenAI(model_name="gpt-3.5-turbo-16k", temperature=0)

In [19]:

Copied!

%%time
print(chat_llm.predict("Can supernovae result in a black hole?"))
%%time
print(chat_llm.predict("Can supernovae result in a black hole?"))

Yes, supernovae can result in the formation of a black hole. A supernova occurs when a massive star reaches the end of its life and undergoes a catastrophic explosion. The explosion expels most of the star's material into space, while the core collapses under its own gravity.

If the core of the star is massive enough, typically more than three times the mass of the Sun, it will collapse further and form a black hole. This collapse is so intense that it creates a region of space with an extremely strong gravitational pull, from which nothing, not even light, can escape. This region is known as a black hole.
CPU times: user 25.9 ms, sys: 3.05 ms, total: 29 ms
Wall time: 7.96 s

In [20]:

Copied!

%%time
# Expect a much faster response:
print(chat_llm.predict("Is it possible that black holes come from big exploding stars?"))
%%time
# Expect a much faster response:
print(chat_llm.predict("Is it possible that black holes come from big exploding stars?"))

Yes, supernovae can result in the formation of a black hole. A supernova occurs when a massive star reaches the end of its life and undergoes a catastrophic explosion. The explosion expels most of the star's material into space, while the core collapses under its own gravity.

If the core of the star is massive enough, typically more than three times the mass of the Sun, it will collapse further and form a black hole. This collapse is so intense that it creates a region of space with an extremely strong gravitational pull, from which nothing, not even light, can escape. This region is known as a black hole.
CPU times: user 15.4 ms, sys: 694 µs, total: 16.1 ms
Wall time: 253 ms

(Actually, every object which inherits from the LangChain Generation class can be seamlessly store and retrieved in this cache.)

Stale entry control¶

Time-To-Live (TTL)¶

You can configure a time-to-live property of the cache, with the effect of automatic eviction of cached entries after a certain time.

Setting langchain.llm_cache to the following will have the effect that entries vanish in an hour:

In [21]:

Copied!





cacheWithTTL = CassandraSemanticCache(
    session=None,
    keyspace=None,
    embedding=myEmbedding,
    table_name=f'semantic_cache_{llmProvider}',
    ttl_seconds=3600,
)
cacheWithTTL = CassandraSemanticCache(
    session=None,
    keyspace=None,
    embedding=myEmbedding,
    table_name=f'semantic_cache_{llmProvider}',
    ttl_seconds=3600,
)

Manual cache eviction¶

Alternatively, you can invalidate individual entries one at a time, just like you saw for the exact-match CassandraCache cache.

But this is an index based on sentence similarity, so this time the procedure has two steps: first, a lookup to find the id of the matching document:

In [22]:

Copied!





lookup = cassSemanticCache.lookup_with_id_through_llm(SPIDER_QUESTION_FORM_1, llm)
if lookup:
    cache_entry_id, response = lookup
    print(cache_entry_id)
else:
    print('No match.')
lookup = cassSemanticCache.lookup_with_id_through_llm(SPIDER_QUESTION_FORM_1, llm)
if lookup:
    cache_entry_id, response = lookup
    print(cache_entry_id)
else:
    print('No match.')

0a1339bc659790da078a4352c05bf422-0e4d63bf605cd5f4329128fcbe38762d

you can see that querying for another form for the "same" question will result in the same id:

In [23]:

Copied!





lookup2 = cassSemanticCache.lookup_with_id_through_llm(SPIDER_QUESTION_FORM_2, llm)
if lookup2:
    cache_entry_id2, response2 = lookup2
    print(cache_entry_id2)
else:
    print('No match.')
lookup2 = cassSemanticCache.lookup_with_id_through_llm(SPIDER_QUESTION_FORM_2, llm)
if lookup2:
    cache_entry_id2, response2 = lookup2
    print(cache_entry_id2)
else:
    print('No match.')

0a1339bc659790da078a4352c05bf422-0e4d63bf605cd5f4329128fcbe38762d

and second, the document id is used in the actual cache eviction:

In [24]:

Copied!

cassSemanticCache.delete_by_document_id(cache_entry_id)
cassSemanticCache.delete_by_document_id(cache_entry_id)

As a check, try asking that question again and note the cell execution time (you can also try re-running the above lookup cell...):

In [25]:

Copied!

%%time
llm(SPIDER_QUESTION_FORM_1)
%%time
llm(SPIDER_QUESTION_FORM_1)

CPU times: user 11.2 ms, sys: 947 µs, total: 12.2 ms
Wall time: 704 ms

Out[25]:

'\n\nMost spiders have eight eyes, although some have fewer or more.'

Whole-cache deletion¶

Lastly, as you have seen earlier, you can empty the cache entirely with:

In [26]:

Copied!

cassSemanticCache.clear()
cassSemanticCache.clear()