VectorStore/QA, learn more¶
NOTE: this uses Cassandra's "Vector Search" capability. Make sure you are connecting to a vector-enabled database for this demo.
In the previous Quickstart, you have created the index and at the same time added the corpus of text to it.
In most cases, these two operations happen at different times: besides, often new documents keep being ingested.
This notebook demonstrates further interactions you can have with a Cassandra Vector Store.
It is assumed you have run the "VectorStore/QA, Quickstart" notebook (so that the vector store is not empty)
from langchain.indexes.vectorstore import VectorStoreIndexWrapper
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.docstore.document import Document
The setup is similar to the one you saw:
from langchain.vectorstores.cassandra import Cassandra
A database connection is needed. (If on a Colab, the only supported option is the cloud service Astra DB.)
# Ensure loading of database credentials into environment variables:
import os
from dotenv import load_dotenv
load_dotenv("../../../.env")
import cassio
Select your choice of database by editing this cell, if needed:
database_mode = "cassandra" # "cassandra" / "astra_db"
if database_mode == "astra_db":
cassio.init(
database_id=os.environ["ASTRA_DB_ID"],
token=os.environ["ASTRA_DB_APPLICATION_TOKEN"],
keyspace=os.environ.get("ASTRA_DB_KEYSPACE"), # this is optional
)
if database_mode == "cassandra":
from cqlsession import getCassandraCQLSession, getCassandraCQLKeyspace
cassio.init(
session=getCassandraCQLSession(),
keyspace=getCassandraCQLKeyspace(),
)
Below is the logic to instantiate the LLM and embeddings of choice. We chose to leave it in the notebooks for clarity.
import os
from llm_choice import suggestLLMProvider
llmProvider = suggestLLMProvider()
# (Alternatively set llmProvider to 'GCP_VertexAI', 'OpenAI', 'Azure_OpenAI' ... manually if you have credentials)
if llmProvider == 'GCP_VertexAI':
from langchain.llms import VertexAI
from langchain.embeddings import VertexAIEmbeddings
llm = VertexAI()
myEmbedding = VertexAIEmbeddings()
print('LLM+embeddings from Vertex AI')
elif llmProvider == 'OpenAI':
os.environ['OPENAI_API_TYPE'] = 'open_ai'
from langchain.llms import OpenAI
from langchain.embeddings import OpenAIEmbeddings
llm = OpenAI(temperature=0)
myEmbedding = OpenAIEmbeddings()
print('LLM+embeddings from OpenAI')
elif llmProvider == 'Azure_OpenAI':
os.environ['OPENAI_API_TYPE'] = 'azure'
os.environ['OPENAI_API_VERSION'] = os.environ['AZURE_OPENAI_API_VERSION']
os.environ['OPENAI_API_BASE'] = os.environ['AZURE_OPENAI_API_BASE']
os.environ['OPENAI_API_KEY'] = os.environ['AZURE_OPENAI_API_KEY']
from langchain.llms import AzureOpenAI
from langchain.embeddings import OpenAIEmbeddings
llm = AzureOpenAI(temperature=0, model_name=os.environ['AZURE_OPENAI_LLM_MODEL'],
engine=os.environ['AZURE_OPENAI_LLM_DEPLOYMENT'])
myEmbedding = OpenAIEmbeddings(model=os.environ['AZURE_OPENAI_EMBEDDINGS_MODEL'],
deployment=os.environ['AZURE_OPENAI_EMBEDDINGS_DEPLOYMENT'])
print('LLM+embeddings from Azure OpenAI')
else:
raise ValueError('Unknown LLM provider.')
LLM+embeddings from OpenAI
Re-use an existing Vector Store¶
Creating this Cassandra
vector store, it will re-connect with the existing data on DB.
In practice, you are loading an existing, pre-populated vector store for further usage.
(make sure you are using the very same embedding function every time! In fact, this is why we have a separate table for each embedding function, i.e. for each llmProvider
.)
myCassandraVStore = Cassandra(
embedding=myEmbedding,
session=None,
keyspace=None,
table_name='vs_test1_' + llmProvider,
)
You can then re-instantiate the index
from the vector store with:
index = VectorStoreIndexWrapper(vectorstore=myCassandraVStore)
and use it as you saw in the "Vector Similarity Search QA Quickstart" (qa-basic.ipynb
):
query = "Who is Luchesi?"
index.query(query, llm=llm)
' Luchesi is a connoisseur of wine who Fortunato believes can tell Amontillado from Sherry.'
Further usage of the vector store¶
These are some of the ways you can query the store:
myCassandraVStore.similarity_search_with_score(
"Does anyone have a coughing fit?",
k=1,
)
[(Document(page_content='"Nitre," I replied. "How long have you had that cough?"\n\n"Ugh! ugh! ugh!--ugh! ugh! ugh!--ugh! ugh! ugh!--ugh! ugh! ugh!--ugh!\nugh! ugh!"\n\nMy poor friend found it impossible to reply for many minutes.\n\n"It is nothing," he said, at last.', metadata={'source': 'texts/amontillado.txt'}), 0.9052705074079563)]
Adding new documents¶
Start with a very off-topic question, to demonstrate that no relevant documents are found (yet).
Note: depending on the embedding function, you might still see some results, off-topic in practice, being found at this stage. In a full end-to-end QA session, however, these would likely be discarded by the LLM, which would presumably end up saying, "I don't know".
SPIDER_QUESTION = 'Compare Agelenidae and Lycosidae'
myCassandraVStore.similarity_search_with_relevance_scores(
SPIDER_QUESTION,
k=1,
score_threshold=0.8,
)
[(Document(page_content='"A huge human foot d\'or, in a field azure; the foot crushes a serpent\nrampant whose fangs are imbedded in the heel."\n\n"And the motto?"\n\n"_Nemo me impune lacessit_."\n\n"Good!" he said.', metadata={'source': 'texts/amontillado.txt'}), 0.8635507717421345)]
You can add a couple of relevant paragraphs to the index, using the add_texts
primitive:
spiderFacts = [
"""
The Agelenidae are a large family of spiders in the suborder Araneomorphae.
The body length of the smallest Agelenidae spiders are about 4 mm (0.16 in), excluding the legs,
while the larger species grow to 20 mm (0.79 in) long. Some exceptionally large species,
such as Eratigena atrica, may reach 5 to 10 cm (2.0 to 3.9 in) in total leg span.
Agelenids have eight eyes in two horizontal rows of four. Their cephalothoraces narrow
somewhat towards the front where the eyes are. Their abdomens are more or less oval, usually
patterned with two rows of lines and spots. Some species have longitudinal lines on the dorsal
surface of the cephalothorax, whereas other species do not; for example, the hobo spider does not,
which assists in informally distinguishing it from similar-looking species.
""",
"""
Jumping spiders are a group of spiders that constitute the family Salticidae.
As of 2019, this family contained over 600 described genera and over 6,000 described species,
making it the largest family of spiders at 13% of all species.
Jumping spiders have some of the best vision among arthropods and use it
in courtship, hunting, and navigation.
Although they normally move unobtrusively and fairly slowly,
most species are capable of very agile jumps, notably when hunting,
but sometimes in response to sudden threats or crossing long gaps.
Both their book lungs and tracheal system are well-developed,
and they use both systems (bimodal breathing).
Jumping spiders are generally recognized by their eye pattern.
All jumping spiders have four pairs of eyes, with the anterior median pair
being particularly large.
""",
]
spiderMetadatas = [
{'source': 'wikipedia/Agelenidae'},
{'source': 'wikipedia/Salticidae'},
]
if llmProvider != 'Azure_OpenAI':
ids = myCassandraVStore.add_texts(
spiderFacts,
spiderMetadatas,
)
print('\n'.join(ids))
else:
# Note: this is a temporary mitigation of an Azure OpenAI error with asking for
# multiple embedding in a single request, which would error with:
# "InvalidRequestError: Too many inputs. The max number of inputs is 1"
for spFact, spMetadata in zip(spiderFacts, spiderMetadatas):
thisId = myCassandraVStore.add_texts(
[spFact],
[spMetadata],
)[0]
print(thisId)
8591e0649933477ba4420aec3b8d5da2 4956716e443d432a96d9706b5cf1ffe2
Another way is to add a text through LangChain's Document
abstraction.
Note that, using one of LangChain's splitters, long input documents are made into (possibly overlapping) digestible chunks without much boilerplate:
mySplitter = RecursiveCharacterTextSplitter(chunk_size=250, chunk_overlap=120)
lycoText = """
Wolf spiders are members of the family Lycosidae.
They are robust and agile hunters with excellent eyesight.
They live mostly in solitude, hunt alone, and usually do not spin webs.
Some are opportunistic hunters, pouncing upon prey as they
find it or chasing it over short distances;
others wait for passing prey in or near the mouth of a burrow.
Wolf spiders resemble nursery web spiders (family Pisauridae),
but wolf spiders carry their egg sacs by attaching them to their spinnerets,
while the Pisauridae carry their egg sacs with their chelicerae and pedipalps.
Two of the wolf spider's eight eyes are large and prominent;
this distinguishes them from nursery web spiders,
whose eyes are all of roughly equal size.
This can also help distinguish them from the similar-looking grass spiders.
"""
lycoDocument = Document(
page_content=lycoText,
metadata={'source': 'wikipedia/Lycosidae'}
)
Use the splitter to "shred" the input document:
lycoDocs = mySplitter.transform_documents([lycoDocument])
lycoDocs
[Document(page_content='Wolf spiders are members of the family Lycosidae.\nThey are robust and agile hunters with excellent eyesight.\nThey live mostly in solitude, hunt alone, and usually do not spin webs.\nSome are opportunistic hunters, pouncing upon prey as they', metadata={'source': 'wikipedia/Lycosidae'}), Document(page_content='Some are opportunistic hunters, pouncing upon prey as they\nfind it or chasing it over short distances;\nothers wait for passing prey in or near the mouth of a burrow.', metadata={'source': 'wikipedia/Lycosidae'}), Document(page_content='Wolf spiders resemble nursery web spiders (family Pisauridae),\nbut wolf spiders carry their egg sacs by attaching them to their spinnerets,\nwhile the Pisauridae carry their egg sacs with their chelicerae and pedipalps.', metadata={'source': 'wikipedia/Lycosidae'}), Document(page_content="while the Pisauridae carry their egg sacs with their chelicerae and pedipalps.\nTwo of the wolf spider's eight eyes are large and prominent;\nthis distinguishes them from nursery web spiders,\nwhose eyes are all of roughly equal size.", metadata={'source': 'wikipedia/Lycosidae'}), Document(page_content='this distinguishes them from nursery web spiders,\nwhose eyes are all of roughly equal size.\nThis can also help distinguish them from the similar-looking grass spiders.', metadata={'source': 'wikipedia/Lycosidae'})]
These are ready to be added to the index:
if llmProvider != 'Azure_OpenAI':
ids = myCassandraVStore.add_documents(lycoDocs)
print('\n'.join(ids))
else:
# Note: this is a temporary mitigation of an Azure OpenAI error with asking for
# multiple embedding in a single request, which would error with:
# "InvalidRequestError: Too many inputs. The max number of inputs is 1"
for lycoDoc in lycoDocs:
thisId = myCassandraVStore.add_documents([lycoDoc])[0]
print(thisId)
b7e6eb16550d4d28b2b2fe9119bf5712 f274790734504c0dbd6becfe9f820ecb c1c8983c557a4b0e9ee103275a289008 ace6957d368a4e1683f66f7bca4a7689 3d1de4166b454b0c91ceeb41d992d779
Querying the store again¶
Time to repeat the question:
myCassandraVStore.similarity_search_with_relevance_scores(
SPIDER_QUESTION,
k=3,
score_threshold=0.8,
)
[(Document(page_content='\n The Agelenidae are a large family of spiders in the suborder Araneomorphae.\n The body length of the smallest Agelenidae spiders are about 4 mm (0.16 in), excluding the legs,\n while the larger species grow to 20 mm (0.79 in) long. Some exceptionally large species,\n such as Eratigena atrica, may reach 5 to 10 cm (2.0 to 3.9 in) in total leg span.\n Agelenids have eight eyes in two horizontal rows of four. Their cephalothoraces narrow\n somewhat towards the front where the eyes are. Their abdomens are more or less oval, usually\n patterned with two rows of lines and spots. Some species have longitudinal lines on the dorsal\n surface of the cephalothorax, whereas other species do not; for example, the hobo spider does not,\n which assists in informally distinguishing it from similar-looking species.\n ', metadata={'source': 'wikipedia/Agelenidae'}), 0.9029694961210916), (Document(page_content="while the Pisauridae carry their egg sacs with their chelicerae and pedipalps.\nTwo of the wolf spider's eight eyes are large and prominent;\nthis distinguishes them from nursery web spiders,\nwhose eyes are all of roughly equal size.", metadata={'source': 'wikipedia/Lycosidae'}), 0.9007341011796064), (Document(page_content='Wolf spiders resemble nursery web spiders (family Pisauridae),\nbut wolf spiders carry their egg sacs by attaching them to their spinnerets,\nwhile the Pisauridae carry their egg sacs with their chelicerae and pedipalps.', metadata={'source': 'wikipedia/Lycosidae'}), 0.893099616943348)]
Item removal and expiration¶
Time-To-Live (TTL)¶
If you provide a TTL value when creating the store, every entry will expire away a certain time after its insertion:
myCassandraVStoreWithTTL = Cassandra(
embedding=myEmbedding,
session=None,
keyspace=None,
table_name='vs_test1_shortlived_' + llmProvider,
ttl_seconds=120,
)
The following two documents will be available for two minutes.
if llmProvider != 'Azure_OpenAI':
ids = myCassandraVStoreWithTTL.add_documents(lycoDocs[0:2])
print('\n'.join(ids))
else:
# Note: this is a temporary mitigation of an Azure OpenAI error with asking for
# multiple embedding in a single request, which would error with:
# "InvalidRequestError: Too many inputs. The max number of inputs is 1"
for lycoDoc in lycoDocs[0:2]:
thisId = myCassandraVStoreWithTTL.add_documents([lycoDoc])[0]
print(thisId)
1e190faf6561489190d242cb96106c92 7c74e98579dc4fc495de3d8ab46c81a3
Alternatively, for a finer control of the time-to-live, you can specify it at insertion time -- which would anyway have precedence over the store-level definition. So, these documents will survive for twenty seconds:
if llmProvider != 'Azure_OpenAI':
ids = myCassandraVStore.add_documents(lycoDocs[2:], ttl_seconds=20)
print('\n'.join(ids))
else:
# Note: this is a temporary mitigation of an Azure OpenAI error with asking for
# multiple embedding in a single request, which would error with:
# "InvalidRequestError: Too many inputs. The max number of inputs is 1"
for lycoDoc in lycoDocs[2:]:
thisId = myCassandraVStore.add_documents([lycoDoc], ttl_seconds=20)[0]
print(thisId)
ff69c70273644e1fbd3f1f2257ea49cc 15b764a073274ba5981364aa6a966576 1483833ffd914486b4fffbc0c01e998f
Manual removal of entries¶
You can delete individual documents from the store.
However, you first need to retrieve their identifier with a similarity search. The following method returns a list of matching 3-tuples, whose last item is the id of the document:
spiderDocIds = []
for doc, score, docId in myCassandraVStore.similarity_search_with_score_id('Compare Agelenidae and Lycosidae'):
print(f' * [{score:.3f}] "{doc.page_content[:32].strip()}..." ({docId})')
spiderDocIds.append(docId)
* [0.903] "The Agelenidae are a large..." (8591e0649933477ba4420aec3b8d5da2) * [0.901] "while the Pisauridae carry their..." (15b764a073274ba5981364aa6a966576) * [0.901] "while the Pisauridae carry their..." (ace6957d368a4e1683f66f7bca4a7689) * [0.893] "Wolf spiders resemble nursery we..." (ff69c70273644e1fbd3f1f2257ea49cc)
At this point you can perform the actual document deletion:
for spiderDocId in spiderDocIds:
myCassandraVStore.delete_by_document_id(spiderDocId)
The last method to remove entries from the store is demonstrated next.
Cleanup¶
You're done.
In order to leave the index empty for the next demo run, you may want to clean the index (i.e. empty the table on DB).
Just don't take this operation lightly in production!
myCassandraVStore.clear()