{ "cells": [ { "cell_type": "markdown", "id": "6715bc2b", "metadata": {}, "source": [ "# Vector Similarity Search QA Quickstart\n", "\n", "Set up a simple Question-Answering system with LangChain and CassIO, using Cassandra / Astra DB as the Vector Database." ] }, { "cell_type": "markdown", "id": "761d9b70", "metadata": {}, "source": [ "_**NOTE:** this uses Cassandra's \"Vector Similarity Search\" capability.\n", "Make sure you are connecting to a vector-enabled database for this demo._" ] }, { "cell_type": "code", "execution_count": 1, "id": "042f832e", "metadata": {}, "outputs": [], "source": [ "from langchain.indexes import VectorstoreIndexCreator\n", "from langchain.text_splitter import (\n", " CharacterTextSplitter,\n", " RecursiveCharacterTextSplitter,\n", ")\n", "from langchain.docstore.document import Document\n", "from langchain.document_loaders import TextLoader" ] }, { "cell_type": "markdown", "id": "4388ac1d", "metadata": {}, "source": [ "The following line imports the Cassandra flavor of a LangChain vector store:" ] }, { "cell_type": "code", "execution_count": 2, "id": "d65c46f0", "metadata": {}, "outputs": [], "source": [ "from langchain.vectorstores.cassandra import Cassandra" ] }, { "cell_type": "markdown", "id": "4578a87b", "metadata": {}, "source": [ "A database connection is needed. _(If on a Colab, the only supported option is the cloud service Astra DB.)_" ] }, { "cell_type": "code", "execution_count": 3, "id": "9515d4af-9fc2-4c29-b9f2-5c90cd48cc1b", "metadata": {}, "outputs": [], "source": [ "# Ensure loading of database credentials into environment variables:\n", "import os\n", "from dotenv import load_dotenv\n", "load_dotenv(\"../../../.env\")\n", "\n", "import cassio" ] }, { "cell_type": "markdown", "id": "25ee44a4-8d6a-4b41-8a8f-7c93eed35034", "metadata": {}, "source": [ "Select your choice of database by editing this cell, if needed:" ] }, { "cell_type": "code", "execution_count": 4, "id": "bfe78b92-7194-4ee9-ba7f-cd308409212c", "metadata": {}, "outputs": [], "source": [ "database_mode = \"cassandra\" # \"cassandra\" / \"astra_db\"" ] }, { "cell_type": "code", "execution_count": 5, "id": "fb5597b3-a028-4d8f-ad6d-8eaf0c941dcc", "metadata": {}, "outputs": [], "source": [ "if database_mode == \"astra_db\":\n", " cassio.init(\n", " database_id=os.environ[\"ASTRA_DB_ID\"],\n", " token=os.environ[\"ASTRA_DB_APPLICATION_TOKEN\"],\n", " keyspace=os.environ.get(\"ASTRA_DB_KEYSPACE\"), # this is optional\n", " )" ] }, { "cell_type": "code", "execution_count": 6, "id": "329d969b-6532-4dc4-a315-1a5438a59919", "metadata": {}, "outputs": [], "source": [ "if database_mode == \"cassandra\":\n", " from cqlsession import getCassandraCQLSession, getCassandraCQLKeyspace\n", " cassio.init(\n", " session=getCassandraCQLSession(),\n", " keyspace=getCassandraCQLKeyspace(),\n", " )" ] }, { "cell_type": "markdown", "id": "32e2a156", "metadata": {}, "source": [ "Both an LLM and an embedding function are required.\n", "\n", "Below is the logic to instantiate the LLM and embeddings of choice. We chose to leave it in the notebooks for clarity." ] }, { "cell_type": "code", "execution_count": 7, "id": "124e3de4", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "LLM+embeddings from OpenAI\n" ] } ], "source": [ "import os\n", "from llm_choice import suggestLLMProvider\n", "\n", "llmProvider = suggestLLMProvider()\n", "# (Alternatively set llmProvider to 'GCP_VertexAI', 'OpenAI', 'Azure_OpenAI' ... manually if you have credentials)\n", "\n", "if llmProvider == 'GCP_VertexAI':\n", " from langchain.llms import VertexAI\n", " from langchain.embeddings import VertexAIEmbeddings\n", " llm = VertexAI()\n", " myEmbedding = VertexAIEmbeddings()\n", " print('LLM+embeddings from Vertex AI')\n", "elif llmProvider == 'OpenAI':\n", " os.environ['OPENAI_API_TYPE'] = 'open_ai'\n", " from langchain.llms import OpenAI\n", " from langchain.embeddings import OpenAIEmbeddings\n", " llm = OpenAI(temperature=0)\n", " myEmbedding = OpenAIEmbeddings()\n", " print('LLM+embeddings from OpenAI')\n", "elif llmProvider == 'Azure_OpenAI':\n", " os.environ['OPENAI_API_TYPE'] = 'azure'\n", " os.environ['OPENAI_API_VERSION'] = os.environ['AZURE_OPENAI_API_VERSION']\n", " os.environ['OPENAI_API_BASE'] = os.environ['AZURE_OPENAI_API_BASE']\n", " os.environ['OPENAI_API_KEY'] = os.environ['AZURE_OPENAI_API_KEY']\n", " from langchain.llms import AzureOpenAI\n", " from langchain.embeddings import OpenAIEmbeddings\n", " llm = AzureOpenAI(temperature=0, model_name=os.environ['AZURE_OPENAI_LLM_MODEL'],\n", " engine=os.environ['AZURE_OPENAI_LLM_DEPLOYMENT'])\n", " myEmbedding = OpenAIEmbeddings(model=os.environ['AZURE_OPENAI_EMBEDDINGS_MODEL'],\n", " deployment=os.environ['AZURE_OPENAI_EMBEDDINGS_DEPLOYMENT'])\n", " print('LLM+embeddings from Azure OpenAI')\n", "else:\n", " raise ValueError('Unknown LLM provider.')" ] }, { "cell_type": "markdown", "id": "285f29cf", "metadata": {}, "source": [ "## A minimal example" ] }, { "cell_type": "markdown", "id": "5cf74a31", "metadata": {}, "source": [ "The following is a minimal usage of the Cassandra vector store. The store is created and filled at once, and is then queried to retrieve relevant parts of the indexed text, which are then stuffed into a prompt finally used to answer a question." ] }, { "cell_type": "markdown", "id": "6f29fc57", "metadata": {}, "source": [ "The following creates an \"index creator\", which knows about the type of vector store, the embedding to use and how to preprocess the input text:\n", "\n", "_(Note: stores built with different embedding functions will need different tables. This is why we append the `llmProvider` name to the table name in the next cell.)_" ] }, { "cell_type": "code", "execution_count": 8, "id": "d2cfe71b", "metadata": {}, "outputs": [], "source": [ "table_name = 'vs_test1_' + llmProvider\n", "\n", "index_creator = VectorstoreIndexCreator(\n", " vectorstore_cls=Cassandra,\n", " embedding=myEmbedding,\n", " text_splitter=CharacterTextSplitter(\n", " chunk_size=400,\n", " chunk_overlap=0,\n", " ),\n", " vectorstore_kwargs={\n", " 'session': None,\n", " 'keyspace': None,\n", " 'table_name': table_name,\n", " },\n", ")" ] }, { "cell_type": "markdown", "id": "ea5ddfb6", "metadata": {}, "source": [ "Loading a local text (a short story by E. A. Poe will do)" ] }, { "cell_type": "code", "execution_count": 9, "id": "de8d65df", "metadata": {}, "outputs": [], "source": [ "loader = TextLoader('texts/amontillado.txt', encoding='utf8')" ] }, { "cell_type": "markdown", "id": "53431eca", "metadata": {}, "source": [ "This takes a few seconds to run, as it must calculate embedding vectors for a number of chunks of the input text:" ] }, { "cell_type": "code", "execution_count": 10, "id": "a4929f2a", "metadata": {}, "outputs": [], "source": [ "# Note: Certain LLM providers need workaround to evaluate batch embeddings\n", "# (as done in next cell).\n", "# As of 2023-06-29, Azure OpenAI would error with:\n", "# \"InvalidRequestError: Too many inputs. The max number of inputs is 1\"\n", "if llmProvider == 'Azure_OpenAI':\n", " from langchain.indexes.vectorstore import VectorStoreIndexWrapper\n", " docs = loader.load()\n", " subdocs = index_creator.text_splitter.split_documents(docs)\n", " #\n", " print(f'subdocument {0} ...', end=' ')\n", " vs = index_creator.vectorstore_cls.from_documents(\n", " subdocs[:1],\n", " index_creator.embedding,\n", " **index_creator.vectorstore_kwargs,\n", " )\n", " print('done.')\n", " for sdi, sd in enumerate(subdocs[1:]):\n", " print(f'subdocument {sdi+1} ...', end=' ')\n", " vs.add_texts(texts=[sd.page_content], metadata=[sd.metadata])\n", " print('done.')\n", " #\n", " index = VectorStoreIndexWrapper(vectorstore=vs)" ] }, { "cell_type": "code", "execution_count": 11, "id": "64c9970e", "metadata": {}, "outputs": [], "source": [ "if llmProvider != 'Azure_OpenAI':\n", " index = index_creator.from_loaders([loader])" ] }, { "cell_type": "markdown", "id": "50c02057", "metadata": {}, "source": [ "### Check what's on DB\n", "\n", "By way of demonstration, if you were to directly read the rows stored in your database table, this is what you would now find there (not that you'll ever _have to_, for LangChain and CassIO provide an abstraction on top of that):" ] }, { "cell_type": "code", "execution_count": 12, "id": "de08db04", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "Row 0:\n", " row_id: c38f50eb434e450ca6c8a8b2b582ca0d\n", " vector: [-0.009224753826856613, -0.01538837980479002, 0.0158158354461193 ...\n", " body_blob: He raised it to his lips with a leer. He paused and nodded to m ...\n", " metadata_s: {'source': 'texts/amontillado.txt'}\n", "\n", "Row 1:\n", " row_id: bc4ff32a7f854eab89a9227152b2fdea\n", " vector: [-0.0034670669119805098, -0.01440946850925684, 0.027317104861140 ...\n", " body_blob: He had a weak point--this Fortunato--although in other regards h ...\n", " metadata_s: {'source': 'texts/amontillado.txt'}\n", "\n", "Row 2:\n", " row_id: 29cb41c42bd94dfc936e4f3e68b73f42\n", " vector: [-0.01560208573937416, -0.0050873467698693275, 0.020469402894377 ...\n", " body_blob: I said to him--\"My dear Fortunato, you are luckily met. How rem ...\n", " metadata_s: {'source': 'texts/amontillado.txt'}\n", "\n", "...\n" ] } ], "source": [ "c_session = cassio.config.resolve_session()\n", "c_keyspace = cassio.config.resolve_keyspace()\n", "\n", "\n", "cqlSelect = f'SELECT * FROM {c_keyspace}.{table_name} LIMIT 3;' # (Not a production-optimized query ...)\n", "rows = c_session.execute(cqlSelect)\n", "for row_i, row in enumerate(rows):\n", " print(f'\\nRow {row_i}:')\n", " # depending on the cassIO version, the underlying Cassandra table can have different structure ...\n", " try:\n", " # you are using the new cassIO 0.1.0+ : congratulations :)\n", " print(f' row_id: {row.row_id}')\n", " print(f' vector: {str(row.vector)[:64]} ...')\n", " print(f' body_blob: {row.body_blob[:64]} ...')\n", " print(f' metadata_s: {row.metadata_s}') \n", " except AttributeError:\n", " # Please upgrade your cassIO to the latest version ...\n", " print(f' document_id: {row.document_id}')\n", " print(f' embedding_vector: {str(row.embedding_vector)[:64]} ...')\n", " print(f' document: {row.document[:64]} ...')\n", " print(f' metadata_blob: {row.metadata_blob}')\n", "\n", "print('\\n...')" ] }, { "cell_type": "markdown", "id": "29d9f48b", "metadata": {}, "source": [ "### Ask a question, get an answer" ] }, { "cell_type": "code", "execution_count": 13, "id": "3f35aaa7", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "' Luchesi is a connoisseur of wine who Fortunato believes can tell Amontillado from Sherry.'" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "query = \"Who is Luchesi?\"\n", "index.query(query, llm=llm)" ] }, { "cell_type": "markdown", "id": "527a4697", "metadata": {}, "source": [ "## Spawning a \"retriever\" from the index" ] }, { "cell_type": "markdown", "id": "4cb6b732", "metadata": {}, "source": [ "You just saw how easily you can plug a Cassandra-backed Vector Index into a full question-answering LangChain pipeline.\n", "\n", "But you can as easily work at a slightly lower level: the following code spawns a `VectorStoreRetriever` from the index for manual [retrieval](https://python.langchain.com/en/latest/modules/indexes/retrievers.html) of documents related to a given query text. The results are instances of LangChain's `Document` class." ] }, { "cell_type": "code", "execution_count": 14, "id": "7bae5520", "metadata": {}, "outputs": [], "source": [ "retriever = index.vectorstore.as_retriever(search_kwargs={\n", " 'k': 2,\n", "})" ] }, { "cell_type": "code", "execution_count": 15, "id": "e4e816ee", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[Document(page_content='\"A huge human foot d\\'or, in a field azure; the foot crushes a serpent\\nrampant whose fangs are imbedded in the heel.\"\\n\\n\"And the motto?\"\\n\\n\"_Nemo me impune lacessit_.\"\\n\\n\"Good!\" he said.', metadata={'source': 'texts/amontillado.txt'}),\n", " Document(page_content='He raised it to his lips with a leer. He paused and nodded to me\\nfamiliarly, while his bells jingled.\\n\\n\"I drink,\" he said, \"to the buried that repose around us.\"\\n\\n\"And I to your long life.\"\\n\\nHe again took my arm, and we proceeded.\\n\\n\"These vaults,\" he said, \"are extensive.\"\\n\\n\"The Montresors,\" I replied, \"were a great and numerous family.\"\\n\\n\"I forget your arms.\"', metadata={'source': 'texts/amontillado.txt'})]" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "retriever.get_relevant_documents(\n", " \"Check the motto of the Montresors\"\n", ")" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3" } }, "nbformat": 4, "nbformat_minor": 5 }