{ "cells": [ { "cell_type": "markdown", "id": "6715bc2b", "metadata": {}, "source": [ "# LlamaIndex, Vector Store Quickstart\n", "\n", "Create a Vector Store with LamaIndex and CassIO, and build a powerful search engine and text generator, backed by Apache Cassandra® / Astra DB." ] }, { "cell_type": "markdown", "id": "2a0167d4-8d57-418c-a4cb-3358e7e18db6", "metadata": {}, "source": [ "### Table of contents:\n", "\n", "- Setup (database, LLM+embeddings)\n", "- Create vector store\n", "- Insert documents in it\n", "- Answer questions using Vector Search\n", "- Remove documents\n", "- Cleanup" ] }, { "cell_type": "markdown", "id": "761d9b70", "metadata": {}, "source": [ "_**NOTE:** this uses Cassandra's \"Vector Similarity Search\" capability.\n", "Make sure you are connecting to a vector-enabled database for this demo._" ] }, { "cell_type": "markdown", "id": "582ca2f3-b7d0-4547-ab73-83c9e1e36d48", "metadata": {}, "source": [ "## Setup" ] }, { "cell_type": "code", "execution_count": 1, "id": "ea9b1235-6334-4ade-bc71-95c17fefbfe1", "metadata": {}, "outputs": [], "source": [ "from llama_index import VectorStoreIndex, SimpleDirectoryReader, StorageContext" ] }, { "cell_type": "markdown", "id": "945efcf3-a3f1-4e27-a0e7-ed589340ec62", "metadata": {}, "source": [ "This is the LlamaIndex class providing support for Astra DB / Cassandra:" ] }, { "cell_type": "code", "execution_count": 2, "id": "77a8f514-d932-4861-8868-71b243e95531", "metadata": {}, "outputs": [], "source": [ "from llama_index.vector_stores import CassandraVectorStore" ] }, { "cell_type": "markdown", "id": "4a9314fc-3269-442a-839f-a8e663c61da0", "metadata": {}, "source": [ "A database connection is needed. _(If on a Colab, the only supported option is the cloud service Astra DB.)_" ] }, { "cell_type": "code", "execution_count": 3, "id": "3e12e6e5-2239-4879-aa7a-3d867a303f1a", "metadata": {}, "outputs": [], "source": [ "# Ensure loading of database credentials into environment variables:\n", "import os\n", "from dotenv import load_dotenv\n", "load_dotenv(\"../../../.env\")\n", "\n", "import cassio" ] }, { "cell_type": "markdown", "id": "681b0da1-010d-4944-9537-dd8a5e21ac98", "metadata": {}, "source": [ "Select your choice of database by editing this cell, if needed:" ] }, { "cell_type": "code", "execution_count": 4, "id": "a6bc8547-686d-4bea-aeea-f9f44d047662", "metadata": {}, "outputs": [], "source": [ "database_mode = \"cassandra\" # \"cassandra\" / \"astra_db\"" ] }, { "cell_type": "code", "execution_count": 5, "id": "59c6a13b-847f-4704-8925-29c310788931", "metadata": {}, "outputs": [], "source": [ "if database_mode == \"astra_db\":\n", " cassio.init(\n", " database_id=os.environ[\"ASTRA_DB_ID\"],\n", " token=os.environ[\"ASTRA_DB_APPLICATION_TOKEN\"],\n", " keyspace=os.environ.get(\"ASTRA_DB_KEYSPACE\"), # this is optional\n", " )" ] }, { "cell_type": "code", "execution_count": 6, "id": "e2adf541-1086-445f-8147-8bf997cdfbbb", "metadata": {}, "outputs": [], "source": [ "if database_mode == \"cassandra\":\n", " from cqlsession import getCassandraCQLSession, getCassandraCQLKeyspace\n", " cassio.init(\n", " session=getCassandraCQLSession(),\n", " keyspace=getCassandraCQLKeyspace(),\n", " )" ] }, { "cell_type": "markdown", "id": "385d2575-d621-4586-96bb-de4092e8624a", "metadata": {}, "source": [ "Both an LLM and an embedding function are required.\n", "\n", "Below is the logic to instantiate the LLM and embeddings of choice. We chose to leave it in the notebooks for clarity." ] }, { "cell_type": "code", "execution_count": 7, "id": "be891c7d-2aaf-452d-9310-605dbe8a0017", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "LLM+embeddings from OpenAI\n", "Vector dimension for this embedding model: 1536\n" ] } ], "source": [ "import os\n", "from llm_choice import suggestLLMProvider\n", "\n", "llmProvider = suggestLLMProvider()\n", "# (Alternatively set llmProvider to 'GCP_VertexAI', 'OpenAI', 'Azure_OpenAI' ... manually if you have credentials)\n", "\n", "if llmProvider == 'OpenAI':\n", " os.environ['OPENAI_API_TYPE'] = 'open_ai'\n", " from llama_index.llms import OpenAI\n", " from llama_index.embeddings import OpenAIEmbedding\n", " llm = OpenAI(temperature=0)\n", " myEmbedding = OpenAIEmbedding()\n", " vector_dimension = 1536\n", " print(\"LLM+embeddings from OpenAI\")\n", "elif llmProvider == 'GCP_VertexAI':\n", " from llama_index import LangchainEmbedding\n", " # LlamaIndex lets you plug any LangChain's LLM+embeddings:\n", " from langchain.llms import VertexAI\n", " from langchain.embeddings import VertexAIEmbeddings\n", " llm = VertexAI()\n", " lcEmbedding = VertexAIEmbeddings()\n", " vector_dimension = len(lcEmbedding.embed_query(\"This is a sample sentence.\"))\n", " # ... if you take care of wrapping the LangChain embedding like this:\n", " myEmbedding = LangchainEmbedding(lcEmbedding)\n", " print(\"LLM+embeddings from Vertex AI\")\n", "elif llmProvider == 'Azure_OpenAI':\n", " os.environ['OPENAI_API_TYPE'] = 'azure'\n", " os.environ['OPENAI_API_VERSION'] = os.environ['AZURE_OPENAI_API_VERSION']\n", " os.environ['OPENAI_API_BASE'] = os.environ['AZURE_OPENAI_API_BASE']\n", " os.environ['OPENAI_API_KEY'] = os.environ['AZURE_OPENAI_API_KEY']\n", " from llama_index import LangchainEmbedding\n", " # LlamaIndex lets you plug any LangChain's LLM+embeddings:\n", " from langchain.llms import AzureOpenAI\n", " from langchain.embeddings import OpenAIEmbeddings\n", " llm = AzureOpenAI(temperature=0, model_name=os.environ['AZURE_OPENAI_LLM_MODEL'],\n", " engine=os.environ['AZURE_OPENAI_LLM_DEPLOYMENT'])\n", " lcEmbedding = OpenAIEmbeddings(model=os.environ['AZURE_OPENAI_EMBEDDINGS_MODEL'],\n", " deployment=os.environ['AZURE_OPENAI_EMBEDDINGS_DEPLOYMENT'])\n", " vector_dimension = len(lcEmbedding.embed_query(\"This is a sample sentence.\"))\n", " # ... if you take care of wrapping the LangChain embedding like this:\n", " myEmbedding = LangchainEmbedding(lcEmbedding)\n", " print('LLM+embeddings from Azure OpenAI')\n", "else:\n", " raise ValueError('Unknown LLM provider.')\n", "\n", "print(f\"Vector dimension for this embedding model: {vector_dimension}\")" ] }, { "cell_type": "markdown", "id": "f4b728fa-5da9-43c9-bf25-838051d8fe26", "metadata": {}, "source": [ "The following cell ensures that throughout all of LlamaIndex the LLM and embedding function instantiated above will be used:" ] }, { "cell_type": "code", "execution_count": 8, "id": "fe7caf30-24e9-4441-b131-91ee9bb71d00", "metadata": {}, "outputs": [], "source": [ "from llama_index import ServiceContext\n", "from llama_index import set_global_service_context\n", "\n", "service_context = ServiceContext.from_defaults(\n", " llm=llm,\n", " embed_model=myEmbedding,\n", " chunk_size=256,\n", ")\n", "set_global_service_context(service_context)" ] }, { "cell_type": "markdown", "id": "d7716353-0a6f-48d1-a2da-ffdb84dc148b", "metadata": {}, "source": [ "## Create vector store" ] }, { "cell_type": "code", "execution_count": 9, "id": "8af3c333-8e7f-41db-9c8b-318903de3ac5", "metadata": {}, "outputs": [], "source": [ "table_name = 'vs_ll_' + llmProvider" ] }, { "cell_type": "markdown", "id": "0d695f72-d102-4dac-af03-9fb6497e5093", "metadata": {}, "source": [ "In the LlamaIndex abstractions, the `CassandraVectorStore` instance is best wrapped into the creation of a \"storage context\", which you'll momentarily use to create the index proper:" ] }, { "cell_type": "code", "execution_count": 10, "id": "157f1891-d637-468b-8c58-32eaa7c79184", "metadata": {}, "outputs": [], "source": [ "storage_context = StorageContext.from_defaults(\n", " vector_store=CassandraVectorStore(\n", " table=table_name,\n", " embedding_dimension=vector_dimension,\n", " insertion_batch_size=15,\n", " )\n", ")" ] }, { "cell_type": "markdown", "id": "776608ce-3bc4-4ad2-8e21-6a8b6cf15322", "metadata": {}, "source": [ "## Insert documents" ] }, { "cell_type": "markdown", "id": "e723dad4-6b1a-421e-bcba-a663ec1116d4", "metadata": {}, "source": [ "You'll want to be able to employ filtering on metadata in your question-answering process, so here is a simple way to associate a metadata dictionary to the ingested sources.\n", "\n", "LlamaIndex supports a _function_ that maps an input file name to a metadata dictionary, and will honour the latter when storing the source text along with the embedding vectors:" ] }, { "cell_type": "code", "execution_count": 11, "id": "e9256a03-3b74-4a6a-a19c-5342b0e4b7f0", "metadata": {}, "outputs": [], "source": [ "def my_file_metadata(file_name: str):\n", " if \"snow\" in file_name:\n", " return {\"story\": \"snow_white\"}\n", " elif \"rapunzel\" in file_name:\n", " return {\"story\": \"rapunzel\"}\n", " else:\n", " return {\"story\": \"other\"}" ] }, { "cell_type": "markdown", "id": "cfac2fc5-369d-49dd-a233-e8e20de15b78", "metadata": {}, "source": [ "In the vector store, you can load, for example, the PDF files found in a given directory. Actually, you can load the documents _and_ instantiate the \"index\" object at the same time:" ] }, { "cell_type": "code", "execution_count": 12, "id": "fd9852e6-ad3a-4a91-abd7-c5ceb2a0c23c", "metadata": {}, "outputs": [], "source": [ "documents = SimpleDirectoryReader(\n", " 'pdfs',\n", " file_metadata=my_file_metadata\n", ").load_data()\n", "index = VectorStoreIndex.from_documents(\n", " documents,\n", " storage_context=storage_context,\n", ")" ] }, { "cell_type": "markdown", "id": "eede3b62-eec5-450a-a0d4-9ddd63b5ddfe", "metadata": {}, "source": [ "### Re-opening a preexisting vector store\n", "\n", "In most realistic cases, you want to access a vector store that was created and populated previously. To do so, you would create the `CassandraVectorStore` as you saw already, and then use the `from_vector_store` static method of VectorStoreIndex, obtaining an \"index\" that you can use exactly like the one you just created:" ] }, { "cell_type": "code", "execution_count": 13, "id": "aedb166e-505d-4ae0-9f78-8ccad4c933ec", "metadata": {}, "outputs": [], "source": [ "# This is how you would get an index for a pre-existing LlamaIndex vector store:\n", "vector_store = CassandraVectorStore(\n", " table=table_name,\n", " embedding_dimension=vector_dimension,\n", ")\n", "index_from_preexisting = VectorStoreIndex.from_vector_store(\n", " vector_store=vector_store\n", ")\n", "# You can try replacing \"index_from_preexisting\" in place of\n", "# \"index\" everywhere in the next cells ... e.g.:\n", "# query_engine = index_from_preexisting.as_query_engine(...)" ] }, { "cell_type": "markdown", "id": "520963ea-15c3-4529-b8eb-9a1b4aa1f1d5", "metadata": {}, "source": [ "## Answer questions" ] }, { "cell_type": "markdown", "id": "aabbb4d2-33a9-4eeb-8ea7-2dbec78ba7cd", "metadata": {}, "source": [ "Everything is ready to start asking questions. This cell allows to search for the answer over the whole indexed corpus:" ] }, { "cell_type": "code", "execution_count": 14, "id": "a5fe0611-6e88-437c-b798-fea06742cea7", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "Question on the whole store: Who is the antagonist of the young lady?\n", " ==> The antagonist of the young lady is the queen.\n" ] } ], "source": [ "question_1 = \"Who is the antagonist of the young lady?\"\n", "print(f\"\\nQuestion on the whole store: {question_1}\\n ==> \", end='')\n", "query_engine_all = index.as_query_engine(similarity_top_k=6)\n", "response_from_all = query_engine_all.query(question_1)\n", "print(response_from_all.response.strip())" ] }, { "cell_type": "markdown", "id": "3973e62e-53d6-4dde-bba2-ee71f9840a66", "metadata": {}, "source": [ "### Metadata filtering" ] }, { "cell_type": "markdown", "id": "af26695b-25de-4f3b-afab-7dcb2bcbb79d", "metadata": {}, "source": [ "You can, instead, constrain the lookup with filters on the metadata.\n", "\n", "Here is the same question, limited to one of the input documents through a filtering condition:" ] }, { "cell_type": "code", "execution_count": 15, "id": "f6f54b2a-9643-4bc9-84c6-0ab0087853cd", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "Question on the 'red_hood' document: Who is the antagonist of the young lady?\n", " ==> The antagonist of the young lady is the queen.\n" ] } ], "source": [ "from llama_index.vector_stores.types import ExactMatchFilter, MetadataFilters\n", "\n", "print(f\"\\nQuestion on the 'red_hood' document: {question_1}\\n ==> \", end='')\n", "query_engine_doc1 = index.as_query_engine(\n", " similarity_top_k=6,\n", " filters=MetadataFilters(filters=[\n", " ExactMatchFilter(key=\"story\", value=\"snow_white\"),\n", " ])\n", ")\n", "response_doc1 = query_engine_doc1.query(question_1)\n", "print(response_doc1.response.strip())" ] }, { "cell_type": "markdown", "id": "21468e7d-fb22-4f6b-80ed-8821f156ea2c", "metadata": {}, "source": [ "And here the very same process, this time limiting the sources to the _other_ input PDF:" ] }, { "cell_type": "code", "execution_count": 16, "id": "5f78865d-16d1-4a8a-8c81-bc5e486e2b53", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "Question on the 'rapunzel' document: Who is the antagonist of the young lady?\n", " ==> The antagonist of the young lady is the witch.\n" ] } ], "source": [ "print(f\"\\nQuestion on the 'rapunzel' document: {question_1}\\n ==> \", end='')\n", "query_engine_doc2 = index.as_query_engine(\n", " similarity_top_k=6,\n", " filters=MetadataFilters(filters=[\n", " ExactMatchFilter(key=\"story\", value=\"rapunzel\"),\n", " ])\n", ")\n", "response_doc2 = query_engine_doc2.query(question_1)\n", "print(response_doc2.response.strip())" ] }, { "cell_type": "markdown", "id": "595f1bd0-be72-4771-bc77-cdc3a8060fb6", "metadata": {}, "source": [ "### MMR (Maximum-marginal-relevance) method" ] }, { "cell_type": "markdown", "id": "fc96feff-f0d0-4ff7-a911-e85e8b727614", "metadata": {}, "source": [ "In many cases, using the MMR method enhances the quality of the answers.\n", "\n", "In short, the method, when running the search on the source which underpins the answer generation, tries to pick chunks of text that, while still relevant, are as diverse from each other as possible.\n", "\n", "In this cell, you can see it in action combined with metadata filtering:" ] }, { "cell_type": "code", "execution_count": 17, "id": "99ba9dda-9503-4842-8f58-d72b71318889", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "Question on 'snow_white', MMR method: Who is the antagonist of the young lady?\n", " ==> The antagonist of the young lady is the queen.\n" ] } ], "source": [ "print(f\"\\nQuestion on 'snow_white', MMR method: {question_1}\\n ==> \", end='')\n", "query_engine_doc2_mmr = index.as_query_engine(\n", " similarity_top_k=6,\n", " vector_store_query_mode=\"mmr\",\n", " vector_store_kwargs={\n", " \"mmr_prefetch_k\": 20,\n", " },\n", " filters=MetadataFilters(filters=[\n", " ExactMatchFilter(key=\"story\", value=\"snow_white\"),\n", " ])\n", ")\n", "response_doc2_mmr = query_engine_doc2_mmr.query(question_1)\n", "print(response_doc2_mmr.response.strip())" ] }, { "cell_type": "markdown", "id": "4bf48334-f078-46f2-a4d0-a4a1dc30cc0c", "metadata": {}, "source": [ "## Remove documents" ] }, { "cell_type": "markdown", "id": "dea2c4e6-a0ac-471d-95df-7f84232557ea", "metadata": {}, "source": [ "Sometimes you need to remove a document from the store. This generally entails removal of a _number_ of nodes (i.e. the individual chunks of text, each with its embedding vector, into which the input document is split at ingestion time).\n", "\n", "This is made easy with the `delete` method of the vector store. _Just keep in mind that, when indexing PDF files, LlamaIndex will treat each *page* of the input file as a separate document, so that you will erase one page at a time._\n", "\n", "In the following, you will:\n", "\n", " 1. ask a question to check the \"baseline\" result\n", " 2. get a few \"document IDs\" relevant to the question\n", " 3. use the `delete` method to remove those IDs from the vector store\n", " 4. ask the same question and compare the answer you get." ] }, { "cell_type": "markdown", "id": "1c9bca5f-68f7-4186-9eb3-7d7bcb9e39f4", "metadata": {}, "source": [ "Ask a question to check the result:" ] }, { "cell_type": "code", "execution_count": 18, "id": "f0bad039-3b35-45b7-924a-7814ddc2f40d", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "Question before removal: Who is Mother Gothel?\n", " ==> Mother Gothel is a character in the story \"Rapunzel\".\n" ] } ], "source": [ "q_removal_test = \"Who is Mother Gothel?\"\n", "print(f\"\\nQuestion before removal: {q_removal_test}\\n ==> \", end='')\n", "response_before_deletion = query_engine_all.query(q_removal_test)\n", "print(response_before_deletion.response.strip())" ] }, { "cell_type": "markdown", "id": "580f14ea-8738-4f34-b6dd-6590b21b91d4", "metadata": {}, "source": [ "Use the `retrieve` query-engine primitive to get a \"raw\" list of best-match document:" ] }, { "cell_type": "code", "execution_count": 19, "id": "ee5a0119-3c7e-4ea4-8b91-5d08363313ee", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Found 4 nodes with their score.\n" ] } ], "source": [ "from llama_index.indices.query.schema import QueryBundle\n", "\n", "q_bundle = QueryBundle(query_str=q_removal_test)\n", "query_engine_all_manydocs = index.as_query_engine(similarity_top_k=4)\n", "nodes_with_scores = query_engine_all_manydocs.retrieve(q_bundle)\n", "\n", "print(f\"Found {len(nodes_with_scores)} nodes with their score.\")" ] }, { "cell_type": "markdown", "id": "a52fceb2-bdcc-4983-8174-d0d0b51a3c45", "metadata": {}, "source": [ "Now delete a few documents (remember that there are in general several \"nodes\" for each \"document\", as the `Counter` here may highlight):" ] }, { "cell_type": "code", "execution_count": 20, "id": "f6b939bc-4dde-4a84-9a9b-9a22a27efdcf", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Deleting doc=4b3b4822-72cc-4cbb-a0c7-7d1cd4b93e87 (came up in 4 node(s))\n", "Done deleting.\n" ] } ], "source": [ "from collections import Counter\n", "\n", "nodes_per_document = Counter(nws.node.ref_doc_id for nws in nodes_with_scores)\n", "\n", "for document_id, node_count in nodes_per_document.most_common():\n", " print(f\"Deleting doc={document_id} (came up in {node_count} node(s))\")\n", " vector_store.delete(document_id)\n", "print(\"Done deleting.\")" ] }, { "cell_type": "markdown", "id": "bd5e1bab-e5ff-4214-9b35-a537bb3cfaa1", "metadata": {}, "source": [ "Now repeat the question:" ] }, { "cell_type": "code", "execution_count": 21, "id": "9c3eb56e-e5d8-4eb2-ad9b-685a6b64040c", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "Question after removal: Who is Mother Gothel?\n", " ==> I'm sorry, but I cannot provide an answer to that query based on the given context information.\n" ] } ], "source": [ "print(f\"\\nQuestion after removal: {q_removal_test}\\n ==> \", end='')\n", "response_after_deletion = query_engine_all.query(q_removal_test)\n", "if response_after_deletion.response:\n", " print(response_after_deletion.response.strip())\n", "else:\n", " print(\"(no answer received)\")" ] }, { "cell_type": "markdown", "id": "69a49931-875d-419d-b39e-1880de31754a", "metadata": {}, "source": [ "## (optional) Cleanup" ] }, { "cell_type": "markdown", "id": "d148e776-bb27-4e3c-85be-2f7afb11711a", "metadata": {}, "source": [ "You may want to clean up your database: in that case, simply run the following cell.\n", "\n", "_Warning: this will delete the vector store and all that you stored into it!_" ] }, { "cell_type": "code", "execution_count": 22, "id": "4b9ebc5f-be2c-4627-8304-b6206110e2bd", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "c_session = cassio.config.resolve_session()\n", "c_keyspace = cassio.config.resolve_keyspace()\n", "c_session.execute(f\"DROP TABLE IF EXISTS {c_keyspace}.{table_name};\")" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3" } }, "nbformat": 4, "nbformat_minor": 5 }