{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "6715bc2b",
   "metadata": {},
   "source": [
    "# Metadata filtering in the Vector Store\n",
    "\n",
    "Enhance a Question-Answering system with metadata filtering with LangChain and CassIO, using Cassandra as the Vector Database."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "761d9b70",
   "metadata": {},
   "source": [
    "_**NOTE:** this uses Cassandra's \"Vector Similarity Search\" capability.\n",
    "Make sure you are connecting to a vector-enabled database for this demo._"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cd81a50b-8b45-4efa-a053-05ff34d30a4c",
   "metadata": {},
   "source": [
    "## Setup"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "042f832e",
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain.indexes import VectorstoreIndexCreator\n",
    "from langchain.text_splitter import (\n",
    "    CharacterTextSplitter,\n",
    "    RecursiveCharacterTextSplitter,\n",
    ")\n",
    "from langchain.docstore.document import Document\n",
    "from langchain.document_loaders import TextLoader"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4388ac1d",
   "metadata": {},
   "source": [
    "The following line imports the Cassandra flavor of a LangChain vector store:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "d65c46f0",
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain.vectorstores.cassandra import Cassandra"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4578a87b",
   "metadata": {},
   "source": [
    "A database connection is needed. _(If on a Colab, the only supported option is the cloud service Astra DB.)_"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "9515d4af-9fc2-4c29-b9f2-5c90cd48cc1b",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Ensure loading of database credentials into environment variables:\n",
    "import os\n",
    "from dotenv import load_dotenv\n",
    "load_dotenv(\"../../../.env\")\n",
    "\n",
    "import cassio"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "25ee44a4-8d6a-4b41-8a8f-7c93eed35034",
   "metadata": {},
   "source": [
    "Select your choice of database by editing this cell, if needed:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "bfe78b92-7194-4ee9-ba7f-cd308409212c",
   "metadata": {},
   "outputs": [],
   "source": [
    "database_mode = \"cassandra\"  # \"cassandra\" / \"astra_db\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "fb5597b3-a028-4d8f-ad6d-8eaf0c941dcc",
   "metadata": {},
   "outputs": [],
   "source": [
    "if database_mode == \"astra_db\":\n",
    "    cassio.init(\n",
    "        database_id=os.environ[\"ASTRA_DB_ID\"],\n",
    "        token=os.environ[\"ASTRA_DB_APPLICATION_TOKEN\"],\n",
    "        keyspace=os.environ.get(\"ASTRA_DB_KEYSPACE\"),  # this is optional\n",
    "    )"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "329d969b-6532-4dc4-a315-1a5438a59919",
   "metadata": {},
   "outputs": [],
   "source": [
    "if database_mode == \"cassandra\":\n",
    "    from cqlsession import getCassandraCQLSession, getCassandraCQLKeyspace\n",
    "    cassio.init(\n",
    "        session=getCassandraCQLSession(),\n",
    "        keyspace=getCassandraCQLKeyspace(),\n",
    "    )"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "32e2a156",
   "metadata": {},
   "source": [
    "Both an LLM and an embedding function are required.\n",
    "\n",
    "Below is the logic to instantiate the LLM and embeddings of choice. We chose to leave it in the notebooks for clarity."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "124e3de4",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "LLM+embeddings from OpenAI\n"
     ]
    }
   ],
   "source": [
    "import os\n",
    "from llm_choice import suggestLLMProvider\n",
    "\n",
    "llmProvider = suggestLLMProvider()\n",
    "# (Alternatively set llmProvider to 'GCP_VertexAI', 'OpenAI', 'Azure_OpenAI' ... manually if you have credentials)\n",
    "\n",
    "if llmProvider == 'GCP_VertexAI':\n",
    "    from langchain.llms import VertexAI\n",
    "    from langchain.embeddings import VertexAIEmbeddings\n",
    "    llm = VertexAI()\n",
    "    myEmbedding = VertexAIEmbeddings()\n",
    "    print('LLM+embeddings from Vertex AI')\n",
    "elif llmProvider == 'OpenAI':\n",
    "    os.environ['OPENAI_API_TYPE'] = 'open_ai'\n",
    "    from langchain.llms import OpenAI\n",
    "    from langchain.embeddings import OpenAIEmbeddings\n",
    "    llm = OpenAI(temperature=0)\n",
    "    myEmbedding = OpenAIEmbeddings()\n",
    "    print('LLM+embeddings from OpenAI')\n",
    "elif llmProvider == 'Azure_OpenAI':\n",
    "    os.environ['OPENAI_API_TYPE'] = 'azure'\n",
    "    os.environ['OPENAI_API_VERSION'] = os.environ['AZURE_OPENAI_API_VERSION']\n",
    "    os.environ['OPENAI_API_BASE'] = os.environ['AZURE_OPENAI_API_BASE']\n",
    "    os.environ['OPENAI_API_KEY'] = os.environ['AZURE_OPENAI_API_KEY']\n",
    "    from langchain.llms import AzureOpenAI\n",
    "    from langchain.embeddings import OpenAIEmbeddings\n",
    "    llm = AzureOpenAI(temperature=0, model_name=os.environ['AZURE_OPENAI_LLM_MODEL'],\n",
    "                      engine=os.environ['AZURE_OPENAI_LLM_DEPLOYMENT'])\n",
    "    myEmbedding = OpenAIEmbeddings(model=os.environ['AZURE_OPENAI_EMBEDDINGS_MODEL'],\n",
    "                                   deployment=os.environ['AZURE_OPENAI_EMBEDDINGS_DEPLOYMENT'])\n",
    "    print('LLM+embeddings from Azure OpenAI')\n",
    "else:\n",
    "    raise ValueError('Unknown LLM provider.')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8662dd1b-60c0-4c94-9ea1-c23554eda779",
   "metadata": {},
   "source": [
    "## Create the vector store and load data"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "786148f2-f058-4c0c-9b46-139e274b7a8b",
   "metadata": {},
   "source": [
    "_Note: in case you have run this demo already, skip ahead to the next subsection (\"B\"): you will directly \"re-open\" a pre-populated store._"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2aac4fe5-a56c-482a-b21e-42e7902626a8",
   "metadata": {},
   "source": [
    "### A. Create store while loading new documents in it"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "285f29cf",
   "metadata": {},
   "source": [
    "This section creates a brand new vector store and loads some source documents in it. The store is created and filled at once, to be later queried to retrieve relevant parts of the indexed text.\n",
    "\n",
    "At question-answering time, LangChain will take care of looking for the relevant context fragments, stuff them into a prompt and finally use the prompt and an LLM to get the answer."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6f29fc57",
   "metadata": {},
   "source": [
    "The following instantiates an \"index creator\", which knows about the type of vector store, the embedding to use and how to preprocess the input text:\n",
    "\n",
    "_(Note: stores built with different embedding functions will need different tables. This is why we append the `llmProvider` name to the table name in the next cell.)_"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "d2cfe71b",
   "metadata": {},
   "outputs": [],
   "source": [
    "table_name = 'vs_test_md_' + llmProvider\n",
    "\n",
    "index_creator = VectorstoreIndexCreator(\n",
    "    vectorstore_cls=Cassandra,\n",
    "    embedding=myEmbedding,\n",
    "    text_splitter=CharacterTextSplitter(\n",
    "        chunk_size=400,\n",
    "        chunk_overlap=0,\n",
    "    ),\n",
    "    vectorstore_kwargs={\n",
    "        'session': None,\n",
    "        'keyspace': None,\n",
    "        'table_name': table_name,\n",
    "    },\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ea5ddfb6",
   "metadata": {},
   "source": [
    "Loading a local text (a few short stories by E. A. Poe will do)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "de8d65df",
   "metadata": {},
   "outputs": [],
   "source": [
    "loader1 = TextLoader('texts/amontillado.txt', encoding='utf8')\n",
    "loader2 = TextLoader('texts/mask.txt', encoding='utf8')\n",
    "loader3 = TextLoader('texts/manuscript.txt', encoding='utf8')\n",
    "loaders = [loader1, loader2, loader3]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "53431eca",
   "metadata": {},
   "source": [
    "This takes a few seconds to run, as it must calculate embedding vectors for a number of chunks of the input text:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "a4929f2a",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Note: Certain LLM providers need workaround to evaluate batch embeddings\n",
    "#       (as done in next cell).\n",
    "#       As of 2023-06-29, Azure OpenAI would  error with:\n",
    "#           \"InvalidRequestError: Too many inputs. The max number of inputs is 1\"\n",
    "if llmProvider == 'Azure_OpenAI':\n",
    "    from langchain.indexes.vectorstore import VectorStoreIndexWrapper\n",
    "    for loader in loaders:\n",
    "        docs = loader.load()\n",
    "        subdocs = index_creator.text_splitter.split_documents(docs)\n",
    "        #\n",
    "        print(f'subdocument {0} ...', end=' ')\n",
    "        vs = index_creator.vectorstore_cls.from_documents(\n",
    "            subdocs[:1],\n",
    "            index_creator.embedding,\n",
    "            **index_creator.vectorstore_kwargs,\n",
    "        )\n",
    "        print('done.')\n",
    "        for sdi, sd in enumerate(subdocs[1:]):\n",
    "            print(f'subdocument {sdi+1} ...', end=' ')\n",
    "            vs.add_texts(texts=[sd.page_content], metadata=[sd.metadata])\n",
    "            print('done.')\n",
    "        #\n",
    "    index = VectorStoreIndexWrapper(vectorstore=vs)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "64c9970e",
   "metadata": {},
   "outputs": [],
   "source": [
    "if llmProvider != 'Azure_OpenAI':\n",
    "    index = index_creator.from_loaders(loaders)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7b82fdb4-fb51-4623-863c-4a50722ba743",
   "metadata": {},
   "source": [
    "_Note: depending on how you load rows in your store, there might be ways to add your own metadata. Ask Langchain docs! For now, we have a `source` metadata field with the file path, and we'll use that one._"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c7649a56-a783-4195-9a90-aca56798035c",
   "metadata": {},
   "source": [
    "For later demonstration, extract the vector store itself as a stand-alone object from the index:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "2f487b19-76c2-4b27-b552-5abcc55e14b7",
   "metadata": {},
   "outputs": [],
   "source": [
    "myCassandraVStore = index.vectorstore"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2ca556b1-3fba-495e-81da-6e07880ffef9",
   "metadata": {},
   "source": [
    "### B. Loading a previously-populated Vector Store\n",
    "\n",
    "In case you already have ingested the documents in the vector store, this is how you would \"re-open\" an index on it:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "id": "0c14bc32-7e54-4438-9ea2-21186e9c4334",
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain.indexes.vectorstore import VectorStoreIndexWrapper\n",
    "\n",
    "myCassandraVStore = Cassandra(\n",
    "    embedding=myEmbedding,\n",
    "    session=None,\n",
    "    keyspace=None,\n",
    "    table_name='vs_test_md_' + llmProvider,\n",
    ")\n",
    "index = VectorStoreIndexWrapper(vectorstore=myCassandraVStore)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "50c02057",
   "metadata": {},
   "source": [
    "## Metadata filtering in Question Answering"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f8548a0b-a1ca-4f0e-84b7-a2d69e98bc9e",
   "metadata": {},
   "source": [
    "The crucial thing is that LangChain automatically sets the metadata key-value pair `{\"source\": <file name>}` when loading documents, so you'll use that to constrain the answering process to specific documents.\n",
    "\n",
    "_(In case you need more flexibility in handling the metadata at insertion time, you should look into building your own `metadatas` argument to the vector store's `add_texts` method. You can see an example usage of `add_texts` a few cells above this one.)_"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "70cbd3b5-4d38-420c-bc0c-3962043a45c1",
   "metadata": {},
   "source": [
    "You'll concentrate on two questions, whose answer depends largely on the particular short story under scrutiny:"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ee682864-bb3c-47be-9c30-ca1257a43728",
   "metadata": {},
   "source": [
    "_Technical note: ensure you are wrapping your `filter` argument in the right dictionary structure, which will depend on whether you are working at the retriever, index, or store abstraction layer. Most of these methods tend to silently swallow unexpected parameters, so extra care is recommended in crafting the right `retriever_kwargs`, `search_kwargs` or `filter` parameter to method calls._"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "id": "43abfcd2-1f87-4ed0-a6ef-93c52831b01b",
   "metadata": {},
   "outputs": [],
   "source": [
    "Q1 = \"Does the Captain do anything weird?\"\n",
    "Q2 = \"Who arrives and scares everyone?\""
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d9065b47-77f7-4a8b-a47a-eb65a722fb5d",
   "metadata": {},
   "source": [
    "### Without metadata filtering (baseline case)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "id": "48590e50-f342-422b-87a9-ba6d27a7d23a",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "--------------------\n",
      "Answer to Q1 (Does the Captain do anything weird?):\n",
      "   ===> No, the captain does not do anything weird. He pays no attention to the narrator, and he does not seem to be aware of the narrator's presence. He is described as having an intense expression and evidence of old age, but he does not do anything out of the ordinary.\n",
      "--------------------\n",
      "Answer to Q2 (Who arrives and scares everyone?):\n",
      "   ===> The Red Death.\n"
     ]
    }
   ],
   "source": [
    "print(f\"{'-'*20}\\nAnswer to Q1 ({Q1}):\\n   ===> \", end=\"\")\n",
    "print(index.query(Q1).strip())\n",
    "print(f\"{'-'*20}\\nAnswer to Q2 ({Q2}):\\n   ===> \", end=\"\")\n",
    "print(index.query(Q2).strip())"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a8d47074-23fc-4ee1-94c8-7a4e99de7a09",
   "metadata": {},
   "source": [
    "### With metadata filtering\n",
    "\n",
    "Additional conditions on metadata filtering are eventually passed as a key-value `filter = {\"source\": <file name>}` parameter to the vector store's similarity search methods.\n",
    "\n",
    "When using the index's `query` method, this means supplying a `retriever_kwargs` argument as follows:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "id": "d1c65a5d-36a3-4649-91e6-8cebac5dd9d4",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "** Using 'manuscript.txt':\n",
      "--------------------\n",
      "Answer to Q1 (Does the Captain do anything weird?):\n",
      "   ===> No, the captain does not do anything weird. He pays no attention to the narrator, and he does not seem to be aware of the narrator's presence. He is described as having an intense expression and evidence of old age, but he does not do anything out of the ordinary.\n",
      "--------------------\n",
      "Answer to Q2 (Who arrives and scares everyone?):\n",
      "   ===> A gigantic ship of perhaps four thousand tons.\n"
     ]
    }
   ],
   "source": [
    "retr_kwargs_manuscript = {\"search_kwargs\": {\"filter\": {\"source\": \"texts/manuscript.txt\"}}}\n",
    "\n",
    "print(\"** Using 'manuscript.txt':\")\n",
    "print(f\"{'-'*20}\\nAnswer to Q1 ({Q1}):\\n   ===> \", end=\"\")\n",
    "print(index.query(Q1, retriever_kwargs=retr_kwargs_manuscript).strip())\n",
    "print(f\"{'-'*20}\\nAnswer to Q2 ({Q2}):\\n   ===> \", end=\"\")\n",
    "print(index.query(Q2, retriever_kwargs=retr_kwargs_manuscript).strip())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "id": "57970b3a-86e3-4a0d-a140-e34925b93204",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "** Using 'mask.txt':\n",
      "--------------------\n",
      "Answer to Q1 (Does the Captain do anything weird?):\n",
      "   ===> No, the Captain does not do anything weird.\n",
      "--------------------\n",
      "Answer to Q2 (Who arrives and scares everyone?):\n",
      "   ===> The Red Death.\n"
     ]
    }
   ],
   "source": [
    "retr_kwargs_mask = {\"search_kwargs\": {\"filter\": {\"source\": \"texts/mask.txt\"}}}\n",
    "\n",
    "print(\"** Using 'mask.txt':\")\n",
    "print(f\"{'-'*20}\\nAnswer to Q1 ({Q1}):\\n   ===> \", end=\"\")\n",
    "print(index.query(Q1, retriever_kwargs=retr_kwargs_mask).strip())\n",
    "print(f\"{'-'*20}\\nAnswer to Q2 ({Q2}):\\n   ===> \", end=\"\")\n",
    "print(index.query(Q2, retriever_kwargs=retr_kwargs_mask).strip())"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "527a4697",
   "metadata": {},
   "source": [
    "## Spawning a \"retriever\" from the index"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c573ebd3-de15-4f15-b269-1bc565391606",
   "metadata": {},
   "source": [
    "You can also create a \"retriever\" from the index and use it for subsequent document fetching (based on semantic similarity).\n",
    "\n",
    "Customizing the retriever amounts to passing a `search_kwargs` argument to the vector store's `as_retriever` method:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "id": "5d90aa80-29bb-46ea-b80f-341b3d5aff99",
   "metadata": {},
   "outputs": [],
   "source": [
    "RETRIEVER_Q = \"What does the narrator do?\""
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3c4816cd-88e1-4dbd-ba72-a2bd64522655",
   "metadata": {},
   "source": [
    "### Without metadata filtering (baseline case)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "id": "7bae5520",
   "metadata": {},
   "outputs": [],
   "source": [
    "retriever_0 = index.vectorstore.as_retriever(search_kwargs={'k': 4})"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "id": "e4e816ee",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[doc 0, texts/amontillado.txt] \"The gait of my friend was unsteady, and the bells ...\"\n",
      "[doc 1, texts/manuscript.txt] \"I had scarcely completed my work, when a footstep ...\"\n",
      "[doc 2, texts/manuscript.txt] \"In the meantime the wind is still in our poop, and...\"\n",
      "[doc 3, texts/amontillado.txt] \"\"The nitre!\" I said; \"see, it increases.  It hangs...\"\n"
     ]
    }
   ],
   "source": [
    "for doc_i, doc in enumerate(retriever_0.get_relevant_documents(RETRIEVER_Q)):\n",
    "    print(f\"[doc {doc_i}, {doc.metadata['source']}] \\\"{doc.page_content[:50]}...\\\"\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c751c0f2-63c1-424f-8c1d-539cb968c94d",
   "metadata": {},
   "source": [
    "### With metadata filtering"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "id": "fa26a655-3f64-4a09-9f51-cf33d0feb568",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[doc 0, texts/manuscript.txt] \"I had scarcely completed my work, when a footstep ...\"\n",
      "[doc 1, texts/manuscript.txt] \"In the meantime the wind is still in our poop, and...\"\n",
      "[doc 2, texts/manuscript.txt] \"As I fell, the ship hove in stays, and went about;...\"\n",
      "[doc 3, texts/manuscript.txt] \"At this instant, I know not what sudden self-posse...\"\n"
     ]
    }
   ],
   "source": [
    "retriever_m = index.vectorstore.as_retriever(search_kwargs={\n",
    "    'k': 4,\n",
    "    'filter': {'source': 'texts/manuscript.txt'},\n",
    "})\n",
    "for doc_i, doc in enumerate(retriever_m.get_relevant_documents(RETRIEVER_Q)):\n",
    "    print(f\"[doc {doc_i}, {doc.metadata['source']}] \\\"{doc.page_content[:50]}...\\\"\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4bb97a92-c81c-415d-93ca-a04c77578966",
   "metadata": {},
   "source": [
    "## MMR (maximal-marginal-relevance) Question Answering"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f33df0bf-6754-41de-bc1d-e6bc0e73301f",
   "metadata": {},
   "source": [
    "Metadata filtering can be combined with the MMR technique for fetching, for the answer generation, relevant text fragments which at the same time are as diverse as possible:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "id": "d4ce1937-ec88-410e-9e73-023f7c28f329",
   "metadata": {},
   "outputs": [],
   "source": [
    "MMR_Q = \"Whose identity is unknown?\""
   ]
  },
  {
   "cell_type": "markdown",
   "id": "895596b2-957c-4b23-8f3e-530d7a6f126a",
   "metadata": {},
   "source": [
    "Once more, depending on whether you are working at the index, retriever or vector store level, you have to encapsulate the `filter` parameter differently. The following cells demonstrate this."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d35a0dff-8bac-4742-9401-6720b7845b26",
   "metadata": {},
   "source": [
    "### Without metadata filtering (baseline case)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "id": "8e2548e1-0229-40f2-8bdb-91370b9ce108",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[doc 0, texts/mask.txt] \"In an assembly of phantasms such as I have painted...\"\n",
      "[doc 1, texts/mask.txt] \"“Who dares?” he demanded hoarsely of the courtiers...\"\n",
      "[doc 2, texts/manuscript.txt] \"I had scarcely completed my work, when a footstep ...\"\n",
      "[doc 3, texts/manuscript.txt] \"A feeling, for which I have no name, has taken pos...\"\n"
     ]
    }
   ],
   "source": [
    "for doc_i, doc in enumerate(myCassandraVStore.search(MMR_Q, search_type='mmr', k=4)):\n",
    "    print(f\"[doc {doc_i}, {doc.metadata['source']}] \\\"{doc.page_content[:50]}...\\\"\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "id": "176215fb-c4d4-44d5-96ac-fa0c5c0bf150",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The figure in the masquerade costume.\n"
     ]
    }
   ],
   "source": [
    "print(index.query(\n",
    "    MMR_Q,\n",
    "    retriever_kwargs={\n",
    "        \"k\": 4,\n",
    "        \"search_type\": \"mmr\",\n",
    "    }\n",
    ").strip())"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "19f85f8f-b6b0-495b-bac5-d4b30bf45ed0",
   "metadata": {},
   "source": [
    "### With metadata"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "id": "4943b097-f795-4ba0-b270-bd50faf6225b",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[doc 3, texts/manuscript.txt] \"I had scarcely completed my work, when a footstep ...\"\n",
      "[doc 3, texts/manuscript.txt] \"A feeling, for which I have no name, has taken pos...\"\n",
      "[doc 3, texts/manuscript.txt] \"Of my country and of my family I have little to sa...\"\n",
      "[doc 3, texts/manuscript.txt] \"When I look around me I feel ashamed of my former ...\"\n"
     ]
    }
   ],
   "source": [
    "mmr_md_filter = {'source': 'texts/manuscript.txt'}\n",
    "\n",
    "results = myCassandraVStore.search(MMR_Q, search_type='mmr', k=4, filter=mmr_md_filter)\n",
    "for i, doc in enumerate(results):\n",
    "    print(f\"[doc {doc_i}, {doc.metadata['source']}] \\\"{doc.page_content[:50]}...\\\"\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "id": "47663895-4578-43dd-8b1b-f33176e16a47",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The man who passed by the speaker's place of concealment.\n"
     ]
    }
   ],
   "source": [
    "print(index.query(\n",
    "    MMR_Q,\n",
    "    retriever_kwargs={\n",
    "        \"search_kwargs\": {\n",
    "            \"filter\": mmr_md_filter,\n",
    "        },\n",
    "        \"k\": 4,\n",
    "        \"search_type\": \"mmr\",\n",
    "    }\n",
    ").strip())"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2c624035-8f00-4ab0-b6f4-f1a0d8a2b81e",
   "metadata": {},
   "source": [
    "## (optional) Cleanup"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d18a2e75-5fd0-4035-9f27-f5e75436d7b3",
   "metadata": {},
   "source": [
    "If you want to delete the data from your database and drop the table altogether, run the following cell:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "id": "267be95d-2375-445a-aad8-f8c575a546bf",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<cassandra.cluster.ResultSet at 0x7f80a3497eb0>"
      ]
     },
     "execution_count": 27,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "c_session = cassio.config.resolve_session()\n",
    "c_keyspace = cassio.config.resolve_keyspace()\n",
    "c_session.execute(f\"DROP TABLE IF EXISTS {c_keyspace}.{table_name};\")"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}