{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# CS 329T Homework 1\n",
    "#### Student Name: \n",
    "#### SUNet ID: \n",
    "#### Due Date: Monday, October 9, 2023."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Over the course of this homework, we will create and evaluate a Retrieval Augmented Generation (RAG) application. RAG applications allow the use of LLMs with custom data sources, including those not used for training the LLM. RAG apps allow LLMs to be used for domain-specific tasks, which LLMs by themseleves are less effective at since they are trained on very general domain corpora. In this homework, we will use a RAG application to query an LLM with natural language prompts on a custom collection of Wikipedia articles. \n",
    "\n",
    "We have split the homework into three questions. In the first question, you will build a prototype version of your RAG application. In the second question, you will initialize 3 different evaluation metrics to evaluate your RAG application. In the third question, you will be asked to try and evaluate different RAG application configurations to find the best performing application for the evaluation set, and explain why it performed the best.\n",
    "\n",
    "The code pre-populated in this homework is geared towards Llama-Index v0.8.4 as the framework, Milvus as the vector store, and TruLens v0.12.0 for evaluations. Documentation for each is linked below. If it suits you, you are welcome to change out these components or add in additional components as you see fit.\n",
    "\n",
    "- [Llama-Index Documentation](https://gpt-index.readthedocs.io/en/v0.8.4/)\n",
    "- [Milvus Documentation](https://milvus.io/docs)\n",
    "- [TruLens Documentation](https://www.trulens.org/trulens_eval/install/)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Setup"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Step 1: Install Docker Compose and Milvus\n",
    "To get up and running, you'll first need to install Docker and Milvus. Please note:\n",
    "- If you have a Windows machine, WSL 2 is required to install Milvus.\n",
    "- Installing Docker Desktop (which includes Docker Compose) is required for MacOS and Windows with WSL 2 enabled, and highly recommended for Linux to ensure Milvus works.\n",
    "- Google Colab and similar platforms are not supported by Docker Compose and Milvus.\n",
    "- When following the installation instructions for Milvus, please download `docker-compose.yml` and run Milvus in the same folder as this Python notebook. \n",
    "- You will need to have Milvus running in the same folder as this notebook to successfully run the code in this notebook. If you stop Milvus at any point, please re-run Milvus to continue working on this notebook.\n",
    "\n",
    "Find installation instructions for your system below:\n",
    "* Docker Compose ([Instructions](https://docs.docker.com/compose/install/))\n",
    "* Milvus Standalone ([Instructions](https://milvus.io/docs/install_standalone-docker.md))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Step 2: Install python dependencies\n",
    "We recommend installing the dependencies below in a new environment in the python package manager of your choice (virtualenv, conda etc)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%pip install trulens-eval==0.12.0 llama_index==0.8.4 pymilvus==2.3.0 nltk==3.8.1 html2text==2020.1.16 tenacity==8.2.3 --quiet\n",
    "%pip install wikipedia transformers sentence-transformers --quiet"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Step 3: Add LLM API key(s)\n",
    "By default, we use OpenAI LLMs for the RAG app in this HW. For this, you will need an OpenAI API key. Please obtain your API key following the instructions [here](https://help.openai.com/en/articles/4936850-where-do-i-find-my-secret-api-key). \n",
    "\n",
    "If you'd like to follow the best practices for API key safety, you can set the environment variable containing your OpenAI API key outside of this notebook to avoid explicitly setting it below by following Step 4 from [this guide](https://help.openai.com/en/articles/5112595-best-practices-for-api-key-safety). In case you elect to do this, please remove the code below.\n",
    "\n",
    "If you use models from providers other than OpenAI (e.g. HuggingFace, Cohere), you will need likely need their API keys as well. You can do so in the same format as the OpenAI key below."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "os.environ[\"OPENAI_API_KEY\"] = \"YOUR KEY HERE\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Defining prompts and a relevant document collection"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Step 1: Collecting our custom dataset of Wikipedia entries\n",
    "Here, we collect our custom dataset of the Wikipedia entries of different cities. We pull the Wikipedia entries of select cities using Llama-Index and store their text in the list `wiki_docs`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from llama_index import WikipediaReader\n",
    "\n",
    "cities = [\n",
    "    \"Los Angeles\", \"Houston\", \"Honolulu\", \"Tucson\", \"Mexico City\", \n",
    "    \"Cincinatti\", \"Chicago\"\n",
    "]\n",
    "\n",
    "wiki_docs = []\n",
    "for city in cities:\n",
    "    try:\n",
    "        doc = WikipediaReader().load_data(pages=[city])\n",
    "        wiki_docs.extend(doc)\n",
    "    except Exception as e:\n",
    "        print(f\"Error loading page for city {city}: {e}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Step 2: Test Prompts\n",
    "Below, we've defined some prompts we'd like to query our custom data with using our RAG app. You might notice that the questions below are specific, and LLMs which are trained on very general corpora may not know the factual answers to these questions. This necessitates retrieving additional data to augment our prompts with to provide the LLM with context, which is the purpose of a RAG application.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "test_prompts = [\n",
    "    \"What's the best national park near Honolulu\",\n",
    "    \"What are some famous universities in Tucson?\",\n",
    "    \"What bodies of water are near Chicago?\",\n",
    "    \"What is the name of Chicago's central business district?\",\n",
    "    \"What are the two most famous universities in Los Angeles?\",\n",
    "    \"What are some famous festivals in Mexico City?\",\n",
    "    \"What are some famous festivals in Los Angeles?\",\n",
    "    \"What professional sports teams are located in Los Angeles\",\n",
    "    \"How do you classify Houston's climate?\",\n",
    "    \"What landmarks should I know about in Cincinatti\"\n",
    "]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Question 1: Building a prototype RAG"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "RAG apps help augment LLMs so they can work with custom data. They generally consist of two stages – the indexing stage, in which a knowledge base is prepared, and the querying stage, in which relevant context for a question is retreived from the knowledge base to assist the LLM in responding to that question. \n",
    "\n",
    "You will be defining the Vector Store, LLM and embedding model in Parts 1, 2, 3. These 3 crucial components will allow us to build the knowledge base and run the indexing stage in Part 4, as well as run the querying stage in Part 5."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##### Part 1: Initializing the Vector Store\n",
    "`MilvusVectorStore` represents a type of vector database. Vector databases index, store, and provide access to structured or unstructured data (e.g., text or images) alongside their respective embedding vectors. These embedding vectors are the representation of the outside data from which our RAG application will find relevant context to augment our queries to the LLM, also known as a knowledge base. Milvus supports different types of vector indexes, which represent how the database organizes and searches through embeddding vectors. [This article](https://milvus.io/docs/v2.0.x/index.md#Indexes-supported-in-Milvus) lists the different types of vector indexes that Milvus supports. Your first task here is to pick one vector index you believe is suitable to the data we have, and finish the initialization of the `vector_store` variable, filling in `index_params` and `search_params` fields to indicate your choice of vector index. You will evaluate the impact of different vector indexes later in the homework.\n",
    "\n",
    "Note: Please ensure that Milvus is running in the same directory. You will face timeout errors if Milvus is not running."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from llama_index.vector_stores import MilvusVectorStore\n",
    "\n",
    "############################################################################\n",
    "# TODO: Initialize MilvusVectorStore with a vector index of your choice.   #\n",
    "# Note: Some vector indexes may not require search_params.                 #\n",
    "############################################################################\n",
    "# *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****\n",
    "\n",
    "vector_store = MilvusVectorStore(\n",
    "    index_params={\n",
    "        \"index_type\": \"\"\n",
    "    },\n",
    "    search_params={},\n",
    "    overwrite=True)\n",
    "\n",
    "# *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****\n",
    "############################################################################\n",
    "#                             END OF YOUR CODE                             #\n",
    "############################################################################"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##### Part 2: Choosing the LLM\n",
    "Secondly, you need to initialize the LLM to be used in our RAG application. Please initialize your choice of LLM in the `llm` variable. Please refer to the [Llama-Index LLM module documentation](https://gpt-index.readthedocs.io/en/v0.8.4/core_modules/model_modules/llms/root.html#modules) to view which LLMs are supported by Llama-Index.\n",
    "\n",
    "Note: While this homework might seem geared towards using OpenAI models, you are free to use other LLM models (e.g. HuggingFace, Cohere) if you'd like. The LangChain LLM wrapper module in Llama-Index allows you to use any LLM supported by LangChain, which includes a much larger selection of LLM providers including Cohere."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "############################################################################\n",
    "# TODO:  Initialize an LLM of your choice.                                 #\n",
    "############################################################################\n",
    "# *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****\n",
    "\n",
    "llm = None\n",
    "\n",
    "# *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****\n",
    "############################################################################\n",
    "#                             END OF YOUR CODE                             #\n",
    "############################################################################"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##### Part 3: Choosing the Embedding Model\n",
    "Thirdly, you need to initialize your choice of embedding model for our RAG application. Your choice of embedding can have significant impact on LLM performance in this RAG application. As a starting point, note that Llama-Index provides access to all HuggingFace embedding models through the `HuggingFaceEmbeddings` class and all OpenAI embedding models through the `OpenAIEmbeddings` class. Initializing one of these models or another embedding model of your choice in `embed_model` below is sufficient for our prototype RAG model – you will be able to evaluate the impact of different embedding models later in the homework."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from llama_index.llms import OpenAI\n",
    "from langchain.embeddings import HuggingFaceEmbeddings\n",
    "from langchain.embeddings.openai import OpenAIEmbeddings\n",
    "\n",
    "############################################################################\n",
    "# TODO: Initialize a model embedding of your choice.                       #\n",
    "############################################################################\n",
    "# *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****\n",
    "\n",
    "embed_model = None\n",
    "\n",
    "# *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****\n",
    "############################################################################\n",
    "#                             END OF YOUR CODE                             #\n",
    "############################################################################"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##### Part 4: Indexing Stage – Building the Knowledge Base\n",
    "This code below completes the indexing stage of a RAG application, which represents the preparation of a knowledge base. `StorageContext` and `ServiceContext` are Llama-Index abstractions that allow us to use our defined vector store, LLM and embedding model to create a `VectorStoreIndex`. This `VectorStoreIndex` is our knowledge base in the RAG application. Under the hood, `VectorStoreIndex` converts the our document collection (stored in `wikidocs`) into embedding vectors in the vector embedding space of our choice of embedding model. These embedding vectors which comprise our knowledge base can now be indexed and searched through efficiently to find the relevant context for LLM queries. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from llama_index.storage.storage_context import StorageContext\n",
    "from llama_index import ServiceContext, VectorStoreIndex\n",
    "\n",
    "storage_context = StorageContext.from_defaults(vector_store = vector_store)\n",
    "service_context = ServiceContext.from_defaults(embed_model = embed_model, llm = llm)\n",
    "index = VectorStoreIndex.from_documents(wiki_docs,\n",
    "            service_context=service_context,\n",
    "            storage_context=storage_context)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##### Part 5: Querying Stage – Augmenting LLM queries with outside context\n",
    "This part represents the querying stage, which consists of retrieving relevant context from the knowledge base to assist the LLM in responding to a question. We first define a query engine in `query_engine`. A query engine is an end-to-end pipeline which takes a query, retreives relevant context from the knowledge base to augment the query, then sends the augmented query to the LLM and finally, returns the LLM's response. It additionally returns the reference context which was retrieved and passed to the LLM with the response.\n",
    "\n",
    "To identify the relevant context to augment the query with, when we add query a prompt with the query engine, it first converts the prompt into an embedding vector. Then, it runs similarity search of the prompt embedding vector with the knowledge base to identify similar vectors and build the relevant context that it believes the LLM needs to know to answer the query."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from tenacity import retry, stop_after_attempt, wait_exponential\n",
    "\n",
    "query_engine = index.as_query_engine(top_k = 5)\n",
    "\n",
    "# adds exponential backoff for LLM queries\n",
    "@retry(stop=stop_after_attempt(10), wait=wait_exponential(multiplier=1, min=4, max=10))\n",
    "def call_query_engine(prompt):\n",
    "        return query_engine.query(prompt)\n",
    "\n",
    "for prompt in test_prompts:\n",
    "        print(f\"Prompt: {prompt}\")\n",
    "        print(f\"Response: {call_query_engine(prompt)}\\n\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Question 2: Initializing RAG Evaluation Metrics"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here, we will be setting up three different evaluation metrics to evaluate the quality of our LLM's output using TruLens. These three evaluation metrics are groundedness, answer relevance and context relevance. We have initialized groundedness and answer relevance – following our example, you will need to initialize context relevance."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Below, we initialize the OpenAI-based TruLens feedback function class. A feedback function is how TruLens implements metrics which score the output of an LLM application by analyzing generated text as part of an LLM application. The `feedback.OpenAI` class contains a collection of these feedback functions to evaluate LLMs. In this homework, we will be using the two feedback functions, namely the answer relevance and context relevance metrics, from the `feedback.OpenAI`. The third, groundedness, is defined in its own feedback class."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from trulens_eval import Tru, feedback\n",
    "from trulens_eval.feedback import Groundedness\n",
    "\n",
    "# init trulens\n",
    "tru = Tru()\n",
    "\n",
    "# Initialize OpenAI-based feedback function collection class\n",
    "openai_gpt35 = feedback.OpenAI(model_engine=\"gpt-3.5-turbo\")\n",
    "\n",
    "# Initialize groundedness class for the groundedness metric\n",
    "grounded = Groundedness(groundedness_provider=openai_gpt35)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Answer relevance, also called question-answer relevance, is best for measuring the relationship of the query/prompt to the final LLM response/answer. It measures the relevance of the final answer to the original question from 0 to 1. \n",
    "\n",
    "In TruLens, this flavor of relevance is particularly optimized for the following features:\n",
    "* Relevance requires adherence to the entire prompt.\n",
    "* Responses that don't provide a definitive answer can still be relevant\n",
    "* Admitting lack of knowledge and refusals are still relevant.\n",
    "* Feedback mechanism should differentiate between seeming and actual relevance.\n",
    "* Relevant but inconclusive statements should get increasingly high scores as they are more helpful for answering the query.\n",
    "\n",
    "In the code below, after initializing the Groundedness class, we initialize a `Feedback` class with the `openai_gpt35.relevance_with_cot_reasons` metric, which measures the relevance of the response to a prompt, while using chain of thought methodology and emitting the reasoning. \n",
    "Once initialized with the metric, the Feedback object requires that the input arguments on which the metric is to be applied are specified. For answer relevance, these input arguments would be the LLM query/prompt and the LLM response/answer. We can specify this by the use of TruLens selectors. The `.on_input_output()` selector at the end of the Feedback class initialization line selects the two arguments for answer relevance, where input refers the LLM query, and output refers to the LLM response."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from trulens_eval import Feedback\n",
    "\n",
    "f_answer_relevance = Feedback(openai_gpt35.relevance_with_cot_reasons, name = \"Answer Relevance\").on_input_output()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The Groundedness feedback function uses OpenAI LLMs to attempt to check if an LLM response is grounded in the contexts supplied with the query on a scale from 1 to 10. First, the information overlap or entailment between the each source in the supplied context and response is measured. The highest score between sources is then averaged and scaled from 0 to 1 to arrive at the final aggregated score.\n",
    "\n",
    "In the code below, after initializing the Groundedness class, we initialize a `Feedback` class with the `grounded.groundedness_measure_with_cot_reasons` metric, which tracks whether the supplied context sources (called context chunks in implementation) support each sentence in the response statement received from the LLM. In the case of groundedness, we need to check whether the LLM response was grounded in the relevant query context that was supplied from the knowledge base. Thus, our choice of LLM input has changed from the query itself to the supplied 'context chunks', while the choice of LLM output remains the same, since we're still using the LLM response to evaluate the metric. To signify this, we bind two different TruLens selectors like so: `.on(TruLlama.select_source_nodes().node.text).on_output()`. Here, `TruLlama.select_source_nodes().node.text` extracts the relevant context that was supplied with the query for use in the metric. Thus, the `.on(TruLlama.select_source_nodes().node.text)` selector here specifies to the Feedback class that the query context is the first metric input. The following `.on_output()` specifies that the LLM response is the second input to the metric. Finally, we use the `aggregate()` function to aggregate the scores which are generated over all context chunks in the supplied context with a groundness-specific aggregation function as input."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from trulens_eval import TruLlama\n",
    "\n",
    "f_groundedness = Feedback(grounded.groundedness_measure_with_cot_reasons, name = \"Groundedness\").on(\n",
    "    TruLlama.select_source_nodes().node.text # this line grabs the context that was supplied with the query\n",
    ").on_output().aggregate(grounded.grounded_statements_aggregator)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Context relevance, sometimes known as question-statement relevance, is best for measuring the relationship of the the query/prompt to the supplied context selected from the knowledge base. It is measured as the average of relevance (from 0 to 1) for each context chunk supplied to the LLM query. \n",
    "\n",
    "In TruLens, this flavor of relevance is optimized for a slightly different set of features:\n",
    "* Relevance requires adherence to the entire query.\n",
    "* Long context with small relevant chunks are relevant.\n",
    "* Context that provides no answer can still be relevant.\n",
    "* Feedback mechanism should differentiate between seeming and actual relevance.\n",
    "* Relevant but inconclusive statements should get increasingly high scores as they are more helpful for answering the query.\n",
    "\n",
    "\n",
    "Following the examples of answer relevance and groundedness above, initialize context relevance in `f_context_relevance` using the Feedback class below. Use the appropriate metric that evaluates context relevance with chain of thought methodology and returns the reasons as well. Use the appropriate selectors and an aggregate function if required.\n",
    "\n",
    "API reference for TruLens Feedback functions is available [here](https://www.trulens.org/trulens_eval/api/feedback/)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "############################################################################\n",
    "# TODO: Initialize f_context_relevance to measure the context relevance    #\n",
    "# between question and each context chunk.                                 #\n",
    "############################################################################\n",
    "# *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****\n",
    "\n",
    "f_context_relevance = None\n",
    "\n",
    "# *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****\n",
    "############################################################################\n",
    "#                             END OF YOUR CODE                             #\n",
    "############################################################################"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Question 3: Finding the best RAG app configuration using evaluation"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this final question, we'd like you to try and evaluate different configurations of the RAG application you built in Question 1 using the evaluation functions you initialized in Question 2, explaining your approach in developing different configurations and analyzing their performance.\n",
    "\n",
    "To evaluate our RAG application built using Llama-Index using TruLens, we can use TruLlama, a deep TruLens integration in Llama-Index. We set up TruLlama by wrapping our query engine from Question 1 and providing the three feedback functions from Question 2, along with some metadata to help differentiate between different configurations.\n",
    "\n",
    "You have free reign over the RAG application design to develop different RAG configurations. You can try different vector index types, embeddings, parameters, context chunk sizes (i.e. size of context chunks that augment the query) and LLMs. We have provided you with the code to iterate through a few of these parameters for different configurations below. \n",
    "\n",
    "If you would like, you are welcome to change out certain components or even add in additional components as you see fit, though this is not required.\n",
    "\n",
    "You will receive full credit if you complete the RAG application such that it uses all of the default components, includes the evaluations listed and evaluate and compare meaningfully different RAG configurations, with a reasonable written approach and analysis. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import itertools\n",
    "\n",
    "############################################################################\n",
    "# TODO: Try and evaluate different RAG configurations.                     #\n",
    "############################################################################\n",
    "# *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****\n",
    "\n",
    "index_params = []\n",
    "embed_models = []\n",
    "context_chunk_sizes = [200, 500]  # feel free to try others\n",
    "\n",
    "top_ks = [1,3]\n",
    "\n",
    "for index_param, embed_model, top_k, chunk_size in itertools.product(\n",
    "    index_params, embed_models, top_ks, context_chunk_sizes\n",
    "    ):\n",
    "    embed_model_name = None\n",
    "    vector_store = None\n",
    "    llm = None\n",
    "    storage_context = StorageContext.from_defaults(vector_store = vector_store)\n",
    "    service_context = ServiceContext.from_defaults(embed_model = embed_model, llm = llm, chunk_size=chunk_size)\n",
    "    index = VectorStoreIndex.from_documents(wiki_docs,\n",
    "            service_context=service_context,\n",
    "            storage_context=storage_context)\n",
    "    \n",
    "    query_engine = index.as_query_engine(similarity_top_k = top_k)\n",
    "    \n",
    "    # Initialize a TruLlama wrapper to connect evaluation metrics with the query engine\n",
    "    tru_query_engine = TruLlama(query_engine,\n",
    "                    feedbacks=[f_groundedness, f_answer_relevance, f_context_relevance],\n",
    "                    metadata={\n",
    "                        'index_param':index_param,\n",
    "                        'embed_model':embed_model_name,\n",
    "                        'top_k':top_k,\n",
    "                        'chunk_size':chunk_size\n",
    "                        })\n",
    "    \n",
    "    @retry(stop=stop_after_attempt(10), wait=wait_exponential(multiplier=1, min=4, max=10))\n",
    "    def call_tru_query_engine(prompt):\n",
    "        # we now send the prompt through the TruLlama-wrapped query engine\n",
    "        return tru_query_engine.query(prompt)\n",
    "    for prompt in test_prompts:\n",
    "        call_tru_query_engine(prompt)\n",
    "# *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****\n",
    "############################################################################\n",
    "#                             END OF YOUR CODE                             #\n",
    "############################################################################"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### **TODO:** Student Written Response\n",
    "\n",
    "##### Please explain your approach in developing different configurations and analyze their performance in a single paragraph below. You will also need to submit a screenshot of your TruLens dashboard with the scores of multiple configurations via Gradescope."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Explore the performance of different configurations in the TruLens dashboard"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "tru.run_dashboard()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# tru.stop_dashboard() # stop if needed"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Alternatively, you can run `trulens-eval` from a command line in the same folder as this notebook to start the dashboard."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Or alternatively, view results directly in your notebook"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "tru.get_records_and_feedback(app_ids=[])[0] # pass an empty list of app_ids to get all"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Question 4: EXTRA CREDIT"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### **TODO:** Student Written Response\n",
    "\n",
    "##### Please provide three examples, one per metric, of custom adversarial prompts that negatively impact each of the three evaluation metrics. You will also need to submit screenshots of the Feedback function scores for each custom prompt. To see these scores, click on your configuration to reach the Evaluations page, and then click on the row corresponding to the custom prompt. In the explanation, please explain intuitively how you developed this prompt and discuss why it's an effective adversarial example (a few sentences is sufficient).\n",
    "\n",
    "#### 1. Adversarial Groundedness Example\n",
    "Prompt:\n",
    "\n",
    "Screenshot file name (in Gradescope):\n",
    "\n",
    "Explanation:\n",
    "\n",
    "\n",
    "#### 2. Adversarial Answer Relevance Example\n",
    "Prompt:\n",
    "\n",
    "Screenshot file name (in Gradescope):\n",
    "\n",
    "Explanation:\n",
    "\n",
    "\n",
    "#### 3. Adversarial Context Relevance Example\n",
    "Prompt:\n",
    "\n",
    "Screenshot file name (in Gradescope):\n",
    "\n",
    "Explanation:\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": []
  }
 ],
 "metadata": {
  "language_info": {
   "name": "python"
  },
  "orig_nbformat": 4
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
