EvidenceBot

Abstract

Large Language Models (LLMs) have become pivotal in reshaping the world by enabling advanced natural language processing tasks such as document analysis, content generation, and conversational assistance. Their ability to process and generate human-like text has unlocked unprecedented opportunities across different domains such as healthcare, education, finance, and more. However, commercial LLM platforms face several limitations, including data privacy concerns, context size restrictions, lack of parameter configurability, and limited evaluation capabilities. These shortcomings hinder their effectiveness, particularly in scenarios involving sensitive information, large-scale document analysis, or the need for customized output. This underscores the need for a tool that combines the power of LLMs with enhanced privacy, flexibility, and usability.

To address these challenges, we present EvidenceBot, a local, Retrieval-Augmented Generation (RAG)-based solution designed to overcome the limitations of commercial LLM platforms. EvidenceBot enables secure and efficient processing of large document sets through its privacy-preserving RAG pipeline, which extracts and appends only the most relevant text chunks as context for queries. The tool allows users to experiment with hyperparameter configurations, optimizing model responses for specific tasks, and includes an evaluation module to assess LLM performance against ground truths using semantic and similarity-based metrics. By offering enhanced privacy, customization, and evaluation capabilities, EvidenceBot bridges critical gaps in the LLM ecosystem, providing a versatile resource for individuals and organizations seeking to leverage LLMs effectively.

System Architecture

The architecture of the proposed pipeline is shown in the system diagram. The pipeline is capable of processing documents in various formats, such as HTML, txt, md, py, pdf, csv, xlsx, and docx. To handle these different file types, the pipeline utilizes the LangChain document loader.

Once the documents are loaded, they are divided into smaller text chunks (default size = 512 tokens), with the option for the user to adjust the chunk size, thus controlling the number of chunks created for the given set of documents. These text chunks are then converted into embeddings using a preselected embedding model.

Following this, the pipeline leverages LangChain to generate semantic indexes for each embedding, facilitating the retrieval and ranking of relevant information based on context and meaning, rather than relying on simple keyword matching. These embeddings and their corresponding semantic indexes are stored in ChromaDB, a high-performance vector database optimized for efficient similarity search.

When a user submits a query, the pipeline converts it into an embedding using the same model as previously used for the documents. Then a semantic search is performed within the vector database, returning the top k most relevant results. The user can specify the value of k within the tool. These relevant text snippets are appended to the user's query and passed to the language model to ensure that the model has the complete context required to generate an accurate response. For compiling and running the language models, we used Ollama, which is an open-source platform that facilitates the deployment and management of local LLM models.

The EvidenceBot dashboard uses Streamlit, HTML, and CSS to create an intuitive, interactive user interface. Streamlit works as the primary framework, which enables integration with Python logic and language models while managing user inputs, file uploads, and dynamic outputs. HTML structures key elements such as the navigation bar, input forms, and chat containers, and CSS is used to enhance the visual design, setting background colors, adjusting container layouts, and ensuring responsiveness. The combination of these tools ensures a streamlined user experience.

User Guide

Generating Individual Responses

The application is, by default, accessible at the following link on the local machine. In this mode, users can provide a prompt and receive responses generated by a selected Large Language Model (LLM). Users can also adjust several model parameters as described in the Application Parameters section.

The model selector dynamically lists all LLMs installed locally. Upon query submission, the system builds a vector database from files placed in the DATA_DOCUMENTS folder. Relevant chunks are retrieved and provided as context to the model for response enhancement.

If any parameters are changed, the application should be restarted to regenerate the vector database accordingly.

Figure 1: Generate Individual Responses Functionality

Responses are logged in the History/log.csv file along with timestamps, original prompts, and source content.

Figure 2: Generate Individual Responses Output

Generating Batch Responses

Selecting the Batch Question Mode enables processing a list of questions via uploaded .csv file. Each row should contain a single question, processed sequentially by the model.

Figure 3: Generate Batch Responses Functionality

Progress is tracked during execution, and results are saved in History/log.csv.

Figure 4: Generate Batch Responses Output

Figure 5: Generate Batch Responses Output File Format

Evaluating Individual Responses

To evaluate an individual response, use the Evaluate mode at localhost:8502. Users must input both the reference and generated texts for comparison.

Figure 6: Evaluate Individual Responses Functionality

The tool outputs BLEU-4, ROUGE-L, BERTScore, and Cosine Similarity metrics with visualization support.

Figure 7: Evaluate Individual Responses Output

Evaluating Batch Responses

Users can evaluate a batch of responses by uploading two .csv files: reference.csv (ground truth) and candidate.csv (model outputs).

Figure 8: Evaluate Batch Responses Input

The system visualizes metric comparisons across the dataset using bar plots.

Figure 9: Evaluate Batch Responses Output

App Parameters

The app supports three RAG parameters and eight model-specific generation parameters.

RAG Parameters

Embedding_model_name: Identifier of the embedding model (e.g., openai/text-embedding-ada-002).
CHUNK_SIZE: Size of each input chunk in tokens/words/characters.
K: Number of top entries to retrieve from the vector store for context injection.

Model Parameters

temp: Controls randomness; higher = more diverse (default: 1.0).
top_p: Nucleus sampling threshold (default: 0.9).
top_k: Top-k sampling threshold (default: 40).
tfs_z: Tail free sampling filter (default: 1.0).
num_ctx: Max tokens used for context (default: 2048).
repeat_penalty: Penalizes token repetition (default: 1.1).
mirostat: Enables adaptive perplexity-based sampling (default: 0).
mirostat_eta: Learning rate for Mirostat adjustment (default: 0.1).
mirostat_tau: Target perplexity for Mirostat (default: 5.0).

Installation Manual

App Installation

To install the app, we have to do the following:

Install Mini-Conda on your computer (if already not installed). The following link can be used for installation: LINK.
Clone the repo using git:
```
https://github.com/Nafiz43/EvidenceBot
```

Create and activate a new virtual environment:

conda create -n EvidenceBot python=3.10.0
conda activate EvidenceBot

Install all the requirements:
```
pip install -r requirements.txt
```
Install Ollama from the following LINK.
Install Models using Ollama:
```
ollama pull MODEL_NAME
```
Replace MODEL_NAME with your desired model name. List of all the models is available at this link.
Open the source directory, where the source code exists. Then, keep the documents that you want to analyze in the DATA_DOCUMENTS folder.
Open CLI (Command Line Interface) in the source directory and hit the following command:
```
ollama pull MODEL_NAME
```
To run the app, navigate to the project directory and execute the following command:
```
sh command.sh
```

Minimum System Requirements

Processor: Modern multi-core CPU (at least 8+ cores)
RAM: 32GB minimum
Storage: 20GB for application code and dependencies
GPU: CUDA-compatible GPU with 8GB+ VRAM

The amount of VRAM required depends on the model that we want to run. Here is an estimate:

7B model requires ~4 GB
13B model requires ~8 GB
30B model needs ~16 GB
65B model needs ~32 GB

Note: While lower configurations are viable, performance may be compromised, leading to longer execution times and potential system slowdowns.

Cloud Deployment Alternative

If deploying to cloud infrastructure:

Standard virtual machine with 8+ vCPUs
32GB RAM
GPU acceleration if available

Docker Installation Manual

Download the Dockerfile from the given link Use the following commands to build and run the Docker container securely.

Build and Run Docker Container

1. Build the Docker image:

docker build -t evidencebot .

2. Run the Docker container and expose port 8501:

docker run -it --rm -p 8501:8501 -v $(pwd)/DATA_DOCUMENTS:/app/DATA_DOCUMENTS evidencebot

This command:

Maps local port 8501 to container's port 8501
Mounts the DATA_DOCUMENTS folder so your documents are accessible inside the container
Removes the container after it exits (--rm)

Acknowledgements

This research was supported by the National Science Foundation under Grant No. 2020751, as well as by the Alfred P. Sloan Foundation through the OSPO for UC initiative (Award No. 2024-22424).

License

The EvidenceBot project is licensed under the Apache License 2.0. This permissive license allows you to use, modify, and distribute the software for both personal and commercial purposes, as long as you include proper attribution and comply with the terms outlined in the license.

Contributing

Contributions are very welcome! If you'd like to add features, fix bugs, or improve the documentation, please feel free to fork the repository and create a pull request. Make sure your changes are well-documented and follow the project's coding standards.

We appreciate your interest in improving this project—thank you for helping make it better!

Contact

For high-level discussions, funding opportunities, or collaboration inquiries, please reach out to the project supervisor, Professor Vladimir Filkov (vfilkov@ucdavis.edu).

For technical questions, bug reports, or concerns regarding the codebase, please contact the project lead, Nafiz Imtiaz Khan (nikhan@ucdavis.edu).

We're excited to hear from you!

BibTeX

@inproceedings{khan2025evidencebot,
  author    = {Nafiz Imtiaz Khan and Vladimir Filkov},
  title     = {EvidenceBot: A Privacy-Preserving, Customizable RAG-Based Tool for Enhancing Large Language Model Interactions},
  booktitle = {Proceedings of the 33rd ACM International Conference on the Foundations of Software Engineering (FSE Companion '25)},
  year      = {2025},
  doi       = {10.1145/3696630.3728607},
  isbn      = {979-8-4007-1276-0/2025/06},
  location  = {Trondheim, Norway},
  publisher = {ACM},
  address   = {New York, NY, USA}
}