Intelligent Documentation in Medical Education: Can AI Replace Manual Case Logging?

1 Department of Computer Science, University of California, Davis, CA, US
2 Department of Radiology, University of California, Davis, CA, US

Abstract

Accurate documentation of procedural case logs is a critical component of radiology training, serving as the basis for competency assessment, credentialing, and regulatory compliance. However, current case logging workflows rely heavily on manual annotation by trainees—an error-prone and time-consuming process that imposes significant administrative burden and detracts from clinical learning. Traditional approaches, such as the crosswalk method that maps structured metadata to procedure labels, often fail to capture procedures documented only in free-text, resulting in incomplete logs and underreporting.

In this study, we explore the feasibility of automating procedural case log generation using large language models (LLMs). We evaluate two state-of-the-art LLMs Qwen-2.5 (a local, open-source model) and Claude-3.5-Haiku (a commercial API model)—under two prompting strategies: Instruction Prompting and Chain-of-Thought reasoning. Across 39 radiological procedures, both models substantially outperform the crosswalk benchmark in terms of sensitivity and F1-score, while maintaining high specificity. Error analysis reveals that model performance varies by procedure complexity and linguistic ambiguity. Additionally, we quantify the cost of inference in terms of latency and token generation, and demonstrate that LLM-based solutions could save over 30-40 hours of manual annotation time per resident annually. Our findings suggest that LLMs offer a scalable, cost-effective, and high-fidelity alternative to manual case logging, paving the way for intelligent documentation systems that support medical education while reducing clerical workload.

Methodology

Our methodology integrates a structured pipeline for automating procedural case logs using large language models (LLMs). The architecture below illustrates the complete system—from unstructured radiology report ingestion to structured output generation and evaluation.

Methodology Architecture Diagram

The process begins with a collection of radiology reports authored by residents during clinical practice. These reports describe diagnostic and interventional procedures using unstructured narrative text.

A data cleaning module is applied to each report to eliminate administrative blocks, signatures, disclaimers, and non-ASCII artifacts. The result is a cleaned report which is used in two parallel downstream tasks:

  • Manual Annotation: Expert annotators read the reports and label them for 39 predefined procedures. These form the ground truth labels.
  • LLM-Based Inference: The cleaned reports are processed by instruction-tuned models (e.g., Qwen-2.5, Claude-3.5-Haiku) using two prompt strategies—Instruction Prompting (IP) and Chain-of-Thought (CoT).

The LLMs receive one prompt per procedure and return a structured JSON response with a binary label and a short justification.

Prompting Strategies

The following templates guide how the LLMs are queried. Each prompt strictly enforces a JSON-based output schema to ensure consistency and downstream parsability.

Prompt Template for Instruction Prompting (IP)
    
I will provide you with a radiology report, followed by several questions about it. Your task is to determine whether a specific radiology study or procedure was performed. Please follow these strict formatting guidelines for your response:

Output must be in valid JSON format with the following keys:

{
  "reason_for_the_label": "A string explaining the reasoning behind the classification.",
  "label": 1 or 0
}

Labeling criteria:
- Return 1 if the radiology study or procedure was explicitly mentioned as performed.
- Return 0 if the study or procedure was not performed, not documented, or uncertain in the report.
- Do not include any additional text or explanations outside the JSON response.
- Ensure strict adherence to this format for every response.
    
  
Prompt Template for Chain-of-Thought (CoT) Prompting
    
I will provide you with a radiology report, followed by a question about whether a specific radiology study or procedure was performed.

<procedure_specific_question>

### Strict Output Format
Your response must be a valid JSON object with the following keys:
{
  "reason_for_the_label": "A concise explanation justifying the classification.",
  "label": 1 or 0
}
    
  

After inference, each LLM response is compared to the expert-annotated ground truth using the following metrics:

  • Sensitivity: Ability to correctly detect procedures present in the report.
  • Specificity: Ability to correctly ignore procedures not present.
  • F1-Score: Harmonic mean of precision and recall.
  • Inference Time: Latency per question-response cycle.
  • Token Count: Proxy for verbosity and computational cost.

These evaluations reveal that LLMs significantly outperform rule-based baselines (e.g., crosswalk-based detection) in terms of accuracy and sensitivity, while maintaining cost-effectiveness in both local and commercial deployments.

The system is highly modular—supporting plug-and-play for new models, prompt types, or evaluation criteria. All LLM interactions are logged for transparency, auditability, and reproducibility, making the system viable for real-world deployment in clinical residency settings.

Results

Performance Comparison of Models Across Modalities
Type Model Prompting Method Modality TP TN FP FN Sensitivity Specificity F1Score
Benchmark Cross-Walk NA All 451153649323865.4699.4073.15
VascularDiagonsis1433065238163.8499.2673.33
VascularIntervention15759063810959.0299.3668.11
NonVascularIntervention1516393324875.8899.5079.06
Local Qwen-2.5:72B IP All649151742834094.1998.1780.08
VascularDiagnosis219306820597.7799.3594.60
VascularIntervention24758031411992.8697.6375.54
NonVascularIntervention18363031221691.9698.1072.62
CoT All627153261316291.0099.1586.66
VascularDiagnosis2143071171095.5499.4594.07
VascularIntervention2425868762490.9898.7282.88
NonVascularIntervention1716387382885.9399.4183.82
Commercial Claude-3.5-Haiku IP All633149614965691.8796.7969.64
VascularDiagnosis215306721995.9899.3293.48
VascularIntervention23057372073686.4796.5265.43
NonVascularIntervention18861572681194.4795.8357.41
CoT All613153481097688.9799.2986.89
VascularDiagnosis2103069191493.7599.3892.71
VascularIntervention2285905393885.7199.3485.55
NonVascularIntervention1756374512487.9499.2182.35

The table above summarizes the performance of different models across various modalities. The results indicate that both local and commercial LLMs significantly outperform the benchmark crosswalk method, particularly in terms of sensitivity and F1-score. The choice of prompting strategy also plays a crucial role in model performance, with Chain-of-Thought prompting generally yielding better results.

For a more detailed analysis, including error types and model-specific insights, please refer to the full paper.

Installation Manual

System Requirements

The system needs a GPU to get a faster response from the models. The amount of VRAM required depends on the model that we want to run. Here is an estimate:

  • 7B model requires ~4 GB
  • 13B model requires ~8 GB
  • 30B model needs ~16 GB
  • 65B model needs ~32 GB

Installation

  1. Clone the repository:
    git clone https://github.com/Nafiz43/PCL-Fetcher
  2. Make sure Miniconda is installed. You can follow the official installation guide here: Miniconda Installation Guide.
  3. Create a conda environment with Python 3.12:
    conda create -n pcl-fetcher python=3.10
  4. Activate the conda environment:
    conda activate pcl-fetcher
  5. Install all required packages:
    pip install -r requirements.txt
  6. Install Ollama:
    Visit https://ollama.com/ and follow the installation instructions.
  7. Download the Ollama models you intend to use locally.

Running the Project

  1. Place your annotated ground-truth dataset inside the data directory.
  2. Rename the dataset as ground-truth.csv to maintain consistency.
  3. Run the following command to derive responses from LLMs:
    python3 01_run_llm.py --model_name=MODEL-NAME --prompting_method=PROMPTING-METHOD --reports_to_process=-1
    Command breakdown:
    • --model_name: Name of the model to run (e.g., mixtral:8x7b-instruct-v0.1-q4_K_M)
    • --prompting_method: Name of the prompting method. Two options available: 1) IP, stands for Instruction Prompting; 2) CoT, stands for Chain-of-Thought prompting.
    • --reports_to_process=-1: Number of reports to process. By default = -1, processes all reports.
    LLM responses will be stored in the local_chat_history directory.

Calculating the Evaluation Metrics

  1. Open the run_evaluation.py file.
  2. Update the value of the variable:
    file_containing_ground_truth – should point to ground truth data (e.g., 'data/PCL_p.csv').
  3. Run the following command in the shell:
    python3 run_evaluation.py --reports_to_process=-1
    Command breakdown:
    • --reports_to_process=-1: Number of reports to process. Default = -1 to process all reports.
  4. Results will be stored in the results/all_models.csv file.

Docker Setup Guide

Docker Commands

  1. Download the Dockerfile:
    Get it from this GitHub link.
  2. Build the Docker image:
    sudo docker build -t pcl-container .
    This will create a Docker image named pcl-container.
  3. Run the Docker container with GPU access:
    sudo docker run -it --rm --gpus=all pcl-container /bin/bash
    This opens an interactive shell in the container and enables GPU access.
  4. Start the Ollama server inside the container:
    ollama serve &
    This will run the Ollama server in the background.
  5. Install and launch your desired LLM model:
    For example, to install and run llama3.2:latest, run:
    ollama run llama3.2:latest
    You can substitute this with any other model you wish to test. Once the model is installed and the server is up, you can run the same experiment commands used in the non-Docker version—no additional installation needed.
  6. Copy results from the container to your local machine:
    1. Open a new terminal.
    2. Check the name of the running container:
      sudo docker ps
    3. Copy the results directory using:
      docker cp CONTAINER_NAME:/app/results /home/nikhan/Data/Case_Log_Data/Procedural-Case-Log
      Replace CONTAINER_NAME with the actual container name from the previous step.
  7. Exit the Docker container:
    exit

License

The project is licensed under the Apache License 2.0. This permissive license allows you to use, modify, and distribute the software for both personal and commercial purposes, as long as you include proper attribution and comply with the terms outlined in the license.

Contributing

Contributions are very welcome! If you'd like to add features, fix bugs, or improve the documentation, please feel free to fork the repository and create a pull request. Make sure your changes are well-documented and follow the project's coding standards.

We appreciate your interest in improving this project—thank you for helping make it better!

Contact

For high-level discussions, funding opportunities, or collaboration inquiries, please reach out to the project supervisor, Professor Vladimir Filkov (vfilkov@ucdavis.edu).

For technical questions, bug reports, or concerns regarding the codebase, please contact the main project maintainer, Nafiz Imtiaz Khan (nikhan@ucdavis.edu).

We're excited to hear from you!