Accurate documentation of procedural case logs is a critical component of radiology training, serving as the basis for competency assessment, credentialing, and regulatory compliance. However, current case logging workflows rely heavily on manual annotation by trainees—an error-prone and time-consuming process that imposes significant administrative burden and detracts from clinical learning. Traditional approaches, such as the crosswalk method that maps structured metadata to procedure labels, often fail to capture procedures documented only in free-text, resulting in incomplete logs and underreporting.
In this study, we explore the feasibility of automating procedural case log generation using large language models (LLMs). We evaluate two state-of-the-art LLMs Qwen-2.5 (a local, open-source model) and Claude-3.5-Haiku (a commercial API model)—under two prompting strategies: Instruction Prompting and Chain-of-Thought reasoning. Across 39 radiological procedures, both models substantially outperform the crosswalk benchmark in terms of sensitivity and F1-score, while maintaining high specificity. Error analysis reveals that model performance varies by procedure complexity and linguistic ambiguity. Additionally, we quantify the cost of inference in terms of latency and token generation, and demonstrate that LLM-based solutions could save over 30-40 hours of manual annotation time per resident annually. Our findings suggest that LLMs offer a scalable, cost-effective, and high-fidelity alternative to manual case logging, paving the way for intelligent documentation systems that support medical education while reducing clerical workload.
Our methodology integrates a structured pipeline for automating procedural case logs using large language models (LLMs). The architecture below illustrates the complete system—from unstructured radiology report ingestion to structured output generation and evaluation.
The process begins with a collection of radiology reports authored by residents during clinical practice. These reports describe diagnostic and interventional procedures using unstructured narrative text.
A data cleaning module is applied to each report to eliminate administrative blocks, signatures, disclaimers, and non-ASCII artifacts. The result is a cleaned report which is used in two parallel downstream tasks:
The LLMs receive one prompt per procedure and return a structured JSON response with a binary label and a short justification.
The following templates guide how the LLMs are queried. Each prompt strictly enforces a JSON-based output schema to ensure consistency and downstream parsability.
After inference, each LLM response is compared to the expert-annotated ground truth using the following metrics:
These evaluations reveal that LLMs significantly outperform rule-based baselines (e.g., crosswalk-based detection) in terms of accuracy and sensitivity, while maintaining cost-effectiveness in both local and commercial deployments.
The system is highly modular—supporting plug-and-play for new models, prompt types, or evaluation criteria. All LLM interactions are logged for transparency, auditability, and reproducibility, making the system viable for real-world deployment in clinical residency settings.
| Type | Model | Prompting Method | Modality | TP | TN | FP | FN | Sensitivity | Specificity | F1Score |
|---|---|---|---|---|---|---|---|---|---|---|
| Benchmark | Cross-Walk | NA | All | 451 | 15364 | 93 | 238 | 65.46 | 99.40 | 73.15 |
| VascularDiagonsis | 143 | 3065 | 23 | 81 | 63.84 | 99.26 | 73.33 | |||
| VascularIntervention | 157 | 5906 | 38 | 109 | 59.02 | 99.36 | 68.11 | |||
| NonVascularIntervention | 151 | 6393 | 32 | 48 | 75.88 | 99.50 | 79.06 | |||
| Local | Qwen-2.5:72B | IP | All | 649 | 15174 | 283 | 40 | 94.19 | 98.17 | 80.08 |
| VascularDiagnosis | 219 | 3068 | 20 | 5 | 97.77 | 99.35 | 94.60 | |||
| VascularIntervention | 247 | 5803 | 141 | 19 | 92.86 | 97.63 | 75.54 | |||
| NonVascularIntervention | 183 | 6303 | 122 | 16 | 91.96 | 98.10 | 72.62 | |||
| CoT | All | 627 | 15326 | 131 | 62 | 91.00 | 99.15 | 86.66 | ||
| VascularDiagnosis | 214 | 3071 | 17 | 10 | 95.54 | 99.45 | 94.07 | |||
| VascularIntervention | 242 | 5868 | 76 | 24 | 90.98 | 98.72 | 82.88 | |||
| NonVascularIntervention | 171 | 6387 | 38 | 28 | 85.93 | 99.41 | 83.82 | |||
| Commercial | Claude-3.5-Haiku | IP | All | 633 | 14961 | 496 | 56 | 91.87 | 96.79 | 69.64 |
| VascularDiagnosis | 215 | 3067 | 21 | 9 | 95.98 | 99.32 | 93.48 | |||
| VascularIntervention | 230 | 5737 | 207 | 36 | 86.47 | 96.52 | 65.43 | |||
| NonVascularIntervention | 188 | 6157 | 268 | 11 | 94.47 | 95.83 | 57.41 | |||
| CoT | All | 613 | 15348 | 109 | 76 | 88.97 | 99.29 | 86.89 | ||
| VascularDiagnosis | 210 | 3069 | 19 | 14 | 93.75 | 99.38 | 92.71 | |||
| VascularIntervention | 228 | 5905 | 39 | 38 | 85.71 | 99.34 | 85.55 | |||
| NonVascularIntervention | 175 | 6374 | 51 | 24 | 87.94 | 99.21 | 82.35 |
The table above summarizes the performance of different models across various modalities. The results indicate that both local and commercial LLMs significantly outperform the benchmark crosswalk method, particularly in terms of sensitivity and F1-score. The choice of prompting strategy also plays a crucial role in model performance, with Chain-of-Thought prompting generally yielding better results.
For a more detailed analysis, including error types and model-specific insights, please refer to the full paper.
The system needs a GPU to get a faster response from the models. The amount of VRAM required depends on the model that we want to run. Here is an estimate:
git clone https://github.com/Nafiz43/PCL-Fetcher
conda create -n pcl-fetcher python=3.10conda activate pcl-fetcherpip install -r requirements.txtdata directory.ground-truth.csv to maintain consistency.python3 01_run_llm.py --model_name=MODEL-NAME --prompting_method=PROMPTING-METHOD --reports_to_process=-1
Command breakdown:
--model_name: Name of the model to run (e.g., mixtral:8x7b-instruct-v0.1-q4_K_M)--prompting_method: Name of the prompting method. Two options available: 1) IP, stands for Instruction Prompting; 2) CoT, stands for Chain-of-Thought prompting.--reports_to_process=-1: Number of reports to process. By default = -1, processes all reports.local_chat_history directory.
run_evaluation.py file.file_containing_ground_truth – should point to ground truth data (e.g., 'data/PCL_p.csv').
python3 run_evaluation.py --reports_to_process=-1
Command breakdown:
--reports_to_process=-1: Number of reports to process. Default = -1 to process all reports.results/all_models.csv file.sudo docker build -t pcl-container .
This will create a Docker image named pcl-container.
sudo docker run -it --rm --gpus=all pcl-container /bin/bash
This opens an interactive shell in the container and enables GPU access.
ollama serve &
This will run the Ollama server in the background.
llama3.2:latest, run:
ollama run llama3.2:latest
You can substitute this with any other model you wish to test. Once the model is installed and the server is up, you can run the same experiment commands used in the non-Docker version—no additional installation needed.
sudo docker ps
results directory using:
docker cp CONTAINER_NAME:/app/results /home/nikhan/Data/Case_Log_Data/Procedural-Case-Log
Replace CONTAINER_NAME with the actual container name from the previous step.
exit
The project is licensed under the Apache License 2.0. This permissive license allows you to use, modify, and distribute the software for both personal and commercial purposes, as long as you include proper attribution and comply with the terms outlined in the license.
Contributions are very welcome! If you'd like to add features, fix bugs, or improve the documentation, please feel free to fork the repository and create a pull request. Make sure your changes are well-documented and follow the project's coding standards.
We appreciate your interest in improving this project—thank you for helping make it better!
For high-level discussions, funding opportunities, or collaboration inquiries, please reach out to the project supervisor, Professor Vladimir Filkov (vfilkov@ucdavis.edu).
For technical questions, bug reports, or concerns regarding the codebase, please contact the main project maintainer, Nafiz Imtiaz Khan (nikhan@ucdavis.edu).
We're excited to hear from you!