SynthCoT-RAG

2024 | GitHub

SynthCoT-RAG

A pipeline for improving classification using synthetic chain-of-thought reasoning and retrieval-augmented generation.

This project began as part of my submission for SemEval-2025 Task 11: Bridging the Gap in Text-Based Emotion Detection. The final pipeline included an ensemble with a Graph Neural Network inspired by EmoGraph. Unfortunately, I did not have time to create a proper submission, nor did I have any paper writing experience at the time. The original competition repository can be found here.

I got a renewed interest in the project after AlphaEvolve by Google Deepmind released in 2025. AlphaEvolve uses a similar strategy of using LLM's to generate reasoning to improve performance on tasks. However, they implement a MAP-Elites type evolution algorithm improve model performance.

Overview

  1. Take a labeled training dataset (CSV format)
  2. Generate synthetic chain-of-thought explanations for each training example using an LLM
  3. Encode and store examples + reasoning in a FAISS vector database
  4. At inference time, retrieve the most semantically similar examples as few-shot context
  5. Prompt an LLM with retrieved reasoning to predict emotion labels
SynthCoT-RAG 1