french

Personal RAG for the Bac de Français 2026: Build Your Own AI Study Assistant in Six Weeks

Gerald Steiner

24 mai 2026 — 9 min de lecture

Ask ChatGPT for a precise quote from Manon Lescaut on the theme of social determinism, and there is a greater than 50% probability you will get a phrasing that does not appear in Prévost's text. The model produces something plausible, stylistically apt, thematically consistent — and yet false. This drift, which specialists call "hallucination," is not a marginal bug; it is the direct consequence of an LLM (large language model) trained on general-purpose corpora, without any grounding in your edition, your class notes, your past exam papers.

RAG — Retrieval-Augmented Generation — structurally corrects this problem. Rather than letting the model improvise from its statistical memory, you first feed it the relevant passages extracted from your own documents, then ask it to write from those passages. The difference is not cosmetic: it is the difference between a witness who invents and a witness who reads from their statement.

This guide is addressed to candidates sitting the Bac de Français 2026 — native speakers and FLE learners alike — who wish to build a reliable, citable, corpus-specific personal study assistant. The central argument is this: learners of French as a foreign language hold a structural advantage in building and using such a system, because the demands of lexical precision and source-based justification they have been trained to meet correspond exactly to the skills that a quality RAG requires.

Why a Personal RAG Outperforms Generic LLMs for the Bac 2026

A generalist LLM is a remarkable tool for brainstorming, rephrasing, or exploring ideas. It becomes dangerous the moment you ask it to quote, date, or attribute with precision. The Bac de Français penalises exactly these imprecisions — an approximate quotation in a commentary can invalidate an entire analysis. A personal RAG is the architectural answer to this requirement.

The Problem of Factual Drift in LLMs Without Context

Large language models build their responses through statistical prediction — each word is chosen because it plausibly follows the preceding words, given the training corpus. For literary works, this mechanism produces plausible paraphrases, mixed attributions, and slightly altered lines. A model that has processed millions of commentaries on Les Fleurs du mal can perfectly reconstruct a stanza of Baudelaire with two words displaced. The examiner, however, knows the original text. The penalty is immediate.

What a Well-Built RAG Additionally Guarantees

A correctly configured RAG system operates in two distinct phases. First, a retrieval phase: when you ask a question, the system calculates the similarity between your query and the fragments of your documents (class notes, works, study sheets) previously transformed into numerical vectors — mathematical representations capturing meaning. It selects the most relevant passages. Then, a generation phase: the LLM receives these passages as context and drafts its response based exclusively on them. The result: the response is traceable, and you can verify each claim against the original source.

Specific Advantage for FLE Learners

The FLE learner has been trained to identify collocations, to justify each lexical choice, and to distinguish registers and language levels. These habits are precisely those that make a RAG useful: knowing how to formulate a precise query ("the extended metaphor of the sea in Du Bellay's sonnets" rather than "Du Bellay sea"), knowing how to assess the relevance of a retrieved passage, knowing how to recognise when a paraphrase betrays a meaning. The native speaker, less accustomed to metalinguistic explanation, will need to develop these reflexes; the FLE learner already possesses them.

The Six Source Families to Index for the Bac de Français

The quality of a RAG depends above all on the quality of the documents entrusted to it. Indexing mediocre sources — illegible scans, summaries of summaries, unrevised study sheets — amounts to building a search engine on a corrupted corpus. Here are the six indispensable families for a solid Bac de Français corpus.

Complete set texts — clean text, verified OCR
Class notes, study sheets, and personal reading journal
Official past exam papers and model answers
Related works for cursory reading and broader context
Critical lexicon and glossary of literary devices
Personal work — drafts, essays, teacher feedback

The text of the set works is the cornerstone of the index. It must be clean — free of OCR errors, without abbreviations, with the original punctuation respected — because the vector similarity engine is sensitive to tokenisation errors. For out-of-copyright works, the Wikisource website or the BnF's Gallica editions provide reliable texts. For recent works, a quality personal scan will suffice, provided a spell check is run. Related works (cursory reading texts, texts studied as part of an associated pathway) complete the index by adding the intertextual networks that the jury expects candidates to mobilise.

Class Notes, Study Sheets, and Personal Reading Journal

Your class notes, revision sheets, and reading journal constitute the most personal — and often the most valuable — layer of the index. They encode your reading of the text, the angles retained by your teacher, and the problematics developed in class. A RAG that does not integrate them will return generic analyses that the examiner has read a hundred times. To make them indexable, convert your handwritten notes to digital text (voice dictation or quality OCR) and read them through to correct errors before ingestion.

Past Papers and Personal Work

The official Bac de Français past papers — available on the Eduscol website — constitute an irreplaceable training corpus. They reveal the expected phrasings, the structure of questions, and the distribution of works. Reference model answers allow the RAG to provide you with reasoning models when you ask a methodological question. Your own work — dissertation drafts, annotated commentaries, rewritten introductions — adds a reflective dimension: you can query the index about your own recurring errors and receive advice calibrated to your actual profile.

A Minimal and Accessible Technical Stack in 2026

Building a personal RAG does not require advanced programming skills or a dedicated server. In 2026, accessible tools make it possible to set up a functional system within a few hours on a standard laptop. The three essential components are: an embedding model, a vector database, and a generation LLM.

Choosing an Embedding Model Suited to French

An embedding is a mathematical function that transforms a text fragment into a high-dimensional numerical vector — typically 768 or 1536 values — such that semantically similar texts produce vectors that are close in that space. For literary French, generalist models trained on English underperform on syntactic subtleties and elevated or classical registers. Prefer sentence-transformers/paraphrase-multilingual-mpnet-base-v2 or, for fully sovereign local use, the dangvantuan/vietnamese-embedding model adapted to French available on Hugging Face. The choice of embedding model directly determines the quality of the retrieved passages — it is the most structuring component of the system.

Choosing a Simple Vector Database — Qdrant, Chroma, or pgvector

The vector database is the storage and search engine that holds your embeddings and responds to similarity queries. For personal and local use, three options dominate in 2026. Chroma is the simplest to deploy: a Python library, zero server configuration, ideal for getting started. Qdrant offers better performance on collections of several thousand fragments and comes with a web visualisation interface. pgvector extends PostgreSQL with vector capabilities: preferable if you already manage a relational database. For a Bac de Français corpus — rarely exceeding 50,000 fragments — Chroma is entirely sufficient.

The Generation LLM — Local Ollama or Cloud API

The generation LLM is the model that drafts the final response from the retrieved passages. Two philosophies conflict here. Locally, Ollama allows you to run models like Mistral 7B, Qwen3, or Llama 3 directly on your machine, without sending data to a third party — a non-trivial sovereignty advantage if you are indexing personal work. Via cloud API, Claude Sonnet or GPT-4o offer superior generation quality for literary French, at the cost of a subscription. For Bac revision, the recommended configuration is: Ollama locally for long sessions and exploratory queries, cloud API for final syntheses and oral examination simulation.

The Queries That Transform Your Revision

A personal RAG is only as good as your ability to query it. The formulation of the query determines the quality of the retrieved passages, and therefore the relevance of the generated response. Three types of queries transform Bac revision: targeted methodological questions, cross-synthesis sheet generation, and oral examination simulation.

Methodology Questions Targeted by Exam Section

For the commentary, formulate queries that cross a stylistic device with an effect: "In my notes on La Princesse de Clèves, how does internal focalization construct the conflict between passion and virtue?" The RAG will retrieve the relevant passages from your notes and the text, and the LLM will draft an analysis grounded in those passages. For the dissertation, ask for arguments on both sides: "Give me three arguments for and three arguments against the thesis that the 18th-century novel is fundamentally didactic — with citations drawn from my sources." The constraint "with citations drawn from my sources" is non-negotiable.

Cross-Synthesis Sheet Generation

One of the most powerful uses of RAG for revision is the generation of synthesis sheets that cross multiple works or multiple angles. "Generate a sheet comparing the treatment of time in Hugo's Les Contemplations and Apollinaire's Alcools, drawing from my class notes and indexed extracts." The model cannot invent correspondences: it works on the real fragments you have indexed. If your index is rich, the sheet will be rich. This property transforms the construction of the index itself into a pedagogical act — the more precisely you annotate, the more relevant the generated sheets will be.

Oral Examination Simulation and Self-Assessment

For the oral section of the Bac de Français, simulating examiner questions is decisive training. Configure a RAG session in "examiner" mode: "You are an examiner for the Bac de Français. Ask me a question on the extract from Lagarce's Juste la fin du monde that I have indexed. After my answer, evaluate it by pointing out the missing elements and imprecisions, drawing solely from my class notes and the original text." This protocol keeps the system anchored to your documents and gives you contextualised feedback — exactly what you need, without the risk of a generic model answer.

Essential Safeguards to Avoid Being Misled

A poorly configured or poorly used RAG can give a false impression of reliability. Three safeguards are non-negotiable for maintaining the integrity of the system in an exam context.

Always Verify Quotations Against the Original Source

Even with a correctly configured RAG, the LLM may slightly rephrase a passage during generation. The absolute rule is: any quotation you intend to use in an exam paper must be verified word by word against the source text — the real book or the original PDF, not the RAG's response. The system tells you where to look; verification remains your responsibility. Treat every RAG response as a researcher's draft, not as a definitive reference.

Refuse Generations Without Explicit Citation

Configure your system to refuse to respond if no relevant passage has been retrieved. In practice, this means adding a system instruction ("system prompt") of the type: "If no extract from the knowledge base supports the response, reply: 'No source available in the index for this question.' Do not generate a response without documentary foundation." This rule forces the system to signal its gaps rather than filling the holes by invention — and at the same time informs you about what is missing from your index.

Keep a Log of Detected Errors

Every time you detect a factual error — inaccurate quotation, wrong attribution, displaced date — record it in a dedicated file: the error produced, the correct source, and the query context. This log has two virtues. It constitutes a document of active memorisation — reviewing your own errors is one of the most effective methods of long-term memory consolidation. It also serves to improve the index: if a recurring error reveals a gap in your sources, add the missing document.

A Six-Week Implementation Schedule

Six weeks are sufficient to go from a raw corpus to a functional study assistant, at the rate of one to two hours of work per week. The effort is concentrated at the beginning — collection and cleaning — so that the intensive revision phase benefits from a stable index.

Weeks 1-2 — Collecting and Cleaning Sources

The first fortnight is devoted exclusively to building the corpus. List all your sources (works, class notes, past papers, personal study sheets) and rank them by priority: set works first, past papers next, personal work last. Convert paper documents to digital text. Read through each document to correct OCR errors — one hour of cleaning upfront saves ten hours of debugging downstream. Organise the files in a clear directory structure: one folder per work, one folder per document type. Do not begin any ingestion before you have a clean corpus.

Weeks 3-4 — Ingestion and First Local Deployment

Install Chroma and a multilingual embedding model. Split your documents into fragments of 300 to 500 words with a 50-word overlap between fragments — this chunking parameter ensures that fragment boundaries do not break units of meaning. Run the ingestion and verify that the number of indexed fragments matches your estimate. Test with around ten representative queries, covering all three exam sections. Correct retrieval problems (fragments too long, missing folders) before moving to the calibration phase.

Weeks 5-6 — Calibration and Personal Training

The final two weeks are weeks of active revision mediated by the RAG. Formulate at least five queries per study session. Record errors in your log. Adjust the parameter for the number of fragments retrieved per query (typically between 3 and 8) according to the density of your corpus. Practise oral examination simulation at least three times, asking for a structured evaluation after each response. At the end of these six weeks, your assistant knows your works, your class notes, and your analytical angles — not a generic angle, but yours.

Acknowledged Limits and Prospects Beyond the Bac

A personal RAG is a powerful tool, but neither omniscient nor infallible. Recognising its limits is as important as knowing how to exploit it — and this effort of lucidity itself prepares you for the intellectual rigour that the Bac de Français demands.

What a Personal RAG Will Never Replace

RAG does not think: it retrieves and assembles. It cannot build an original problematic, feel the dramatic tension of a scene, or choose the angle that will make an essay singular. These operations require an understanding and sensitivity that only the human reader develops through genuine, sustained engagement with texts. RAG is an aid to memorisation and structuring — it does not substitute for reading, it extends it. A candidate who has not read the set works and attempts to rely exclusively on their assistant will produce responses that are technically sourced yet intellectually hollow — exactly what Bac examiners know how to detect.

Reusing the System for Future Studies

The index you build for the Bac de Français 2026 is the first link in a personal knowledge infrastructure that will accompany your entire university career. In literary preparatory classes, corpora are larger but the logic is identical. At university, the same architecture serves for research dissertations — by adding academic articles to your base. The competency you develop — building a reliable index, formulating precise queries, verifying sources — is a transversal epistemic skill, independent of discipline. FLE learners who master this system before entering higher education hold a lasting methodological advantage, well beyond the Bac.

Personal RAG for the Bac de Français 2026: Build Your Own AI Study Assistant in Six Weeks

Gerald Steiner

Why a Personal RAG Outperforms Generic LLMs for the Bac 2026

The Problem of Factual Drift in LLMs Without Context

What a Well-Built RAG Additionally Guarantees

Specific Advantage for FLE Learners

The Six Source Families to Index for the Bac de Français

Class Notes, Study Sheets, and Personal Reading Journal

Past Papers and Personal Work

A Minimal and Accessible Technical Stack in 2026

Choosing an Embedding Model Suited to French

Choosing a Simple Vector Database — Qdrant, Chroma, or pgvector

The Generation LLM — Local Ollama or Cloud API

The Queries That Transform Your Revision

Methodology Questions Targeted by Exam Section

Cross-Synthesis Sheet Generation

Oral Examination Simulation and Self-Assessment

Essential Safeguards to Avoid Being Misled

Always Verify Quotations Against the Original Source

Refuse Generations Without Explicit Citation

Keep a Log of Detected Errors

A Six-Week Implementation Schedule

Weeks 1-2 — Collecting and Cleaning Sources

Weeks 3-4 — Ingestion and First Local Deployment

Weeks 5-6 — Calibration and Personal Training

Acknowledged Limits and Prospects Beyond the Bac

What a Personal RAG Will Never Replace

Reusing the System for Future Studies

Lire la suite

Investissements internationaux : entre soutien africain, volatilité boursière et régulation chinoise

Maine : le tourisme québécois en berne, une économie locale à la peine

Réforme Logement 2026 : Ce qui change vraiment pour les propriétaires et locataires

Les racines antiques de la pensée : quand la philosophie rencontre la mythologie grecque

Why a Personal RAG Outperforms Generic LLMs for the Bac 2026

The Problem of Factual Drift in LLMs Without Context

What a Well-Built RAG Additionally Guarantees

Specific Advantage for FLE Learners

The Six Source Families to Index for the Bac de Français

Set Texts and Related Corpus

Class Notes, Study Sheets, and Personal Reading Journal

Past Papers and Personal Work

A Minimal and Accessible Technical Stack in 2026

Choosing an Embedding Model Suited to French

Choosing a Simple Vector Database — Qdrant, Chroma, or pgvector

The Generation LLM — Local Ollama or Cloud API

The Queries That Transform Your Revision

Methodology Questions Targeted by Exam Section

Cross-Synthesis Sheet Generation

Oral Examination Simulation and Self-Assessment

Essential Safeguards to Avoid Being Misled

Always Verify Quotations Against the Original Source

Refuse Generations Without Explicit Citation

Keep a Log of Detected Errors

A Six-Week Implementation Schedule

Weeks 1-2 — Collecting and Cleaning Sources

Weeks 3-4 — Ingestion and First Local Deployment

Weeks 5-6 — Calibration and Personal Training

Acknowledged Limits and Prospects Beyond the Bac

What a Personal RAG Will Never Replace

Reusing the System for Future Studies

Lire la suite

Investissements internationaux : entre soutien africain, volatilité boursière et régulation chinoise

Maine : le tourisme québécois en berne, une économie locale à la peine

Réforme Logement 2026 : Ce qui change vraiment pour les propriétaires et locataires

Les racines antiques de la pensée : quand la philosophie rencontre la mythologie grecque