FIPO: Fallacy-Informed Preference Optimization - Steering LLMs Toward Logically Sound Arguments
🏆 Outstanding Paper Award Winner at NAACL 2025!
I’m thrilled to share that our paper “A Logical Fallacy-Informed Framework for Argument Generation” has received the Outstanding Paper Award at the 2025 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL 2025)!
🚨 The Problem: LLMs Generate Fallacious Arguments
Despite their remarkable capabilities, Large Language Models (LLMs) still struggle with generating logically sound arguments, often producing content that contains logical fallacies - errors in reasoning that undermine the validity of an argument. Our preliminary study revealed a startling finding: 21% of arguments generated by ChatGPT contain logical fallacies.
Consider these examples of fallacious vs. logically sound arguments:
Topic: AI is bad for the job market (Support) |
---|
❌ Ad Hominem: “AI proponents are tech enthusiasts disconnected from real-world job market issues.” |
❌ Circular Reasoning: “AI is harmful to the employment sector because AI is bad for the job market.” |
❌ Appeal to Emotion: “AI will leave millions of families struggling and unemployed.” |
✅ Logically Sound: “AI automates tasks, reducing employment opportunities by replacing humans in manufacturing and administrative jobs.” |
💡 Our Solution: Fallacy-Informed Preference Optimization (FIPO)
We introduce FIPO, a novel framework that helps steer LLMs toward generating logically sound arguments by making them explicitly aware of logical fallacies during training.
Key Innovation: Weighted Classification Loss
FIPO combines traditional preference optimization with a weighted cross-entropy classification loss that penalizes the model based on fallacy frequency, applying stronger penalties for misclassifying more common fallacies.
The FIPO loss function is:
LFIPO(πθ) = LCPO(πθ) + λLCLF(πθ)
Where the classification loss LCLF is defined as:
**LCLF(πθ) = -E(t,s,yw,yl,k)∼D’[w0 log P0hθ(yw | t,s) + wk log Pkhθ(yl | t,s)]** |
Here:
- wk: Weights based on fallacy frequency in training data
- Pkhθ: Probability of fallacy type k from classification head
- λ = 0.3: Balancing parameter (optimized through experiments)
🧠 The 13 Types of Logical Fallacies
We define 13 categories of logical fallacies based on centuries of logical reasoning research dating back to Aristotle:
Most Common Fallacies (in our dataset):
-
Faulty Generalization (18.0%) - Drawing conclusions about all instances from limited examples
- Example: “I know someone who smoked cannabis and became successful. Therefore, everyone who smokes cannabis will be successful.”
-
Ad Hominem (12.3%) - Attacking the person making the argument rather than the argument itself
- Example: “Those climate scientists are just trying to get more funding for their research.”
-
Ad Populum (9.5%) - Argument based on what the majority believes
- Example: “Most people think this policy is good, so it must be correct.”
-
False Causality (8.8%) - Implying causal relationships without supporting evidence
- Example: “Ever since we installed those wind turbines, there have been more bird deaths in the area.”
-
Circular Reasoning (7.0%) - The conclusion comes back to the premise without proving itself
- Example: “We should trust the news because the news says we should trust it.”
Other Fallacy Types:
- Appeal to Emotion (6.8%) - Manipulating emotions rather than using logic
- Fallacy of Relevance (6.6%) - Introducing irrelevant information
- Fallacy of Logic (6.2%) - Errors in logical structure
- Intentional (5.8%) - Deliberately wrong arguments
- False Dilemma (5.8%) - Presenting only two options when many exist
- Fallacy of Extension (5.8%) - Attacking exaggerated versions of arguments
- Fallacy of Credibility (5.4%) - Attacking speaker’s character
- Equivocation (2.0%) - Using ambiguous language
🔬 Methodology: 4-Step Framework
Our approach involves four key steps:
Step 1: Supervised Fine-Tuning (SFT)
Train the base model on the ExplaGraphs dataset containing topics, stances, and arguments.
Step 2: Preference Data Collection
Generate fallacious arguments using ChatGPT for each original argument, creating preference pairs where:
- Preferred (yw): Original logically sound argument
- Dispreferred (yl): Generated fallacious argument with label k
We generated 7,872 fallacious arguments spanning all 13 fallacy types following the natural distribution found in real-world discourse.
Step 3: Preference Optimization
Apply standard preference optimization methods (DPO, PPO, CPO, KTO) using the preference dataset.
Step 4: FIPO Enhancement
Add our weighted classification loss to the best-performing preference method (CPO) to create FIPO.
📊 Outstanding Results
Dramatic Fallacy Reduction
Model | Baseline (SFT) | Best Previous Method | FIPO | Improvement |
---|---|---|---|---|
Llama-2 (7B) | 34.5% | 26.0% (PPO) | 17.0% | 17.5% reduction |
Mistral (7B) | 32.5% | 27.75% (KTO) | 19.5% | 13.0% reduction |
Quality Improvements (Win-Rate)
FIPO not only reduces fallacies but also maintains high argument quality:
- Human Evaluation Win-Rate: 46% (vs. 40.3% loss rate for CPO)
- GPT-4 Evaluation Win-Rate: 63.5% (highest among all methods)
- Significant reduction in “loss rate”: From 40.3% (CPO) to 23% (FIPO)
Fallacy-Specific Performance
FIPO excels particularly at reducing the most common fallacy types:
Fallacy Type | Llama-2 SFT | Llama-2 FIPO | Reduction |
---|---|---|---|
Faulty Generalization | 27.5% | 7.0% | -20.5% |
False Causality | 2.5% | 3.5% | Controlled |
Appeal to Emotion | 1.0% | 2.5% | Controlled |
The weighted classification loss ensures the model focuses on the most problematic and frequent fallacies.
🔍 Human Evaluation Validation
GPT-4 Reliability Verification
We validated GPT-4’s ability to identify fallacies through human annotation:
- Randolph’s κ agreement: 0.640 (substantial agreement)
- Majority agreement ratio: 95.5% between annotators and GPT-4
Comparative Analysis
200 arguments were independently classified by our team and compared with GPT-4:
- Strong alignment on most fallacy types
- Main discrepancy: Fallacy of Relevance detection
- Some confusion between Faulty Generalization and False Causality (expected due to subtle differences)
🧪 Ablation Studies Confirm Design Choices
We validated our design through rigorous ablation studies:
Approach | Fallacy Rate | Analysis |
---|---|---|
Uniform Dataset Distribution | 37.5% | Natural distribution is crucial |
Unweighted Cross-Entropy | 29.0% | Weighting by frequency is essential |
FIPO (Full Method) | 17.0% | Best performance |
These results confirm that both the natural fallacy distribution and weighted classification loss are essential components of FIPO.
🌍 Generalization: Out-of-Domain Performance
FIPO’s effectiveness extends beyond training data:
Debatepedia Dataset Results:
- Fallacy Rate: 45% (second-best, close to KTO’s 44%)
- Win Rate: 62% (highest among all methods)
- Excellent at reducing: False Causality and Fallacy of Relevance
This demonstrates FIPO’s ability to generalize to new domains and topics.
💻 Implementation Details
Base Models
- Llama-2 (7B) and Mistral (7B)
- LoRA fine-tuning: Reduces parameters from 7B to 8.3M (0.12%)
Training Configuration
- Classification loss weight (λ): 0.3 (optimal balance)
- Fallacy weights (wk): Based on natural frequency distribution
- Base method: CPO (best trade-off between win-rate and fallacy-rate)
Key Hyperparameters
- Learning rate: 2×10⁻⁴
- LoRA rank: 16, α = 32
- Training epochs: 3
- β (CPO): 0.1
🔮 Impact and Future Directions
Immediate Impact
- First work to systematically address logical fallacies in argument generation
- Novel framework combining preference optimization with classification objectives
- Significant performance gains across multiple models and evaluation metrics
Broader Implications
- Trust and Safety: Reduces spread of logically flawed arguments
- Educational Applications: Could help teach logical reasoning
- Democratic Discourse: Promotes more sound public debate
Future Research Directions
- Scaling to larger models: Apply FIPO to models >10B parameters
- Multi-domain extension: Expand to scientific, legal, and technical arguments
- Real-time fallacy detection: Interactive systems for argument evaluation
- Cross-lingual fallacy analysis: Extend to non-English languages
- Integration with fact-checking: Combine logical and factual verification
🎯 Key Takeaways
- LLMs have a significant logical fallacy problem: Even advanced models like ChatGPT generate fallacious arguments 21% of the time
- Explicit fallacy awareness helps: Making models aware of specific fallacy types significantly improves argument quality
- Weighted classification loss is crucial: Focusing on frequent fallacies yields better overall performance
- Preference optimization isn’t enough alone: Standard methods like DPO and PPO help but miss fine-grained fallacy distinctions
- FIPO provides substantial improvements: Up to 17.5% reduction in fallacy rates while maintaining argument quality
📚 Resources and Code
- 📄 Paper: A Logical Fallacy-Informed Framework for Argument Generation
- 💻 Code & Data: Available at github.com/lucamouchel/Logical-Fallacies
- 🏆 Award Recognition: Outstanding Paper Award - NAACL 2025
👥 Acknowledgments
This work was a collaborative effort with amazing researchers:
- Luca Mouchel (Lead Author, Master Internship)
- Shaobo Cui, Robert West, Antoine Bosselut, Boi Faltings
- EPFL, Switzerland
Special thanks to the ICT-48 Network of AI Research Excellence Center “TAILOR”, the Swiss National Science Foundation, and our other funding partners.
📖 Citation
@inproceedings{mouchel-etal-2025-logical,
title = "A Logical Fallacy-Informed Framework for Argument Generation",
author = "Mouchel, Luca and Paul, Debjit and Cui, Shaobo and West, Robert and Bosselut, Antoine and Faltings, Boi",
booktitle = "Proceedings of the 2025 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)",
year = "2025",
pages = "7296--7314",
publisher = "Association for Computational Linguistics"
}
🤔 Discussion Questions
For Researchers:
- How might FIPO be adapted for other types of reasoning tasks beyond argument generation?
- What other fine-grained classification objectives could improve preference optimization?
For Practitioners:
- How can we integrate fallacy detection into real-world applications like social media platforms or educational tools?
- What are the computational trade-offs when adding classification heads to large language models?
For Society:
- How do we balance improving logical reasoning with maintaining diverse perspectives in AI-generated content?
- What role should AI play in moderating online discourse and identifying fallacious arguments?
This research represents a significant step toward more trustworthy and logically sound AI systems. As LLMs become increasingly prevalent in decision-making and public discourse, ensuring they generate well-reasoned arguments is not just a technical challenge—it’s a societal imperative.
🏆 Proud to have this work recognized with the Outstanding Paper Award at NAACL 2025!
Enjoy Reading This Article?
Here are some more articles you might like to read next: