FIPO: Fallacy-Informed Preference Optimization - Steering LLMs Toward Logically Sound Arguments

🏆 Outstanding Paper Award Winner at NAACL 2025!

I’m thrilled to share that our paper “A Logical Fallacy-Informed Framework for Argument Generation” has received the Outstanding Paper Award at the 2025 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL 2025)!

🚨 The Problem: LLMs Generate Fallacious Arguments

Despite their remarkable capabilities, Large Language Models (LLMs) still struggle with generating logically sound arguments, often producing content that contains logical fallacies - errors in reasoning that undermine the validity of an argument. Our preliminary study revealed a startling finding: 21% of arguments generated by ChatGPT contain logical fallacies.

Consider these examples of fallacious vs. logically sound arguments:

Topic: AI is bad for the job market (Support)
❌ Ad Hominem: “AI proponents are tech enthusiasts disconnected from real-world job market issues.”
❌ Circular Reasoning: “AI is harmful to the employment sector because AI is bad for the job market.”
❌ Appeal to Emotion: “AI will leave millions of families struggling and unemployed.”
✅ Logically Sound: “AI automates tasks, reducing employment opportunities by replacing humans in manufacturing and administrative jobs.”

💡 Our Solution: Fallacy-Informed Preference Optimization (FIPO)

We introduce FIPO, a novel framework that helps steer LLMs toward generating logically sound arguments by making them explicitly aware of logical fallacies during training.

Key Innovation: Weighted Classification Loss

FIPO combines traditional preference optimization with a weighted cross-entropy classification loss that penalizes the model based on fallacy frequency, applying stronger penalties for misclassifying more common fallacies.

The FIPO loss function is:

L_FIPO(π_θ) = L_CPO(π_θ) + λL_CLF(π_θ)

Where the classification loss L_CLF is defined as:

**L_CLF(π_θ) = -E_{(t,s,y_w,y_l,k)∼D’}[w₀ log P⁰_{h_θ}(y_w

t,s) + w_k log P^k_{h_θ}(y_l

t,s)]**

Here:

w_k: Weights based on fallacy frequency in training data
P^k_{h_θ}: Probability of fallacy type k from classification head
λ = 0.3: Balancing parameter (optimized through experiments)

🧠 The 13 Types of Logical Fallacies

We define 13 categories of logical fallacies based on centuries of logical reasoning research dating back to Aristotle:

Most Common Fallacies (in our dataset):

Faulty Generalization (18.0%) - Drawing conclusions about all instances from limited examples
- Example: “I know someone who smoked cannabis and became successful. Therefore, everyone who smokes cannabis will be successful.”
Ad Hominem (12.3%) - Attacking the person making the argument rather than the argument itself
- Example: “Those climate scientists are just trying to get more funding for their research.”
Ad Populum (9.5%) - Argument based on what the majority believes
- Example: “Most people think this policy is good, so it must be correct.”
False Causality (8.8%) - Implying causal relationships without supporting evidence
- Example: “Ever since we installed those wind turbines, there have been more bird deaths in the area.”
Circular Reasoning (7.0%) - The conclusion comes back to the premise without proving itself
- Example: “We should trust the news because the news says we should trust it.”

Other Fallacy Types:

Appeal to Emotion (6.8%) - Manipulating emotions rather than using logic
Fallacy of Relevance (6.6%) - Introducing irrelevant information
Fallacy of Logic (6.2%) - Errors in logical structure
Intentional (5.8%) - Deliberately wrong arguments
False Dilemma (5.8%) - Presenting only two options when many exist
Fallacy of Extension (5.8%) - Attacking exaggerated versions of arguments
Fallacy of Credibility (5.4%) - Attacking speaker’s character
Equivocation (2.0%) - Using ambiguous language

🔬 Methodology: 4-Step Framework

Our approach involves four key steps:

Step 1: Supervised Fine-Tuning (SFT)

Train the base model on the ExplaGraphs dataset containing topics, stances, and arguments.

Step 2: Preference Data Collection

Generate fallacious arguments using ChatGPT for each original argument, creating preference pairs where:

Preferred (y_w): Original logically sound argument
Dispreferred (y_l): Generated fallacious argument with label k

We generated 7,872 fallacious arguments spanning all 13 fallacy types following the natural distribution found in real-world discourse.

Step 3: Preference Optimization

Apply standard preference optimization methods (DPO, PPO, CPO, KTO) using the preference dataset.

Step 4: FIPO Enhancement

Add our weighted classification loss to the best-performing preference method (CPO) to create FIPO.

📊 Outstanding Results

Dramatic Fallacy Reduction

Model	Baseline (SFT)	Best Previous Method	FIPO	Improvement
Llama-2 (7B)	34.5%	26.0% (PPO)	17.0%	17.5% reduction
Mistral (7B)	32.5%	27.75% (KTO)	19.5%	13.0% reduction

Quality Improvements (Win-Rate)

FIPO not only reduces fallacies but also maintains high argument quality:

Human Evaluation Win-Rate: 46% (vs. 40.3% loss rate for CPO)
GPT-4 Evaluation Win-Rate: 63.5% (highest among all methods)
Significant reduction in “loss rate”: From 40.3% (CPO) to 23% (FIPO)

Fallacy-Specific Performance

FIPO excels particularly at reducing the most common fallacy types:

Fallacy Type	Llama-2 SFT	Llama-2 FIPO	Reduction
Faulty Generalization	27.5%	7.0%	-20.5%
False Causality	2.5%	3.5%	Controlled
Appeal to Emotion	1.0%	2.5%	Controlled

The weighted classification loss ensures the model focuses on the most problematic and frequent fallacies.

🔍 Human Evaluation Validation

GPT-4 Reliability Verification

We validated GPT-4’s ability to identify fallacies through human annotation:

Randolph’s κ agreement: 0.640 (substantial agreement)
Majority agreement ratio: 95.5% between annotators and GPT-4

Comparative Analysis

200 arguments were independently classified by our team and compared with GPT-4:

Strong alignment on most fallacy types
Main discrepancy: Fallacy of Relevance detection
Some confusion between Faulty Generalization and False Causality (expected due to subtle differences)

🧪 Ablation Studies Confirm Design Choices

We validated our design through rigorous ablation studies:

Approach	Fallacy Rate	Analysis
Uniform Dataset Distribution	37.5%	Natural distribution is crucial
Unweighted Cross-Entropy	29.0%	Weighting by frequency is essential
FIPO (Full Method)	17.0%	Best performance

These results confirm that both the natural fallacy distribution and weighted classification loss are essential components of FIPO.

🌍 Generalization: Out-of-Domain Performance

FIPO’s effectiveness extends beyond training data:

Debatepedia Dataset Results:

Fallacy Rate: 45% (second-best, close to KTO’s 44%)
Win Rate: 62% (highest among all methods)
Excellent at reducing: False Causality and Fallacy of Relevance

This demonstrates FIPO’s ability to generalize to new domains and topics.

💻 Implementation Details

Base Models

Llama-2 (7B) and Mistral (7B)
LoRA fine-tuning: Reduces parameters from 7B to 8.3M (0.12%)

Training Configuration

Classification loss weight (λ): 0.3 (optimal balance)
Fallacy weights (w_k): Based on natural frequency distribution
Base method: CPO (best trade-off between win-rate and fallacy-rate)

Key Hyperparameters

Learning rate: 2×10⁻⁴
LoRA rank: 16, α = 32
Training epochs: 3
β (CPO): 0.1

🔮 Impact and Future Directions

Immediate Impact

First work to systematically address logical fallacies in argument generation
Novel framework combining preference optimization with classification objectives
Significant performance gains across multiple models and evaluation metrics

Broader Implications

Trust and Safety: Reduces spread of logically flawed arguments
Educational Applications: Could help teach logical reasoning
Democratic Discourse: Promotes more sound public debate

Future Research Directions

Scaling to larger models: Apply FIPO to models >10B parameters
Multi-domain extension: Expand to scientific, legal, and technical arguments
Real-time fallacy detection: Interactive systems for argument evaluation
Cross-lingual fallacy analysis: Extend to non-English languages
Integration with fact-checking: Combine logical and factual verification

🎯 Key Takeaways

LLMs have a significant logical fallacy problem: Even advanced models like ChatGPT generate fallacious arguments 21% of the time
Explicit fallacy awareness helps: Making models aware of specific fallacy types significantly improves argument quality
Weighted classification loss is crucial: Focusing on frequent fallacies yields better overall performance
Preference optimization isn’t enough alone: Standard methods like DPO and PPO help but miss fine-grained fallacy distinctions
FIPO provides substantial improvements: Up to 17.5% reduction in fallacy rates while maintaining argument quality

📚 Resources and Code

📄 Paper: A Logical Fallacy-Informed Framework for Argument Generation
💻 Code & Data: Available at github.com/lucamouchel/Logical-Fallacies
🏆 Award Recognition: Outstanding Paper Award - NAACL 2025

👥 Acknowledgments

This work was a collaborative effort with amazing researchers:

Luca Mouchel (Lead Author, Master Internship)
Shaobo Cui, Robert West, Antoine Bosselut, Boi Faltings
EPFL, Switzerland

Special thanks to the ICT-48 Network of AI Research Excellence Center “TAILOR”, the Swiss National Science Foundation, and our other funding partners.

📖 Citation

@inproceedings{mouchel-etal-2025-logical,
    title = "A Logical Fallacy-Informed Framework for Argument Generation",
    author = "Mouchel, Luca and Paul, Debjit and Cui, Shaobo and West, Robert and Bosselut, Antoine and Faltings, Boi",
    booktitle = "Proceedings of the 2025 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)",
    year = "2025",
    pages = "7296--7314",
    publisher = "Association for Computational Linguistics"
}

🤔 Discussion Questions

For Researchers:

How might FIPO be adapted for other types of reasoning tasks beyond argument generation?
What other fine-grained classification objectives could improve preference optimization?

For Practitioners:

How can we integrate fallacy detection into real-world applications like social media platforms or educational tools?
What are the computational trade-offs when adding classification heads to large language models?

For Society:

How do we balance improving logical reasoning with maintaining diverse perspectives in AI-generated content?
What role should AI play in moderating online discourse and identifying fallacious arguments?

This research represents a significant step toward more trustworthy and logically sound AI systems. As LLMs become increasingly prevalent in decision-making and public discourse, ensuring they generate well-reasoned arguments is not just a technical challenge—it’s a societal imperative.

🏆 Proud to have this work recognized with the Outstanding Paper Award at NAACL 2025!