ConfliBERT – UTD Event Data

ConfliBERT is our BERT model for conflict event data coding tasks. It is a domain-specific pre-trained language model for conflict and political violence.

Ways to access, learn about, and use ConfliBERT

HuggingFace

Main Github

Manual

Development Lifecycle

ConfliBERT Demos

ConfliBERT Demo notebook via Google Colab.

ConfliBERT Demo GUI on HuggingFace

Papers on ConfliBERT

Brandt, Patrick T., Sultan Alsarra, Vito J. D’Orazio, Dagmar Heintze, Latifur Khan, Shreyas Meher, Javier Osorio, and Marcus Sianan. 2024. ConfliBERT: A Language Model for Political Conflict. Abstract: Conflict scholars have used rule-based approaches to extract information about political violence from news reports and texts. Recent Natural Language Processing developments move beyond rigid rule-based approaches. We review our recent ConfliBERT language model (Hu et al. 2022) to process political and violence related texts. The model can be used to extract actor and action classifications from texts about political conflict. When fine-tuned, results show that ConfliBERT has superior performance in accuracy, precision and recall over other large language models (LLM) like Google’s Gemma 2 (9B), Meta’s Llama 3.1 (7B), and Alibaba’s Qwen 2.5 (14B) within its relevant domains. It is also hundreds of times faster than these more generalist LLMs. These results are illustrated using textsfrom the BBC, re3d, and the Global Terrorism Dataset (GTD). Finally, we discuss limitations of the models and propose extensions.
Yibo Hu, MohammadSaleh Hosseini, Erick Skorupa Parolin, Javier Osorio, Latifur Khan, Patrick Brandt, and Vito D’Orazio. 2022. ConfliBERT: A Pre-trained Language Model for Political Conflict and Violence. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 5469–5482, Seattle, United States. Association for Computational Linguistics. Abstract: Analyzing conflicts and political violence around the world is a persistent challenge in the political science and policy communities due in large part to the vast volumes of specialized text needed to monitor conflict and violence on a global scale. To help advance research in political science, we introduce ConfliBERT, a domain-specific pre-trained language model for conflict and political violence. We first gather a large domain-specific text corpus for language modeling from various sources. We then build ConfliBERT using two approaches: pre-training from scratch and continual pre-training. To evaluate ConfliBERT, we collect 12 datasets and implement 18 tasks to assess the models’ practical application in conflict research. Finally, we evaluate several versions of ConfliBERT in multiple experiments. Results consistently show that ConfliBERT outperforms BERT when analyzing political violence and conflict.
W. Yang et al., ConfliBERT-Spanish: A Pre-trained Spanish Language Model for Political Conflict and Violence, 2023 7th IEEE Congress on Information Science and Technology (CiSt), Agadir – Essaouira, Morocco, 2023, pp. 287-292, doi: 10.1109/CiSt56084.2023.10409883.
Abstract: This article introduces ConfliBERT-Spanish, a pre-trained language model specialized in political conflict and violence for text written in the Spanish language. Our methodology relies on a large corpus specialized in politics and violence to extend the capacity of pre-trained models capable of processing text in Spanish. We assess the performance of ConfliBERT-Spanish in comparison to Multilingual BERT and BETO baselines for binary classification, multi-label classification, and named entity recognition. Results show that ConfliBERT-Spanish consistently outperforms baseline models across all tasks. These results show that our domain-specific language-specific cyberinfrastructure can greatly enhance the performance of NLP models for Latin American conflict analysis. This methodological advancement opens vast opportunities to help researchers and practitioners in the security sector to effectively analyze large amounts of information with high degrees of accuracy, thus better equipping them to meet the dynamic and complex security challenges affecting the region.
Sultan Alsarra, Luay Abdeljaber, Wooseong Yang, Niamat Zawad, Latifur Khan, Patrick Brandt, Javier Osorio, and Vito D’Orazio. 2023. ConfliBERT-Arabic: A Pre-trained Arabic Language Model for Politics, Conflicts and Violence. In Proceedings of the 14th International Conference on Recent Advances in Natural Language Processing, pages 98–108, Varna, Bulgaria. INCOMA Ltd., Shoumen, Bulgaria. Abstract: This study investigates the use of Natural Language Processing (NLP) methods to analyze politics, conflicts and violence in the Middle East using domain-specific pre-trained language models. We introduce Arabic text and present ConfliBERT-Arabic, a pre-trained language models that can efficiently analyze political, conflict and violence-related texts. Our technique hones a pre-trained model using a corpus of Arabic texts about regional politics and conflicts. Performance of our models is compared to baseline BERT models. Our findings show that the performance of NLP models for Middle Eastern politics and conflict analysis are enhanced by the use of domain-specific pre-trained local language models. This study offers political and conflict analysts, including policymakers, scholars, and practitioners new approaches and tools for deciphering the intricate dynamics of local politics and conflicts directly in Arabic.