Project Topics

Suggestions and research areas for thesis and project activities

Overview

You can find a detailed list of available research topics along with our proposed projects.

Each project follows the following template:

*Project Title*

  • Description: brief project description

  • Supervised by: supervisor #1, supervisor #2, etc...

  • Type:

    • "TO BE DEFINED (TBD)": the project describes a very generic problem without any additional detail, thus, allowing ample room for individual proposals.

    • "PARTIALLY DEFINED": the project provides some details about a problem, such as method and available data. However, additional details have to be defined to set up a specific task.

    • "SET UP": the project is almost/entirely well defined.

  • References: resource #1, resource #2, etc...

If you are interested

If one or multiple research projects meet your interest, don't hesitate to get in touch with the corresponding supervisors.

All projects are supervised by prof. Paolo Torroni (p.torroni@unibo.it).

Argumentation Mining

One of our main research interests is Argument/Argumentation Mining (AM). It can be informally described as the problem of automatically detecting and extracting arguments from the text. Arguments are usually represented as a combination of a premise (a fact) that supports a subjective conclusion (opinion, claim). Argumentation Mining touches a wide variety of well-known NLP tasks, spanning from sentiment analysis, stance detection to summarization and dialogue systems.

#1 - Argument Correction/Completion

  • Description: Given an argument, or its main claim, we would like to generate an improved version of the argument itself. The argument can either be generated from scratch or from an initial and partial argument. Improvement is expressed in terms of quality (argument ranking), via models (we use an existing argument classification model and evaluate its prediction confidence) or on task-specific conditioning factors (e.g. sentiment, stance). Additionally, the generated text can be constrained in order to not be too much different from the original input claim.

  • Supervised by: Andrea Galassi (a.galassi@unibo.it), Federico Ruggeri (federico.ruggeri6@unibo.it)

  • Type: PARTIALLY DEFINED

#2 - Multimodal Argument Mining

  • Description: We would like to make use of speech information (e.g. prosody) to enhance the set of features that can be used to detect arguments. Speech can either be represented by means of ad-hoc feature extraction methods (e.g. MFCC) or via end-to-end architectures. Few existing corpora both offer argument annotation layers and speech data regarding a given text document.

  • Supervised by: Eleonora Mancini (e.mancini@unibo.it), Federico Ruggeri (federico.ruggeri6@unibo.it)

  • Type: PARTIALLY DEFINED

  • Non-exhaustive list of potential projects: (i) Analyze the impact of fusion strategies. Employ a Multimodal AM dataset, such as MM-USElec, and modify the architectures utilized in experiments conducted on the datasets. Place emphasis on exploring modality fusion strategies. This undertaking involves conducting a thorough review of the current state of the art in fusion techniques within multimodal models, followed by an assessment of the impact of these strategies on performance in the designated benchmark task. (ii) Multimodal Sequential Tagging. Conduct experiments involving sequential tagging on the MM-USElec and/or MM-USElec-Fallacy dataset. Given that the existing alignments between text and audio are currently at the sentence level, it becomes imperative to generate alignments and audio resources at the word level using available sources. Subsequently, frame the classification task as sequential tagging for more granular analysis and results. (iii) Creation of a Multimodal AM dataset. Build a multimodal dataset (similar to MM-USElec) from an existing text-only dataset or audio-only dataset,  as resources are often limited to a single modality; thus, the task encompasses retrieving content for the missing modality, conducting audio-text alignment, devising a validation strategy for the alignment, and formulating a baseline for resource testing. (iv) Perform Fallacy Detection Task. During the initial construction of the resource, MM-USED-fallacy was specifically tailored for fallacy classification due to constraints imposed by the available annotated resources; however, with the recent release of textual annotations for fallacy detection, this project aims to align MM-USED-fallacy with the new resource. This alignment process sets the stage for creating a dataset capable of multimodal fallacy detection, allowing for subsequent testing and comparison with text-only fallacy detection methodologies.

  • References
    Eleonora Mancini, Federico Ruggeri, Andrea Galassi, and Paolo Torroni. 2022. Multimodal Argument Mining: A Case Study in Political Debates. In Proceedings of the 9th Workshop on Argument Mining, pages 158–170, Online and in Gyeongju, Republic of Korea. International Conference on Computational Linguistics.

 

#3 - Learning to Quantify Arguments

  • Description: There are contexts where the fine-grained detection and analysis of argumentative components is not required, and instead a macro-scale estimate of the number of components in the text would be sufficient. For example, a retrieval system for news or scientific literature may prioritize documents with many components. We are interested in developing a case study about the quantification of arguments in textual documents, first applying existing libraries and methods, and then developing ad-hoc solutions.

  • Supervised by: Andrea Galassi (a.galassi@unibo.it)

  • Type: PARTIALLY DEFINED

  • References: "LeQua@CLEF2022: Learning to Quantify" (Esuli et al. 2022) for an example of quantification; "AMICA: An Argumentative Search Engine for COVID-19 Literature" (Lippi et al. IJCAI 2022) for a possible application of the approach.

Neural-Symbolic Machine Learning and Neural-Symbolic NLP

Neural-Symbolic techniques aim to combine the efficiency and effectiveness of neural architectures with the advantages of symbolic or relational techniques in terms of the use of prior knowledge, explainability, compliance, and interpretability.

Despite the existence of many NeSy frameworks, few of them are suited to be applied in the NLP domain for various reasons.

We are interested in extending such frameworks to NLP tasks and in applying them to challenging problems, such as Argument Mining, or to other challenging settings.

#1 - Experimental comparison of neuro-symbolic tools

  • Description: Most neuro-symbolic approaches have been independently tested and evaluated on some data sets and benchmarks, but the comparisons between different approaches and across different tasks remain limited. Based on the activities of the TAILOR excellence network, this project has the objective of carrying out an experimental activity that aims to fill this gap.

  • Supervised by: Andrea Galassi (a.galassi@unibo.it), Marco Lippi (marco.lippi@unimore.it)

  • Type: PARTIALLY DEFINED

#2 - Ground-specific Markov Logic Networks

  • Description: Markov Logic Networks (MLNs) are a statistical relational learning paradigm that combines first-order logic and probabilistic graphical models. Ground-specific MLNs extend MLNs by combining them with (deep) neural networks. This project aims to extend an existing implementation of MLNs in order to improve their usability across different tasks.

  • Supervised byAndrea Galassi (a.galassi@unibo.it), Marco Lippi (marco.lippi@unimore.it)

  • Type: SET UP

Legal Analytics

The domain of legal documents is one of those that would benefit the most from a wide development and application of NLP tools. At the same time, it typically requires a human with a high level of specialization and background knowledge to perform tasks in this context, which are difficult to transfer to an automatic tool.

In this context, we are involved in multiple projects (see CLAUDETTE, ADELE, LAILA, POLINE, PRIMA on the Projects page), which address tasks such as: argument mining, summarization, outcome prediction, detection of unfair clauses, information extraction, and cross-lingual knowledge transfer.

Our purpose is to research and develop tools that can have a meaningful impact on the community. We are in close contact with teams of legal experts who can provide their expertise, and we have access to reserved datasets that can be used to develop automatic tools.

#1 - Different approaches to multi-lingual legal tools using multi-lingual representation

  • Description: Recently, we have conducted an experiment regarding the extension of an existing tool for English language to other languages such as Italian, German, and Polish. This study covered many different alternatives, such as the re-train of a new tool from scratch, the projection of labels, and the use of automatic translation. The purpose of this project is to explore the use of multi-lingual embedding for this task, comparing different types of embeddings in several scenarios and several language.

  • Supervised by: Andrea Galassi (a.galassi@unibo.it), Marco Lippi (marco.lippi@unimore.it),

  • Type: SET UP

  • References: "Unfair clause detection in terms of service across multiple languages", Galassi et al 2024, Artificial Intelligence and Law

 

#2 - Transformers-based tools for detection and classification of unfair clauses

 

  • Description:  A few years ago we developed a tool for the automatic detection of unfair clauses in Terms of Services and Privacy Policies documents in English language. Such a tool was developed using technologies that may now be surpassed by more recent technologies. This project aims to develop a new version of the same tool through the use of Transformer-based technologies and more recent datasets.

  • Supervised by: Andrea Galassi (a.galassi@unibo.it), Marco Lippi (marco.lippi@unimore.it),

  • TypeSET UP

  • References"Assessing the Cross-Market Generalization Capability of the CLAUDETTE System", Jablonowska et al 2021, JURIX

 

Jablonowska

Development of new datasets and linguistic resources, and experiments on them

Modern machine learning techniques have proven capable of learning even very abstract and high-level concepts, but usually with a caveat: they need plenty of data! Even those techniques that allow unsupervised or semi-supervised learning still need accurate and reliable ground truth to validate and test the final models.

For these reasons, the development of corpora and datasets is a fundamental step toward the development of new models and techniques that can address complex tasks.

We are interested in creating and testing new language resources, especially for tasks that require expert knowledge/skills and/or languages other than English. We are also interested in using these resources to develop and/or test new models and techniques.

#1 - Open for proposals

Subjectivity, biases, propaganda for Fact-Checking

We are interested in develop tools that can help with the detection of non-objective content in news for the purpose of fact-checking.

#1 - Subjectivity

  • Description: Subjectivity detection is a task that is potentially useful in practical settings concerning fake news detection or document analysis. We have created a set of annotation guidelines for sentence labeling in textual documents after several discussions between computer scientists and linguistic experts. The aims of this project is to apply these annotation guidelines to create a new corpus, or extend existing annotated corpora for the task of subjectivity detection (i.e., determining if a sentence is subjective or objective with respect to its author point of view), or apply SotA NLP method to existing datasets. Open problems are document classification based on sentence-level annotation and transfer learning across languages

  • Supervised by: Andrea Galassi (a.galassi@unibo.it), Federico Ruggeri (federico.ruggeri6@unibo.it)

  • TypePARTIALLY DEFINED

Unstructured Knowledge Integration

We are interested in developing deep learning models that are capable of employing knowledge in the form of natural language. Such knowledge is easy to interpret and to define (compared to structured representations like syntactic trees, knowledge graphs and symbolic rules). Unstructured knowledge increases the interpretability of models and goes in the direction of defining a realistic type of artificial intelligence. However, properly integrating this type of information is particularly challenging due to its inherent ambiguity and variability.

#1 - Scalable Input-Conditioned Knowledge Sampling

  • Description: Correctly using unstructured knowledge (text) is a challenging task. When knowledge increases in size, there's also the problem of efficiently using it while maintaining satisfying performance. As a solution, we look for a way to learn a map between each input example and the corresponding priority distribution over the knowledge: a sampling weight is associated with each element in the knowledge set. Learning this mapping and merging multiple instances together is not trivial.

  • Supervised by: Federico Ruggeri (federico.ruggeri6@unibo.it)

  • Type: PARTIALLY DEFINED

#2 - Conditioned Text Generation via Textual Knowledge

  • Description: The idea is to generate text conditioned on a given set of unstructured knowledge information, i.e., natural language texts. Informally, textual knowledge defines the set of properties that the generation system must attain to. The generation process must make use of such knowledge when generating in order to achieve a desired. Several case studies can be devised under this general perspective. One good example is data augmentation, where generated data must be similar to the existing one (statistics measures can be employed to evaluate synthetic and original data). Complex black-box methods can be employed as well to satisfy strict requirements.

  • Supervised by: Federico Ruggeri (federico.ruggeri6@unibo.it)

  • Type: TBD

#3 - Dynamic Textual Knowledge

  • Description: In the context of unstructured knowledge integration, the available knowledge set might not be in the best format possible, i.e., the one that best suits given data. To this end, we would like to investigate the task of knowledge update. A possible formulation is the one where a dedicated model iteratively re-writes available knowledge based on the performance of another model that makes use of such knowledge. The main constraint is that updated knowledge must not be altered too much.

  • Supervised by: Federico Ruggeri (federico.ruggeri6@unibo.it)

  • Type: TBD

#4 - Integrating and extracting textual knowledge from LLMs

  • Description: LLMs are everywhere, yet their interpretability and real-world understanding are unclear. We would like to investigate how unstructured knowledge could be integrated and/or extracted from this models.

  • Supervised by: Federico Ruggeri (federico.ruggeri6@unibo.it)

  • Type: TBD

Dialogue Systems and Chatbots

Dialogue Systems are a pervasive technology that is getting more and more popular in every aspect of our life.

We are especially interested in analyzing aspects that are usually not addressed by mainstream companies.

Open Projects

NLP on Scientific Paper and Grant Proposals: Automatic Review, Summarization, and more

We are interested in developing new tools that can support researchers in the analysis of large datasets of scientific publications.

This includes two connected branches. On the one hand, we want to automatically extract all the information that we consider relevant from a scientific publication. On the other hand, we want to perform a selection of publications based on their quality, their content, and their relevance with respect to a user query.

#1 - Paper Selection and Information Extraction in Healthcare

  • Description: Given an annotated dataset regarding the use of pediatric drugs, the task concerns excluding the papers that are not relevant and extracting the important information from the remaining ones.

  • Supervised by: Andrea Galassi (a.galassi@unibo.it)

  • Type: TBD

#2 - Paper Comparison

  • Description: Given a dataset with relevant and irrelevant information on a given topic, develop techniques to compare and evaluate the similarity between two scientific papers.

  • Supervised by: Andrea Galassi (a.galassi@unibo.it)

  • Type: TBD

Knowledge Graphs and LLMs

A Knowledge Graph is a graph structure used to represent the Knowledge contained in a Knowledge Base. In this representation, real world entities (e.g. objects, facts, events) are represented as nodes and their relationships as edges.
Knowledge Graphs provide for a compact, usable and human-readable world representation, they are however of discrete nature (hard to work with deep learning). Moreover, KGs are subject to a number of challenges (e.g. entity alignment, ontologies mismatches, etc.) that renders them hard to work with especially during evaluation.
Investingating methods to integrate KGs and LLMs, especially in the field of NLP and from a computational linguistic point of view could potentially enhance LLMs capabilities in lacking fields such as reasoning and maintaining consistency.

 

#1 - Knowledge Extraction

  • Description: Given a text in natural language extract a Knowledge Graph using Language Models. The key point for this project is to extract relevant information from text and produce a valid (and useful) knowledge base. Open problems: integration with ontologies, new concepts, unknown concepts.

  • Supervised by: Gianmarco Pappacoda (gianmarco.pappacoda@unibo.it)

  • Type: TBD

 

#2 - Knowledge Injection

  • Description: Given a Knowledge Graph and a Language Models, explore methods for enhancing the Language Model's responses with factual knowledge contained in the Knowledge Graph. Possible applications: question answering and information retrieval systems.

  • Supervised by: Gianmarco Pappacoda (gianmarco.pappacoda@unibo.it)

  • Type: TBD

 

#3 - Ontology learning

  • Description: Given a text and a Language Model, learn the corresponding ontology describing entities and relationships.

  • Supervised by: Gianmarco Pappacoda (gianmarco.pappacoda@unibo.it)

  • Type: TBD