Project Topics

Suggestions and research areas for thesis and project activities


You can find a detailed list of available research topics along with our proposed projects.

Each project follows the following template:

[Project # - *Project Title*]

  • Description: project description

  • Supervised by: supervisor #1, supervisor #2, etc...

  • Type:

    • "to be defined (TBD)": the project describes a very generic problem without any additional detail, thus, allowing high flexibility for individual proposal. Initial meetings with the candidate will be dedicated to defining a specific a task (model, data, own contribution).

    • "partially defined": the project provides some details about a problem, such as method and available data. However, additional details have to be defined to set up a specific task. Initial meetings with the candidate will be dedicated to define a complete task to carry out.

    • "set up": the project is almost/entirely well defined. Initial meetings with the candidate will be dedicated to explain the project and defining a proper workplan.

If you are interested

If one or multiple research projects meet your interest, please contact the corresponding supervisors.

Neural-Symbolic Machine Learning and Neural-Symbolic NLP

Neural-Symbolic techniques aim to combine the efficiency and effectiveness of neural architectures with the advantages of symbolic or relational techniques in terms of the use of prior knowledge, explainability, compliance, and interpretability.

Despite the existence of many NeSy frameworks, few of them are suited to be applied in the NLP domain for various reasons.

We are interested in extending such frameworks to NLP tasks and in applying them to challenging problems, such as Argument Mining, or to other challenging settings.

Open Projects:

[Project #1 - Experimental comparison of neuro-symbolic tools]

  • Description: Most of the neuro-symbolic approaches have been independently tested and evaluated on some data sets and benchmarks, but the comparisons between different approaches and across different tasks remains limited. Based on the activities of the TAILOR excellence network, this project has the objective to carry out an experimental activity that aims to fill this gap.

  • Supervised by: Andrea Galassi, Marco Lippi

  • Type: partially defined

[Project #2 - Ground-specific Markov Logic Networks]

  • Description: Markov Logic Networks (MLNs) are a statistical relational learning paradigm which combines first-order logic and probabilistic graphical models. Ground-specific MLNs extend MLNs by combining them with (deep) neural networks. This project aims to extend an existing implementation of MLNs in order to improve their usability across different tasks.

  • Supervised by: Andrea Galassi, Marco Lippi

  • Type: set up

Development of new datasets and linguistic resources, and experiments on them

Modern machine learning techniques have proven capable of learning even very abstract and high-level concepts, but usually with a caveat: they need plenty of data! Even those techniques that allow unsupervised or semi-supervised learning, still need accurate and reliable ground truth to validate and test the final models.

For these reasons, the development of corpora and datasets is a fundamental step towards the development of new models and techniques that can address complex tasks.

We are interested in creating and testing new language resources, especially for tasks that require expert knowledge/skills and/or languages other than English.
We are also interested in using these resources to develop and/or test new models and techniques.

Possible thesis/projects:

  • Advanced experiments on subjectivity corpora, especially regarding the hierarchical relationship between sentences and documents, and transfer learning across languages and tasks.
  • Development of new datasets, we are especially interested in annotations regarding sentiment/subjectivity and argumentation, but feel free to propose your own idea.
  • ....

Legal Analytics

The domain of legal documents is one of those that would benefit the most from a wide development and application of NLP tools. At the same time, it typically requires a human a high level of specialization and background knowledge to perform tasks in this context, which are difficult to transfer to an automatic tool.

In this context, we are involved in multiple projects (see ADELE and LAILA on the Projects page), which address tasks such as: argument mining, summarization, outcome prediction, cross-lingual transfer of knowledge.

Our purpose is to research and develop tools that can have a meaningful impact on the community.
We are in close contact with teams of experts that can provide their expertise and we have access to reserved datasets that can be used to develop automatic tools.

Possible thesis/projects:

  • Development of advanced neural models that can learn the motivations behind the outcome of a judgment
  • Application of neural-symbolic frameworks to legal texts. In particular, on a case study regarding asylum requests
  • Study and application of advanced methods of text alignment and projection of labels in a cross-lingual setting
  • ...

Dialogue Systems and Chatbots

Dialogue Systems are a pervasive technology that is getting more and more popular in every aspect of our life.

We are especially interested in analyzing aspects that are usually not addressed by mainstream companies.


Possible projects/thesis are:

  • Development of a chatbot that preserves privacy and is based on NLP and Computational Argumentation: we already have a proposal of architecture that we would like to apply it to new case studies such as healthcare, services for citizens, and immigration.
  • Development of a module for a chatbot that is focused on emotion recognition in speech.
  • ...

Argumentation Mining

One of our main research interests is Argument/Argumentation Mining (AM). It can be informally described as the problem of automatically detecting and extracting arguments from the text. Arguments are usually represented as a combination of a premise (a fact) that supports a subjective conclusion (opinion, claim).
Argumentation Mining touches a wide variety of well-known NLP tasks, spanning from sentiment analysis, stance detection to summarization and dialogue systems.


Possible projects/thesis are:

  • Application of out-of-the-shelf tools to new AM corpora: we want to apply existing and "standard" NLP techniques (from BiLSTM to Transformers to BERT) to recent AM corpora to establish experimental baselines for future works.
  • Data augmentation for AM: one of the main problems with AM is the unbalance of datasets; we want to experiment with data augmentation techniques to create new synthetic samples that can be used to balance the datasets.
  • Argument correction/completion: Given an argument, or its main claim, we would like to generate an improved version of the argument itself. The argument can either be generated from scratch or from an initial and partial argument. Improvement is expressed in terms of quality (argument ranking), via models (we use an existing argument classification model and evaluate its prediction confidence) or on task-specific conditioning factors (e.g. sentiment, stance). Additionally, the generated text can be constrained in order to not be too much different from the original input claim.
  • Tree-constrained argument detection systems: In our recent research work on argument detection with graph neural networks (Tree-constrained GNNs for AM), we have introduced a method to take into account constituency tree information for the task of argument detection. The hypothesis is that arguments might have underlying structured patterns. These patterns are integrated into the model at architecture and can be easily inspected. As the next step, a multi-layer architecture of the proposed model has yet to be explored. Additionally, explicit comparison models in a similar fashion to SentenceBERT are of our interest.
  • Integrating speech for argument detection: We would like to make use of speech information (e.g. prosody) to enhance the set of features that can be used to detect arguments. Speech can either be represented by means of ad-hoc feature extraction methods (e.g. MFCC) or via end-to-end architectures. Few existing corpora both offer argument annotation layers and speech data regarding a given text document. An existing project was defined a few years ago and might be completed with newly published corpora.
  • The importance of punctuation in AM: Punctuation is often discarded in the text preprocessing pipeline. However, punctuation conveys useful language information and should be taken into account as well. We would like to investigate the importance of punctuation in argument detection.
  • Argumentative chatbots: we are studying techniques to mix the effectiveness of NL models with the logic provided by argumentation. Several research directions can be defined, among which: chatbots that provide information after argumentative reasoning, collection of argumentative dialogues, argumentative prediction on dialogical documents.
  • Development of a stand-alone AM application, comprehensive of multiple argumentation models to choose from and pre-trained models to use for prediction. The application can be based on an already existing library for AM developed by our team.
  • Analyse the Human Values behind Arguments and their relationship with other element, such as quality of argumentation and rethorical devices.

NLP on Scientific Paper and Grant Proposals: Automatic Review, Summarization, and more


We are interested in developing new tools that can support researchers in the analysis of large datasets of scientific publications.

This includes two connected branches. On the one hand, we want to automatically extract all the information that we consider relevant from a scientific publication. On the other hand, we want to perform a selection of publications based on their quality, their content, and their relevance with respect to a user query.

  • Selection of relevant papers and extraction information for medical publication: given an annotated dataset regarding the use of pediatric drugs, exclude the papers that are not relevant and extract the important information from the remaining ones.
  • Selection of relevant papers through snowballing: given a dataset with relevant and irrelevant information on a given topic, develop techniques to compare and evaluate the similarity between two publications.
  • ...

Unstructured Knowledge Integration

We are interested in developing deep learning models that are capable of employing knowledge in the form of natural language. Such knowledge is easy to interpret and to define (compared to structured representations like syntactic trees, knowledge graphs and symbolic rules). Unstructured knowledge increases the interpretability of models and goes in the direction of defining a realistic type of artificial intelligence. However, properly integrating this type of information is particularly challenging due to its inherent ambiguity and variability.


  • Text classification and clustering solely relying on annotation guidelines: Unstructured knowledge comes into the form of annotation guidelines. The model is trained to imitate the same annotation process of a human annotator by attributing an annotation rule to each text input (e.g. sentence). The model can be trained in a supervised fashion if there's a 1-1 mapping between input and annotation rule. Alternatively, it can be trained in an unsupervised fashion by enforcing such mapping property by exploiting annotation rules information. Advanced similarity and rule comparison methods might be devised to enforce high performance, interpretability and robustness. The process can be further enhanced by considering dynamic modification of annotation rules to match the data-driven learning process: modify or generate new rules that better fit given data.
  • Text generation using text rules: The idea is to generate text conditioned on a given set of unstructured knowledge information, i.e. natural language texts. Informally, textual knowledge defines the set of properties that the generation system must attain to. The generation process must make use of such knowledge when generating in order to achieve a desired. Several case studies can be devised under this general perspective. One good example is given by data augmentation where generated data must be similar to the existing one (statistics measures can be employed to evaluate synthetic and original data). Complex black-box methods can be employed as well to satisfy strict requirements.
  • Efficient knowledge usage when considering large scale knowledge: Correctly using unstructured knowledge (text) is a challenging task. When knowledge increases in size, there's also the problem of efficiently using it while maintaining satisfying performance. In the MemBERT paper, we describe some sampling techniques in order to iteratively view limited size views of available knowledge. However, it is necessary to guarantee that each input example has its own set of text knowledge. A naive sampling strategy is destructive in the sense that updates its internal state based on each batch feedback. As a solution, we look for a way to learn a map between each input example and the corresponding priority distribution over the knowledge: a sampling weight is associated to each element in the knowledge set. Learning this mapping and merging multiple instances together is not trivial.
  • Dynamic knowledge update: In the context of unstructured knowledge integration, the available knowledge set might not be in the best format possible, i.e. the one that best suits given data. To this end, we would like to investigate the task of knowledge update. A possible formulation is the one where a dedicated model iteratively re-writes available knowledge based on the performance of another model that makes use of such knowledge. The main constraint is that updated knowledge must not be altered too much. For instance, associated constituency and dependency tree structures should be preserved.
  • Automatic logic representation from the text: A logic representation of a given problem is a consistent, verifiable and interpretable format for text encoding. We would like to learn such representation from the text as a structured learning process. The objective is to preserve the content of the input text and being able to reconstruct it. The obtain logic representation can be declined into a known logic schema (e.g. FOL) in order to be directly employed by existing models/programs. The general approach follows the autoencoder or machine-translation formulation. Learning can then be described as learning a set of predicates concatenated via logic operations.