Pretraining of a Swiss Long Legal BERT Model
We will scrape legal text in German, French and Italian to pretrain a Swiss Long Legal BERT model capable of performing NLP tasks better in the Swiss legal domain.
Factsheet
- Lead school Business School
- Institute(s) Institute for Public Sector Transformation
- Research unit(s) Digital Sustainability Lab
- Funding organisation Others
- Duration (planned) 15.12.2021 - 31.12.2022
- Project management Prof. Dr. Matthias Stürmer
- Head of project Joël Niklaus
-
Project staff
Alperen Bektas
Veton Matoshi - Partner Schweizerisches Bundesgericht
Situation
We see a clear research gap that BERT models capable of handling long mul- tilingual text are currently underexplored (gap 1). Additionally, to the best of our knowledge, there is no multilingual legal BERT model available yet (gap 2). Tay et al. 2020b present a benchmark for evaluating BERT-like models capable of handling long input and conclude preliminarily that BigBird Zaheer et al., 2020 is the currently best performing variant.
Course of action
We thus propose to pretrain a BERT-like model (likely BigBird) on multi- lingual long text to fill the first research gap. To fill the second gap, we propose to further pretrain Gururangan et al., 2020 this model on multilingual legal text.