Slovenian large language model launched
Slovenia has joined the race to develop a national large language model, a generative artificial intelligence tool akin to the popular ChatGPT. An early version of the model has been launched as a stepping stone towards the development of a much larger and more capable model.
Called GaMS, an acronym for Generative Model for Slovenian, the model is being developed at the Centre for Language Resources and Technology at the University of Ljubljana as part of the state-funded project povejmo.si. Its second iteration, GaMS 9B-Instruct, is available for testing on the project web page.
Marko Robnik Šikonja, a project team member and professor at the Faculty of Computer and Information Science, told the press on 20 March that the model has been bootstrapped on Gemma, an open-source model by Google, and trained with Slovenian data.
Existing large language models are mostly trained with data in large languages such as English but they lack the cultural specifics of Slovenian, he said.
This model has been trained with Slovenian-language datasets containing ten billion words, but the plan is to eventually get datasets equalling 40 billion words and efforts are currently under way to collect additional texts.
Large national institutions such as the National and University Library as well as media outlets and publishers have been invited to contribute their texts.
"Once we have processed the data collected in this action, the models will be upgraded. We are also planning the release of a larger model called GaMS 27B," Robnik Šikonja said.
The large language model runs on Vega, the Slovenian supercomputer, and is freely available to the public. It is designed to be used not just by the general population but also for industrial and research purposes.