AI research on languages related to Finnish was presented at Metropolia

10.12.2024
http://Hirvi,%20joka%20seisoo%20kirjan%20päällä

The 9th International Workshop on Computational Linguistics for Uralic Languages, or more familiarly IWCLUL, was held at Metropolia. The event brought a large group of international researchers to the Arabia campus, where they presented their language technology research related to Uralic languages, which are languages related to Finnish.

The challenge of being endangered

Of the Uralic languages, only Finnish, Estonian and Hungarian are majority languages with official state support. The other Uralic languages are more or less endangered. The number of speakers varies from Meadow Mari with 360,000 speakers and Erzya with 300,000 speakers, to Skolt Sámi with 300 speakers and Ume Sámi with just 5 speakers. Some languages no longer have native speakers at all. However, hope is not lost even for these languages, as Valts Ernštreits, the director of the Livonian Institute, often remarks: “Every time the last speaker of Livonian is believed to have died, a new last speaker emerges from some cottage.”

Jack Rueter pitämässä esitelmää Pikachuista
Jack Rueter reminded us of the importance of popular culture also in the context of endangered languages.

Modern language technology requires a lot of data, which makes AI development for smaller languages more challenging. Often, there is little to no data available, and it tends to have significant variation. Spelling rules are often not as clearly defined or deeply ingrained in the speakers’ habits as they are for major languages.

Large language models sparked discussion.

Large language models like ChatGPT do not currently support any small Uralic language. However, researchers have devised methods to elicit responses from these models by carefully crafting prompts. In addition to my own presentation, both Flammie Pirinen and Niko Partanen reported the results of their research.

Lev Kharlashkin esiintymässä lavalla
IWCLUL was organized through volunteer efforts. In the picture, Lev Kharlashkin is inviting the next speaker to the stage.

The problem with large language models, even for Finnish, Estonian and Hungarian, is that they split words into smaller units, tokens, based on the English language. Iaroslav Chelombitko and Aleksei Dorkin had proposed solutions for this issue.

Metropolia’s values on display

The work done at Metropolia in the fields of sustainable development and artificial intelligence was also highlighted at the event. Melany Macías presented our research, in which AI learns to predict sustainable development goals in Finnish based on English-language data.

Melany Macías puhumassa tuloksistaan. Kalvolla näkyy, että mBART-malli tuotti parhaat tulokset
Melany Macías presented the accuracy of predicting sustainable development goals.

Comments

No comments

Comment