USE OF NON-TEXTUAL TRAINING DATA FOR SINGAPORE’S COMMON LANGUAGES IN NATIONAL MULTIMODAL LARGE LANGUAGE MODEL PROGRAMME

Mr Dennis Tan Lip Fong asked the Prime Minister whether the National Multimodal Large Language Model Programme will incorporate non-textual training data for commonly used languages in Singapore such as Singlish, Malay, Tamil and Chinese dialects to enhance its voice recognition capabilities and widen the accessibility of this programme to Singaporeans.

Mrs Josephine Teo (for the Prime Minister): The National Multimodal Large Language Model Programme aims to develop Large Language Models (LLMs) that are more suited for our context. The Southeast Asian Languages in One Network (SEA-LION) model that was recently released was trained on a dataset that has more than 10 languages, including colloquial English (or Singlish), Chinese, Malay and Tamil.

In the next phase, the programme will look at techniques to incorporate speech data containing non-verbal cues such as tone and pitch, to augment SEA-LION. For this, they will first evaluate the model performance when non-textual data in standard and colloquial English are added, before moving on to other languages. As we build our local expertise in developing and training regional LLMs through this effort, we will closely monitor ongoing developments and will adapt our plans as the technologies in the field evolve.

Prime Minister’s Office
10 January 2024

https://sprs.parl.gov.sg/search/#/sprs3topic?reportid=written-answer-na-15440

Share this: