Language Processing Resources for Under-Resourced Languages

People all over the world need to use their own language when using computers or accessing information on the Internet. Many languages lack access to basic computational linguistic resources that would make it possible to satisfy this need. Instead, this has proven to be a major bottleneck when it comes to promoting the use of computers and the Internet. It is difficult to develop new linguistic resources without access to already existing ones. The primary goal of the project is to develop techniques and methods that can be used to efficiently develop computational linguistic resources for new languages based on existing tools and resources. This will be done for Amharic, the working language of the Ethiopian government. A secondary goal is to establish a network of interested institutions that can contribute to a unified approach to the development and utilization of these resources.
Summary of project objectives: 

There is a need for people all over the world to be able to use their own language when using computers or accessing information on the Internet. Still, today many languages lack access to basic computational linguistic resources (such as lexica, part-of-speech taggers, parsers, corpora or treebanks) that would make it possible to satisfy this need. Instead, this has proven to be a major bottleneck when it comes to promoting the use of computers and the Internet in the language. It is difficult to develop new linguistic resources without access to already existing ones. In this project we investigate how well existing linguistic knowledge can be transferred between languages with a minimum of human involvement and develop tools and techniques that can support such knowledge transfer. We to do this by working with the case of Amharic, the official working language of the Ethiopian government and spoken by approximately 20 million people.

The primary goal for the project is to develop techniques and methods that can be used to efficiently develop computational linguistic resources for new languages. This will also result in specific linguistic tools and corpora being developed for Amharic. A secondary goal is to establish a network of interested parties and institutions that can contribute to a standardised and unified approach to the development and a proper future utilisation of these resources in an open environment.

Partners: 
Stockholm University / KTH
SICS, Swedish Institute of Computer Science AB
Contact person: 
Lars Asker
Field of work: 
Natural Language Processing
Funding: 
SEK 800.000
Total cost: 
SEK 800.000
Project Duration: 
October, 2005 - December, 2006