In these days, corpus-based approach of natural language processing
has been popular, where machine learning or statistical method is applied to
large-scale resources of natural language for producing various kinds of
linguistic knowledge. There are many initiatives over the world to
construct and maintain these resources, and the following is a partial
list we collected.
This survey has been done in 2001.
Natural
Language Resource Projects
GSK
(1999-) GSK(abbreviation of ``Gengo Shigen Kyoyukiko'' in Japanese that
means ``Language Resource Consortium'') is an organization which aims at
developing, sharing and distributing language resources for researches of
speech and natural language processing. It is organized in May of 1999. It is
still not active, and is in course of preparation by volunteers. GSK office
will be in National Institute of Advanced Industrial Science and Technology.
(KS,02/9/4)
Spontaneous Speech:
Corpus and Processing Technology A
science and techlonogy agency priority program supported by the ministory of
education of Japan. The major themes of this program are 1)building a large
scale spontaneous speech corpus, 2)studying spontaneous speech understanding,
and 3)exploring spontaneous speech summrization. (TC,02/08/05)
International
Research Project by CICC (1987-1994) CICC(Center of the International
Cooperation for Computerization) have promoted a international research
project on machine translation. Research institutes in China, Indonesia,
Malaysia, Thailand and Japan developed machine translation systems of their
languages. An interlingua, electronic dictionaries, corpora, evaluation
methodologies of MT systems have been produced by the project. It was funded
by ODA(Official Development Assistant) of Japan. (KS,02/9/4)
LDC (Linguistic Data
Consortium) A consortium supporting language-related education,
research and technology development by creating and sharing linguistic
resources: data, tools and standards. Members are entitled to one copy of each
corpus released in the years of membership. A commercial membership is
US$20,000 per year and nonprofit and government memberships cost US$2000. LDC
was founded in 1992, has been supported by ARPA and NSF. The office is in
UPenn. (SK,02/9/2)
ELRA (European Language Resources
Association) (1995-) A non-commercial organization which plays a
leading role at present in the accumulation and circulation of the language
resources in Europe. ELRA is established based on the suggestion of the
RELATOR project in 1995 aiming the development, the evaluation and the
distribution of the language resources in Europe, and it mainly manages the
language resources developed by EU-funded projects. ELDA (European Language
resources Distribution Agency) distributes the collection of the language
resources and evaluation as ELRA's operational body. The data handled there
are voice DBs, (monolingual/multilingual) dictionaries, text corpora and term
collections. As text data, the results of the projects such as BNC, CRATER,
ECI, MULTEXT, PAROLE, AMARYLLIS, EuroWordNet, LRsP&P and CLEF are
provided. ELRA/ELDA has been playing active parts in the evaluation
project such as the French Amaryllis project, the CLEF project of EU and the
worldwide Aurora project. Some of the language resources used there are
already on the distribution catalog, and tools related to the evaluation will
be added soon. It sponsors the biannual LRECs (1998,
2000
and 2002).
(YF,02/09/04)
TELRI (Trans European Language Resources
Infrastructure) I (1995-1998) and II (1999-2001) An initiative based
of the projects carried out under the PECO/COPERNICUS program, which aimed at
providing multilingual language resources by making connections among the
footholds of the language technology in the whole European area and the NIS
countries. The activities of the first term were done in January, 95 -
December, 98. The second term were January, 1999 - December, 2001 in the
beginning, but is extended for six months until June, 2002. It contributed to
the promotion of the ELAN (European Language Activity Network) project
(1998-1999) with the PAROLE/SIMPLE project. It has, as its archive, TRACTOR (TELRI Research Archive of Computational
Tools and Resources), and is doing the construction, collection,
management and distribution of the language resources in the languages
including the Middle and East European languages. The languages dealt there
are Bulgarian, Croatia word, Czech, Dutch, English, Estonia word, Finnish,
French, German, Greek, Hungarian, Italian, Latvian, Lithuanian, Rumanian,
Russian, Serbian, Slovak, Slovene, Swedish, Turkish, Ukrainian, and Uzbek. The
result CD-ROM of the MULTEXT-EAST project which applied CES made by the
MULTEXT project to the Middle and East European languages are included in the
archive. As for the TRACTOR archives, a monolingual/multilingual on-line
searching system is planned. Negotiation with the copyright holder is
necessary for the commercial use. (YF,02/09/04)
ACQUILEX
(Acquisition of Lexical Knowledge) (1989-1995) A project aimed at
building a multilingual lexical knowledge-base as a part of the ESPRIT program (one of the
information technology promotion programs by the European Commission). In its
first phase (1989-1992), ACQUILEX produced tools for building a multilingual
lexical knowledge database from existent electronic dictionaries. In its
second phase (1992-1995), it took on the extraction of lexical information
from machine readable language resources. A group of
tools developed there is introduced to the public. The results are being
used in the projects of the LRE (Linguistic Resources and Engineering) program
and so on. (YF,02/09/04)
MULTILEX (A
Multi-Functional Standardised Lexicon for European Community Languages)
(1990-1993) A project which aimed at making a standard general-purpose
dictionary description for European languages. It aims at the improvement of
re-usability of lexical resources in publishing, machine translation, optical
character recognition speech understanding and information retrieval. It
developed the MULTILEX internal format (MLEXd) which was a specification for
the description of monolingual, multilingual and terminological dictionaries.
The MLEXd follows the SGML framework. MLEXd was adopted in the GENELEX project
which aimed at general-purpose development of a dictionary description form,
and in the EUROLANG project which aimed at the development of a machine
translation system in the EUREKA program. A lot of software was also developed
for its use. (YF, 02/09/04)
GENELEX
(1990-1994) A project of the EUREKA program which aimed at developing
a common general-purpose dictionary form independent from theories and
applications and also aimed at developing a dictionary transformation method
from existing machine readable dictionaries of French, Italian, Spanish, and
Portuguese. It first designed the models of the dictionary in each language is
designed, and then the unification was done. The result was extended to the
dictionary description of three languages of Middle Europe (Czech, Hungarian
and Polish) in the CEGLEX project of the PECO/COPERNICUS program. (YF,
02/09/04)
CEGLEX
( CENTRAL EUROPEAN GENELEX MODEL) (1995-1996) A project of the
PECO/COPERNICUS program which aimed at making a standard for the dictionary
description of Middle European languages (Czech, Hungarian and Polish) based
on the general-purpose dictionary model developed by the GENELEX project.
Results in the morpheme level are being used in the GRAMLEX project of
PECO/COPERNICUS that developed a tagging program in the morpheme level. (YF,
02/09/04)
EAGLES (The Expert
Advisory Group on Language Engineering Standards) (1993-) One of the
projects of the LRE program aimed at promoting standardization for (a)
large-scale linguistic resources such as text corpora, computational lexica
and speech corpora, (b) manipulating such knowledge via computational
linguistic formalisms, mark up languages and various software tools, and (c)
assessing and evaluating resources, tools, and products. EAGLES proposed the
EAGLES Guidelines.
The EAGLES Guidelines are adopted in many projects such as PAROLE, SIMPLE and
EUROWORDNET, and is the de facto standard for the corpus description. It was
finished in 1999, and its activities were taken over by ISLE (International
Standards for Language Engineering). (YF,02/09/04)
DELIS (Descriptive
Lexical Specifications and Tools for Corpus-based Lexicon building)
(1993-1995) One of the projects of the LRE program which aimed at the
development of tools for efficient building of corpora for making
dictionaries, and example search. It employed the Frame Semantics, HPSG-like
syntax and Typed Feature Structures (TFS), and developed a syntactic evidence
retrieval tool (Search Condition Generator) for morphologically and
syntactically annotated text corpora. The Search Condition Generator generates
corpus query in the format of the English Constraint Grammar (ENGCG:
Helsinki). Semantic, syntactical and morphosyntactical descriptions for verbs
of perception and communication were given as examples in five languages:
English, French, Italian, Dutch and Danish. (YF, 02/09/04)
MULTEXT
(Multilingual Text Tools and Corpora) (1994-1996) A project aimed at
making standards and specifications for the description and processing of
corpora, and providing tools and language resources using them. It developed
the CES
(Corpus Encoding Standard) following the TEI guidelines together with the
EAGLES Text Representation subgroup, Vassar (US)/CNRS (Fr) collaboration. CES
are a part of the EAGLES guidelines. (YF,02/09/04)
Multext-East (Multilingual Text Tools and
Corpora for Central and Eastern European Languages) (1995-1997) A
project of the PECO/COPERNICUS program which aimed at applying the CES of the
MULTEXT project to corpus development in six Middle and East European
languages: (Bulgarian, Czech, Estonian, Hungarian, Rumanian and Slovene). Some
feedback was done to CES. Its results are available from the TELRI.
(YF,02/09/04)
RELATOR (A European
Network of Repositories for Linguistic Resources) (1993-1995) One of
the projects of the LRE program by the representatives from the major
European-wide bodies and associations such as ELSNET, ESCA and EACL, and by
the RELATOR Industrial Steering Committee (ISC), composed of the
representatives of leading IT companies, publishers, PTTs and other providers
of electronic information services. The objectives of the project were:
Creation of structured and publicly available catalogues of various
existing language resources,
Discussion with the relevant actors such as owners of language
resources, their producers, users, funding bodies, international
organizations and related associations, about various aspects of the needs,
possible solutions, conditions for joint actions,
Establishment, management and maintenance of a European repository of
reusable language resources, identification, examination and evaluation of
the methods for distribution of them to various type of users at various
levels: organizational, technical, legal and financial,
Experiments for evaluation of collection and dissemination of existing
language resources by (i) utilizing the distributed network of ELSNET and
(ii)pressing and distribution of CD-ROMs,
The presentation of the final advice toward the establishment of the
organization which should become center of the collection, verification,
management and dissemination of language resources.
A foundation
of the establishment of ELRA was laid and involved in the start of the
language resources preparation project such as PAROLE as well again. (YF,
02/09/04)
Emphasis was placed for the development of highly generalised software
for language processing. The distinction between modules for common support
functions and modules for specific purposes was clarified by combining
general libraries and project specific modules.
The concept of "monitor corpora" is realized, and each system developer
can at any time generate, from a general text collection, his/her own
virtual text collection weighted to the problem in order for use in
dictionary-making which would suit for the purpose. By doing so, it becomes
possible to assign the problem of acquisition of suited corpus to the
definition of the virtual corpus.
Emphasis was put on the importance of the combination of
language-independent software and software with minimum assumption.
The language specific element concerns English, German, Dutch,
French, Italian and Portuguese. The basic part of the project is formed by the
COSMAS II system for structuring and management of large-scale text corpora.
This enables linear and nested annotations and even competing annotations.
There are Converter and Loader programmes of TEI -compliant data for
portability between systems. (YF, 02/09/04)
PAROLE (Preparatory
Action for Linguistic Resources Organization for Language Engineering)
(1996-1998) An EU project which aimed at the preparation of general
dictionaries and corpora in each language of EU countries. The corpora are
based on the CES (Corpus Encoding Standard) of the MULTEXT project. Its lexica
adopted the standard by the GENELEX project. Its results are availale from
ELRA (ELDA). (YF,02/09/04)
ELSNET (the European Network of
Excellence in Human Language Technologies) (1991-) Aimed at the
promotion of HLT (Human Language Technologies) in Europe, ELSNET runs seminars
and workshops and a Web site, a mailing list as a network promoting the
communication of the people concerned with the research, development and
application in the field of speech and natural language technology and related
technologies. The making and
distribution of language resources for experiments are among its
objectives. It was established under the ESPRIT program in 1991. At present,
it is one of the (about) twenty Network of Excellence's of the IST. It covers
26 European countries at present. Members are public research bodies or
private enterprises aiming at development and use of speech and natural
language technologies. 60% of about 135 members are academic bodies like
universities and 40% are from industry. The URL of its Web site is http:
//www.elsnet.org/. (YF,02/09/04)
ELAN (European Language
Activity Network) (1998-1999) A project of the MLIS program aimed at
coordinating existing language resources in Europe and their possible users by
the cooperation of PAROLE and TELRI. Its results are
Development of a common query language (ELANCQL) and a searching
software based on it,
Implementation of a user community network, which provides means for
awareness raising, a clear copyright policy and supports for users.
The preparation of the standard resources of the following various
languages: Albanian, Belgian French, Belorussian, Bulgarian, Catalan,
Croatian, Czech, Danish, Dutch, English, Estonian, Finnish, French, German,
Greek, Hungarian, Irish, Italian, Latvian, Lithanian, Polish, Portuguese,
Romanian, Russian, Serbian, Slovakian, Slovene, Spanish, Swedish and Uzbek.
(YF, 02/09/04)
EUROMAP Language Technologies
(= HOPE: HLT Opportunity Promotion in Europe) (2000-2003) One of the
HLT related projects in the IST program aimed mainly at the promotion of
market take-up of the results of the HLT related research and development
projects in Europe. The project is scheduled to be active from 2000 to 2003.
It is currently implemented by a team of 11 National Focal Points (NFPs) in
Austria, Belgium/Netherland, Bulgaria, Denmark, Finland, France, Germany,
Greece, Italy, Spain and the UK. It initially puts its focus on the national
level activity by the NFPs and then will extend to cross border activities.
Main objectives are as the following.
promotion of projects aimed at ready-to-market results
acceleration of awareness of the benefits of HLT enabled systems,
services and applications
boosting the participation of best-of-class technology developers in
research projects
improving relevance of project targets and user needs
promotion of beta tests, demonstrations, real time utilisation
monitoring and so no, in corporation with users and the society.
ENABLER (European
National Activities for Basic Language Resources) (2001-2003) A new
project started in December, 2001, aiming at the promotion of cooperation
between the language resource development activities in each country in their
respective language, transportation of technologies and tools between
languages, and facilitating development of multi-lingual language resources.
The main objectives are as the following :
Network building among the activities of each country. and, building of
publicly available repository open to public
Holding of the official forum for coordination, exchange of information,
data, best practices, sharing of tools, and co-operation on specific issues
Facilitation of technology transfer between languages by boosting the
compatibility and mutual accessibility of results of the activities in each
country
Avoiding of duplication and divergence of activities by the promotion of
exchange of tools, specifications and validation protocols
Contribution to the creation of a EU centre for harmonisation of
metadata description of speech, text, multimedia and mutimodal linguistic
resources
Promotion of the industrial exploitation of language resources
Fostering the realisation of an international co-operative framework for
the provision of language resources
It is one of the HLT projects of
the IST program. (YF, 02/09/04)
CLASS (Collaboration in Language and
Speech Science and technology) (2000-2002) One of the HLT related
projects of the IST program aimed at the coordination of activities of the HLT
related projects through clustering. It coordinates cluster and cross-cluster
activities by fostering communication, collaboration, international activities
and liason with the Commission. It fosters value-added project collaboration
on key topics to enable the projects to achieve more and better results and
have stronger and more visible impacts. CLASS promotes the synergy and
critical mass of research and sharing of the complementary knowledge,
expertise and know-how represented in the projects The next four clusters
are made :
natural and multimodal interactivity
cross-lingual knowledge management
intelligent interactive information presentation
speech and language technology evaluation
It runs mailing lists
and a website. Data from the HLT related projects are also being accumlated,
but for now, the use is restricted to its members. (YF, 02/09/04)
Network-DC
(Network of Regional and International Data Centers) (2000-2002) An
ELDA-LDC collaboration project aimed at construction of a global multilingual
language resources network. It will conduct the production, acquisition,
normalisation, certification and distribution of spoken and written language,
for research and technology development. NETWORK-DC will set up a network of
data centres, thus facilitating the access electronic language resources
currently managed by many different regional data centres. ELDA plans to
create up to five broadcast news in different languages and LDC plans to
create a linguistic corpus including significant samples of the 45 languages
of broadcast used by the Voice of America. (YF, 02/09/04)
BALKANET (Balkan
Word Net) (2001-2004) One of the HLT related projects of the IST
program aimed at building of the multilingual semantic language databases of
the various languages of the Balkan Peninsula based on the model of
EuroWordNet. The target languages are the following six languages: Greek,
Turkish, Bulgarian, Rumanian, Czech and Serbian. BalkaNet aims at combining
effectively lexicography and modern computation, and tries to organise lexical
information of Balkan languages by representing semantic relationships using
word meanings. WordNets of each language are developed first, and they are
combined in one common lexical database. It also aims at the expansion of the
general idea of EuroWordNet by probing into the relations between the Romance
languages and Balkan languages. (YF, 02/09/04)
TEI-C (Text Encoding Initiative
Consortium) (2000-) TEI-C is a consortium which was established in
December of 2002 to sustain, develop and financially support TEI. It has
executive offices in Bergen, Norway. (KS,02/9/4)
Project Gutenberg (1971-) Project
Gutenberg aims at gathering electronic documents which is out of copyright,
such as classical literatures, and making them freely available via the web
site. It also accepts submissions of electronic documents. It was started by
Michael Hart in 1971, and run by volunteers now. (KS,02/9/4)
OTA (Oxford Text Archive)
(1976-) This project gathers and distributes many documents and provides
great benefits for both authors and readers. It is organized by Oxford
University. It was started by Lou Burnard in 1976. In principle, anyone can
submit documents to OTA and download documents from OTA freely. More than
2,500 documents in over different 25 languages are currently available in OTA.
(KS,01/10/26)
OLAC (Open Language Archives
Community) (2000-) OLAC is an international partnership for
distributing digital language resources. It is investigating methodologies of
describing language resources using unicode and marking up by XML, proposing a
metadata set (data about data, such as name, subject language, creator, date,
description etc.) of language resources, and developing a catalogue of
language resources using the metadata set. It was established in December of
2000. It is funded by NSF. (KS,02/9/4)
WordNet
(1985-) WordNet is an English thesaurus available in free. It was
originally developed by George A. Miller in Princeton University in 1985, and
are still revised by many researchers. The newest version of WordNet is 1.7.1.
It is partially funded by NSF. (KS,02/9/4)
EuroWordNet
(1996-1999) EuroWordNet is WordNet for 7 European languages, Dutch,
Italian, Spanish, German, French, Czech and Estonian. Synsets (word classes)
of EuroWordNet have links to one of synsets in WordNet, so that it is possible
to go from the words in one language to similar words in any other language.
Furthermore, EuroWordNet and WordNet share top-ontology. It was developed from
March of 1996 to June of 1999. The activity was a part of the research project
``Human Language Technologies''.
(KS,02/9/4)
ISLE
(International Standards for Language Engineering) (2000-) The aim of
ISLE is to develop international standard language resources. Researchers in
EU and USA mainly work for the project. There are three working group,
Computational Lexicons WG, Natural Interaction and Multimodality WG and
Evaluation WG. It will be continued in the period of 27 months from January of
2000. ISLE is a part of the research project ``Human Language Technologies'', and also
supported by EAGLES. (KS,02/9/4)
COCOSDA (the International
Committee for the Co-ordination and Standardisation of Speech Databases and
Assessment Techniques for Speech Input/Output The International
Committee for the Co-ordination and Standardisation of Speech Databases and
Assesment Techniques, COCOSDA, has been established to encourage and promote
international interaction and cooperation in the foundation areas of Spoken
Language Processing, esepcially for Speech Input/Output. The importance of
collaboration which transcends national boundaries is increasingly recognized.
This is both because of the practical and scientific value attached to
systematic work which encompasses a range of languages and analytic approaches
and also because of the practical need to establish common methods of
performance description and quantitative comparison. (from COCOSDA's
Page,02/08/05)
LINGUAPAX Project A UNESCO
project to respect for all language communities and protect language dicersity
for intercultural understanding and peace of the world. (TC,02/08/05)
The Less Commonly
Taught Languages (LCTL) Project A project of the Center for Advanced
Research on Language Acquisition at the University of Minnesota USA, to help
advence the teaching and larning of the less commonly taught languages
including all of the world’s languages except English French, German, and
Spanish. (TC,02/08/05)
Research Project by KORTERM
(1998-) KORTERM(Korea Terminology Research Center for Language and
Knowledge Engineering) is a research institute involved in terminology in
Korea. This project aims at developing, distributing and standardizing Korean
technical term dictionaries by Choi Key-Sun in KAIST (Korea Advanced Institute
of Science and Technology). It has 4 phases. Furthermore, KORTERM is concerned
with ISO/TC37, ISO
technical committee on terminology and other language resources.
(KS,02/9/4)
Evaluation
Projects
TIPSTER
Text Program (1991-1998) The Project led by DARPA, CIA,
and NIST to enhance text processing
technologies through the cooperation of researchers/developers in Government,
industry, and academia. It focused on the following technologies.
Document Detection: the capability to locate documents containing the
type of information the user wants from either a text stream or a store of
documents.
Information Extraction: the capability to locate specified information
within a text.
Summarization: the capability to condense the size of a document or
collection while retaining the key ideas in the material.
The
evaluation workshops such as TREC, MUC, and SUMMAC were held under the name of
TIPSTER. Those are the workshops in which the participants are given the same
questions (ex., search queries, full texts for summarization, and so on) and
submit answers (ex, retrieved documents, summaries, and so on). Organizers
then evaluate the submitted answers to compare the techniques used. The
purpose of the evaluation workshops is not competing each other but exchange
technical ideas to enhance the targeting field. TIPSTER ended in fall 1998 and
was took over by TIDES. (MI,02/01/23)
TIDES (Translingual Information Detection, Extraction, and Summarization)
(DARPA
site, NIST site) (1999-) TIDES is
the successor of TIPSTER but more sharply focuses on the translingual aspect
of information processing. The goal of TIDES is to dramatically reduce the
amount of time it takes to perform cross-lingual retrieval, information
extraction, summarization and interpretation, and machine translation of a new
language. TIDES supports the workshops of DUC, TREC, and TDT.
(MI,02/01/23)
TREC (Text Retrieval Conference)
(1992-) TREC is the first evaluation workshop on information retrieval. Its
purposes are constructing large-scale test collections, providing the
opportunity of technical exchange between participants, and establishing the
methodology of the system evaluation. TREC began by the following two tasks.
Ad hoc Task
This is the conventional text retrieval. Participants are given document
collections and retrieve relevant documents to the queries those are also
given. The method called "pooling" is used to collect potentially relevant
documents from the submitted documents and assessors judge the relevance of
the documents in the pool. Systems are evaluated by the measures of "recall"
and "precision".
Routing Task (filtering)
The targeting application of this task is SDI(information filtering).
Participants are given several documents known to be relevant to the search
topics, and try to retrieve another set of relevant documents from the
additionally provided document collections.
In addition to the above
two tasks, a number of new tasks have been proposed and conducted in the
successive TREC workshops. The latest TREC2001 contained the tasks of Web
retrieval, Cross-lingual retrieval, Filtering, Interactive retrieval, and
Video retrieval. (MI,02/01/24)
MUC
(Message Understanding Conference) (1990-1998) MUC is an evaluation
workshop for information extraction. The latest MUC-7 (1998) had the following
tasks.
Named Entity Task
Extracting names of persons, organizations, locations, times, and so on.
Template Element Task
Extracting attributes of proper nouns.
Template Relation Task
Extracting relations between proper nouns.
Co-reference Task
Extracting co-reference relations between nouns (ex., anaphoric
relations).
Scenario Template Task
Extracting events related to the given scenario (ex., an specific
accident).
Multi-lingual Entity Task (MET)
Multi-lingual version of the Named Entity Task.
News articles
had been used in MUC. (MI,02/01/24)
SUMMAC
(TIPSTER Text Summarization Conference) (1998) SUMMAC is the first
evaluation workshop on text summarization. Since there is no consensus about
the method of evaluating good or bad summaries, exploring the methodology of
the evaluation was also the target of SUMMAC. SUMMAC had the following tasks.
Ad hoc Task
Summaries are evaluated through text retrieval. Here, assessors read
summaries instead of the original texts and see whether they can correctly
judge the relevance only with the summaries.
Categorization Task
Assessors categorize texts by reading only summaries and compare the
results with the categorization by reading the full texts.
Q&A
Using the summaries, assessors answer the questions which can be
correctly answered if the original texts were used.
Generally
speaking, summaries can be categorized into indicative ones and indicative
ones. Indicative summaries are used as the references to the original texts.
Informative summaries can be used in place of the original texts. The above
"Ad hoc Task" evaluates indicative summaries and "Q&A" evaluates
informative summaries. The difference between "Ad hoc Task" and
"Categorization Task" is that the former consider the user's bias (i.e.,
information need) in reading summaries. So the former summaries are often
called as query-biased summaries and the latter generic summaries. SUMMAC was
taken over by DUC. (MI,02/01/29)
DUC
(Document Understanding Conference) (2000-) DUC is the successor of
SUMMUC. In addition to the summarization from single document, that from a
group of documents was evaluated in DUC2001. In this task, about ten news
articles related to an event were given as a group of source documents.
Systems summarized those documents into 400, 200, 100, and 50 words. The
submitted summaries were compared with the summaries made by human.
(MI,02/01/29)
TDT
(Topic Detection and Tracking) (1997-) The evaluation workshop for
detecting and tracking topics from streaming information such as video news.
English and Chinese(Mandarin) are taken into consideration. TDT has the
following tasks.
Story Segmentation
Detecting boundaries between cohesive stories from streaming audio data.
Textual transcriptions are also available.
Topic Tracking
Keeping track of stories similar to a set of example stories.
Topic Detection
Building clusters of stories that discuss the same story. Unlike the
conventional clustering, objective data were given in the chronological
order.
First Story Detection
Detecting if a story is the first story of a new, unknown topic.
Link Detection
Detecting whether or not two stories are topically linked.
(MI,02/01/29)
CLEF (Cross Language Evaluation
Forum) (2000-) CLEF is an evaluation workshop of European
cross-language information retrieval. The followings are the
major tasks of CLEF.
Multi-lingual Information Retrieval
This task requires searching a multi-lingual document
collection for relevant documents. Using a selected topic
language, the goal is to retrieve documents for all languages in
the collection.
Bilingual Information Retrieval
In this task, any topic language can be used to search
a document collection in different language of topics, for example
searching English collection by Dutch topic.
Monolingual Information Retrieval
Here topic language and target language are the
same. Since English monolingual task is popular in TREC, CREF
focuses on other European languages than English.
(MI,02/01/29)
IREX
(Information Retrieval and Extraction Exercise)
(1998-1999) IREX was an evaluation workshop for Japanese Named
Entity (NE) and Japanese Information Retrieval (IR). Targeting
corpus in IREX was a set of Mainichi News Articles.
Named Entity (NE)
It is similar to the NE or MET task of MUC, that is for
extracting names of persons, organizations, locations, times, and
so on.
Information Retrieval (IR)
It is similar to the ad-hoc retrieval task of TREC.
IREX Workshop was held along with NTCIR-1 Workshop. (MI,02/01/29)
NTCIR(NII-NACSIS
Test Collection for IR Systems) Project
(1998-) An evaluation project led by NII focusing on the Asian
languages. NTCIR includes information retrieval, question
answering, text summarization, etc. As for the tasks of
information retrieval, NTCIR has made test collections of research
paper abstracts (Japanese), news articles (Japanese/Chinese), web
pages (Japanese domain), and patents (Japanese/English
abstracts). Cross-lingual retrieval of Asian languages has also
been a major target of NTCIR. Here is the list of tasks of NTCIR-3.
Cross-lingual information retrieval task
Both of the search target and the search topic were
multi-lingual, that is Chinese, Korean, Japanese, and
English. Search target was news articles.
Web task
Search target was web pages (10Gbytes or 100Gbytes) collected
from Japanese "jp" domain.
Text Summarization Challenge (TSC-2)
See "TSC" section for detail.
Question Answering Challenge (QAC-1)
It was similar to the QA tasks of SUMMAC and TREC. Resource
used for answering was Japanese news articles.
Patent Retrieval Task
Search target was two years patent full texts. In addition to
the full texts, participants could utilize five years' patent
abstracts of Japanese and English (paired corpus) to submit
cross-lingual retrieval results.
(MI,03/06/17)
TSC (Text Summarization
Challenge)
(1999-) TSC is an evaluation workshop for Japanese text
summarization. In TSC-1 (at NTCIR-2), both of extract-based
and free-styled summaries were evaluated through intrinsic
and extrinsic methods. In the intrinsic evaluation, human experts
directly evaluated the quality of submitted summaries, and in the
extrinsic evaluation, IR application was used for evaluating the
submitted summaries. TSC-2 dealt with multi-documents
summarization. Target documents in TSC-1 and TSC2 were Japanese
news articles. (MI,02/01/29)
SENSEVAL
(1998-) SENSEVAL is an evaluation exercise on word sense disambiguation.
The first SENSEVAL was held in the summer of 1998 for English, French and
Italian, and 23 research groups participated it. The second SENSEVAL,
SENSEVAL-2 was held in the spring of 2001 for 9 languages (English, Japanese
etc.), and 37 research groups participated it. (KS,02/9/4)
Other
Projects
Language Understanding
and Action Control The project aims at creating a new academic
discipline which concerns language and action, and establishing its basis. A
3D software robot world is constructed which is free from mechanical
constraints and the robots are operated via natural language dialogue. The
project is supported by MEXT (Ministry of Education, Culture, Sports, Science
and Technology, Japan), started on April, 2001, and ends on March, 2006.
(SK,02/9/2)
Intelligent Media
Technology for Supporting Natural Communication between People The
project introduces a new concept, conversational content, which is a new
intelligent communication media and can talk with people. Based on this
concept, the project aims at establishing a new technology which support
people to communicate with each other by a rich communication skill such as
speech, facial expression, and gesture, without being aware of computer
existence. The project is supported by MEXT (Ministry of Education, Culture,
Sports, Science and Technology, Japan), started on April, 2001, and ends on
March, 2006. (SK,02/9/2)
Kototoi
Project The project aims at realizing a flexible information retrieval
by relating unstructured texts to background knowledge (ontology) explicitly.
An intelligent information/knowledge management system will be constructed by
exploiting the results in four important technologies: natural language
processing such as information extraction, ontology/knowledge management,
network clawer, and information presentation based on dialogue. The project is
supported by JST (Japan Science and Technology Corporation), started on
December, 2000, and ends on March. (SK,02/9/2)
Human-centered Intelligent Information Access Technology The project
aims at establishing a new technology which enables a flexible access to
information contents depending on its semantics and context based on the
shared understandings of semantics and situation by people and artifacts.
Research topics are structuring of semantics of contents, utilization of
semantically structured contents, user modelings, and user interface. The
project is supported by JST (Japan Science and Technology Corporation),
started on December, 2000, and ends on March, 2005. (SK,02/9/2)
Center for
Integrated Acoustic Information Research (CIAIR) The Center for Integrated Acoustic Information
Research (CIAIR) is a project supported by the Ministry of Education, Culture,
Sports, Science and Technology of Japan. This project try to create a center
of integrated research concerning the human-sound relation in terms of the
following five questions: 1) how can sound be spatially located?; 2) how can
the characteristics of sound be analyzed and synthesized?; 3) how can speech
and characters be transformed into each other?; 4) how do humans communicate
with speech?; and 5) how do humans interpret sounds? (TC,02/08/05)
Verbmobil
Project Verbmobil was a long-term project of the Federal Ministry of
Education, Science, Research and Technology of Germany. The long-sighted aim
reached, was the development of a mobile translation system for the
translation of spontaneous speech in face-to-face situations.
(TC,02/08/05)
CHILDES project for JapaneseJCHAT A project to develop the Japanese version of
CHILDES(Child Language Data Exchange System) database for research of language
acquisition. (TC,02/08/05)
ETHNOLOGUE Language of the
World A language database of more than 1,300 languages in the world
published by SIL International, a service organization that works with people
who speak the world’s lesse-known langauages. (TC,02/08/05)
The International Clearing House for Endangered Languages The
homepage of the Internaional Clearing House for Endangerd Languages at the
department of asian and pacific linguistics of the Universityo of Tokyo,
Japan. (TC,02/08/05)
UDCC(Universal Decimal Classification
Consortium) UDCC is a consortium which administers and exploits UDC. It
created an international database, the Master Reference File (MRF), which
could be the source of many kinds of UDC edition and updated it once a year.
The copyright of the Japanese version was held by Information Science and
Technology Association, Japan. which is a member of UDC management committee.
(SK,02/9/2)
Links
to Corpus
Linguistic Grammars
Online (LinGO) A large-scale grammer for English based on the
HPSG framework, which is used in the Vermobil machine translation system.
SEU (The Survey of English
Usage)(1959-) The Survey of English Usage (SEU) is an English Language
research unit, based in the Department of English Language and Literature at
University College London. (from SEU’s homepage, 02/08/05)
Brown Corpus (The
Standard Corpus of Present-Day Edited American English)(1961-1964) This
Standard Corpus of Present-Day American English consists of 1,014,312 wordsl
of running text of edited English prose printed in the United States during
the calendar year 1961. So far as it has been possible to determine, the
writers were native speakers of American English. Although all of the material
first appeared in print in the year 1961, some of it was undoubtedly written
earlier. However, no material known to be a second edition or reprint of
earlier text has been included. (from the Brown Corpus Manual, Brown
University, 1964/1979) (TC,02/09/02)
LOB Corpus (The
Lancaster-Oslo/Bergen Corpus)(1970-1978) The Lancaster - Oslo/Bergen
(LOB) Corpus is a million-word collection of present-day British English
texts, compiled under the direction of Geoffrey Leech, University of
Lancaster, and Stig Johansson, University of Oslo, in collaboration with Knut
Hofland, Norwegian Computing Centre for the Humanities, Bergen. Like its
American counterpart, the Brown Corpus (see Francis and Kucera 1979), it
contains 500 text samples of approximately 2,000 words distributed over 15
text categories (from The Tagged LOB Corpus manual, Nrowegin Computing Centre
for the Humanities Bergen, 1986). (TC,02/09/02)
Freiburg-Brown
Corpus(1991-1996) In 1991, Christian Mair took the initiative to
compile a set of corpora that would match the well-known and widely used Brown
and LOB corpora with the only difference that they should represent the
language of the early 1990s. The project started in April 1991 with the
compilation of the British Press Section of the new FLOB corpus. 1992 saw the
beginning of the new Freiburg Brown Corpus, Frown. To speed up the process of
compilation, Christian Mair was granted additional funding by the DFG (German
Research Foundation) for the years 1994-1996 within the
Sonderforschungsbereich (special research group) 321 'Orality and Literacy'.
(from Manual of Information to accompany The Freiburg - Brown Corpus of
American English, Albert-Ludwigs-Universität Freiburg, 1999)
(TC,02/09/02)
Freiburg-LOB
Corpus(1991-1996) In 1991, Christian Mair took the initiative to
compile a set of corpora that would match the well-known and widely used Brown
and LOB corpora with the only difference that they should represent the
language of the early 1990s. The project started in April 1991. To speed up
the process of compilation, Christian Mair was granted additional funding by
the DFG (German Research Foundation) for the years 1994-1996 within the
Sonderforschungsbereich (special research group) 321 'Orality and Literacy'.
(from Manual of Information to accompany The Freiburg - LOB Corpus of British
English, Albert-Ludwigs-Universität Freiburg) (TC, 02/09/02)
Kolhapur
Corpus(1986) The present corpus of Indian English was conceived in
Lancaster in 1978 when the main author of this work was researching under the
supervision of Professor G.N. Leech. On his return to India he started the
project with an initial grant from the Shivaji University in 1980, and carried
it forward with a substantial financial assistance from the U.G.C.
supplemented by support from various other sources including personal funds.
(from MANUAL OF INFORMATION TO ACCOMPANY THE KOLHAPUR CORPUS OF INDIAN
ENGLISH, FOR USE WITH DIGITAL COMPUTERS, DEPARTMENT OF ENGLISH SHIVAJI
UNIVERSITY, KOLHAPUR, INDIA, 1986) (TC,02/09/02)
ACE (The Australian
Corpus of English)(1988-1989) The Australian Corpus of English (ACE)
was compiled in the department of Linguistics at Macquarie University NSW
Australia, from 1986 on. It was supported by a small grant 1988-9 from the
Australian Research Grants Council, and by a series of grants from Macquarie
University. Other support came from the National Languages and Literacy
Institute of Australia and the University of New South Wales. The project
was conceived by Pam Peters, Peter Collins and David Blair, and was carried
through with the help of a number of research assistants, notably Alison
Moore, Elizabeth Green, Robert Jenkins, Catherine Martin, Diana Grace, Heather
Middleton, Wendy Young and Adam Smith. Computational help and advice was
provided by Harry Purvis and Steve Cassidy, and the project enjoyed continuous
infrastructure support from Macquarie's Speech, Hearing and Language Research
Centre. (from MANUAL OF INFORMATION THE AUSTRALIAN CORPUS OF ENGLISH (ACE),
MACQUARIE UNIVERSITY) (TC,02/09/02)
ICE-GB (The
British component of the International Corpus of English)(1990-) ICE-GB
is the British component of the International Corpus of English (ICE). ICE
began in 1990 with the primary aim of providing material for comparative
studies of varieties of English throughout the world. Twenty centres around
the world are preparing corpora of their own national or regional variety of
English. (from ICE-GB's Homepage) (TC)
COLT(The Bergen Corpus of London
Teenager Language) The Bergen Corpus of London Teenage Language (COLT)
is the first large English Corpus focusing on the speech of teenagers. It was
collected in 1993 and consists of the spoken language of 13 to 17-year-old
teenagers from different boroughs of London. The complete corpus, half a
million words, has been orthographically transcribed and word-class tagged,
and is a constituent of the British National Corpus. (from COLT's homepage)
(TC, 02/09/02)
CHILDES (Child Language Data
Exchange System)(1984-) The CHILDES system provides tools for studying
conversational interactions. These tools include a database of transcripts,
programs for computer analysis of transcripts, methods for linguistic
coding,and systems for linking transcripts to digitized audio and video. (from
CHILES's homepage) (TC, 02/09/02)
Helsinki Corpus of English Text : Dialect Part A corpus of dialect of
the contemporaly english in England. (TC,02/09/02)
SEC (The
Lancaster/IBM Spoken English Corpus)(1984-1991) The Lancaster/IBM
Spoken English Corpus consists of 52000 words of prepared and semi-prepared
British English speech. It was developed in a collaborative venture between
the Linguistics Department at Lancaster and IBM UK Scientific Centre in the
period 1984 -1991. (from SEC's homepage) (TC,02/09/02)
MARSEC (The
Machine Readable Spoken English Corpus) The Marsec corpus of spoken
standard southern British English is a development of the Lancaster/IBM spoken
English corpus (SEC). Whereas the SEC edition of the corpus comprises
annotated orthographic transcriptions of the spoken material it does not
include the acoustic material. The MARSEC edition of the corpus adds the
acoustic recordings on a second CD-ROM (see below for ordering details) and
includes word-level time-alignment (downloadable below) between the
transcripts and the acoustic signal. (from MERSEC's homepage)
(TC,02/09/02)
LLC (The
London-Lund Corpus of Spoken English) (1990) As the name implies, the
London-Lund Corpus of Spoken English (LLC) derives from two projects. The
first is the Survey of English Usage (SEU) at University College London,
launched in 1959 by Randolph Quirk, who was succeeded as Director in 1983 by
Sidney Greenbaum. The second project is the Survey of Spoken English (SSE),
which was started by Jan Svartvik at Lund University in 1975 as a sister
project of the London Survey. The goal of the Survey of English Usage is to
provide the resources for accurate descriptions of die grammar of adult
educated speakers of English. For that purpose the major activity of the
Survey has been the assembly and analysis of a corpus comprising samples of
different types of spoken and written British English. The original target for
the corpus of one million words has now been reached, and the corpus is
therefore complete. (from LLC's homepage) (TC,02/09/02)
PoW (The
Polytechnic of Wales Corpus) (1978-1984) The corpus was originally
collected between 1978-84 for a child language development project to study
the use of various syntactico-semantic constructs in children between the ages
of six and twelve. A sample of approximately 120 children in this age range
from the Pontypridd area in South Wales was selected, and divided into four
cohorts of 30, each within three months of the ages 6, 8, 10, and 12. (from
Pow's homepage) (TC,02/09/02)
SBCSAE (The
Santa Barbara Corpus of Spoken American English) The Santa Barbara
Corpus of Spoken American English is based on hundreds of recordings of
natural speech from all over the United States, representing a wide variety of
people of different regional origins, ages, occupations, and ethnic and social
backgrounds. It reflects many ways that people use language in their lives:
conversation, gossip, arguments, on-the-job talk, card games, city council
meetings, sales pitches, classroom lectures, political speeches, bedtime
stories, sermons, weddings, and more. (from SBCSAE's homepage)
(TC,02/09/02)
BOE (The Bank
of English)(1991-1995) The Bank of English is a collection of samples
of modern English language held on computer for analysis of words, meanings,
grammar and usage. In linguistics and lexicography such a collection is termed
a corpus. In January 2002 the latest release of the corpus amounted to 450
million words and it continues to grow with the constant addition of new
material. (from BOE's homepage) (TC,02/09/02)
BNC (The British National Corpus)
(1991-) The British National Corpus (BNC) is a 100 million word
collection of samples of written and spoken language from a wide range of
sources, designed to represent a wide cross-section of current British
English, both spoken and written. (from BNC's homepage) (TC,02/09/02)
ICE (The
International Corpus of English)(1990-) The International Corpus of
English (ICE) began in 1990 with the primary aim of collecting material for
comparative studies of English worldwide. Fifteen research teams around the
world are preparing electronic corpora of their own national or regional
variety of English. (from ICE's homapge) (TC,02/09/02)
The Cambridge-Leeds Corpus of Early Modern English A full-text corpus
of Early Modern English of 1600-1800. (TC,02/09/02)
COPC (The Century of Prose Corpus) A corpus of the Century of Prose
(1680-1780). (TC,02/09/02)
The Corpus of Early American English A corpus of early American English
of 1620-1720. (TC, 02/09/02)
CELT (The Corpus of Electronic
Texts) CELT, the Corpus of Electronic Texts, brings the wealth of Irish
literary and historical culture to the Internet, for the use and benefit of
everyone worldwide. It has a searchable online database consisting of
contemporary and historical texts from many areas, including literature and
the other arts. (from CELT's homepage) (TC,02/09/02)
The
Corpus of Late Modern English Prose(1992-1994) A corpus of informal
private letters by British writers, covering the period 1861 to 1919. All
decades in that range are represented, four by about 20,000 words of text
each. The decade 1880-89 has only about 6,000, 1890-99 about 13,000. However,
the range of dates by birth-date of writer is narrower: 1837-67. Corpus
constructed 1992-1994 by David Denison with the very considerable assistance
of Graeme Trousdale and Linda van Bergen. (from The Corpus of Late Modern
English Prose's homepage) (TC,02/09/02)
Helsinki
Corpus Diachronic Part(1984-1991) The Helsinki Corpus of English Texts:
Diachronic and Dialectal is a computerized collection of extracts of
continuous text. It is the result of a project commenced in 1984 and directed
by Matti Rissanen and Ossi Ihalainen at the University of Helsinki. The Corpus
contains a diachronic part covering the period from c. 750 to c. 1700 and a
dialect part based on transcripts of interviews with speakers of British rural
dialects from the 1970's. The aim of the Corpus is to promote and facilitate
the diachronic and dialectal study of English as well as to offer computerized
material to those interested in the development and varieties of language. The
material is intended for both mainframe and microcomputer use. (from MANUAL TO
THE DIACHRONIC PART OF THE HELSINKI CORPUS OF ENGLISH TEXTS, 1996)
(TC,02/09/02)
ARCHER (A Representative Corpus of Historical English Registers) A
corpus of written and sploken English of 1650-1990. (TC)
ZEN (The Zurich English Newspaper Corpus) A corpus of newspaper of
1671-1791. (TC,02/09/02)
ICAMET
(Innsbruck Computer Archive of Middle English)(1992-1999) The Prose
Corpus of ICAMET is a compilation of 129 texts (March 1999) of Middle English
prose, digitalized from extant editions and constantly enlarged by further
files. The corpus can, of course, well be used for (comparative) linguistic
analysis. But since it is a full-text database, it particularly aims at target
groups of users who, unlike those of the Helsinki Corpus, are not so much
interested in extracts of texts, but in their complete versions. The corpus
thus allows literary, historical and topical analyses of various kinds,
particularly studies of cultural history. As to language analysis, it invites
linguists to raise questions of style, rhetorics or narrative technique, for
which one would want a lengthier piece of text or even the complete text.
(from ICAMET's homepage) (TC,02/09/02)
Lampeter
Corpus (start-end) The Lampeter Corpus of Early Modern English Tracts
is a collection of texts on various subject matter published between 1640 and
1740 - a time that is marked by the rise of mass publication, the development
of a public discourse in many areas of everyday life and, last but not least,
the standardisation of British English. (from Lampeter Corpus's homepage) (TC,
02/09/02)