Raw text of Nihon Keizai Sangyo, Kin'yu, Ryutsu Shimbun newspaper articles for 1994-2000. Purchase details can be viewed at http://www.nikkeish.co.jp/gengo/zenbun.htm.
Raw text of Japanese newspaper articles of Yomiuri Shimbun for 1987-2001. Approximate 110,000 articles per year for 1987-1997, 230,000 for 1998-2000, 340,000 for 2001. Purchase details can be viewed at http://www.ndk.co.jp/yomiuri/.
Raw text of English newspaper articles of Yomiuri Shimbun for 1989-2001. Approximate 9,000 articles per year. Purchase details can be viewed at http://www.ndk.co.jp/yomiuri/.
A corpus of articles in STAGE, which is a newspaper of people with intellectual disabilities published during 1996 and 2014. Two kinds of data are included: text data as is and text data split by a period.
Morphologically analyzed data of MITI (Ministry of International Trade and Industry, Japan) white papers for 1993-1995, manually post-edited. Distribution of this corpus is now suspended.
Morphologically analyzed data of the Japan Electronics Industry Development Agency's annual report, survey report on the trend of natural language processing. Manually post-edited. Distribution of this corpus is now suspended.
Morphologically analyzed data of Iwanami Japanese Dictionary (5th edition) with index tags. Manually post-edited. Distribution of this corpus is now suspended.
Differential data of the results of morphological analysis of the CD-Mainichi shimbun. (all articles from 1991-1995) Distribution of this corpus is now suspended.
The linguistic data which the EDR Corpus contains has been obtained by collecting a large number of example sentences and analyzing them on morphological, syntactic, and semantic levels. The Japanese Corpus contains approximately 200,000 sentences. Ver. 4.0 is released in 2010.
Annotation.corpus
word segmentation, part-of-speech, syntax, word sense
Creator
Japan Electronic Dictionary Research Institute, Ltd., Japan
Contact person
National Institute of Information and Communications (thoth(at)edr.co.jp)
The linguistic data which the EDR Corpus contains has been obtained by collecting a large number of example sentences and analyzing them on morphological, syntactic, and semantic levels. The English Corpus contains approximately 120,000 sentences. Ver. 4.0 is released in 2010.
Annotation.corpus
word segmentation, part-of-speech, syntax, word sense
Creator
Japan Electronic Dictionary Research Institute, Ltd., Japan
Contact person
National Institute of Information and Communications (thoth(at)edr.co.jp)
Morphologically and syntactically annotated corpus for 40,000 sentences in Mainichi Shimbun newspaper articles for 1995. 5,000 sentences out of them are also annotated with the information of case, anaphora and coreference. Annotation is manually post-edited. Due to copyright restrictions, users can obtain only annotation data for free and are required to purchase ``Mainichi Shimbun 1995 CD-ROM'' to reconstruct the original corpus.
Annotation.corpus
word segmentation, part-of-speech, syntax, case, anaphora, coreference
Creator
Kurohashi and Kawahara laboratory, Kyoto University
Contact person
Kurohashi and Kawahara laboratory, Kyoto University (nl-resource(at)nlp.ist.i.kyoto-u.ac.jp)
Task oriented dialog corpus of two humans. It contains 80 minutes movies of 9 dialogs of two tasks, `face task' and `traveling task'. Transcriptions of dialogs are also included. Transcriptions are annotated with tags of dialog structure, syntax, coreference, prosody and facial expression. You can obtain it through GSK.
Annotation.corpus
word segmentation, part-of-speech, syntax, dialog structure, coreference, prosody, facial expression
Creator
Japan Electronics and Information Technology Industries Association (JEITA)
Contact person
GSK (Gengo Shigen Kyokai)
Price
22,000 JPY for personal members of GSK, 44,000 JPY for personal non-members, 44,000 JPY for organization members, 88,000 JPY for organization non-members
Analyzed blog corpus consisting of 4,186 sentences, 249 articles on the 4 themes (sightseeing in Kyoto, mobile phone, sport, gourmet). It is manually annotated with morphological, syntactic, case, ellipsis, opinion tags. (Distribution is now suspended.)
Annotation.corpus
word segmentation, part-of-speech, syntax, case, ellipsis, opinion information
Creator
Kyoto University, NTT Communication Science Laboratories
Contact person
Kurohashi and Kawahara Laboratory, Kyoto University
3000 newspaper articles (about 37,000 sentences, 910,000 words) annotated with morphological information, syntactic structures and word senses. All annotations are manually revised. It is compiled in GDA (Global Document Annotation) format. This data contains only metadata, but not the original text. To restore the complete corpus containing the text, Mainichi Shimbun CD-ROM (1994) is required.
Annotation.corpus
word segmentation, part-of-speech, syntax, word sense, co-reference
This corpus consists of 56,000 lexical items in Iwanami Japanese Dictionary Fifth Edition. The headword, structure of the senses, and other information in the dictionary are described in XML.
Annotation.corpus
structure of the dictionary
Creator
Iwanami Shoten Publishers
Contact person
GSK (Gengo Shigen Kyokai)
Price
22,000 JPY for personal members of GSK, 44,000 JPY for personal non-members, 44,000 JPY for organization members, 88,000 JPY for organization non-members
A balanced corpus randomly collecting texts from contemporary written Japanese. It consists of publish based sub-corpus (35 million words), library based sub-corpus (30 million words) and non-sampling sub-corpus (35 million words). A part of the corpus is annotated with manually edited morphological tags.
Creator
National Institute for Japanese Language and Linguistics
Contact person
National Institute for Japanese Language and Linguistics (kotonoha(at)ninjal.ac.jp)
A database collecting spontaneous speech of Japanese with annotations for speech processing research. It consists of speech data of spontaneous speech (660 hours), their transcriptions (7 million words) and their POSs. For the core data (45 hours, 500 thousands words), articulation and intonation labels are annotated.
Creator
National Institute for Japanese Language and Linguistics, National Institute of Information and Communications Technology, Tokyo Institute of Technology
Contact person
National Institute for Japanese Language and Linguistics
A Japanese corpus consisting of 40,000 sentences excepted from Mainichi Shimbun 1995 articles, which are same sentences in Kyoto text corpus, annotated with co-reference and predicate-argument relations. Only annotations are available in public.
Creator
Computational Linguistics Laboratory, Nara Institute of Science and Technology
Contact person
Computational Linguistics Laboratory, Nara Institute of Science and Technology
A corpus for `idiom identification task' (a task to judge if an expression is an idiom or has a literal meaning). Each example sentence is annotated with a label `idiom' or `literal meaning'. 1,000 example sentences are collected for one idiom.
Automatically extracted basic Japanese sentences based on Kyoto University Case Frame data. It contains manually modified 5304 sentences. It also contains manually translated data from Japanese basic sentences into English and Chinese.
It consists of 233 essays written by Japanese college students where all essays are manually annotated with grammatical errors, POS tags, and phrase structures. It also consists of error detection/correction results with the error detection/correction systems obtained in Error Detection and Correction Workshop (EDCW2012).
Annotation.corpus
part-of-speech, syntax, error correction
Creator
Nagata Laboratory, Konan University and The Japan Institute for Educational Measurement Inc. (JIEM)
Contact person
GSK (Gengo Shigen Kyokai)
Price
22,000 JPY for personal members of GSK, 44,000 JPY for personal non-members, 44,000 JPY for organization members, 88,000 JPY for organization non-members
Subject.language
English
Date
2019/5
Rights
No commercial use. Research/Education purpose only.
Texts excepted from simulated clinical records. Texts written by a doctor are annotated with age, symptom, hospital, location, person, date etc. Expiration date for use is March 31st, 2016.
Creator
The Joint Use Conference for Electronic Health Care Education (JUCEE), Aramaki Laboratory (Center for Knowledge Structuring, The University of Tokyo)
The REX corpora consist of 6 multimodal corpora of referring expressions in collaborative puzzle solving dialogues. The corpora have two notable features, namely (1) they include time-aligned extra-linguistic information (dialogue speech, movies of puzzle solving processes, participant's mouse operations and eye-gaze) on top of linguistic information (transcribed utterances, referring expressions for puzzle pieces), (2) dialogues were collected with various configurations in terms of the puzzle type, hinting and language.
Creator
Tokyo Institute of Science and Technology (Tokunaga Laboratory)
Contact person
GSK (Gengo Shigen Kyokai)
Price
22,000 JPY for personal members of GSK, 44,000 JPY for personal non-members, 44,000 JPY for organization members, 88,000 JPY for organization non-members (for educational and research purpose). 220,000 for members of GSK, 440,000 for non-members (for commercial use, contract is required).
It is a large bilingual scientific paper corpus consisting of a Japanese-English paper abstract corpus of 3M parallel sentences (ASPEC-JE) and a Japanese-Chinese paper excerpt corpus of 680K parallel sentences (ASPEC-JC). ASPEC-JE was constructed from Japanese-English scientific paper abstracts, which are the property of JST. NICT automatically created the 1-to-1 sentence alignments. ASPEC-JC was constructed by manually translating Japanese scientific papers into Chinese.
Annotation.corpus
sentence alignment
Creator
The Japan Science and Technology Agency (JST), The National Institute of Information and Communications Technology (NICT)
The core data of ¡ÈBalanced Corpus of Contemporary Written Japanese (BCCWJ)¡É (about 2,000 documents) and the collection of newspaper articles ¡ÈMainichi Shimbun CD-ROM 1995¡É (about 8,000 documents) are manually annotated with named entity tags defined in Sekine's extended named entity hierarchy. There are 43,000 entities (100,000 tokens) in BCCWJ and 60,000 entities (240,000 tokens) in the newspaper. It does not contain the text, that is, only annotation is provided to the users.
Text data of all entries in Japanese Wikipedia annotated with extended named entities. 20,000 entries are manually annotated, while the rest of the entries are tagged by machine learning method.
A collection of dialog where four people discuss for decision making. Each transcribed utterance is annotated with topic tags. A summary of each discussion is also included.
Creator
Shimada Laboratory, Kyushu Institute of Technology
Contact person
Shimada Laboratory, Kyushu Institute of Technology
A corpus of free conversation between two people, where each utterance is annotated with a dialog act and sympathy tag. Nine categories are used for annotation of dialog acts: Self-disclosure, Question(YesNo), Question(What), Response(YesNo), Response(Declaration), Backchannel, Filler, Confirmation and Request. On the other hand, three categories are used as the sympathy tags: Sympathy, Antipathy and Neutral. They represents if a speaker shows his/her sympathy or antipathy for a partner. The number of the dialog (chat) is 97, while the number of utterance is 92,020.
Annotation.corpus
dialog act, sympathy
Creator
Shirai Laboratory, Japan Advanced Institute of Science and Technology (JAIST)
Contact person
GSK (Gengo Shigen Kyokai)
Price
22,000 JPY for personal members of GSK, 44,000 JPY for personal non-members, 44,000 JPY for organization members, 88,000 JPY for organization non-members
¡ÆBTSJ-Japanese Natural Conversation Corpus with Transcripts and Recordings, ver. 2020' compiles 377 spontaneous conversations. BTSJ is the abbreviation of the transcribing rules named ¡ÆBasic Transcription System for Japanese' developed for pragmatic analyses of Japanese interaction by considering the characteristics of Japanese language and interactional styles. Some examples of these characteristics are the phenomena such as that each sentence-final indicating a different level of politeness, and the frequent backchannels occurrence. BTSJ also transcribes the phenomena such as fillers and silences which are essential for pragmatic analyses.
Creator
National Institute for Japanese Language and Linguistics (USAMI Mayumi)
Contact person
BTSJ-Japanese Natural Conversation Corpus Committee (btsjcorpus(at)ninjal.ac.jp)
A collection of English learner essays annotated with feedback comments concerning writing skills and preposition use. The base essays are excerpts from ICNALE: International Corpus Network of Asian Learners. All feedback comments are manually annotated in Japanese. A translation dictionary (into English) is also included so that some (but not all) of the feedback comments can be translated into English.
Creator
Konan University, Dr. Shin'ichiro ISHIKAWA (Kobe University)
A list of topic labels for each utterance in the ``Natural Conversation Corpus''. The number of topic labels are about 100. (It is required to obtain ``Natural Conversation Corpus'' to reconstruct the complete corpus.)
Creator
J-TOCC research group (Leader Dr. Naoaki Nakamata, Kyoto University of Education)
Contact person
GSK (Gengo Shigen Kyokai)
Price
free (GSK member), 32,400JPY (GSK non-member)
Subject.language
Japanese
Date
2020/7
Rights
Limited to academic, educational and development (non-commercial) use
A dataset including learner sentences with correction difficulty weights to evaluate grammatical error correction systems. One can readily evaluate any error correction system considering correction difficulty and also can visualize correction difficulty by using this dataset and Go-To-Scorer. The learner sentences are from the Konan-JIEM Learner Corpus 6th ed.
Creator
NLP and Computational Linguistics Lab., Konan University
Contact person
GSK (Gengo Shigen Kyokai)
Price
22,000 JPY for personal members of GSK, 44,000 JPY for personal non-members, 44,000 JPY for organization members, 88,000 JPY for organization non-members
Subject.language
English, Japanese
Date
2020/12
Rights
No commercial use. Limited to academic and educational pursuits.
A collection of English learner essays which are excepted from ICNALE (International Corpus Network of Asian Learners) based on two criteria: grammatical error free and moderately scored. Each essay is annotated with the argument structure and sentence rearrangement for improving essays.
A collection of essays written by Japanese native speakers for a given assignments and their mark assessed by human evaluators. There are 1 to 3 assignments of the essay for each 9 theme. The length of the essay is limited to 100 to 800 characters for each assignment. It consists of 4,800 marked essays.
It consists of 285 essays written by 192 Japanese learners during 2007 and 2011. They are corrected by three teachers of Japanese, then the 9023 tags of errors are annotated in XML format. The types of error tags are categorized as a span, content, and cause/background.
A corpus containing 1 source text and 40 student summaries manually segmented in Idea Units. The source text is an expository text with problem solution structures embedded in it. The summaries were written by 40 undergraduate students at a university in Japan. They were all non-native speakers of English. The summaries were collected as part of an assignment for an academic writing course in which the students were asked to read the source text (391 words) and summarise its main ideas and key details in approximately 80 words.
Transcription of dialogues in both Japanese and English for the same conversations. It contains 4 different conversations for 2 different topic (registration of international conference, conversation between a travel agency and a customer) and 2 different media (telephone, keyboard) , each of them is in 1 CD-ROM.
Creator
Advanced Telecommunications Research Institute International, Japan
Contact person
Advanced Telecommunications Research Institute International, Japan
Text database of Kobun (Japanese ancient writings), Waka (31-syllable Japanese poem) and Kanbun (text written in classical Chinese). It consists of approximate 50 text.
Text corpus of Kodansha Japanese-English Dictionary, including 38,000 example Japanese sentences with English translations. Users are required to submit a license agreement form to National Institute of Advanced Industrial Science and Technology, Japan.
Collection of E-mail messages about business or private. Messages are created by 10-39 years old men and women with specified mobile phones or PCs through simulation.
Creator
Straight Word Inc.
Publisher
Power Shift Inc.
Contact person
Power Shift Inc. (http://www.powershift.co.jp/company/form.html)
The Konan Kodomo corpus (KK corpus) consists of texts written by students in primary school. The number of students is 66 and the period of the data collection is eight month.
Corpora and databases for Learning of Japanese developed by CASTEL/J. It contains books, white books, movie scripts, Kanji database and Japanese-English dictionary database and so on.
Creator
CASTEL/J
Contact person
GSK (Gengo Shigen Kyokai)
Price
22,000 JPY for personal members of GSK, 44,000 JPY for personal non-members, 44,000 JPY for organization members, 88,000 JPY for organization non-members
Japanese thesaurus developed for the machine translation system ALT-J/E. It comprises about 300,000 words, which are classified into 3,000 semantic classes. It also has a valency dictionary containing 14,000 Japanese subcategorization patterns with corresponding English patterns.
Creator
NTT Communication Science Laboratories
Publisher
Iwanami Shoten, Publishers
Contact person
NTT Communication Science Laboratories, Machine Translation Research Group (mt(at)cslab.kecl.ntt.co.jp)
The BioCaster public health ontology is based on a top-level SUMO taxonomy and covers 27 high priority infectious diseases including the pathogens that cause them, their symptoms, syndrome groupings etc. in six Asia-Pacific languages: Chinese (standard), English, Japanese, Korean, Thai, and Vietnamese. Term variants are also given for all terms. Links to major external resources such as MeSH, SNOMED CT and Wikipedia are included.
Creator
Nigel Collier research group, National Institute of Informatics
Contact person
Koichi Takeuchi (Okayama University, koichi(at)cl.it.okayama-u.ac.jp), Nigel Collier and AI Kawazoe(National Institute of Informatics, collier(at)nii.ac.jp)
Price
Free
Subject.language
Chinese (standard), English, Japanese, Korean, Thai, Vietnamese
WordNet for Japanese. Japanese equivalents are given to synsets of the Princeton WordNet 3.0. It consists of 49,190 concepts (synsets), 85,966 words and 156,684 senses (synset-word pairs).
Creator
National Institute of Information and Communications Technology
A verb dictionary for natural language processing. It consists of 4425 verbs, 7473 senses. It contains hierarchical semantic categories, case frames and typical example sentences for each sense.
Japanese lexicon comprises 861 verbs, 136 adjectives and 1,081 nouns, which are considered as representative examples of Japanese words. Each entry includes information about semantics, morphology, grammatical categories, case frames and idiomatic usage. You can obtain it through GSK.
The EDR Electronic Dictionary is composed of 9 types of dictionaries (Japanese Word, English Word, Concept, Japanese Co-occurrence, English Co-occurrence, Japanese-English Bilingual, Japanese-Chinese Bilingual, English-Japanese Bilingual and Technical Terminology), as well as the EDR Corpus. Ver. 4.0 is released in 2010.
Creator
Japan Electronic Dictionary Research Institute, Ltd., Japan
Contact person
National Institute of Information and Communications (thoth(at)edr.co.jp)
The basic roles of the Word Dictionary include providing the relations between words and concepts related to each other, and providing grammatical attributes regarding these relationships. The Japanese Word Dictionary contains approximately 260,000 words. Ver. 4.0 is released in 2010.
Creator
Japan Electronic Dictionary Research Institute, Ltd., Japan
Contact person
National Institute of Information and Communications (thoth(at)edr.co.jp)
The basic roles of the Word Dictionary include providing the relations between words and concepts related to each other, and providing grammatical attributes regarding these relationships. The English Word Dictionary contains approximately 190,000 words. Ver. 4.0 is released in 2010.
Creator
Japan Electronic Dictionary Research Institute, Ltd., Japan
Contact person
National Institute of Information and Communications (thoth(at)edr.co.jp)
The main role of the Japanese-English Bilingual Dictionary is to describe the correspondence between the Japanese word and the concept represented by the word and to provide the English correspondence word when used with the given meaning. It contains approximately 240,000 words. Ver. 4.0 is released in 2010.
Creator
Japan Electronic Dictionary Research Institute, Ltd., Japan
Contact person
National Institute of Information and Communications (thoth(at)edr.co.jp)
The main role of the Japanese-Chinese Bilingual Dictionary is to describe the correspondence between the Japanese word and the concept represented by the word and to provide the Chinese correspondence word when used with the given meaning. It contains approximately 230,000 words. It is released in 2010.
Creator
National Institute of Information and Communications
Contact person
National Institute of Information and Communications (thoth(at)edr.co.jp)
The main role of the English-Japanese Bilingual Dictionary is to describe the correspondence between the English word and the concept represented by the word and to provide the Japanese correspondence word when used with the given meaning. It contains approximately 160,000 words. Ver. 4.0 is released in 2010.
Creator
Japan Electronic Dictionary Research Institute, Ltd., Japan
Contact person
National Institute of Information and Communications (thoth(at)edr.co.jp)
The Concept Dictionary contains information on the approximately 410,000 concepts listed in the Word Dictionary and is divided according to information type into the Headconcept Dictionary, the Concept Classification Dictionary, and the Concept Description Dictionary. The Headconcept Dictionary describes information on the concepts themselves. The Concept Classification Dictionary describes the super-sub relations among the approximately 410,000 concepts. The "super-sub" relation refers to the inclusion relation between concepts, and the set of interlinked concepts can be regarded as a type of thesaurus. The Concept Description Dictionary describes the semantic (binary) relations, such as 'agent,' 'implement,' and 'place,' between concepts that co-occur in a sentence. Ver. 4.0 is released in 2010.
Creator
Japan Electronic Dictionary Research Institute, Ltd., Japan
Contact person
National Institute of Information and Communications (thoth(at)edr.co.jp)
The Co-occurrence Dictionary describes collocational information in the form of binary relations. The Japanese Co-occurrence Dictionary contains approximately 930,000 phrases. Ver. 4.0 is released in 2010.
Creator
Japan Electronic Dictionary Research Institute, Ltd., Japan
Contact person
National Institute of Information and Communications (thoth(at)edr.co.jp)
The Co-occurrence Dictionary describes collocational information in the form of binary relations. The English Co-occurrence Dictionary contains approximately 460,000 phrases. Ver. 4.0 is released in 2010.
Creator
Japan Electronic Dictionary Research Institute, Ltd., Japan
Contact person
National Institute of Information and Communications (thoth(at)edr.co.jp)
The Technical Terms Dictionary contains technical terms in English and Japanese from the field of information processing. The Technical Terms Dictionary is composed of the following subdictionaries: the Japanese Technical Terms Dictionary, the English Technical Terms Dictionary, the Japanese-English Bilingual Dictionary of Technical Terms, the English-Japanese Bilingual Dictionary of Technical Terms, the Concept Dictionary of Technical Terms, the Japanese Technical Terms Co-occurrence Data, and the English Technical Terms Co-occurrence Data. It contains 119,000 Japanese words and 78,000 English words. Ver. 4.0 is released in 2010.
Creator
Japan Electronic Dictionary Research Institute, Ltd., Japan
Contact person
National Institute of Information and Communications (thoth(at)edr.co.jp)
Vocabulary of approximate 23,000 content words in 14 Japanese ancient writing such as ``Tsurezuregusa'' and ``Hojoki''. It also contains word frequencies.
The lexicon consists of 70,000 basic Malay words. POS, grammatical information and English translation are compiled for each word. Technical terms are also included.
Creator
Center for the International Cooperation for Computerization
Contact person
GSK (Gengo Shigen Kyokai)
Price
22,000 JPY for personal members of GSK, 44,000 JPY for personal non-members, 44,000 JPY for organization members, 88,000 JPY for organization non-members
The lexicon consists of 50,000 basic Indonesian words. POS, grammatical information and English translation are compiled for each word. Idioms, acronyms and technical terms are also included.
Creator
Center for the International Cooperation for Computerization
Contact person
GSK (Gengo Shigen Kyokai)
Price
22,000 JPY for personal members of GSK, 44,000 JPY for personal non-members, 44,000 JPY for organization members, 88,000 JPY for organization non-members
The lexicon consists of 50,000 basic Chinese words. Pronunciations and grammatical information are compiled for each word. Technical terms are also included.
Creator
Center for the International Cooperation for Computerization
Contact person
GSK (Gengo Shigen Kyokai)
Price
22,000 JPY for personal members of GSK, 44,000 JPY for personal non-members, 44,000 JPY for organization members, 88,000 JPY for organization non-members
The lexicon consists of 50,000 basic Thai words. English translations are compiled for each word. Collocation and technical terms are also included.
Creator
Center for the International Cooperation for Computerization
Contact person
GSK (Gengo Shigen Kyokai)
Price
22,000 JPY for personal members of GSK, 44,000 JPY for personal non-members, 44,000 JPY for organization members, 88,000 JPY for organization non-members
The lexicon consists of technical terms in Malay, Indonesian, Chinese and Thai. Technical terms about computer, electronics, engineering and related area are included. Japanese translations, English translations, POS, pronunciations, classifier, syntactic information etc. are compiled for each word.
Creator
Center for the International Cooperation for Computerization
Contact person
GSK (Gengo Shigen Kyokai)
Price
22,000 JPY for personal members of GSK, 44,000 JPY for personal non-members, 44,000 JPY for organization members, 88,000 JPY for organization non-members
Subject.language
Malay, Indonesian, Chinese, Thai
Language
Malay, Indonesian, Chinese, Thai, English, Japanese
Database of Japanese compound functional expressions and their example usage. It contains 337 compound functional expressions and at most 50 examples per an expression. Examples are excerpted from newspaper articles. Mainichi Shimbun CD-ROM 1995 is required to reconstruct complete database.
Creator
Group MUST
Contact person
Group MUST (Suguru Matsuyoshi, Takehiro Utsuro, Satoshi Sato, Masatoshi Tsuchiya)
Japanese Semantic Pattern Dictionary -Compound and Complex Sentence Eds.-, which is a lexicon consisting of 227,000 translation patterns of Japanese and English, and its related documents and tools.
A lexicon compiling Japanese functional expressions (both functional words and compound words). It has a 9-level hierarchical structure. Number of functional expressions at the lowest level is 16,801.
Creator
Suguru Matsuyoshi, Satoshi Sato
Contact person
tsu90tsu80ji%sslab.nuee.nagoya-75u.ac.jp (remove all numbers, replace % with (at).)
A machine readable dictionary for morphological analysis of Japanese. It can be used as a dictionary of ChaSen and MeCab, which are public Japanese morphological analyzers. It contains canonical form, word form, writing variants, speech variants and accent. It contains 15,000 words (canonical forms) in July 2009.
Creator
DEN Yasuharu, YAMADA Atsushi, OGURA Hideki, KOISO Hanae, OGISO Toshinobu
A Japanese lexicon which is a successor of IPAdic. Parts-of-speech of all words except for proper nouns are rechecked. Variants and compound word structures are added. It is used for ChaSen and MeCab.
Creator
Computational Linguistics Laboratory, Nara Institute of Science and Technology, Japan
Contact person
Computational Linguistics Laboratory, Nara Institute of Science and Technology
A large-scale Japanese case frame dictionary automatically obtained from 10 billion Web pages. A case frame is data about a predicate and its associated nouns. It contains case frames for 110,000 predicates. It is available only for GSK members.
Creator
Language Media Lab., Kyoto University, Japan
Contact person
Kurohashi and Kawahara laboratory, Kyoto University (nl-resource(at)nlp.ist.i.kyoto-u.ac.jp)
A list of Japanese basic words. 2,800 level A words, 3,000 level B words, total about 5,800 words are compiled. Not only a word but also functional expressions/idioms are also included.
Collection of 5,000 Japanese evaluation expressions (predicates) manually annotated with polarity tags. Four classes, which are combination of positive/negative and subjective/objective, are used as polarity tags.
A large scale lexicon of nominal case frames. Nominal case frame is a set of necessary elements for interpretation of the noun. Nominal case frames are compiled for each sense of a noun. It consists of 160,000 nouns automatically constructed from 1.6 billion of Japanese sentences in Web.
Creator
Kurohashi and Kawahara laboratory, Kyoto University
Contact person
Kurohashi and Kawahara laboratory, Kyoto University (nl-resource(at)nlp.ist.i.kyoto-u.ac.jp)
An electronic dictionary consisting of 8,544 evaluative expressions (word senses) with their polarity (positive or negative). It can be used to classify evaluative expressions from some points of views (tender emotion view, ethic view).
Creator
Center for Corpus Development, National Institute for Japanese Language and Linguistics
A set of three kinds of dictionaries. Dictionary of places is a collection of 117,075 place names (addresses) from across Japan. Compiled information on places includes pronunciations, Romanized transliterations, orthographic variants, latitude, longitude, etc. Dictionary of facilities is a collection of about 1,000 facilities, such as art galleries, museums, amusement parks, etc. Compiled information on facilities includes pronunciations, orthographic variants, latitude, longitude, etc. Web dictionary of facilities is a collection of facilities excerpted from Japanese Wikipedia. Compiled information on facilities includes pronunciations, addresses, categories, etc. Latitude and longitude are also compiled on the part of facilities. Number of facilities in the dictionary is 32,419 (24,859 with precise latitude and longitude). The dictionary may contain errors since it is automatically compiled.
Creator
GSK (Gengo Shigen Kyokai)
Contact person
GSK (Gengo Shigen Kyokai)
Price
22,000 JPY for personal members of GSK, 44,000 JPY for personal non-members, 44,000 JPY for organization members, 88,000 JPY for organization non-members (for educational and research purpose). 220,000 for members of GSK, 440,000 for non-members (for commercial use, contract is required).
A comprehensive database of Japanese multiword expression, multiword unit and formulaic language. It consists of 18 lexicons.
1. JMWEL_nominal v2.0
A Lexicon of Japanese idiomatic or collocational nominal phrases, e.g., makka-na-uso ¿¿¤ÃÀÖ-¤Ê-±³ ¡Èdownright lie¡É, amai-mi-toosi ´Å¤¤-¸«-Ä̤· ¡Èover-optimistic prospect¡É. It includes about 23500 head entries each of which is given the information on its notational variants, morphological structure, syntactic function and structure, and internal modifiability. (8000JPY)
A Lexicon of Japanese idiomatic or collocational verbal phrases which have the form other than NOUN-PARTICLE(¡Èga¡É,¡Éwo¡É, or ¡Èni¡É)-VERB construction, e.g., tama-no-kosi-ni-noru ¶Ì-¤Î-ÍÁ-¤Ë-¾è¤ë ¡Èmarry into money¡É, bake-no-kawa-ga-hageru ²½¤±-¤Î-Èé-¤¬-Çí¤²¤ë ¡Èexpose one's true colors¡É. It includes about 13800 head entries each of which is given the information on its notational variants, morphological structure, syntactic function and structure, and internal modifiability. (15000JPY)
A Lexicon of Japanese idiomatic or collocational adjective phrases such as ki-ga-chiisai µ¤-¤¬-¾®¤µ¤¤ ¡Èbe timid¡É, kigen-ga-yoi µ¡·ù-¤¬-Îɤ¤ ¡Èbe cheerful¡É. It includes about 3700 head entries each of which is given the information on its notational variants, morphological structure, syntactic function and structure, and internal modifiability. (11000JPY)
6. JMWEL_adjective verbal v2.0
A Lexicon of Japanese idiomatic or collocational adjective verbal phrases such as sensaku-zuki Á§º÷-¹¥¤ ¡Èbe inquisitive¡É, kingen-jicchoku ¶à¸·-¼Âľ ¡Èbe serious-minded¡É. It includes about 2600 head entries each of which is given the information on its notational variants, morphological structure, syntactic function and structure, and internal modifiability. (7000JPY)
7. JMWEL_adverbial v2.0
A Lexicon of Japanese idiomatic or collocational adverbial phrases such as omoi-mo-yora-zu »×¤¤-¤â-¤è¤é-¤º ¡Èunexectedly¡É, ki-wo-tuke-te µ¤-¤ò-ÉÕ¤±-¤Æ ¡Ècarefully¡É. It includes about16200 head entries each of which is given the information on its notational variants, morphological structure, syntactic function and structure, and internal modifiability. (15000JPY)
8. JMWEL_adnominal v2.0
A Lexicon of Japanese idiomatic or collocational adnominal phrases such as yo-ni-iuÀ¤-¤Ë-±¾¤¦ ¡Èso called¡É, suji-no-toot-ta ¶Ú-¤Î-Ä̤Ã-¤¿ ¡Èreasonable¡É. It includes about16500 head entries each of which is given the information on its notational variants, morphological structure, syntactic function and structure, and internal modifiability. (15000JPY)
9. JMWEL_discourse marker v2.0
A Lexicon of Japanese idiomatic or collocational discourse-marking expressions or sentence connectives such as sou-ha-it-temo ¤½¤¦-¤Ï-¸À¤Ã-¤Æ¤â ¡Èhowever¡É, odoroku-beki-koto-ni ¶Ã¤¯-¤Ù¤-¤³¤È-¤Ë ¡Èastonishingly¡É. It includes about1200 head entries each of which is given the information on its notational variants, morphological structure, syntactic function and structure, and internal modifiability. (9000JPY)
10. JMWEL_post-predicative v2.0
A Lexicon of Japanese post-predicate multiword expressions such as beki-dat-ta-n-da-kedo ¤Ù¤-¤À¤Ã-¤¿-¤ó-¤À-¤±¤É ¡È¡Ä should have Vpp ¡Ä¡É, te-itadake-mase-n-ka-ne ¤Æ-失-¤Þ¤»-¤ó-¤«¤Í ¡ÈWould you V ¡Ä¡É which give the information on tense, aspect, modality, polarity, mood, speaker's attitude to the proposition, etc. It includes about 4900 head entries each of which is given the information on its notational variants, morphological and syntactic function, syntactic structure, and semantic feature. (23000JPY)
11. JMWEL_postpositional v2.0
A Lexicon of Japanese postpositional multiword expressions such as ni-kansi-te ¤Ë-´Ø¤·-¤Æ ¡Èabout¡É, wo-gisei-ni ¤ò-µ¾À·-¤Ë ¡Èat the expense of¡É, ta-ato-ni ¤¿-¸å-¤Ë ¡Éafter¡É which give the semantic relationship among noun phrases and predicative phrases in the sentence. It includes about 2700 head entries each of which is given the information on its notational variants, morphological and syntactic function, syntactic structure, and samples of usage. (16000JPY)
12. JMWEL_idiom v2.0
A Lexicon of Japanese idioms such as abura-wo-uru Ìý-¤ò-Çä¤ë ¡Èidle away one's time¡É, youryou-ga-ii Í×ÎÎ-¤¬-Îɤ¤ ¡Èknow how to swim with the tide¡É, me-to-hana-no-saki ÌÜ-¤È-É¡-¤Î-Àè ¡Èjust a stone's throw away¡É. It includes about 4500 head entries each of which is given the information on its notational variants, morphological structure, syntactic function and structure, and internal modifiability. (17000JPY)
13. JMWEL_proverb saying cliche v2.0
A Lexicon of Japanese proverbs, sayings, cliches such as kame-no-kou-yori-tosi-no-kou µµ-¤Î-¹Ã-¤è¤ê-ǯ-¤Î-¸ù¡¡¡ÈAge and experience teach wisdom¡É, kabe-ni-mimi-ari ÊÉ-¤Ë-¼ª-¤¢¤ê ¡ÈWalls have ears¡É, gou-ni-it-te-ha-gou-ni-sitagae ¶¿-¤Ë-Æþ¤Ã-¤Æ-¤Ï-¶¿-¤Ë-½¾¤¨ ¡ÈWhen you are in Rome, do as Romans do¡É. It includes about 4000 head entries each of which is given the information on its notational variants, morphological structure, syntactic function and structure, and internal modifiability. (9000JPY)
14. JMWEL_onomatopoeic v2.0
A Lexicon of Japanese onomatopoeic expressions such as kachikachi-ni-kooru ¥«¥Á¥«¥Á-¤Ë-Åà¤ë¡¡¡Èget frozen solid¡É, buruburu-furueru ¥Ö¥ë¥Ö¥ë-¿Ì¤¨¤ë ¡Ètremble¡É. It includes about 13000 head entries each of which is given the information on its notational variants, morphological structure, syntactic function and structure. (20000JPY)
15. JMWEL_four character word v2.0
A Lexicon of Japanese four character words such as sessa-takuma ÀÚâø-ÂöËá¡¡¡Èimproving each other through friendly rivalry¡É, isseki-nichou °ìÀÐ-ÆóÄ» ¡Èkilling two birds with one stone¡É. It includes about 3500 head entries each of which is given the information on its notational variants, morphological structure, syntactic function and structure. Some head entries are given meaning explanations. (8000JPY)
16. JMWEL_incomplete phrase v2.0
A Lexicon of Japanese incomplete phrases which are commonly used in daily life, such as neko-ni-koban Ç-¤Ë-¾®È½¡¡¡Ècasting pearls before swine¡É, yamai-ha-ki-kara ÉÂ-¤Ï-µ¤-¤«¤é¡Èworry often causes the illness¡É. It includes about 470 head entries each of which is given the information on its notational variants, type of the incompleteness, morphological structure, syntactic function and structure. (5000JPY)
17. JMWEL_cranberry v2.0
A Lexicon of Japanese cranberry expressions which include cranberry-type morphs as substrings. For example, shigami-tuku ¤·¤¬¤ß-ÉÕ¤¯ ¡Ècling to¡É and usiro-metai ¸å¤í-¤á¤¿¤¤¡Èfeel guilty¡É are included as head entries, because shigami and metai are thought cranberry-type, respectively. It includes about 180 head entries each of which is given the information on its notational variants, morphological structure, syntactic function and structure. (3000JPY)
A Lexicon of Japanese calling, responding, greeting, monologue or interjective expressions, such as ara-maa ¤¢¤é-¤Þ¤¢, uso ¥¦¥½, arigatou ÍÆñ¤¦, o-tsukare-sama ¤ª-Èè¤ì-ÍÍ. It includes about 1050 head entries each of which is given the information on its notational variants, morphological structure, syntactic function and structure, and the semantic feature vector. (18000JPY)
Customized system dictionary for morphological analyzer MeCab. It includes many neologisms (new word), which are extracted from many language resources on the Web. When you analyze the Web documents, it's better to use this system dictionary and default one (ipadic) together.
A list of 6,380 morphemes that compose compound words, which are 7,192 technical terms in the medical record. Each morpheme is annotated with one of 80 semantic medical labels, its frequency at the position of the beginning, middle and end within 7,192 compound words, and its pronunciation.
Creator
JP18H03499 research group (leader: Dr. Kaoru Aira, Seinan Jo Gakuin University)
Reports on studies on the vocabulary of high and middle school textbooks in 1974 and 1980 in Japan. The vocabulary list is also included in ``Vocabulary Survey of Broadcasts CD-ROM''.
Vocabulary on TV program and CM broadcasted in April - June 1989 (26,000 words). It also includes the vocabulariy list of ``Studies on the Vocabulary of High and Middle School Textbooks''.
Word collocation database consisting of 1,160,000 entries, such as triples of a verb, its case filler noun and its case marker, extracted from Japanese newspaper articles.
Scenarios of radio dramas broadcasted by NHK (Japan Broadcasting Corporation) from 1936 to 1955. They are written by Masaru Kobayashi. ISBN 4-89476-222-6
Collection of Japanese translations for 15,000 English sentences excerpted from books, magazines and pamphlets about science and technology. ISBN 4-621-04991-7
N-gram are obtained from open Web pages written in Japanese that is crawled by Google. It contains 1-gram to 7-gram which occur more than or equal 20 times in 20 billion sentences.
Creator
Google Inc.
Contact person
GSK (Gengo Shigen Kyokai)
Price
22,000 JPY for personal members of GSK, 44,000 JPY for personal non-members, 44,000 JPY for organization members, 88,000 JPY for organization non-members
Data collection for evaluation of recognizing textual entailment (RTE). It consists of 2,700 sets, where 4-scaled values evaluating degree of entailment are attached. Each set is classified into 5 categories: inclusion, word(noun), word(predicate), syntax and inference.
Creator
Kurohashi and Kawahara laboratory, Kyoto University
Contact person
Kurohashi and Kawahara laboratory, Kyoto University (nl-resource(at)nlp.ist.i.kyoto-u.ac.jp)
N-gram (1-gram to 3-gram) of Japanese words obtained from texts in BBS and blog crawled by Baidu. It provides N-grams for every month from January 2000 to July 2010.
Various data in Rakuten, Inc. (1) ``Rakuten Ichiba'' All product data (Approx. 50 million items). (2) ``Rakuten Travel'' Facility data (11,468 facilities), review data (350,000 reviews, 340,000 evaluations). (3) ``Rakuten GORA'' (Rakuten's golf service) Facility data (1,669 facilities), review data (320,000 reviews). Data is available via NII or ALAGIN.
A multimodal corpus consisting of speech (English, Japanese) and its interpretation (Japanese, English) in press conferences coupled with movie, speech and their transcription. The original speech and its interpretation are transcribed by the automatic speech recognition system. The average duration of press conferences is an hour. The data of each conference includes the opening speech and question answering. The number of conferences is 79 (71 of simultaneous interpreting, 8 of consecutive interpreting).
Creator
JNPC corpus team
Contact person
GSK (Gengo Shigen Kyokai)
Price
22,000 JPY for personal members of GSK, 44,000 JPY for personal non-members, 44,000 JPY for organization members, 88,000 JPY for organization non-members
A list of n-gram with its frequency excerpted from Japanese Web corpus (25.8 billion words) developed by National Institute for Japanese Language and Linguistics. It contains 1-gram to 3-gram of characters, 1-gram to 6-gram of words, and 1-gram with morphological information. They are obtained by morphological analysis using MeCab-0.996 and UniDic-2.1.2.
Creator
Center for Corpus Development, National Institute for Japanese Language and Linguistics
Contact person
GSK (Gengo Shigen Kyokai)
Price
11,000 JPY for personal members of GSK, 22,000 JPY for organization members (non-commertial use) of GSK, 44,000 JPY for organization members (commertial use) of GSK, 22,000 JPY for personal non-members of GSK, 33,000 JPY for organization non-members (non-commertial use) of GSK, 66,000 JPY for organization non-members (commertial use) of GSK.
Word embeddings trained from Japanese Web corpus (25.8 billion words) developed by National Institute for Japanese Language and Linguistics. It consists of 200-dimensional CBOW (word2vec), 200-dimensional skip-gram (fastText), 300-dimensional CBOW (fastText), 300-dimensional skip-gram (fastText). Words are segmented by morphological analysis using MeCab-0.996 and UniDic-2.1.2.
Creator
Center for Corpus Development, National Institute for Japanese Language and Linguistics
Contact person
GSK (Gengo Shigen Kyokai)
Price
11,000 JPY for personal members of GSK, 22,000 JPY for organization members (non-commertial use) of GSK, 44,000 JPY for organization members (commertial use) of GSK, 22,000 JPY for personal non-members of GSK, 33,000 JPY for organization non-members (non-commertial use) of GSK, 66,000 JPY for organization non-members (commertial use) of GSK.
BERT model pre-trained from Japanese Web corpus developed by National Institute for Japanese Language and Linguistics. Sentences including more than 5 words, 22.6 billion words, are used for pre-training of BERT. The vocabulary is 48,914 lemmas that are either a function word in UniDic or a content words in UniDic and Bunrui-Goi-Hyo thesaurs. Words are segmented by morphological analysis using MeCab-0.996 and UniDic-2.1.2.
Creator
Center for Corpus Development, National Institute for Japanese Language and Linguistics
Contact person
GSK (Gengo Shigen Kyokai)
Price
11,000 JPY for personal members of GSK, 22,000 JPY for organization members (non-commertial use) of GSK, 44,000 JPY for organization members (commertial use) of GSK, 22,000 JPY for personal non-members of GSK, 33,000 JPY for organization non-members (non-commertial use) of GSK, 66,000 JPY for organization non-members (commertial use) of GSK.
5 sets of speech database of simulated conversation between a travel agency and a customer. 892 conversations in Japanese, and 618 in Japanese and English. Transcription and morphological annotation are also available.
Annotation.corpus
word segmentation, part-of-speech
Creator
Advanced Telecommunications Research Institute International, Japan
Contact person
Advanced Telecommunications Research Institute International, Japan
Speech database containing the following three contents: (a) ATR 503 phonetic balanced sentences (read speech) uttered by 64 speakers (30 males and 34 females), total 9,600 sentences. (b) Various guide task sentences (read speech) uttered by 36 speakers (18 males and 18 females), total 12,474 sentences. (c) Simulated 37 dialogues with transcribed texts uttered by 37 speakers (29 males and 8 females).
Creator
The Acostical Society of Japan
Contact person
Nishigaki, Shigeo (AI and Fuzzy Promotion Center, Japan Information Processing Development Center (JIPDEC), 3-5-8 Shibakoen, Minatoku, Tokyo 105, JAPAN, TEL. +81-3-3432-9390, FAX. +81-3-3431-4324)
JNAS contains speech recordings and their orthographic transcriptions of 306 speakers (153 males and females each) reading excerpts from the Mainichi Newspaper and the ATR 503 PB-Sentences. All utterances and sentences are in the Japanese language.
A corpus of dialogues for town guidance task between the system and human recorded by Wizard of Oz. It could be used for analysis of turn taking, head nodding, interruption, reply for interruption and so on. It consists of 162 dialogues of 33 speakers, that are more than 1000 minutes in total. Speech data, pitch pattern, transcriptions, tags representing beginnings and endings of dialog and semantic representation of utterances are contained.
Creator
Advanced Industrial Science and Technology (AIST)
Contact person
GSK (Gengo Shigen Kyokai)
Price
33,000 JPY for personal members of GSK, 66,000 JPY for personal non-members, 66,000 JPY for organization members, 132,000 JPY for organization non-members
Speech data of phonetic balanced word set uttered by 10 male speakers. WD-I consists of 492 words, while WD-II consists of 1,542 words. WD-I is a subset of WD-II.
This corpus is composed of 323 items with 4 repetitions for each item including 110 monosyllables, 178 isolated words and 35 4-digit sequences. The total data amounts to 120 hours, contained on 76 DAT cassettes. Each item is uttered by 75 male and 75 female speakers. Speakers range in age from 20 to 60. The total data nmuber is 193,800 samples.
Creator
Japan Electronics Industry Development Association
Contact person
Sasaki (Sunrise Music Incorporated, Roppongi Fuji Bldg. 4F, 4-11-10 Roppongi, Minato-ku, Tokyo, 106 , Japan, Tel: +81-3-3408-6541, Fax: +81-3-3408-1505 )
Speech Database of Japanese dialects. It is only distributed to universities and national research institutes.
Creator
Tahara, Hiroshi (Osaka Shoin Women's Univ., Japan), Egawa, Kiyoshi (National Institute for Japanese Language, Japan)
Contributor
Grant-in-Aid for Scientific research on Priority Areas on ``Spoken Japanese'', provided by MEXT (Ministry of Education, Culture, Sports, Science and Technology, Japan)
Contact person
Tahara, Hiroshi (Osaka Shoin Women's Univ., Tel. +81-6-723-8181, Fax. +81-6-723-8881), Egawa, Kiyoshi (National Institute for Japanese Language, Tel. +81-3-3900-3111, Fax. +81-3-3906-3530)
This corpus consists of speech and transcriptions of 93 dialogues. (``Juten Ryoiki Kenkyu'' means grant-in-aid for scientific research on priority areas, the name of the fund from which this corpus was supported)
Creator
Doshita, Shuji
Contributor
Grant-in-Aid for Scientific research on Priority Areas on ``Understanding and Generating Dialogue by Integrated Processing of Speech, Language and Concept'' provided by MEXT (Ministry of Education, Culture, Sports, Science and Technology, Japan)
Contact person
Media Drive Co., Ltd. (juten-corpus(at)mediadrive.co.jp)
Speech and transcriptions of 28 dialogues in the task ``plan of travel abroad'' and 28 in ``purchase of a car''. Distribution of this corpus is now suspended.
A speech corpus consisting of interviews of 80 people, which are 60 elderly people (60-79 years old) and 20 control people (20-59 years old). A sample transcription of a port of answers is also included.
Creator
Social computing laboratory, Nara Institute of Science and Technology
A morphological analyzer for Japanese using a language model trained by Recurrent Neural Network Language Model (RNNLM). Its accuracy is much improved comparing to JUMAN or MeCab by considering semantic fluency of word sequence. Formats of grammar, lexicon, output and so on are inherited from JUMAN.
Creator
Kurohashi and Kawahara laboratory, Kyoto University
Contact person
Kurohashi and Kawahara laboratory, Kyoto University (nl-resource(at)nlp.ist.i.kyoto-u.ac.jp)
ChaSen is a FREE Japanese Morphological analyzer. It grows out of developing JUMAN version 2.0 and has made a significant improvement in system performance. ChaSen version 1.0 is officially released on 19 February 1997 by Computational Linguistics Laboratory, Graduate School of Information Science, Nara Institute of Science and Technology (NAIST). Latest version is 2.2.9 released on 8 February 2002.
Creator
Computational Linguistics Laboratory, Nara Institute of Science and Technology, Japan
Contact person
Computational Linguistics Laboratory, Nara Institute of Science and Technology (chasen(at)is.aist-nara.ac.jp)
A morphological analyzer for Japanese. It was customized to rapidly output only the most appropriate result. It also identifies unknown words with a simple scheme.
A general toolkit developed for analyzing text, with a focus on Japanese, Chinese and other languages requiring word or morpheme segmentation. It is abl e to perform word segmentation, pronunciation tagging and POS tagging. Users can also train models by themselves.
A syntactic analyzer for Japanese. KNP first identifies ``Bunsetsu''(a chunk of words) boundaries for an input sentence, then analyzes dependencies between them. The latest version is 4.0 (in January 2012).
Creator
Kurohashi and Kawahara laboratory, Kyoto University
Contact person
Kurohashi and Kawahara laboratory, Kyoto University (nl-resource(at)nlp.ist.i.kyoto-u.ac.jp)
This tool kit contains a morphological and syntactic LR parser (MSLR parser) and some related tools. MSLR parser is a tool for simultaneous analysis of syntactic and morphological form. A grammar and a dictionary for analysis on Japanese are included in the package. Furthermore, users can replace them with their own grammars and dictionaries.
Creator
Tokyo Institute of Technology, Japan
Contact person
Tokunaga Laboratory, Tokyo Institute of Technology (mslr(at)cl.cs.titech.ac.jp)
It generates a syntactic analyzer, a Prolog program based on a bottom-up chart algorithm, from a definite clause grammar (DCG). It requires SICStus Prolog.
Creator
Computational Linguistics Laboratory, Nara Institute of Science and Technology, Japan
Contact person
Computational Linguistics Laboratory, Nara Institute of Science and Technology (nlt(at)is.aist-nara.ac.jp)
It generates a syntactic analyzer, a Prolog program based on a left corner parsing method, from a definite clause grammar (DCG). It requires SICStus Prolog.
Creator
Computational Linguistics Laboratory, Nara Institute of Science and Technology, Japan
Contact person
Computational Linguistics Laboratory, Nara Institute of Science and Technology (nlt(at)is.aist-nara.ac.jp)
assistant tool for constructing POS-tagged corpora
Description
assistant tool for constructing POS-tagged corpora. It is a GUI tool to graphically display outputs of morphological analisys tool and allow human annotators to modify them.
Creator
Computational Linguistics Laboratory, Nara Institute of Science and Technology, Japan
Contact person
Computational Linguistics Laboratory, Nara Institute of Science and Technology (chasen(at)cl.aist-nara.ac.jp)
FuuTag is an annotation tool for a SGML text, which is based on Sekine's extended Named Entity hierarchy in default configuration. It is possible to customize tag descriptions with config file.
A browser-based linguistic annotation tool for PDF documents. It offers functions for various types of linguistic annotations, including part-of-speech, named entity, dependency relation, and coreference chain.
Minise is a compact search engine supporting basic features. Minise performs full-text search query using several types of indexes. Minise supports sequential search, inverted file index, character N-gram (q-gram), and suffix array. Minise is supposed to be used for a small-midium size document set (e.g. 200000 documents), for academic, research purpose.
A library for a compact trie data structure. It requires 1/4 - 1/10 of the memory usage compared to the previous implementations, and can therefore handle quite a large number of keys (e.g. 1 billion) efficiently.
A simple library for fast approximate string retrieval. It can find strings in a database whose similarity with a query string is no smaller than a threshold. It is applicable for spelling correction, flexible dictionary matching, duplicate detection and so on.
Concordancer for languages with a little inflection such as Japanese. Users can search word sequence patterns from a corpus with a dictionary. It is distributed under CECILL(free) license.
A simple, customizable, and open source implementation of Conditional Random Fields (CRFs) for segmenting/labeling sequential data. It is designed for generic purpose and will be applied to a variety of NLP tasks.
A library for learning hidden Markov models by using Online EM algorithm. This library is specialized for large scale data; e.g. 1 million words. The output includes parameters, and estimation results.
A library for online-learning algorithms (Perceptron, Averaged Perceptron, Passive Agressive, ALMA, Confidence Weighted Linear-Classification), which is specialized for large-scale, but sparse, learning tasks such as Natural Language Processing tasks. While these algorithms are very efficient in terms of speed and space (linear in the number of examples, and features), its performances are comparable to the batch-style learning methods such as SVMs, MEs. It provides C++ library, and stand-alone programs for learning, predicting.
A collection of machine-learning algorithms for classification. Currently, it supports L1/L2-regularized logistic regression (aka. Maximum Entropy), L1/L2-regularized L1-loss linear-kernel Support Vector Machine (SVM) and Averaged Perceptron.
A tool to solve a combinatorial optimization problem similar to knapsack problem. For example, it can be used for multiple documents summarization, i.e. to choose a small number of important sentences (extracts) in a given set of source documents.
Tool to extract technical terms from documents. There are 3 steps to extract terms: (1) word segmentation by morphological analyzers, (2)identification of compound words, (3) calculation of significance score. The target languages are Japanese and English. They also provides Web services called `Gensen-Web'.