) Database and Its Applications

Building the Kamus Besar Bahasa Indonesia (KBBI) Database and Its Applications David Moeljadi1, Ian Kamajaya2, Dora Amalia3 1Nanyang Technological Uni...

0 downloads 80 Views 2MB Size
Building the Kamus Besar Bahasa Indonesia (KBBI) Database and Its Applications David Moeljadi1 , Ian Kamajaya2 , Dora Amalia3 1 Nanyang

Technological University, Singapore Pte Ltd, Singapore 3 Badan Pengembangan dan Pembinaan Bahasa, Indonesia 2 ASTrio

The 11th International Conference of the Asian Association for Lexicography, Center for Linguistics and Applied Linguistics, Guangdong University of Foreign Studies

10 June 2017

Moeljadi et al. (ASIALEX 2017)

KBBI Database

10 June 2017

1 / 31

Outline

1. Kamus Besar Bahasa Indonesia (KBBI) 2. Cleaning-up, conversion, and database creation 3. The current state of KBBI database and its applications 4. Conclusion and future work

Moeljadi et al. (ASIALEX 2017)

KBBI Database

10 June 2017

2 / 31

Kamus Besar Bahasa Indonesia (KBBI)

the official dictionary of the Indonesian language published by Badan Pengembangan dan Pembinaan Bahasa (The Language Development and Cultivation Agency) or Badan Bahasa under Ministry of Education and Culture, Republic of Indonesia The KBBI Fourth Edition [9] data was in Excel and Word files The KBBI database was built in 2016 Moeljadi et al. (ASIALEX 2017)

KBBI Database

10 June 2017

3 / 31

The Indonesian language

bahasa Indonesia “the language of Indonesia” the sole official and national language of the Republic of Indonesia, the common language for hundreds of ethnic groups in Indonesia [1] L1 speakers: around 43 million [6] L2 speakers: more than 156 million (2010 census data) Latin script Morphologically mildly agglutinative: prefixes, suffixes, …[8] Moeljadi et al. (ASIALEX 2017)

KBBI Database

10 June 2017

4 / 31

The Online KBBI before October 2016

data from KBBI III, for simple searches by headwords the search results were exactly in the same format as in the printed dictionary the data structure was not identified, no database Moeljadi et al. (ASIALEX 2017)

KBBI Database

10 June 2017

5 / 31

Types of lexical resources (Lim et al. 2016)

Types of lexical resources, based on digital readiness [7] Moeljadi et al. (ASIALEX 2017)

KBBI Database

10 June 2017

6 / 31

Dictionary entries in KBBI (1)

Moeljadi et al. (ASIALEX 2017)

KBBI Database

10 June 2017

7 / 31

Dictionary entries in KBBI (2) (homonymous entry)

Moeljadi et al. (ASIALEX 2017)

KBBI Database

10 June 2017

8 / 31

Dictionary entries in KBBI (3) (proverbs and idioms)

Moeljadi et al. (ASIALEX 2017)

KBBI Database

10 June 2017

9 / 31

Dictionary entries in KBBI (4) (cross-references)

Moeljadi et al. (ASIALEX 2017)

KBBI Database

10 June 2017

10 / 31

From KBBI IV to KBBI V

Moeljadi et al. (ASIALEX 2017)

KBBI Database

10 June 2017

11 / 31

From KBBI IV to KBBI V

Moeljadi et al. (ASIALEX 2017)

KBBI Database

10 June 2017

12 / 31

Word and Excel files

Moeljadi et al. (ASIALEX 2017)

KBBI Database

10 June 2017

13 / 31

From Word and Excel to Rich Text Format (rtf)

Moeljadi et al. (ASIALEX 2017)

KBBI Database

10 June 2017

14 / 31

From rtf to HyperText Markup Language (html)

Moeljadi et al. (ASIALEX 2017)

KBBI Database

10 June 2017

15 / 31

KBBI Cleaner

Moeljadi et al. (ASIALEX 2017)

KBBI Database

10 June 2017

16 / 31

Using Python…

The data was broken down by lemmas, sublemmas (derived words, compounds, proverbs, and idioms), labels, pronunciations, definitions, examples, scientific names, and chemical formulas using regular expression. Moeljadi et al. (ASIALEX 2017)

KBBI Database

10 June 2017

17 / 31

Regular expression a language for specifying text search strings which requires a pattern that we want to search for and a corpus of texts to search through [5].

Moeljadi et al. (ASIALEX 2017)

KBBI Database

10 June 2017

18 / 31

KBBI Database SQLite (www.sqlite.org)

Moeljadi et al. (ASIALEX 2017)

KBBI Database

10 June 2017

19 / 31

The current state of the KBBI Database

(as of 6 June 2017) Headwords: 48,141 Derived words: 26,198 Compounds: 30,374 Proverbs: 2,039 Idioms: 268 Entries (total): 108,239 Definitions: 126,642 Examples: 29,260

Moeljadi et al. (ASIALEX 2017)

KBBI Database

10 June 2017

20 / 31

What can we get from KBBI Database? I 1

More specific and targeted word lookups, e.g. ▶

looking up phrases and MWEs such as compound words, idioms, and proverbs as well as derived words SELECT entri, jenis, makna FROM baseview WHERE entri="sedia payung sebelum hujan";



looking up entries by their labels (part-of-speech, language, and domain labels) SELECT entri, ragam, bahasa, makna FROM baseview WHERE ragam="ark" and bahasa="Jw";

Moeljadi et al. (ASIALEX 2017)

KBBI Database

10 June 2017

21 / 31

What can we get from KBBI Database? II 2

Lexicography analysis ▶

extracting the most frequent words in the definition sentences → can be used as a lexical set for the Indonesian learner’s dictionary Word yang dan atau sebagainya dengan



Freq. 43,613 26,221 14,414 12,410 12,016

Word untuk dalam di tidak dari

Freq. 10,312 8,638 8,537 7,756 7,280

Word pada orang tentang seperti …

Freq. 6,793 6,110 4,746 3,422 …

extracting the most frequent genus terms in the definition sentences Word orang proses alat tidak bagian

Moeljadi et al. (ASIALEX 2017)

Freq. 2,703 1,858 1,595 1,526 835

Word perihal tempat menjadikan yang hasil KBBI Database

Freq. 823 806 745 664 656

Word sesuatu kata pohon mempunyai …

Freq. 573 557 547 526 …

10 June 2017

22 / 31

What can we get from KBBI Database? III

3

Linguistic analysis ▶

grouping the derived words based on affixes and patterns of reduplication in Indonesian Affix/Redup. meNmeN-...-kan ber-an peN-...-an …

Moeljadi et al. (ASIALEX 2017)

Example mengabadi mengabadikan berabang abaian pengabadian … Total

KBBI Database

Number 5,185 2,884 2,704 1,873 1,780 … 24,587

Percentage 21.1% 11.7% 11.0% 7.6% 7.2% … 100.0%

10 June 2017

23 / 31

What can we get from KBBI Database? IV 4

Linking to other lexical resources ▶

scientific names as a pivot to align KBBI entries to Wordnet Bahasa [4] KBBI entry abaka abalone abrikos acerang adas adas manis …

5

Scientific name musa textilis haliotis prunus armeniaca coleus amboinicus foeniculum vulgare pimpinella anisum …

Wordnet lemma abaca Haliotis common apricot country borage common fennel anise, anise plant …

WN synset 12353431-n 01942724-n 12641007-n 12845187-n 12939282-n 12943049-n …

Online and offline applications etc.

Moeljadi et al. (ASIALEX 2017)

KBBI Database

10 June 2017

24 / 31

Online application

officially launched on 28 October 2016 [2], its user interface and the system were made using ASP.NET (www.asp.net). https://kbbi.kemdikbud.go.id/ Dictionary Writing System (DWS) [3] which enables lexicographers to compile and edit dictionary text, as well as to facilitate project management, typesetting, and output to printed or electronic media Moeljadi et al. (ASIALEX 2017)

KBBI Database

10 June 2017

25 / 31

Offline mobile applications Android Play Store

iOS App Store

officially launched on 17 November 2016 play.google.com/store/apps/details?id=yuku.kbbi5 itunes.apple.com/us/app/kamus-besar-bahasa-indonesia/ id1173573777 Moeljadi et al. (ASIALEX 2017)

KBBI Database

10 June 2017

26 / 31

Conclusion and future work Building a database is vital for machine-tractable lexicons The database allows lexicographers, linguists, and researchers in NLP field to access the rich lexicographic and linguistic contents in the Indonesian language in more flexible ways, opening up possibilities in discovering new insights into the language, as well as helping the KBBI editorial staff work on the dictionary more effectively The database will be expanded with etymological information (Our work on compiling and editing the etymological information has been done since 2015 and is still in progress. We have finished working on lemmas from Sanskrit and are working on lemmas originating from Old Javanese and Dutch) The database will be connected to corpora

Moeljadi et al. (ASIALEX 2017)

KBBI Database

10 June 2017

27 / 31

Acknowledgments

Thanks to Francis Bond and Luís Morgado da Costa for the precious advice on the database structure Thanks to Ivan Lanin for improving the database and making it more efficient Thanks to Lim Lian Tze who inspired us to write this paper Thanks to NTU HSS library support staff: Rashidah Ismail, Raihana Abdul Wahid, and Tan Chuan Ko for allowing the first author to borrow KBBI IV paper dictionary for months; and to Wong Oi May who helped order the dictionary

Moeljadi et al. (ASIALEX 2017)

KBBI Database

10 June 2017

28 / 31

References I Hasan Alwi et al. Tata Bahasa Baku Bahasa Indonesia. 3rd ed. Jakarta: Balai Pustaka, 2014. Dora Amalia, ed. Kamus Besar Bahasa Indonesia. 5th ed. Jakarta: Badan Pengembangan dan Pembinaan Bahasa, 2016. B. T. Sue Atkins and Michael Rundell. The Oxford Guide to Practical Lexicography. Oxford University Press, 2008. Francis Bond et al. “The combined Wordnet Bahasa”. In: NUSA: Linguistic studies of languages in and around Indonesia 57 (2014), pp. 83–100. Daniel Jurafsky and James H. Martin. Speech and Language Processing. 2nd ed. New Jersey: Pearson Education, Inc., 2009. M. Paul Lewis. Ethnologue: Languages of the World. 16th ed. Dallas, Texas: SIL International, 2009. url: http://www.ethnologue.com (visited on 12/01/2014). Moeljadi et al. (ASIALEX 2017)

KBBI Database

10 June 2017

29 / 31

References II

Lian Tze Lim et al. “Digitising a machine-tractable version of Kamus Dewan with TEI-P5”. In: PeerJ Preprints 4 (July 2016), e2205v1. issn: 2167-9843. doi: 10.7287/peerj.preprints.2205v1. url: https://doi.org/10.7287/peerj.preprints.2205v1. James Neil Sneddon et al. Indonesian Reference Grammar. 2nd ed. New South Wales: Allen & Unwin, 2010. Dendy Sugono, ed. Kamus Besar Bahasa Indonesia Pusat Bahasa. 4th ed. Jakarta: PT Gramedia Pustaka Utama, 2008.

Moeljadi et al. (ASIALEX 2017)

KBBI Database

10 June 2017

30 / 31

Thank you

Moeljadi et al. (ASIALEX 2017)

KBBI Database

10 June 2017

31 / 31