(Publisher of Peer Reviewed Open Access Journals)

International Journal of Advanced Technology and Engineering Exploration (IJATEE)

ISSN (Print):2394-5443    ISSN (Online):2394-7454
Volume-9 Issue-91 June-2022
Full-Text PDF
Paper Title : Identification and extraction of multiword expressions from Hindi & Urdu language in natural language processing
Author Name : Vaishali Gupta and Nisheeth Joshi
Abstract :

Text can be translated from one language to another using statistical machine translation, but there are still gaps in the translations because of a lack of language resource material. Building a linguistic corpus necessarily requires the extraction of multiword expressions (MWE). MWE is a collection of words with idiomatic expression properties. However, due to its non-compositional meaning of distinctive words, identifying and extracting MWE is a time-consuming task. In this case, an automated system has been developed for the extraction of MWEs from Hindi and Urdu language sources automatically. The entire process includes tagging, pattern matching, an identification algorithm, and the extraction of MWEs from the data. Tagging each word with a unique part of speech tag is used as an input to the pattern-matching algorithm. Using pattern matching, MWE tags of specific patterns were selected, and the algorithm for automatic MWE detection was built on top of that. The conditional random field (CRF++) model was used to automatically extract the MWEs from data. Confusion matrix was used to conduct the automated evaluation of this proposed system. For Hindi and Urdu, the calculated overall accuracy is 96.82% and 96.62%, respectively.

Keywords : Bigrams, Tags, Multiword expression (MWE), Conditional random field (CRF), Confusion matrix.
Cite this article : Gupta V, Joshi N. Identification and extraction of multiword expressions from Hindi & Urdu language in natural language processing. International Journal of Advanced Technology and Engineering Exploration. 2022; 9(91):807-826. DOI:10.19101/IJATEE.2021.875212.
References :
[1]De CHM, Ramisch C, Das GVNM, Villavicencio A. Alignment-based extraction of multiword expressions. Language Resources and Evaluation. 2010; 44(1):59-77.
[Crossref] [Google Scholar]
[2]Constant M, Eryiğit G, Monti J, Van DPL, Ramisch C, Rosner M, et al. Multiword expression processing: a survey. Computational Linguistics. 2017; 43(4):837-92.
[Crossref] [Google Scholar]
[3]Baldwin T, Kim SN. Multiword expressions. Handbook of Natural Language Processing. 2010; 2:267-92.
[Google Scholar]
[4]Sag IA, Baldwin T, Bond F, Copestake A, Flickinger D. Multiword expressions: a pain in the neck for NLP. In international conference on intelligent text processing and computational linguistics 2002 (pp. 1-15). Springer, Berlin, Heidelberg.
[Crossref] [Google Scholar]
[5]Nandi M, Ramasree R. Rule based extraction of multi-word expressions for elementary sanskrit texts. International Journal of Advanced Research in Computer Science. 2013; 3(11):661-7.
[Google Scholar]
[6]Kumar S, Behera P, Jha GN. A classification-based approach to the identification of multiword expressions (MWEs) in magahi applying SVM. Procedia Computer Science. 2017; 112:594-603.
[Crossref] [Google Scholar]
[7]Boroş T, Pipa S, Mititelu VB, Tufiş D. A data-driven approach to verbal multiword expression detection. PARSEME shared task system description paper. In proceedings of the 13th workshop on multiword expressions 2017 (pp. 121-6).
[Crossref] [Google Scholar]
[8]Sinha RM. Stepwise mining of multi-word expressions in Hindi. In proceedings of the workshop on multiword expressions: from parsing and generation to the real world 2011 (pp. 110-5).
[Google Scholar]
[9]Agrawal S, Sanyal R, Sanyal S. Hybrid method for automatic extraction of multiword expressions. International Journal of Engineering & Technology. 2018; 7(2.6):33-8.
[Google Scholar]
[10]Majumder G, Pakray P, Khiangte Z, Gelbukh A. Multiword expressions (MWE) for Mizo language: literature survey. In international conference on intelligent text processing and computational linguistics 2016 (pp. 623-35). Springer, Cham.
[Crossref] [Google Scholar]
[11]Singh D, Bhingardive S, Bhattacharyya P. Multiword expressions dataset for Indian languages. In proceedings of the tenth international conference on language resources and evaluation (LREC16) 2016 (pp. 2331-5).
[Google Scholar]
[12]Dandapat S, Mitra P, Sarkar S. Statistical investigation of Bengali noun-verb (NV) collocations as multi-word-expressions. Proceedings of Modeling and Shallow Parsing of Indian Languages (MSPIL). 2006:230-3.
[Google Scholar]
[13]Attia M, Toral A, Tounsi L, Pecina P, Van Genabith J. Automatic extraction of Arabic multiword expressions. In proceedings of the 2010 workshop on multiword expressions: from theory to applications 2010 (pp. 19-27).
[Google Scholar]
[14]Kulkarni N, Finlayson M. JMWE: a java toolkit for detecting multi-word expressions. In proceedings of the workshop on multiword expressions: from parsing and generation to the real world 2011 (pp. 122-4).
[Google Scholar]
[15]Chakraborty T, Das D, Bandyopadhyay S. Identifying bengali multiword expressions using semantic clustering. Lingvisticæ Investigationes. 2014; 37(1):106-28.
[Crossref] [Google Scholar]
[16]Daoud D, Al-kouz A, Daoud M. Time-sensitive Arabic multiword expressions extraction from social networks. International Journal of Speech Technology. 2016; 19(2):249-58.
[Crossref] [Google Scholar]
[17]Singh A, Jamwal SS. Identification, extraction and translation of multiword expressions. International Journal of Advanced Research in Computer Science and Software Engineering. 2016; 6(7):445-9.
[Google Scholar]
[18]Joon R, Singhal A. Role of lexical and syntactic fixedness in acquisition of hindi MWEs. In international conference on advances in computing and data sciences 2019 (pp. 155-63). Springer, Singapore.
[Crossref] [Google Scholar]
[19]Qasmi NH, Zia HB, Athar A, Raza AA. SimplifyUR: unsupervised lexical text simplification for Urdu. In proceedings of the 12th language resources and evaluation conference 2020 (pp. 3484-9).
[Google Scholar]
[20]Han L, Jones GJ, Smeaton AF. MultiMWE: building a multi-lingual multi-word expression (MWE) parallel corpora. arXiv preprint arXiv:2005.10583. 2020.
[Google Scholar]
[21]Fleischhauer J. Predicative multi-word expressions in persian. In proceedings of the 34th Pacific Asia conference on language, information and computation 2020 (pp. 552-61).
[Google Scholar]
[22]Goyal KD, Goyal V. Development of hybrid algorithm for automatic extraction of multiword expressions from monolingual and parallel corpus of English and Punjabi. In proceedings of the 17th international conference on natural language processing (ICON): system demonstrations 2020 (pp. 4-6).
[Google Scholar]
[23]Ramisch C, Savary A, Guillaume B, Waszczuk J, Candito M, Vaidya A, et al. Edition 1.2 of the PARSEME shared task on semi-supervised identification of verbal multiword expressions. In proceedings of the joint workshop on multiword expressions and electronic lexicons 2020 (pp. 107-18).
[Google Scholar]
[24]Marszałek-kowalewska K. Discovery of multiword expressions with loanwords and their equivalents in the persian language. In proceedings of the international conference on recent advances in natural language processing 2021 (pp. 918-28).
[Google Scholar]
[25]Tan KS, Lim TM, Tan CW. A study on multiword expression features in emotion detection of code-mixed twitter data. In international conference on artificial intelligence in engineering and technology (IICAIET) 2021 (pp. 1-5). IEEE.
[Crossref] [Google Scholar]
[26]Han L, Jones GJ, Smeaton AF, Bolzoni P. Chinese character decomposition for neural MT with multi-word expressions. arXiv preprint arXiv:2104.04497. 2021.
[Crossref] [Google Scholar]
[27]Jamwal SS, Gupta P, Sen VS. Multiword expression extraction using supervised ML for dogri language. In mobile radio communications and 5G networks 2022 (pp. 365-77). Springer, Singapore.
[Crossref] [Google Scholar]
[28]Iwatsuki K, Boudin F, Aizawa A. Extraction and evaluation of formulaic expressions used in scholarly papers. Expert Systems with Applications. 2022.
[Crossref] [Google Scholar]
[29]Muraki EJ, Abdalla S, Brysbaert M, Pexman PM. Concreteness ratings for 62 thousand English multiword expressions. Concreteness Ratings for Multiword Expressions. 2022.
[Google Scholar]
[30]Nunsanga MV, Pakray P, Lalngaihtuaha M, Lolit Kumar Singh L. Stochastic based part of speech tagging in mizo language: unigram and bigram hidden markov model. In edge analytics 2022 (pp. 711-22). Springer, Singapore.
[Crossref] [Google Scholar]
[31]Khan W, Daud A, Khan K, Nasir JA, Basheri M, Aljohani N, et al. Part of speech tagging in Urdu: comparison of machine and deep learning approaches. IEEE Access. 2019; 7:38918-36.
[Crossref] [Google Scholar]
[32]Kaur J, Saini JR. A study of text classification natural language processing algorithms for Indian languages. VNSGU Journal of Science and Technology. 2015; 4(1):162-7.
[Google Scholar]
[33]Gayen V, Sarkar K. A machine learning approach for the identification of bengali noun-noun compound multiword expressions. arXiv preprint arXiv:1401.6567. 2014.
[Crossref] [Google Scholar]
[34]Sing S, Jha GN. English multi-word expressions (MWE): a tagset for health domain. In international conference on advances in computing, communications and informatics (ICACCI) 2018 (pp. 1812-7). IEEE.
[Crossref] [Google Scholar]
[35]Venkatapathy S, Joshi A. Measuring the relative compositionality of verb-noun (VN) collocations by integrating features. In proceedings of human language technology conference and conference on empirical methods in natural language processing 2005 (pp. 899-906).
[Google Scholar]
[36]Diab MT, Krishna M. Unsupervised classification of verb noun multi-word expression tokens. In international conference on intelligent text processing and computational linguistics 2009 (pp. 98-110). Springer, Berlin, Heidelberg.
[Crossref] [Google Scholar]
[37]Bharati A, Sangal R, Mishra D, Venkatapathy S, Reddy TP. Handling multi-word expressions without explicit linguistic rules in an MT system. In international conference on text, speech and dialogue 2004 (pp. 31-40). Springer, Berlin, Heidelberg.
[Crossref] [Google Scholar]
[38]Hu D. An introductory survey on attention mechanisms in NLP problems. In proceedings of SAI intelligent systems conference 2019 (pp. 432-48). Springer, Cham.
[Crossref] [Google Scholar]
[39]Khan SA, Anwar W, Bajwa UI. Challenges in developing a rule based urdu stemmer. In proceedings of the 2nd workshop on south southeast asian natural language processing 2011 (pp. 46-51).
[Google Scholar]
[40]Kansal R, Goyal V, Lehal GS. Rule based Urdu stemmer. In proceedings of COLING 2012: demonstration papers 2012 (pp. 267-76).
[Google Scholar]
[41]Lafferty J, McCallum A, Pereira FC. Conditional random fields: probabilistic models for segmenting and labeling sequence data. 2001.
[Google Scholar]
[42]Shahnawaz, Mishra RB. Statistical machine translation system for English to Urdu. International Journal of Advanced Intelligence Paradigms. 2013; 5(3):182-203.
[Google Scholar]