Detection of Complex Sentences in Punjabi Language Using CRF

Journal of Innovation in Electronics and Communication Engineering Detection of Complex Sentences in Punjabi Language Using CRF Nikita1, Sanjeev Kuma...
Author: Marcus Daniel
2 downloads 3 Views 252KB Size
Journal of Innovation in Electronics and Communication Engineering

Detection of Complex Sentences in Punjabi Language Using CRF Nikita1, Sanjeev Kumar Dhiman2, Sanjeev Kumar Sharma3 1, 2, 3

Department of Computer Science and Engineering, DAV University, Jalandhar, India [email protected], [email protected], [email protected]

Abstract - To compute the sentences for different purposes in Natural Language Processing, there is a need to identify the complex sentences and make them simple. For grammar checking, machine translation, summarization etc, corpus with simple sentences is required. So, Simplification of complex sentences is a major task of NLP. There are different types of complex sentences with different features. This paper works on Punjabi language and informs about the different types of complex sentences and uses Conditional Random Field (CRF), a statistical approach to identify complex sentences from Punjabi corpus. Keywords- Natural Language Processing, Complex Sentences, CRF, Clauses. 1. INTRODUCTION NLP, Natural Language Processing is something which deals with the language which humans spoke generally to communicate with computer systems via natural language instead of computer language. There are various tasks in NLP that are performing nowadays i.e. machine translation, sentence simplification, grammar checking, pos tagging and so on. Sentence simplification is a basic task which simplifies the sentences in easy manner or breaks the sentences into sub parts so that machine can easily understand it and processing can be done. A lot of works have been done on sentence simplification for foreign languages but not much on Indian languages. Various types of sentences are [29]: Simple sentence: A Simple sentence is a sentence which represents entire concept, with one independent clause. For ex: bache khed rhe han. Compound sentence: A Compound sentence contains two simple sentences joined by a subordinating conjunction.

Vol. 5(2), July – Dec 2015@ ISSN 2249-9946

For ex: Ram khed reha si ate sita padh rhi si. Complex sentence: A sentence having one independent clause and one or more dependent clause is a complex sentence. For ex: Rasi tapdian mere pair nu mocha a gai. 1.1 Types of Complex Sentences In Punjabi Language There are two types in which complex sentences are divided [7]: •

Predicate Bound: Participial, Infinitival, Conjunctival



Non-Predicate Bound: Sequential, Non-Sequential. Table 1: Types of complex sentences Predicate Bound

Name of clause Participial

the

Conjunctival

Two type: Perfect and Imperfect Two type: Simple and Imperfective -

Sequential

-

Non-Sequential

Two types: Relative and Non-Relative Two types: Restrictive and Non-Restrictive Two types: Conditional and Non-Conditional

Infinitival

Non-Pred icate Bound

Types

Relative

Non-relative

2. CLAUSES AND ITS TYPES A clause is a combination of words containing a subject and a verb. Some clauses can stand by themselves and some

42

Journal of Innovation in Electronics and Communication Engineering not, so clauses are divided into two types: independent clause and dependent clause or sub-ordinate clause. 2.1

A. Independent clause: A clause is independent of another clause i.e. don't depend upon any other to make its sense. An independent clause can stand on its own.

2.2

B. Dependent clause: It is a type of clause that does not express a complete thought and depends upon independent clause to complete its meaning. A subordinate conjunction is used to create a subordinate clause. Dependent clause is further divided into three types i.e. Adverbial clause, Adjectival clause and Nominal clause. 2. PREVIOUS WORK DONE

Vinh Van Nguyen, Minh Le Nguyen et al, 2009 [22] presented a CRF framework approach for clause splitting. A new bottom-up dynamic algorithm was proposed for decoding and some linguistic features for clause splitting. The accuracy of their system was far better than previous ones i.e. they have 90.01% precision, 78.98% recall and have overall 84.09% f1 score. Aniruddha Ghosh, Amitava Das, Sivaji Bandyopadhya, 2010 [5] proposed a method for Bengali language in which two different approaches for clause boundary identification was used i.e. rule based and CRF. This system first identifies the verbs present in a sentence and then identifies the clause. After clause boundary identification, type of clauses was identified. According to the results the accuracy of rule based clause boundary identification system was 73.12% and clause type identification was 78.07%. Lucia Specia, 2010 [21] have introduce a new approach for text simplification based on Statistical Machine Translation, which provide information that how to translate complex sentences to simple one. In this work, corpus of natural simplification has been used. The experiments have shown that the framework can easily simplify lexical operations i.e. lexical simplification and simple rewriting. Results were very promising, the quality of sentences were not degraded. Daraksha Parveen, Ratna Sanyal, Afreen Ansari, 2011 [14] used a hybrid approach for clause boundary identification for Urdu language which includes two techniques i.e. Rule based and machine learning. CRF has been applied in this paper. POS tagging and chunking has been done manually. The resulted corpus was used as

Vol. 5(2), July – Dec 2015@ ISSN 2249-9946

input data. Different types of clause markers were used which help in making relations between the sentences and combining two Urdu sentences. In Urdu, for identification of beginning and ending of clauses linguistic rules were used. Results are acquired using clause markers and are 89.2% precision and 90.0% recall. Dharam Veer Sharma and Aarti, 2011 [17] have explained various characteristics of Punjabi language. The origin and symbols of Punjabi language are presents in this paper. Various relations exist in thesaurus and role of thesaurus in natural language processing also has been elaborated in this paper. Heiki-Jaan Kaalep, Kadri Muischnek, 2012 [6] describes the rule-based system for tagging clause boundaries. The system also identifies parenthesis and embedded clauses. Corpus of the University of Tartu was taken which is a collection of written texts containing 245 million words. Finite clauses and subclass of finite clauses were targeted. Type of clause was not being identified by the system. The algorithm used was processed in various steps. A sentence was traversed several times. The results were 95% recall and 96% precision. Lakshmi S, Lalitha Devi, et al, 2012 [16] used CRF for Malayalam Language for clause identification. Two types of features have been used. First was word level feature in which three things are considered i.e. lexical word, its part-of-speech and chunk. Second feature used was structural level which is grammatical rules. Data presented here was in column format. Window of size five have taken. According to results, conditional tags were more accurate and approximate accuracy of conditional closing tags was 98.53%. Navneet Kaur, Kamaldeep Garg and Sanjeev Kumar Sharma, 2013 [7] have proposed a technique to identify and separate complex sentences from Punjabi corpus. They discuss different patterns in which we classify complex sentences. With the help of identification markers an algorithm was used to separate predicate and non-predicate bound type complex sentences. Two different data sets were taken. According to which accuracy of predicate bound was 85% and 82% and for non-predicate bound was 81% and 80%. Chandni, Rajneeesh Narula et al, 2014 [2] proposed a technique for identification and separation of different types of sentences from Punjabi corpus. A complete introduction about sentences has been given in the work. Different features were used. An algorithm proposed by them was used to identify types of sentences on the basis

43

Journal of Innovation in Electronics and Communication Engineering of clauses present in them. This system provides accuracy of more than 90% based on precision and recall. Rahul Sharma and Soma Paul, 2014 [18] has been proposed a hybrid approach for Hindi Language. A stochastic approach with two models i.e. step-by-step and merged models trained with 14500 sentences using CRF and hand crafted rules was used. Different rules were prescribed to mark the boundaries of clauses to identify them. Average accuracy shown by step-by-step model was 91.53% and the merged model shows 80.63%. 4. METHODOLOGY

Table 2: Features Features Word

Description A word itself

Identification marker

ਿਜਹੜਾ, Most important ਿਕ ਿਕ, feature, as this ਇਆਂ, ਿਦਆਂ feature helps us in identification of the complex sentences and their types.

Wi

Part-of-speech tag It includes assigning of grammatical labels of the word in the text. Word’s tag, its previous and next word’s tag is considered.

t-1, t, t+1

4.1Conditional Random Field Conditional Random Field is a statistical based approach which predicts sequences of labels or tags for the given input data [8]. CRFs are undirected graphical models, also known as random field, which is used to calculate the conditional probability p(x|y) of a possible output nodes y=(y1,…,yn) given the input x=(x1,…,x2) which is also called the observation. A CRF in general can be expressed as:   1   P  y| x   Ψ (x y C,  C C)   Z(x) Cc

Some features are considered such as neighboring words and word bigrams, prefixes and suffixes, capitalization, membership in domain-specific lexicons and semantic information from source.

Words suffix and prefix

A fixed length Prefix: ਮੇਰੇ◌ੇਰੇਰੇ ◌ੇ word suffix and Sufix: ਮੇਰੇਮੇਰਮੇਮ prefix of each word in our dataset is considered as a feature.

Word root

Whether an identified word located by identification marker is a root or a word itself.

itself

or

Root:ਕੇ , ਿਕ Word: ਵੇਖਿਦਆਂ

4.3 Experiment Training and testing file:We take 2725 sentences out of which 1625 sentences are taken for training the CRF and the rest 1100 sentences are taken for testing the system. 4.4 Architecture The following procedure is used: Corpus

CRF have been applied to a variety of domains, including text processing, computer vision, and bioinformatics. We used CRF++-0.58 toolkit for our system. We used existing Rule based POS tagger to annotate the corpus.

Tagging

Tagged data

Preprocessing

4.2 Features The main task in any machine learning approach is to select an accurate feature set for proper working of CRF. Multiple features are used in this work for training the CRF which are as follows:

Training file + template file

Test file + model file

Model file

Evaluated data Results

Vol. 5(2), July – Dec 2015@ ISSN 2249-9946

44

Journal of Innovation in Electronics and Communication Engineering 4.5 Training and Learning CRF Punjabi corpus is collected from different sources and tagging is performed on that data. Then tagged data is used for training. Training and learning files should be written in similar manner. A template file is prepared which contains dataset, corresponds to features. Template file can change

with the change in the features. Following two commands are used in CRF toolkit for training and testing the system:

crf_learn template_file train_file model_file crf_learn generates the model file in model_file. template_file and train_file are two basic files used for training the system. Table 3 : Experimental Results

This command is used for training the system. With the combination of training and template file a model file is generated which is further used in testing. crf_test -m model_file test_file crf_test is a command used for testing our model file with test file. Model_file is a file created by crf_learn Test_file is a file containing data for testing. This command is used for testing the system. 5. EXPERIMENTAL RESULTS A result for our proposed system is given in the table: III which is far better than the existing system's result. In existing system [7] authors consider morphological features for identification of complex sentences. Authors in this work did not used any kind of statistical approach due to which accuracy of existing system was not appropriate which was 85.2%.

features that include the interaction between sentences. Our accuracy for finding the complex sentences is 95.1%. REFERENCES [1]

C Poornima, V Dhanalakshmi, M Kumar Anand, P K Sonam, "Rule based Sentence Simplification for English to Tamil Machine Translation System." International Journal of Computer Applications (0975 - 8887), Volume 25, No.8, July 2011. Print.

[2]

Chandni, Narula Rajneesh, Sharma Sanjeev Kumar. "Identification and Separation of Simple, Compound and Complex Sentences in Punjabi Language."International Journal of Computer Applications & Information Technology.Vol. 6. Issue II. Aug-September 2014. Print.

[3]

Daelemans Walter, Hothker Anja, Sang Erik Tjong Kim, "Automatic Sentence Simplification for Subtitling in Dutch and English." In: Proceedings of the 4th Conference on Language Resources and Evaluation, Lisbon, Portugal (2004) 1045-1048. Print.

[4]

Fernandes Eraldo R, Santos Cicero N. dos, Milidiu Ruy L, "A Machine Learning Approach to Portuguese Clause Identification." In: Proceedings of the Computational Processing of the Portuguese Language. pp. 55-64 (2010). Print.

6. CONCLUSION In this work we have proposed a CRF based approach to identify different types of complex sentences. CRF is a statistical approach used by different authors for different languages but it has never been applied to Punjabi language for identification of complex sentences. By applying CRF, we provide a framework to note all accessible

Vol. 5(2), July – Dec 2015@ ISSN 2249-9946

45

Journal of Innovation in Electronics and Communication Engineering [5]

Ghosh Aniruddha, Das Amitava, Bandyopadhya Sivaji, "Clause Identification and Classification in Bengali." Proceedings of the 1st Workshop on South and Southeast Asian Natural Language Processing (WSSANLP), pages 17-25, the 23rd International Conference on Computational Linguistics (COLING), Beijing, August 2010. Print.

[6]

Kaalep Heiki-Jaan, Muischnek Kadri, "Robust clause boundary identification for corpus annotation." In: Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC 2012), Istanbul, Turkey (2012). Print.

[7]

Kaur Navneet, Garg Kamaldeep, Sharma Sanjeev Kumar. "Identification and Separation of Complex Sentences from Punjabi Language."International Journal of Computer Applications.Volume 69- No.13. May 2013. Print.

[8]

[9]

Klinger Roman, Tomanek Katrin, "Classical Probabilistic Models and Conditional Random Fields", Algorithm Engineering Report, TR07-2013,December 2007, ISSN 1864-4503. Print. Kumar Dinesh, Josan Gurpreet Singh, " Part of Speech Taggers for Morphologically Rich Indian Languages: A Survey", International Journal of Computer Applications (0975 - 8887), Volume 6- No.5, September 2010. Print.

[10] Leffa Vilson J, "Clause Processing in Complex Sentences." First International Conference on Language Resources And Evaluation, Proceedings of the First International Conference on Language Resources and Evaluation. Granada, Espanha: 1998. v. 2, p. 937-943. Print. [11] Lehal Gurpreet Singh," A Survey of the State of the Art in Punjabi Language Processing", Language in India, www. languageinindia.com, Strength for Today and Bright Hope for Tomorrow", Volume 9: 10 October 2009, ISSN 1930-2940. Print. [12] Nadkarni Prakash M, Machado LucilaOhno, Chapman Wendy W," Natural language processing: an introduction", J Am Med Inform Assoc 2011; 18:544e551. Doi: 10.1136/amiajnl-2011-000464, Published by group.bmj.com on October 5, 2011. Print. [13] Orasan, C.: A Hybrid Method for Clause Splitting in Unrestricted English Text. In: Proceedings of ACIDCA 2000 Corpora Processing, Monastir Tunisia, pp. 129134 (2000). Print. [14] Parveen Daraksha, Sanyal Ratna, Ansari Afreen, "Clause Boundary Identification using Classifier and

Vol. 5(2), July – Dec 2015@ ISSN 2249-9946

Clause Markers in Urdu Language." Polibits Research Journal on Computer Science, 43, pp. 61-65, 2011. Print. [15] Puscasu Georgiana, "A Multilingual Method for Clause Splitting." In: Proceedings of the 7th annual colloquium for the UK Special interest group for computational linguistics (CLUK 2004), Birmingham, UK. Print. [16] S Lakshmi, Ram Sundar Vijay, R and Sobha, Devi Lalitha, "Clause Boundary Identification for Malayalam Using CRF." Proceedings of the Workshop on Machine Translation and Parsing in Indian Languages (MTPIL-2012), pages 83-92, COLING 2012, Mumbai, December 2012. Print. [17] Sharma Dharam Veer, Aarti. "Punjabi Language Characteristics and Role of Thesaurus in Natural Language processing."International Journal of Computer Science and Information Technologies.Vol. 2 (4). 2011. 1434-1437. Print. [18] Sharma Rahul, Paul Soma, "A hybrid approach for automatic clause boundary identification in Hindi." Proceedings of the 5th Workshop on South and Southeast Asian NLP, 25th International Conference on Computational Linguistics, pages 43-49, Dublin, Ireland, August 23-29 2014. Print. [19] Sharma Saurabh and Gupta Vishal."Punjabi Text Clustering by Sentence Structure Analysis." pp. 237244. 2012. Print. [20] Siddharthan, A.: An architecture for a text simplification system. In: LREC 2002: Proceedings of the Language Engineering Conference, LEC 2002, pp. 64-71. Print. [21] Specia Lucia, "Translating from Complex to Simplified Sentences." Lecture Notes in Computer Science, 6001:30-39, Springer-Verlag, 2010. Print. [22] Nguyen Vinh Van, Nguyen Minh Le, Shimazu Akira, "Using Conditional Random Fields for Clause Splitting."Proceedings of the 10th Conference of the Pacific Association for Computational Linguistics, pages 58-65, 2009. Print. [23] Yin Dapeng, Jiang Peilin, Ren Fuji, Kuroiwa Shingo, "Chinese complex long sentences processing method for Chinese-Japanese machine translation." IEEE, 2007. Print. [24] Zhou Junsheng, Zhang Yabing, Dai Xinyu, Chen Jiajun, "Chinese Event Descriptive Clause Splitting with Structured SVMs." Springer-Verlag Berlin Heidelberg, Pp. 175-183, 2010. Print.

46

Suggest Documents