A Capable Text Data Mining Using in Artificial Neural Network

International Journal of Engineering and Techniques - Volume 1 Issue 6, Nov – Dec 2015 RESEARCH ARTICLE OPEN ACCESS A Capable Text Data Mining Using...
Author: Randolph Moore
9 downloads 0 Views 130KB Size
International Journal of Engineering and Techniques - Volume 1 Issue 6, Nov – Dec 2015 RESEARCH ARTICLE

OPEN ACCESS

A Capable Text Data Mining Using in Artificial Neural Network Mrs.R.Kalpana, Mrs.P.Padmapriya 1,2

(HEAD, Computer Science Department, Annai Vailankanni Arts and Science College, Thanjavur-7.)

Abstract: Text mining efforts to innovate new, previous unknown or hidden data by automatically extracting collection of information from various written resources. Applying knowledge detection method to formless text is known as Knowledge Discovery in Text or Text data mining and also called Text Mining. Most of the techniques used in Text Mining are found on the statistical study of a term either word or phrase. There are different algorithms in Text mining are used in the previous method. For example Single-Link Algorithm and Self-Organizing Mapping(SOM) is introduces an approach for visualizing high-dimensional data and a very useful tool for processing textual data based on Projection method. Genetic and Sequential algorithms are provide the capability for multiscale representation of datasets and fast to compute with less CPU time based on the Isolet Reduces subsets in Unsupervised Feature Selection. We are going to propose the Vector Space Model and Concept based analysis algorithm it will improve the text clustering quality and a better text clustering result may achieve. We think it is a good behavior of the proposed algorithm is in terms of toughness and constancy with respect to the formation of Neural Network. Keywords — Concept analysis, document clustering, k-Nearest Neighbor (k-NN), data visualization, Self-Organizing Map (SOM).

I. INTRODUCTION ANNs are processing devices such as algorithms or hardware that are freely modeled after the neuronal structure of the mammalian with smaller scales. A large ANN might have lot of processor units whereas a mammalian brain has huge of neurons to increase their overall interaction and emergent behavior. In Neural Network that address classification problems, training set, testing set, learning rate are considered as key tasks. That is collection of input/output patterns that are used to train the network and used to assess the network performance, set the rate of adjustments. This paper describes a proposed back propagation neural net classifier that performs cross validation for original Neural Network. In order to reduce the optimization of classification accuracy, training time. This algorithm is independent of specify data sets so that many ideas and solutions can be transferred to other classifier paradigm. We have to propose text data mining with this Artificial Neural Network.

ISSN: 2395-1303

Clustering or Cluster Analysis is one of the data mining concepts is an unsupervised pattern where this pattern try to identify intrinsic sets of a text document. So that a group of clusters is created in which clusters demonstrate intra cluster similarity and inter cluster similarity [1]. Commonly text clustering patterns attempt to separate the documents into sets where each set represents various themes that are different than those areas represented by other groups. Most of the current text clustering methods based on Vector Space Model (VSM). VSM is a broadly used data representation for text classification on clustering. Methods used for text mining includes decision trees[2],conceptual clustering[3], statistical analysis[4] and clustering based on data summarization[5]. Usually, in text data mining techniques, the term frequency of a phrase or a word is computed to discover the importance of the phrase in the file. However, two phrases can have the same frequency in their papers, but one phrase adds more to the meaning of its sentences than another phrase.

http://www.ijetjournal.org

Page 82

International Journal of Engineering and Techniques - Volume 1 Issue 6, Nov – Dec 2015

II. Concept-based mining model The proposed concept-based mining model consists of sentence-based concept analysis, document-based concept analysis, corpus-based concept-analysis, and concept-based similarity measure. A raw text document is the input to the proposed model. Each document has well-defined sentence boundaries. Each sentence in the document is labeled automatically based on parser. After running the semantic role labeler, each sentence in the document might have one or more labeled verb argument structures. In this model, both the verb and the argument are considered as terms. One semantic role in the same sentence. In such cases, this term plays important semantic roles that contribute to the meaning of the sentence. In the concept-based mining model, a labeled terms either word or phrase is considered as concept. The System architecture consists of the following main modules: o Text preprocessing o Concept Analysis and o Concept based similarity measure Fig.1 is an Architecture of Concept Based model and it consists of sentence-based concept analysis, document-based concept analysis and conceptbased similarity measure. Text Preprocess: Separate Sentences, Label Terms, removing stop words.

Concept Analysis Concept based similarity

• Sentence based • Document based • Corpus based

Fig.1 Architecture of Concept Based Model

A. Text Preprocessing 1) Label Terms A raw text document is the input to the proposed model. Each document has well defined sentence boundaries. Each sentence in the document is

ISSN: 2395-1303

labeled automatically based on the parser. After running the semantic role labeler, each sentence in the document might have one or more labeled verb argument structures. The labeled verb argument structures, the output of the role labeling task, are captured and analyzed by the concept-based mining model on sentence, document levels. In this model, both the verb and the argument are considered as terms. One term can be an argument to more than one verb in the same sentence. This means that this term can have more than one semantic role in the same sentence. In such cases, this term plays important semantic roles that contribute to the meaning of the sentence. In the concept-based mining model, a labeled terms either word or phrase is considered as concept. 2) Removing stop words In computing stop words are words which are filtered out prior to, or after, processing of natural language data (text). It is controlled by human input and not automated. There is not one definite list of stop words which all tools use, if even used. Some tools specifically avoid using them to support phrase search. 3) Stem words In linguistic morphology, stemming is the process for reducing inflected (or sometimes derived) words to their stem, base or root form – generally a written word form. The stem need not be identical to the morphological root of the word; it is usually sufficient that related words map to the same stem, even if this stem is not in itself a valid root. Algorithms for stemming have been studied in computer science since 1968. Many search engines treat words with the same stem as synonyms as a kind of query broadening, a process called conflation. Stemming programs are commonly referred to as stemming algorithms or stemmers. B. Concept Analysis To analyze each concept at the sentence level is called as Sentence based Concept Analysis. Consider the following sentence: “Texas and Australia researchers have created industry-ready sheets of materials made from nanotubes that could lead to the development of artificial muscles”.

http://www.ijetjournal.org

Page 83

International Journal of Engineering and Techniques - Volume 1 Issue 6, Nov – Dec 2015

In this example, stop words are removed and concepts are shown without stemming for better readability as follows: 1. Concepts in the first verb-argument structure of the verb created: • Texas Australia researchers • created • industry-ready sheets of material nanotubes lead development of artificial muscles 2. Concepts in the second verb-argument structure of the verb made: • materials • nanotubes lead development artificial muscles 3. Concepts in the third verb-argument structure of the verb lead: • nanotubes • lead • development artificial muscles. It is imperative to note that these concepts are extracted from the same sentence. Thus, the concepts mentioned in this example sentence are: • Texas • Australia • researchers • created • industry • ready • sheets • materials • nanotubes • lead • development • artificial • muscles After finding the concepts at sentence level, concepts are also found at document level.

III. Performances of Neural Network Systems One concern in machine learning community is that a system trained on small samples may not perform well on test data. On the other hand, if training data sets are too large, our concern is how well and efficiently a system can learn. The objective of this study [6] is what neural network systems are better suited for applications that have

ISSN: 2395-1303

small or large training data. For studying neural learning from small training data we chose five data sets like contact-lenses, cpu, weather symbolic, Weather, labor-nega-data. All five collections have rather balanced distribution among all classes, and the number of pattern classes is not too large. First, we utilized our developed text mining algorithms, including text mining techniques based on classification of data in several data collections. After that, we employ exiting neural network to deal with measure the training time for five data sets. Experimental results show that the accuracy was the same for all datasets but Contact-lences, which is the only one with absent attributes. For Contactlences the exactness with Proposed Neural Network was in average around 0.3 % less than with the original Neural Network. The larger the dataset, the better the improvement in speed. Other informal experiments with larger datasets showed that Proposed Neural Network can be more than ten times quicker when the dataset is bigger than CPU or the network has many unknown elements.

IV. Advantages and Disadvantages of Neural Networks The calculated output [7] is compared to the identified output. If the calculated output is correct, then nothing more is necessary. If the computed output is incorrect, then the weights are adjusted so as to make the computed output closer to the known output. This process is continued for a large number of cases, or time-series, until the net gives the correct output for a given input. The entire collection of cases learned is called a “training sample” (Connor, Martin and Atlas, 1994). In most real world problems, the neural network is never 100% correct. Neural networks are programmed to learn up to a given threshold of error. After the neural network learns up to the error threshold, the weight adaptation mechanism is turned off and the net is tested on known cases it has not seen before. The application of the neural network to unseen cases gives the true error rate (Baets, 1994). Artificial neural networks present a number of advantages over conventional methods of analysis.

http://www.ijetjournal.org

Page 84

International Journal of Engineering and Techniques - Volume 1 Issue 6, Nov – Dec 2015

First, artificial neural networks make no assumptions about the nature of the distribution of the data and are not therefore, biased in their analysis. Instead of making assumptions about the underlying population, neural networks with at least one middle layer use the data to develop an internal representation of the relationship between the variables (White, 1992). Second, since time-series data are dynamic in nature, it is necessary to have non-linear tools in order to discern relationships among time-series data. Neural networks are best at discovering non-linear relationships (Wasserman, 1989; Hoptroff, 1993; Moshiri, Cameron, and Scuse, 1999; Shtub and Versano, 1999; Garcia and Gencay, 2000; and Hamm and Brorsen, 2000). Third, neural networks perform well with missing or incomplete data. Whereas traditional regression analysis is not adaptive, typically processing all older data together with new data, neural networks adapt their weights as new input data becomes available (Kuo and Reitch, 1994). Fourth, it is relatively easy to obtain a forecast in a short period of time as compared with an econometric model. However, there are some problem connected with the use of artificial neural networks. No estimation or prediction errors are calculated with an artificial neural network (Caporaletti, Dorsey, Johnson, and Powell, 1994). Also, artificial neural networks are “black boxes,” for it is impractical to form out how relations in unseen layers are estimated (Li, 1994). In addition, a network may become a bit overzealous and try to fit a curve to some data even when there is no relationship. Another problem is that neural networks have long guidance times. Reducing guidance time is crucial because building a neural network forecasting system is a process of trial and error. Therefore, the more research a researcher can run in a finite period of time, the more confident he can be of the result.

V. CONCLUSION This effort links the gap between Artificial Neural network processing and text data mining disciplines. A new concept based mining model composed of four components i.e sentence based concept analysis, documents based concept

ISSN: 2395-1303

analysis, corpus based concept analysis and concept based similarity measure is future to develop the text clustering quality. By utilizing the semantic formation of the sentences in documents, a enhanced text clustering result is achieved. By merging the factors disturbing the weights of thoughts on the sentence, document, and corpus levels, a concept-based match determine that is able of the exact result of pair wise documents is invented. This allows performing model matching and concept-based similarity calculations among documents in a very robust and accurate way. The quality of text clustering achieved by his model considerably better the traditional solo term based approaches. There are a number of chances for extending this effort. One direction is to connection this effort to Web document clustering. Another direction is to apply the same model to text data classification. REFERENCES [1] Shady Shehata, Fakhri Karray and Mohamed S. Kamel, “An Efficient Concept-Based Mining Model for Enhancing Text Clustering”, IEEE Transactions on Knowledge and Data Engineering, Vol. 22, No.10, pp. 1360 – 1371, October 2010. [2] U.Y. Nahm and R.J. Mooney, “A Mutually Beneficial Integration of Data Mining and Information Extraction”, Proc.17th Nat’l Conf. Artificial Intelligence (AAAI ’00), pp. 627-632, 2000. [3] L.Talavera and J. Bejar, “Generality-Based Conceptual Clustering with Probabilistic Concepts”, IEEE Trans, Pattern Analysis and Machine Intelligence, Vol.23, no.2, pp. 196-206, Feb. 2001. [4] T.Hofmann, “The Cluster-Abstraction Model: Unsupervised Learning of Topic Hierarchies from Text Data”, Proc. 16 th Int’l Joint Conf. Artificial Intelligence (IJCAI ’99), pp.682-687, 1999. [5] T.Honkela, S.Kaski, k.Lagus, and T. Kohonen, “WEBSOM – Self Organizing Maps of Document Collections,” Proc. Workshop Self Organizing Maps (WSOM ’97),1997. [6] Guobin Ou,Yi Lu Murphey, “Multi-class pattern classification using neural networks”, Pattern Recognition 40 (2007).

http://www.ijetjournal.org

Page 85

International Journal of Engineering and Techniques - Volume 1 Issue 6, Nov – Dec 2015

[7] Yochanan Shachmurove, Department of Economics, The City College of the City, University of New York and The University of Pennsylvania, Dorota Witkowska, Department of Management,Technical University of Lodz “CARESS Working Paper #00-11Utilizing Artificial Neural Network Model to Predict Stock Markets” September 2000. [8] M. Steinbach, G. Karypis, and V. Kumar, “A Comparison of Document Clustering Techniques,” Proc. Knowledge Discovery and Data Mining (KDD) Workshop Text Mining, Aug. 2000. [9] C. Fillmore, “The Case for Case,” Universals in Linguistic Theory, Holt, Rinehart and Winston, 1968. [10] S.Y. Lu and K.S. Fu, “A Sentence-toSentence Clustering Procedure for Pattern Analysis,” IEEE Trans. Systems, Man, and Cybernetics, vol. 8, no. 5, pp. 381-389, May 1978. [11] S. Pradhan, W. Ward, K. Hacioglu, J. Martin, and D. Jurafsky, “Shallow Semantic Parsing Using Support Vector Machines,” Proc. Human Language Technology/North Am. Assoc. for Computational Linguistics (HLT/NAACL), 2004.

ISSN: 2395-1303

http://www.ijetjournal.org

Page 86

Suggest Documents