Automatic Representative News Generation using Automatic Clustering

The 14th Industrial Electronics Seminar 2012 (IES 2012) Electronic Engineering Polytechnic Institute of Surabaya (EEPIS), Indonesia, October 24, 2012 ...
Author: Rosanna Lyons
2 downloads 0 Views 2MB Size
The 14th Industrial Electronics Seminar 2012 (IES 2012) Electronic Engineering Polytechnic Institute of Surabaya (EEPIS), Indonesia, October 24, 2012

Automatic Representative News Generation using Automatic Clustering Diptia Zandra Eka Puspitasari , Ali Ridho Barakbah, Idris Winarno Electronic Engineering Polytechnic Institute of Surabaya Institut Teknologi Sepuluh Nopember (ITS) Surabaya Email: [email protected], [email protected], [email protected]

Abstract More than 2000 news presented by 32 online news sites in Indonesia in one day, it can make user who don't have enough time to access it being difficult to choose which news that worth enough to read for them because there are news which have same topic and content among of those news. Cluster the news automatically which can provide news representative from all similar news is the best solution to cover news redundancy problem. This paper presents a new approach of automatic representative news generation using automatic clustering. This approach involves 5 steps which are (1) Data Acquisition, (2) Keyword Extraction, (3) Metadata Aggregation, (4) Automatic Clustering, and (5) Representation News Generation. Data Acquisition is used to generate the news from RSS and present the news description that tokenized and filtered in Keyword Extraction Process. Token values, token links, and tokens are the result of Keyword Extraction and inputted into Metadata Aggregation process to provide a matrix of token values of each links. By using Automatic Clustering method, the system can identified the match number of cluster and clustered the news automatically to provide the news representative to the users. The news representation can be found by finding the news which has shortest distance with centroid in each cluster. The results of news representative depend on the token value of each links, if the difference value of cluster is too small, it means that the news are muchseparated news, but if the difference value of cluster is too big, that means the news are less-separated news. The longer time that taken as a refresh-time, the automatic clustering results will be more accurate, because the more data that can be formed as a cluster. Keywords : Data Acquisition, Keyword Extraction, Metadata Aggregation, Automatic Clustering, Representation News Generation

1.

Introduction Along with the increasing number of internet users in Indonesia, the more sites that provide information for Internet users, including information such as news. Lots of news media (both TV and newspapers) shifted to online news sites due to the

increase of Internet users. Based on paper starting on 20 up to June 21, 2012, the number of news on the internet reach 2000 more each day. Sometimesthe news that displayed by a site is taken from other sites. It triggers news redundancy on the internet which led the increasing number of presented news. This phenomenon does not allow users to read even choose the news and from thousands the amount. Therefore, it urgently needs solutions to resolve news redundancy problem. The purpose of this paper was to choose a representation of news that has been taken from 32 online news sites in Indonesia. By using the Automatic Clustering algorithm, it produced a news as a representation of each different news clusters. In addition to search the news representation, this paper also can classify the news based on the description of the news content automatically. Paper related to the news representation performed by George Adam and Vassilis Poulopoulos who collaborated with Christos Bouras[1]. In their paper, they describe a mechanism that fetches web pages that include news articles from major news portals and blogs. Constructed in order to support tools that are used to acquire news articles from all over the world, process them and present them back to the end users in a personalized manner. Another paper about document clustering process performed by Michael Steinbach who collaborated with George Karypis and Vipin Kumar [2]. They conduct paper on some common document clustering techniques. In particular, they compare the two main approaches to document clustering, agglomerative Hierarchical clustering and K-means. Another paper which does the paper about clustering search engine is “A Dynamic Clustering Interface to Web Search Results” [3]. The researchers of this paper are Oren Zamir and Oren Etzioni. They introduce Grouper – an interface to the results of the Husky Search meta-search engine, which dynamically groups the search results into clusters labeled by phrases extracted from the snippets. This paper presents a new system for generating automatic representative news using automatic clustering. As automatic representative news, this paper must generate news from 32 online sites in Indonesia choose the news representative of each news theme. The news data obtained from RSS feed in those online news sites. Text mining is really used to read the news contents of each news data. By using tokenizing and filtering method, the tokens and token

ISBN: 978-602-9494-28-0

74

Computer Science and Engineering, Information Systems Technologies and Applications

References

Figure 5. The Possibility of Clustering Result after Increasing Number of News 4.

Conclusion This paper presented an automatic representative news generation by applying Automatic Clustering [4]. Our proposed system generated news representatives from 32 news sites in Indonesia without determine the number of news clusters manually, but automatically. The values of link index give a significant influence for automatic clustering result. The similarity of link index can caused unrelated news become one clusters, so that the result is unexpected result. If the difference value of number of links and number of cluster is too far, it means the cluster become much-separated cluster, but if the difference value of number of links and number of cluster is too near, it means the cluster become less-separated cluster. The time interval of data acquisition also gives significant effect for automatic clustering process. The more time has taken, so the more number news that will be clustered. It can make the stretch of the data become bigger, so that the value of accuracy ratio (φ) will be increased too.

[1] Efficient extraction of news articles based on RSS. George, A., Christos, B., & Vassilis, P. (n.d.) . Computer and Informatics Engineer Department, University of Patras. [2] A Comparison of Document Clustering Techniques. Steinbach, Michael, Karypis, George. and Kumar, Vipin. 2008. Twin Cities : Universty of Minnesota, 2008. [3] Grouper: A Dynamic Clustering Interface to Web Search Results. Zamir, Oren and Etzioni, Oren. 2010. Seattle : Department of Computer Science and Engineering, 2010. [4] Determining Constraints of Moving Variance to Find Global Optimum and Make Automatic Clustering. A.R. Barakbah; K. Arai. 2004. Surabaya : IES, Politeknik Elektronika Negeri Surabaya, 2004. [5] Johnson, Roger A., Advanced Euclidean Geometry, Dover, 2007 (orig. 1929): p. 173, corollary to #272. [6] Clustering. Barakbah, Ali Ridho. 2006. Surabaya : Soft Computing Research Group EEPIS-ITS, 2006.

79

Suggest Documents