Keywords Big Data, Hadoop, Map Reduce, Apache Hive, No SQL and HPC

Volume 6, Issue 8, August 2016 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Pape...
Author: Everett Lucas
2 downloads 0 Views 1MB Size
Volume 6, Issue 8, August 2016

ISSN: 2277 128X

International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com

Survey Paper on Big Data C. Lakshmi*, V. V. Nagendra Kumar MCA Department, RGMCET, Nandyal, Andhra Pradesh, India Abstract— Big data is the term for any collection of datasets so large and complex that it becomes difficult to process using traditional data processing applications. The challenges include analysis, capture, curation, search, sharing, storage, transfer, visualization, and privacy violations. Big data is a set of techniques and technologies that require new forms of integration to uncover large hidden values from large datasets that are diverse, complex, and of a massive scale. Big data environment is used to acquire, organize and analyze the various types of data. Data that is so large in volume, so diverse in variety or moving with such velocity is called Big data. Analyzing Big Data is a challenging task as it involves large distributed file systems which should be fault tolerant, flexible and scalable. The technologies used by big data application to handle the massive data are Hadoop, Map Reduce, Apache Hive, No SQL and HPCC. First, we present the definition of big data and discuss big data challenges. Next, we present a systematic framework to decompose big data systems into four sequential modules, namely data generation, data acquisition, data storage, and data analytics. These four modules form a big data value chain. Following that, we present a detailed survey of Materials and methods used in research and industry communities. In addition, we present the prevalent Hadoop framework for addressing big data. Finally, we outline Big data system architecture and present key challenges of research directions for big data system. Keywords— Big Data, Hadoop, Map Reduce, Apache Hive, No SQL and HPC I. INTRODUCTION Big data is a largest buzz phrases in domain of IT, new technologies of personal communication driving the big data new trend and internet population grew day by day but it never reach by 100%. The need of big data generated from the large companies like facebook, yahoo, Google, YouTube etc for the purpose of analysis of enormous amount of data which is in unstructured form or even in structured form. Google contains the large amount of information. So; there is the need of Big Data Analytics that is the processing of the complex and massive datasets This data is different from structured data in terms of five parameters –variety, volume, value, veracity and velocity (5V’s). The five V’s (volume, variety, velocity, value, veracity) are the challenges of big data management are: 1. Volume: Data is ever-growing day by day of alltypes ever MB, PB, YB, ZB, KB, TB of information. The data results into large files. Excessive volume of data is main issue of storage. This main issue is resolved by reducing storage cost. Data volumes are expected to grow 50 times by 2020. 2. Variety: Data sources are extremely heterogeneous. The files comes in various formats and of any type, it may be structured or unstructured such as text, audio, videos, log files and more. The varieties are endless, and the data enters the network without having been quantified or qualified in any way. 3. Velocity: The data comes at high speed.Sometimes 1 minute is too late so big data is time sensitive. Some organisations data velocity is main challenge. The social media messages and credit card transactions done in millisecond and data generated by this putting in to databases. 4. Value: It is a most important v in big data. Value is main buzz for big data because it is important for businesses, IT infrastructure system to store large amount of values in database. 5. Veracity: The increase in the range of valuestypical of a large data set. When we dealing with high volume, velocity and variety of data, the all of data are not going 100% correct, there will be dirty data. Big data and analytics technologies work with these types of data. Huge volume of data (both structured and unstructured) is management by organization, administration and governance. Unstructured data is a data that is not present in a database. Unstructured data may be text, verbal data or in another form. Textual unstructured data is like power point presentation, email messages, word documents, and instant massages. Data in another format can be.jpg images, .png images and audio files. © 2016, IJARCSSE All Rights Reserved

Page | 368

Lakshmi et al., International Journal of Advanced Research in Computer Science and Software Engineering 6(8), August- 2016, pp. 368-381

Fig. 1 Parameters of Big Data Fig.2 illustrates a general big data network model with MapReduce. A distinct application in the cloud has put demanding requirements for acquisition, transportation and analytics of structured and unstructured data. The challenges include analysis, capture, curation, search, sharing, storage, transfer, visualization, and privacy violations. The trend to larger data sets is due to the additional information derivable from analysis of a single large set of related data, as compared to separate smaller sets with the same total amount of data, allowing correlations to be found to "spot business trends, prevent diseases, and combat crime and so on". Scientists regularly encounter limitations due to large data sets in many areas, including meteorology, genomics, connectomics, complex physics simulations, and biological and environmental research. The limitations also affect Internet search, finance and business informatics. Data sets grow in size in part because they are increasingly being gathered by ubiquitous information-sensing mobile devices, aerial sensory technologies (remote sensing), software logs, cameras, microphones, radio-frequency identification (RFID) readers, and wireless sensor networks. The world's technological per-capita capacity to store information has roughly doubled every 40 months since the 1980s;as of 2012, every day 2.5 exabytes (2.5×1018) of data were created. The challenge for large enterprises is determining who should own big data initiatives that straddle the entire organization.

Fig. 2 General Framework of Big Data Networking © 2016, IJARCSSE All Rights Reserved

Page | 369

Lakshmi et al., International Journal of Advanced Research in Computer Science and Software Engineering 6(8), August- 2016, pp. 368-381 The history of big data can be roughly split into the following stages: Megabyte to Gigabyte: In the 1970s and 1980s, his-torical business data introduced the earliest ``big data'' challenge in moving from megabyte to gigabyte sizes. The urgent need at that time was to house that data and run relational queries for business analyses and report-ing. Research efforts were made to give birth to the ``database machine'' that featured integrated hardware and software to solve problems. The underlying philos-ophy was that such integration would provide better per-formance at lower cost. After a period of time, it became clear that hardwarespecialized database machines could not keep pace with the progress of general-purpose com-puters. Thus, the descendant database systems are soft-ware systems that impose few constraints on hardware and can run on generalpurpose computers. Gigabyte to Terabyte: In the late 1980s, the popular-ization of digital technology caused data volumes to expand to several gigabytes or even a terabyte, which is beyond the storage and/or processing capabilities of a single large computer system. Data parallelization was proposed to extend storage capabilities and to improve performance by distributing data and related tasks, such as building indexes and evaluating queries, into disparate hardware. Based on this idea, several types of parallel databases were built, including shared-memory databases, shared-disk databases, and shared-nothing databases, all as induced by the underlying hardware architecture. Of the three types of databasTerabyte to Petabyte: During the late 1990s, whenthe database community was admiring its `` nished'' work on the parallel database, the rapid development of Web 1.0 led the whole world into the Internet era, along with massive semi-structured or unstructured web-pages holding terabytes or petabytes (PBs) of data. The resulting need for search companies was to index and query the mushrooming content of the web. Unfortunately, although parallel databases handle structured data well, they provide little support for unstructured data. Additionally, systems capabilities were limited to less than several terabyte. Petabyte to Exabyte: Under current development trends,data stored and analyzed by big companies will undoubtedly reach the PB to exabyte magnitude soon. However, current technology still handles terabyte to PB data; there has been no revolutionary technology developed to cope with larger datasets. II. LITERATURE SURVEY 1 Hadoop Map Reduce is a large scale, open source software framework dedicated to scalable, distributed, dataintensive computing. The framework breaks up large data into smaller parallelizable chunks and handles scheduling ▫ Maps each piece to an intermediate value ▫ Reduces intermediate values to a solution ▫ User-specified partition and combiner options Fault tolerant, reliable, and supports thousands of nodes and petabytes of data • If you can rewrite algorithms into MapReduces, and your problem can be broken up into small pieces solvable in parallel, then Hadoop’s Map Reduce is the way to go for a distributed problem solving approach to large datasets • Tried and tested in production • Many implementation options. We can present the design and evaluation of a data aware cache framework that requires minimum change to the original Map Reduce programming model for provisioning incremental processing for Big Data applications using the Map Reduce model [4]. 2 The author [2] stated the importance of some of the technologies that handle Big Data like Hadoop, HDFS and Map Reduce. The author suggested about various schedulers used in Hadoop and about the technical aspects of Hadoop. The author also focuses on the importance of YARN which overcomes the limitations of Map Reduce. 3 The author [3] have surveyed various technologies to handle the big data and there architectures. In this paper we have also discussed the challenges of Big data (volume, variety, velocity, value, veracity) and various advantages and a disadvantage of these technologies. This paper discussed an architecture using Hadoop HDFS distributed data storage, real-time NoSQL databases, and MapReduce distributed data processing over a cluster of commodity servers. The main goal of our paper was to make a survey of various big data handling techniques those handle a massive amount of data from different sources and improves overall performance of systems. 4 The author continue with the Big Data definition and enhance the definition given in [3] that includes the 5V Big Data properties: Volume, Variety, Velocity, Value, Veracity, and suggest other dimensions for Big Data analysis and taxonomy, in particular comparing and contrasting Big Data technologies in e-Science, industry, business, social media, healthcare. With a long tradition of working with constantly increasing volume of data, modern e-Science can offer industry the scientific analysis methods, while industry can bring advanced and fast developing Big Data technologies and tools to science and wider public.[1] 5 The author [6] stated the need to process enormous quantities of data has never been greater. Not only are terabyte - and petabyte scale datasets rapidly becoming commonplace, but there is consensus that great value lies buried in them, waiting to be unlocked by the right computational tools. In the commercial sphere, business intelligence, driven by the ability to gather data from a dizzying array of sources. Big Data analysis tools like Map Reduce over Hadoop and HDFS, promises to help organizations better understand their customers and the marketplace, hopefully leading to better business decisions and competitive advantages [3]. 6 The author [5] stated there is a need to maximize returns on BI investments and to overcome difficulties. Problems and new trends mentioned in this article and finding solutions by combination of advanced tools, techniques and methods would help readers in BI projects and implementations. BI vendors are struggling and doing continuous effort to bring © 2016, IJARCSSE All Rights Reserved

Page | 370

Lakshmi et al., International Journal of Advanced Research in Computer Science and Software Engineering 6(8), August- 2016, pp. 368-381 technical capabilities and to provide complete out of the box solution with set of tools and techniques. In 2014, due to rapid change in BI maturity, BI teams are facing tough time to have infrastructure with less skilled resources. Consolidation and convergence is going on, market is coming up with wide range of new technologies. Still the ground is immature and in a state of rapid evolution. 7 The author [8] given some important emerging framework model design for Big Data Analytics and a 3-tier architecture model for Big Data in Data Mining. In the proposed 3-tier architecture model is more scalable in working with different environment and also benefits to overcome with the main issue in Big Data Analytics for storing, Analyzing, and visualization. The framework model given for Hadoop HDFS distributed data storage, real-time Nosql databases, and MapReduce distributed data processing over a cluster of commodity servers. 8 Big data framework needs to consider complex relationships between samples, models and data sources along with their evolving changes with time and other possible factors. To support Big data mining high performance computing platforms are required. With Big data technologies [3] we will hopefully be able to provide most relevant and most accurate social sensing feedback to better understand our society at real time [7]. 9 There are lots of scheduling technique are availab le to improve job performance but a ll the technique have some litt le limitation so any one technique cannot overcome that particular para meter in which they effecting the performance to whole system. Like data locality, fa irness, load balance, straggler proble m and deadline constrains. All the technique has advantages over any other technique so if we co mbined or interchange some technique then the result will be even much better than the individual scheduling technique [10]. 10 The author [9] describes the concept of Big Data along with 3 Vs, Volume, Velocity and variety of Big Data. The paper also focuses on Big Data processing problems. These technical challenges must be addressed for efficient and fast processing of Big Data. The challenges include not just the obvious issues of scale, but also heterogeneity, lack of structure, error -handling, privacy, timeliness, provenance, and visualization, at all stages of the analysis pipeline from data acquisition to result interpretation. These technical challenges are common across a large variety of application domains, and therefore not cost -effective to address in the context of one domain alone. The paper describes Hadoop which is an open source software used for processing of Big Data. 11 The author [12] proposed system is based on implementation of Online Aggregation of Map Reduce in Hadoop for ancient big data processing. Traditional MapReduce implementations materialize the intermediate results of mappers and do not allow pipelining between the map and the reduce phases. This approach has the advantageof simple recovery in the case of failures, however, reducers cannot start executing tasks before all mappers have finished. As the Map Reduce Online is a modeled version of Hadoop Map Reduce, it supports Online Aggregation and stream processing,while also improving utilization and reducing response time. 12 The author [11] stated learning from the application studies, we explore the design space for supporting dataintensive and compute-intensive applications on large data-center-scale computer systems. Traditional data processing and storage approaches are facing many challenges in meeting the continuously increasing computing demands of Big Data. This work focused on Map Reduce, one of the key enabling approaches for meeting Big Data demands by means of highly parallel processing on a large number of commodity nodes. III. TECHNOLOGIES AND METHODS All paragraphs must be indented. Big data is a new concept for handling massive data therefore the architectural description of this technology is very new. There are the different technologies which use almost same approach i.e. to distribute the data among various local agents and reduce the load of the main server so that traffic can be avoided. There are endless articles, books and periodicals that describe Big Data from a technology perspective so we will instead focus our efforts here on setting out some basic principles and the minimum technology foundation to help relate Big Data to the broader IM domain. A. Hadoop Hadoop is a framework that can run applications on systems with thousands of nodes and terabytes. It distributes the file among the nodes and allows to system continue work in case of a node failure. This approach reduces the risk of catastrophic system failure. In which application is broken into smaller parts (fragments or blocks).Apache Hadoop consists of the Hadoop kernel, Hadoop distributed file system (HDFS), map reduce and related projects are zookeeper, Hbase, Apache Hive. Hadoop Distributed File System consists of three Components: the Name Node, Secondary Name Node and Data Node. The multilevel secure (MLS) environmental problems of Hadoop by using security enhanced Linux (SE Linux) protocol. In which multiple sources of Hadoop applications run at different levels. This protocol is an extension of Hadoop distributed file system. Hadoop is commonly used for distributed batch index building; it is desirable to optimize the index capability in near real time. Hadoop provides components for storage and analysis for large scale processing. Now a day’s Hadoop used by hundreds of companies. The advantage of Hadoop is Distributed storage & Computational capabilities, extremely scalable, ptimized for high throughput, large block sizes, tolerant of software and hardware failure. © 2016, IJARCSSE All Rights Reserved

Page | 371

Lakshmi et al., International Journal of Advanced Research in Computer Science and Software Engineering 6(8), August- 2016, pp. 368-381

Fig. 3 Architecture of Hadoop Components of Hadoop: HBase: It is open source, distributed and Non-relational database system implemented in Java. It runs above the layer of HDFS. It can serve the input and output for the Map Reduce in well mannered structure. Oozie: Oozie is a web-application that runs in ajava servlet. Oozie use the database to gather the information of Workflow which is a collection of actions. It manages the Hadoop jobs in a mannered way. Sqoop: Sqoop is a command-line interface application that provides platform which is used for converting data from relational databases and Hadoop or vice versa. Avro: It is a system that provides functionality ofdata serialization and service of data exchange. It is basically used in Apache Hadoop. These services can be used together as well as independently according the data records. Chukwa: Chukwa is a framework that is used fordata collection and analysis to process and analyze the massive amount of logs. It is built on the upper layer of the HDFS and Map Reduce framework. Pig: Pig is high-level platform where the MapReduce framework is created which is used with Hadoop platform. It is a high level data processing system where the data records are analyzed that occurs in high level language. Zookeeper: It is a centralization based service thatprovides distributed synchronization and provides group services along with maintenance of the configuration information and records. Hive: It is application developed for datawarehouse that provides the SQL interface as well as relational model. Hive infrastructure is built on the top layer of Hadoop that help in providing conclusion, and analysis for respective queries. Hadoop was created by Doug Cutting and Mike Cafarella in 2005. Doug Cutting, who was working at Yahoo! at the time, named it after his son's toy elephant. It was originally developed to support distribution for the Nutch search engine project. Hadoop is open- source software that enables reliable, scalable, distributed computing on clusters of inexpensive servers. Hadoop is: Reliable: The software is fault tolerant, it expectsand handles hardware and software failures Scalable: Designed for massive scale ofprocessors, memory, and local attached storage Distributed: Handles replication. Offers massivelyparallel programming model, Map Reduce.

Fig. 4 Hadoop System © 2016, IJARCSSE All Rights Reserved

Page | 372

Lakshmi et al., International Journal of Advanced Research in Computer Science and Software Engineering 6(8), August- 2016, pp. 368-381 Hadoop is an Open Source implementation of a large-scale batch processing system. That use the Map-Reduce framework introduced by Google by leveraging the concept of map and reduce functions that well known used in Functional Programming. Although the Hadoop framework is written in Java, it allows developers to deploy customwritten programs coded in Java or any other language to process data in a parallel fashion across hundreds or thousands of commodity servers. It is optimized for contiguous read requests(streaming reads), where processing includes of scanning all the data. Depending on the complexity of the process and the volume of data, response time can vary from minutes to hours. While Hadoop can processes data fast, so its key advantage is its massive scalability. Hadoop is currently being used for index web searches, email spam detection, recommendation engines, prediction in financial services, genome manipulation in life sciences, and for analysis of unstructured data such as log, text, and clickstream. While many of these applications could in fact be implemented in a relational database(RDBMS), the main core of the Hadoop framework is functionally different from an RDBMS. The following discusses some of these differences Hadoop is particularly useful when: Complex information processing is needed:  Unstructured data needs to be turned into structured data.  Queries can’t be reasonably expressed using SQL Heavily recursive algorithms.  Complex but parallelizable algorithms needed, such as geo-spatial analysis or genome sequencing. Machine learning:  Data sets are too large to fit into database RAM, discs, or require too many cores (10’s of TB up to PB) .  Data value does not justify expense of constant real-time availability, such as archives or special interest info, which can be moved to Hadoop and remain available at lower cost.  Results are not needed in real time Fault tolerance is critical. Significant custom coding would be required to:  Handle job scheduling. Hadoop was inspired by Google's MapReduce, a software framework in which an application is broken down into numerous small parts. Any of these parts (also called fragments or blocks) can be run on any node in the cluster. Doug Cutting, Hadoop's creator, named the framework after his child's stuffed toy elephant. The current Apache Hadoop ecosystem consists of the Hadoop kernel, MapReduce, the Hadoop distributed file system (HDFS) and a number of related projects such as Apache Hive, HBase and Zookeeper. The Hadoop framework is used by major players including Google, Yahoo and IBM, largely for applications involving search engines and advertising. The preferred operating systems are Windows and Linux but Hadoop can also work with BSD and OS X. HDFS The Hadoop Distributed File System (HDFS) is the file system component of the Hadoop framework. HDFS is designed and optimized to store data over a large amount of low-cost hardware in a distributed fashion. Name Node: Name node is a type of the master node, which is having the information that means meta data about the all data node there is address(use to talk ), free space, data they store, active data node , passive data node, task tracker, job tracker and many other configuration such as replication of data.

Fig. 5 HDFS Architecture The NameNode records all of the metadata, attributes, and locations of files and data blocks in to the DataNodes. The attributes it records are the things like file permissions, file modification and access times, and namespace, which is a hierarchy of files and directories. The NameNode maps the namespace tree to file blocks in DataNodes. When a client © 2016, IJARCSSE All Rights Reserved

Page | 373

Lakshmi et al., International Journal of Advanced Research in Computer Science and Software Engineering 6(8), August- 2016, pp. 368-381 node wants to read a file in the HDFS it first contacts the Namenode to receive the location of the data blocks associated with that file. A NameNode stores information about the overall system because it is the master of the HDFS with the DataNodes being the slaves. It stores the image and journal logs of the system. The NameNode must always store the most up to date image and journal. Basically, the NameNode always knows where the data blocks and replicates are for each file and it also knows where the free blocks are in the system so it keeps track of where future files can be written. Data Node: Data node is a type of slave node in the hadoop, which is used to save the data and there is task tracker in data node which is use to track on the ongoing job on the data node and the jobs which coming from name node. The DataNodes store the blocks and block replicas of the file system. During startup each DataNode connects and performs a handshake with the NameNode. The DataNode checks for the accurate namespace ID, and if not found then the DataNode automatically shuts down. New DataNodes can join the cluster by simply registering with the NameNode and receiving the namespace ID. Each DataNode keeps track of a block report for the blocks in its node. Each DataNode sends its block report to the NameNode every hour so that the NameNode always has an up to date view of where block replicas are located in the cluster.During the normal operation of the HDFS, each DataNode also sends a heartbeat to the NameNode every ten minutes so that the NameNode knows which DataNodes are operating correctly and are available. The base Apache Hadoop framework is composed of the following modules: Hadoop Common – contains libraries and utilities needed by other Hadoop modules. Hadoop Distributed File System (HDFS) – a distributed file-system that stores data on commodity machines, providing very high aggregate bandwidth across the cluster. Hadoop MapReduce – a programming model for large scale data processing.

Fig. 6 Hadoop Ecosystem All the modules in Hadoop are designed with a fundamental assumption that hardware failures are common and thus should be automatically handled in software by the framework. Apache Hadoop's MapReduce and HDFS components originally derived respectively from Google's MapReduce and Google File System (GFS) papers."Hadoop" often refers not to just the base Hadoop package but rather to the Hadoop Ecosystem fig.6 which includes all of the additional software packages that can be installed on top of or alongside Hadoop, such as Apache Hive, Apache Pig and Apache Spark. B. Map Reduce: Map-Reduce was introduced by Google in order to process and store large datasets on commodity hardware. Map Reduce is a model for processing large-scale data records in clusters. The Map Reduce programming model is based on two functions which are map() function and reduce() function. Users can simulate their own processing logics having well defined map() and reduce() functions. Map function performs the task as the master node takes the input, divide into smaller sub modules and distribute into slave nodes. A slave node further divides the sub modules again that lead to the hierarchical tree structure. The slave node processes the base problem and passes the result back to the master Node. The Map Reduce system arrange together all intermediate pairs based on the intermediate keys and refer them to reduce() function for producing the final output. Reduce function works as the master node collects the results from all the sub problems and combines them together to form the output. Map(in_key,in_value)-->list(out_key,intermediate_value)Reduce(out_key,list(intermediate_value))-->list(out_value) The parameters of map () and reduce () function is as follows: map (k1,v1) ! list (k2,v2) and reduce (k2,list(v2)) ! list (v2) A Map Reduce framework is based on a master-slave architecture where one master node handles a number of slave nodes . Map Reduce works by first dividing the input data set into even-sized data blocks for equal load distribution. Each data block is then assigned to one slave node and is processed by a map task and result is generated. The slave node interrupts the master node when it is idle. The scheduler then assigns new tasks to the slave node. The scheduler takes data locality and resources into consideration when it disseminates data blocks. © 2016, IJARCSSE All Rights Reserved

Page | 374

Lakshmi et al., International Journal of Advanced Research in Computer Science and Software Engineering 6(8), August- 2016, pp. 368-381

Fig. 7 Architecture of Map Reduce Figure 7 shows the Map Reduce Architecture and Working. It always manages to allocate a local data block to a slave node. If the effort fails, the scheduler will assign a rack-local or random data block to the slave node instead of local data block. When map() function complete its task, the runtime system gather all intermediate pairs and launches a set of condense tasks to produce the final output. Large scale data processing is a difficult task, managing hundreds or thousands of processors and managing parallelization and distributed environments makes is more difficult. Map Reduce provides solution to the mentioned issues, as is supports distributed and parallel I/O scheduling, it is fault tolerant and supports scalability and it has inbuilt processes for status and monitoring of heterogeneous and large datasets as in Big Data. It is way of approaching and solving a given problem. Using Map Reduce framework the efficiency and the time to retrieve the data is quite manageable. To address the volume aspect, new techniques have been proposed to enable parallel processing using Map Reduce framework. Data aware caching (Dache) framework that made slight change to the original map reduce programming model and framework to enhance processing for big data applications using the map reduce model. The advantage of map reduce is a large variety of problems are easily expressible as Map reduce computations and cluster of machines handle thousands of nodes and fault-tolerance. The disadvantage of map reduce is Real-time processing, not always very easy to implement, shuffling of data, batch processing. Map Reduce Components: 1. Name Node: manages HDFS metadata, doesn’tdeal with files directly. 2. Data Node: stores blocks of HDFS—default replication level for each block: 3. 3. Job Tracker: schedules, allocates and monitorsjob execution on slaves—Task Trackers. 4. Task Tracker: runs Map Reduce operations. Map Reduce Framework Map Reduce is a software framework for distributed processing of large data sets on computer clusters. It is first developed by Google .Map Reduce is intended to facilitate and simplify the processing of vast amounts of data in parallel on large clusters of commodity hardware in a reliable, fault-tolerant manner. MapReduce is the key algorithm that the Hadoop MapReduce engine uses to distribute work around a cluster. Typical Hadoop cluster integrates MapReduce and HFDS layer. In MapReduce layer job tracker assigns tasks to the task tracker.Master node job tracker also assigns tasks to the slave node task tracker fig.8.

Fig. 8 Map reduce is based on the Maser-Slave architecture © 2016, IJARCSSE All Rights Reserved

Page | 375

Lakshmi et al., International Journal of Advanced Research in Computer Science and Software Engineering 6(8), August- 2016, pp. 368-381 Master node contains Job tracker node (MapReduce layer) Task tracker node (MapReduce layer) Name node (HFDS layer) Data node (HFDS layer) Multiple slave nodes contain Task tracker node (MapReduce layer) Data node (HFDS layer) MapReduce layer has job and task tracker nodes HFDS layer has name and data nodes C. Hive: Hive is a distributed agent platform, a decentralized system for building applications by networking local system resources. Apache Hive data warehousing component, an element of cloud-based Hadoop ecosystem which offers a query language called HiveQL that translates SQL-like queries into Map Reduce jobs automatically. Applications of apache hive are SQL, oracle, IBM DB2. Architecture is divided into Map-Reduce-oriented execution, Meta data information for data storage, and an execution part that receives a query from user or applications for execution.

Fig. 9 Architecture of HIVE The advantage of hive is more secure and implementations are good and well tuned.The disadvantage of hive is only for ad hoc queries and performance is less as compared to pig. D. No-SQL: No-SQL database is an approach to data management and data design that’s useful for very large sets of distributed data. These databases are in general part of the real-time events that are detected in process deployed to inbound channels but can also be seen as an enabling technology following analytical capabilities such as relative search applications. These are only made feasible because of the elastic nature of the No-SQL model where the dimensionality of a query is evolved from the data in scope and domain rather than being fixed by the developer in advance. It is useful when enterprise need to access huge amount of unstructured data. There are more than one hundred No SQL approaches that specialize in management of different multimodal data types (from structured to non-structured) and with the aim to solve very specific challenges. Data Scientist, Researchers and Business Analysts in specific pay more attention to agile approach that leads to prior insights into the data sets that may be concealed or constrained with a more formal development process. The most popular No-SQL database is Apache Cassandra. The advantage of No-SQL is open source, Horizontal scalability, Easy to use, store complex data types, Very fast for adding new data and for simple operations/queries. The disadvantage of No-SQL is Immaturity, No indexing support, No ACID, Complex consistency models, Absence of standardization. © 2016, IJARCSSE All Rights Reserved

Page | 376

Lakshmi et al., International Journal of Advanced Research in Computer Science and Software Engineering 6(8), August- 2016, pp. 368-381

Fig. 10 Architecture of No SQL E. HPCC: HPCC is an open source platform used for computing and that provides the service for handling of massive big data workflow. HPCC data model is defined by the user end according to the requirements. HPCC system is proposed and then further designed to manage the most complex and data-intensive analytical related problems. HPCC system is a single platform having a single architecture and a single programming language used for the data simulation.HPCC system was designed to analyze the gigantic amount of data for the purpose of solving complex problem of big data. HPCC system is based on enterprise control language which has the declarative and on-procedural nature programming language the main components of HPCC are: HPCC Data Refinery: Use parallel ETL enginemostly. HPCC Data Delivery: It is massively based onstructured query engine used. Enterprise Control Language distributes the workload between the nodes in appropriate even load. IV. EXPERIMENTS ANALYSIS Big-Data System Architecture: In this section, we focus on the value chain for big data analytics. Speci cally, we describe a big data value chain that consists of four stages (generation, acquisition, storage, and processing). Next, we present a big data technology map that associates the leading technologies in this domain with speci c phases in the big data value chain and a time stamp. Big-Data System: A Value-Chain View A big-data system is complex, providing functions to deal with different phases in the digital data life cycle, ranging from its birth to its destruction. At the same time, the system usually involves multiple distinct phases for different applications. In this case, we adopt a systems-engineering approach, well accepted in industry, to decom-pose a typical big-data system into four consecutive phases, including data generation, data acquisition, data storage, and data analytics, as illustrated in the horizontal axis of Fig. 3. Notice that data visualization is an assistance method for data analysis. In general, one shall visualize data to and some. The details for each phase are explained as follows. Data generation concerns how data are generated. In this case, the term ``big data'' is designated to mean large, diverse, and complex datasets that are generated from various longitudinal and/or distributed data sources, including sensors, video, click streams, and other available digital sources. Normally, these datasets are associated with different levels of domain-speci c values. In this paper, we focus on datasets from three prominent domains, business, Internet, and scienti c research, for which values are relatively easy to understand.However, there are overwhelming technical chal-lenges in collecting, processing, and analyzing these datasets that demand new solutions to embrace the latest advances in the information and communications technology (ICT) domain. Data acquisition refers to the process of obtaining informa-tion and is subdivided into data collection, data transmission, and data pre-processing. First, because data may come from a diverse set of sources, websites that host formatted text, images and/or videos - data collection refers to dedicated data collection technology that acquires raw data from a spe-ci c data production environment. Second, after collecting raw data, we need a high-speed transmission mechanism to transmit the data into the proper storage sustaining system for various types of analytical applications. © 2016, IJARCSSE All Rights Reserved

Page | 377

Lakshmi et al., International Journal of Advanced Research in Computer Science and Software Engineering 6(8), August- 2016, pp. 368-381

Fig. 11 Big data technology map. It pivots on two axes, i.e., data value chain and timeline. The data value chain divides thedata lifecycle into four stages, including data generation, data acquisition, data storage, and data analytics. In each stage, we highlight exemplary technologies over the past 10 years. rough patterns rst, and then employ speci c data mining methods. I mention this in data analytics section. Finally, collected datasets might contain many meaningless data, which unnecessarily increases the amount of storage space and affects the consequent data analysis. For instance, redundancyis common in most datasets collected from sensors deployed to monitor the environment, and we can use data compres-sion technology to address this issue. Thus, we must per-form data pre-processing operations for ef cient storage and mining. Data storage concerns persistently storing and managinglarge-scale datasets. A data storage system can be divided into two parts: hardware infrastructure and data manage-ment. Hardware infrastructure consists of a pool of shared ICT resources organized in an elastic way for various tasks in response to their instantaneous demand. The hardware infrastructure should be able to scale up and out and be able to be dynamically recon gured to address different types of application environments. Data management software is deployed on top of the hardware infrastructure to main-tain large-scale datasets. Additionally, to analyze or interact with the stored data, storage systems must provide several interface functions, fast querying and other programming models. Data analysis leverages analytical methods or tools toinspect, transform, and model data to extract value. Many application elds leverage opportunities presented by abun-dant data and domain-speci c analytical methods to derive the intended impact. Although various elds pose dif-ferent application requirements and data characteristics, a few of these elds may leverage similar underlying tech-nologies. Emerging analytics research can be classi ed into six critical technical areas: structured data analytics, text analytics, multimedia analytics, web analytics, net-work analytics, and mobile analytics. Big-Data System: A Layered View: Alternatively, the big data system can be decomposed into a layered structure, as illustrated in Fig. 12.

© 2016, IJARCSSE All Rights Reserved

Page | 378

Lakshmi et al., International Journal of Advanced Research in Computer Science and Software Engineering 6(8), August- 2016, pp. 368-381

Fig. 12 Layered architecture of big data system. It can be decomposedinto three layers, including infrastructure layer, computing layer, and application layer, from bottom to up. The layered structure is divisible into three layers, i.e., the infrastructure layer, the computing layer, and the application layer, from bottom to top. This layered view only provides a conceptual hierarchy to underscore the complexity of a big data system. The function of each layer is as follows. The infrastructure layer consists of a pool of ICT resources, which can be organized by cloud computing infrastructure and enabled by virtualization technology. These resources will be exposed to upper-layer systems in a negrained manner with a speci c service-level agreement (SLA). Within this model, resources must be allocated to meet the big data demand while achieving resource ef ciency by maximizing system utilization, energy awareness, operational simpli cation, etc. The computing layer encapsulates various data tools into a middleware layer that runs over raw ICT resources. In the context of big data, typical tools include data inte-gration, data management, and the programming model. Data integration means acquiring data from disparate sources and integrating the dataset into a uni ed form with the necessary data pre-processing operations. Data management refers to mechanisms and tools that provide persistent data storage and highly ef cient management, such as distributed le systems and SQL or NoSQL data stores. The programming model implements abstraction application logic and facilitates the data analysis appli-cations. MapReduce, Dryad, Pregel, and Dremel exemplify programming models. The application layer exploits the interface provided by the programming models to implement various data analysis functions, including querying, statistical anal-yses, clustering, and classi cation; then, it combines basic analytical methods to develop various led related applications. McKinsey presented ve potential big data application domains: health care, public sector admin-istration, retail, global manufacturing, and personal location data. Big-Data System Challenges: Designing and deploying a big data analytics system is not a trivial or straightforward task. As one of its de nitions suggests, big data is beyond the capability of current hard-ware and software platforms. The new hardware and software platforms in turn demand new infrastructure and models to address the wide range of challenges of big data. Recent works have discussed potential obstacles to the growth of big data applications. In this paper, we strive to classify these challenges into three categories: data collection and management, data analytics, and system issues. Data collection and management addresses massive amounts of heterogeneous and complex data. The following challenges of big data must be met: Data Representation: Many datasets are heterogeneousin type, structure, semantics, organization, granular-ity, and accessibility. A competent data presentation should be designed to re ect the structure, hierarchy, and diversity of the data, and an integration technique should be designed to enable ef cient operations across different datasets. Redundancy Reduction and Data Compression: Typi-cally, there is a large number of redundant data in raw datasets. Redundancy reduction and data compression without scarifying potential value are ef cient ways to lessen overall system overhead. © 2016, IJARCSSE All Rights Reserved

Page | 379

Lakshmi et al., International Journal of Advanced Research in Computer Science and Software Engineering 6(8), August- 2016, pp. 368-381 Data Life-Cycle Management: Pervasive sensing andcomputing is generating data at an unprecedented rate and scale that exceed much smaller advances in storage system technologies. One of the urgent challenges is that the current storage system cannot host the massive data. In general, the value concealed in the big data depends on data freshness; therefore, we should set up the data importance principle associated with the analysis value to decide what parts of the data should be archived and what parts should be discarded. Data Privacy and Security: With the proliferation ofonline services and mobile phones, privacy and security concerns regarding accessing and analyzing personal information is growing. It is critical to understand what support for privacy must be provided at the platform level to eliminate privacy leakage and to facilitate var-ious analyses. There will be a signi cant impact that results from advances in big data analytics, including interpretation, mod-eling, prediction, and simulation. Unfortunately, massive amounts of data, heterogeneous data structures, and diverse applications present tremendous challenges, such as the following. Approximate Analytics: As data sets grow and the real-time requirement becomes stricter, analysis of the entire dataset is becoming more dif cult. One way to poten-tially solve this problem is to provide approximate results, such as by means of an approximation query. The notion of approximation has two dimensions: the accuracy of the result and the groups omitted from the output. Connecting Social Media: Social media possessesunique properties, such as vastness, statistical redun-dancy and the availability of user feedback. Various extraction techniques have been successfully used to identify references from social media to speci c product names, locations, or people on websites. By connect-ing inter- eld data with social media, applications can achieve high levels of precision and distinct points of view. Deep Analytics: One of the drivers of excitementaround big data is the expectation of gaining novel insights. Sophisticated analytical technologies, such as machine learning, are necessary to unlock such insights. However, effectively leveraging these analysis toolkits requires an understanding of probability and statistics. The potential pillars of privacy and security mechanisms are mandatory access control and security communication, multi-granularity access control, privacy-aware data mining and analysis, and security storage and management. Finally, large-scale parallel systems generally confront sev-eral common issues; however, the emergence of big data has ampli ed the following challenges, in particular. Energy Management: The energy consumption of large-scale computing systems has attracted greater concern from economic and environmental perspectives. Data transmission, storage, and processing will inevitably consume progressively more energy, as data volume and analytics demand increases. Therefore, system-level power control and management mechanisms must be considered in a big data system, while continuing to provide extensibility and accessibility. Scalability: A big data analytics system must be able tosupport very large datasets created now and in the future. All the components in big data systems must be capable of scaling to address the ever-growing size of complex datasets. Collaboration: Big data analytics is an interdisciplinaryresearch eld that requires specialists from multiple professional elds collaborating to mine hidden val-ues. A comprehensive big data cyber infrastructure is necessary to allow broad communities of scientists and engineers to access the diverse data, apply their respec-tive expertise, and cooperate to accomplish the goals of analysis. V. CONCLUSIONS In this paper we have surveyed various technologies to handle the big data and there architectures. In this paper we have also discussed the challenges of Big data (volume, variety, velocity, value, veracity) and various advantages and a disadvantage of these technologies. This paper discussed an architecture using Hadoop HDFS distributed data storage, real-time NoSQL databases, and MapReduce distributed data processing over a cluster of commodity servers. The main goal of our paper was to make a survey of various big data handling techniques those handle a massive amount of data from different sources and improves overall performance of systems. REFERENCES [1] Yuri Demchenko ―The Big Data Architecture Framework (BDAF)‖ Outcome of the Brainstorming Session at the University of Amsterdam 17 July 2013. [2] Amogh Pramod Kulkarni, Mahesh Khandewal, ―Survey on Hadoop and Introduction to YARN‖, International Journal of Emerging Technology and Advanced Engineering Website: www.ijetae.com (ISSN 2250-2459, ISO 9001:2008 Certified Journal, Volume 4, Issue 5, May 2014). [3] Sagiroglu, S.Sinanc, D.,‖Big Data: A Review‖,2013, 20-24. [4] Ms. Vibhavari Chavan, Prof. Rajesh. N. Phursule, ―Survey Paper On Big Data‖ International Journal of Computer Science and Information Technologies, Vol. 5 (6), 2014. [5] Margaret Rouse, April 2010―unstructured data‖. [6] Kyuseok Shim, MapReduce Algorithms for Big Data Analysis, DNIS 2013, LNCS 7813, pp. 44–48, 2013. [7] Dong, X.L.; Srivastava, D. Data Engineering (ICDE),‖ Big data integration― IEEE International Conference on , 29(2013) 1245–1248. © 2016, IJARCSSE All Rights Reserved

Page | 380

Lakshmi et al., International Journal of Advanced Research in Computer Science and Software Engineering 6(8), August- 2016, pp. 368-381 [8] Tekiner F. and Keane J.A., Systems, Man and Cybernetics (SMC), ―Big Data Framework‖ 2013 IEEE International Conference on 13–16 Oct. 2013, 1494–1499. [9] Mrigank Mridul, Akashdeep Khajuria, Snehasish Dutta, Kumar N ―Analysis of Big Data using Apache Hadoop and Map Reduce‖ Volume 4, Issue 5, May 2014‖. [10] Suman Arora, Dr.Madhu Goel, ―Survey Paper on Scheduling in Hadoop‖ International Journal of Advanced Research in Computer Science and Software Engineering, Volume 4, Issue 5, May 2014. [11] Aditya B. Patel, Manashvi Birla and Ushma Nair, "Addressing Big Data Problem Using Hadoop and Map Reduce," in Proc. 2012 Nirma University International Conference On Engineering. [12] Jimmy Lin ―Map Reduce Is Good Enough?‖ The control project, IEEE Computer 32 (2013).

© 2016, IJARCSSE All Rights Reserved

Page | 381