First, the technical data management technology background Big Data is a data management technology, data management system has a long history, it is the foundation of all computer applications. Speaking of software, data management, data management software must be concerned about the environment and its dependent hardware to solve application scenarios, we can recognize the location data management software by 1 to map. First, it is a software, hardware, and the upper layer is applied between the bottom layer, in essence, is the use of computer hardware storage and computing power, data storage, management and processing operations, and finally to the upper support various applications.
1 data management technology background
data management technology to now there are at least 50 years of development history, probably gone through several stages of development, shown in Figure 2. The first stage is to propose a relational data model, laid the theoretical foundation for the development of relations database; the second stage is the rise and development of business such as Oracle and DB2 relational database growth, we began commercial database products in all walks of life applications, databases officially become the following addition to the server, operating system, third necessities; the third stage due to business scenarios control Engineering Copyright , the database is divided into business-oriented transactional database database-oriented analysis and statistical analysis, architecture, and modeling between the two has changed, this is the first time the separation of database technology; fourth stage is the wave of distributed database technology, data analysis from the earliest end produced CONTROL ENGINEERING China Copyright , stand-alone can not cope with the needs of massive data analysis on a distributed horizontal expansion of demand on the agenda, Hadoop, Spark and various NoSQL is to respond to this need was born , while the technology also distributed about 2010 to expand the database world affairs, mainly to deal with more and more Internet business.
stage of development of the data management system of FIG 2
Data management system at a position between the hardware and application determines its own technological evolution depends on the development and application of the upper end of the underlying hardware changes in demand. In the hardware side Control Engineering Copyright , from the beginning of the 1970s, chip, memory, general purpose server accessibility on Moore’s Law, the performance of stand-alone increasinglyThe stronger, driving the growing processing power of databases, use of memory capacity is also a big trend. And into 2000, the bottleneck of growth can not keep up the growth of processing power and data service, stand-alone system revealed CONTROL ENGINEERING China Copyright , resulting in a data management system transition to a distributed architecture. In the application side, Internet-based business, online business of making traffic and frequency of visits exponential growth of single centralized architecture processing bottlenecks encountered, and the amount of user-level mobile Internet in the hundreds of millions of proposed mass data analysis of the challenges, it is a distributed architecture to meet these challenges for us. Second, Big Data Big Data technology development and application technology originated in the Internet, is the first page of the site and the explosive growth of the search engine company was first felt the technical challenges brought about huge amounts of data, followed by the rise of social networking, video website, mobile Internet wave exacerbated this challenge. Internet companies are finding growth in the amount of new data, the diversity and demands on the aging process of a traditional database, business intelligence scale-up architectures can not cope. In this context, Google’s technology system first proposed a set of distributed data processing in 2004, that is a distributed file system Google File System (Google file system, GFS), MapReduce distributed computing systems and distributed database BigTable, to low cost a good solution to the dilemma faced by big data, laid the foundation for big data technology. Inspired by Google’s papers, Apache Hadoop implements its own distributed file system, HDFS, MapReduce distributed computing systems and distributed database HBase, and will be its open source, which is the starting point for big data technology open source ecosystem. Around 2008, Yahoo was first set up in the real world on a large scale Hadoop cluster, this is the first case to use Hadoop in Internet companies, ecological later Hadoop technology has penetrated into the Internet, telecommunications, financial and even more industry. 2009 UCBerkley University AMPLab developed the Spark, after five years of development, officially replaced the ecology Hadoop MapReduce position as the next generation of computing engines, while the 2013 net calculation Flink was born on Spark issued a challenge. After the 2014 Big Data technologies and ecological development has entered a stable period.
3 big data analytics technology development process of
After about 10 years of development, formed a big data technology as the leading open-source, multiple technologies and architectures coexist characteristics. From the data in the information system of the Life Cycle, Big Data technologies Five major ecological direction, including data collection and transmission, data storage, resource scheduling, calculation processing, query and analysis. In the field of data collection and transmission gradually formed Sqoop, Flume, Kafka and a series of open source technology, both off-line and real-time collection and transmission of data. In the storage layer, HDFS has become the de facto standard for big data disk storage, data model for other than relational, open source community formed KV (key-value), column type, documentation, 4 class NoSQL database system, HBase, Cassandra , MongoDB, Neo4j, Redis databases such flourishing. Scheduling of resources, Yarn dominate, Mesos certain development signed in. Calculation processing engines slowly covering the off-line batch computing, real-time calculation, stream computing and other scenes, the birth of MapReduce, Spark, Flink, Storm and other computing framework. Forming a rich SQL on Hadoop solution, Hive, HAWQ, Impala, Presto, Drill technology with traditional massively parallel processing (massively parallel processor, MPP) database fierce competition in the data query and analysis.
FIG 4 Eco art Big Data
Third, the large data after 2014, technology trends, the overall technical large stack data has stabilized, since the cloud, artificial intelligence technology, as well as chips, memory end change technology are big data corresponding changes. Summed up, there are a few trends: First, the replacement streaming architecture, the earliest large ecological data is no uniform method of calculating batch and flow, but to use Lambda architecture, batch tasks with batch computing engine, streaming tasks using streaming calculation engine, such as batch processing using MapReduce, the flow calculated using Storm. Later Spark trying to batchUniform flow and batch angle, Spark Streaming idea of using micro-bach to process the stream data. In recent years, the sudden emergence of a pure stream architecture Flink, because of its architectural design is reasonable, ecological health, particularly fast development in recent years. The Spark has recently abandoned its own micro-batch architecture, turned to pure stream architecture Structure Streaming, stream computing future overlord yet to see the outcome. Second, the cloud of big data technology, on the one hand is a mature public cloud services, many large data technology have been moved to the cloud, its operation and maintenance methods and operating environment have changed greatly CONTROL ENGINEERING China All Rights Reserved , and calculating elastic change brings more storage resources, on the other hand, a large private data technologies deployed gradually container using virtualization technology, it is desirable to use more refined computing resources. Third, heterogeneous computing requirements, in recent years, in addition to general-purpose CPU, GPU, FPGA, ASIC, etc. The rapid development of chip, different chips good computational tasks, e.g. good GPU processing image data, a large data according to different techniques began to try tasks to call different chips, improve the efficiency of data processing. Fourth compatible application intelligent class, with the rise of deep learning, more and more widespread application of AI class, big data technology stack in the abilities of compatible AI through a one-stop ability to do data analysis and AI applications, so that developers can write in a tool station SQL task, calling machine learning and deep learning algorithm to train the model, various types of data analysis tasks. IV Summary and Outlook data management technology has been developed over 50 years, Big Data technology is based on data management technology, the technology stack for large-scale data analysis, it is mainly distributed architecture design ideas, through parallel computing ways to enhance processing efficiency, along with the high scalability, extended at any time according to business needs. After about 15 years of development, large data stack technology matures, but in recent years the cloud, the development of universal application of artificial intelligence technology, as well as changes in the memory chip and the bottom end, and the other video, data technique gave a large band to the new requirements. Future Big Data technologies will be along heterogeneous computing, batch flow integration, cloud technology, compatible with the direction of AI, in-memory computing and other continuous change, mature 5G and networking applications, andWill be brought massive video processing and the Internet of Things data, these data also support big data technology is the direction of future development.