A paper summarizing the key technologies of the 15 major aspects of big data

In recent years, big data has come to the forefront and has brought about earth-shaking changes. Let people realize that it is more important than mastering the huge data information to master the technology of specializing in meaningful data.

The key technologies of big data cover technologies from data storage, processing, application, etc. According to the processing of big data, it can be divided into big data collection, big data preprocessing, big data storage and management, big data analysis and mining. Waiting for the link.

This article sorts out the key technologies of big data to readers.

Part 1. Big data collection

Data collection is the first step in the life cycle of big data. It obtains various types of structured, semi-structured and unstructured massive data through RFID radio frequency data, sensor data, social network data, and mobile Internet data. Since there may be thousands of simultaneous concurrent accesses and operations, it is necessary to adopt a collection method specifically for big data, which mainly includes the following three types:

A. Database collection

Some companies use traditional relational databases MySQL and Oracle to store data. Speaking of more tools, there are ETL tools between Sqoop and structured database. Of course, the open source Kettle and Talend itself also integrate big data integration content, which can realize data synchronization with hdfs, hbase and mainstream Nosq database. integrated.

B. Network data collection

Network data collection is mainly the process of obtaining data information from the website by means of web crawlers or websites exposing APIs. In this way, unstructured data and semi-structured data on the network can be extracted from the webpage and stored in a structured manner as a unified local data file.

C. Document collection

For the collection of files, the more talked about is Flume for real-time file collection and processing. Of course, for ELK (ElasTIcsearch, Logstash, Kibana combination), although it is processing logs, there are also full incremental real-time files based on template configuration. Acquisition implementation. If it is just a collection and analysis of the log, then the ELK solution is completely sufficient.

Part 2. Big data preprocessing

The world of data is large and complex, and it can be fragmented, false, and outdated. In order to achieve high quality analytical mining results, you must improve the quality of your data during the data preparation phase. Big data preprocessing can clean, fill, smooth, merge, normalize, and check consistency of the collected raw data, transforming those disorganized data into a relatively single and easy-to-handle configuration, laying the foundation for later data analysis. basis. Data preprocessing mainly includes: data cleaning, data integration, data conversion and data protocol.

A. Data cleansing

Data cleansing mainly includes missing value processing (lack of attributes of interest), noise data processing (data in the data, or data deviating from the expected value), and inconsistent data processing. The main cleaning tools are ETL (ExtracTIon/TransformaTIon/Loading) and Potter's Wheel.

Missing data can be processed by global constants, attribute mean values, possible value padding or directly ignoring the data; noise data can be binned (grouping raw data, then smoothing the data in each group), clustering, computer Manual inspection and regression to remove noise; manual corrections for inconsistent data.

B. Data integration

Data integration refers to the consolidation of data from multiple data sources into a consistent data repository. This process addresses three important issues: pattern matching, data redundancy, data value collision detection and processing.

Data from multiple data sets will result in different entity names due to naming differences. Usually, entity identification needs to use metadata to distinguish and match entities with different sources. Data redundancy may be derived from the inconsistency of data attribute naming. In the solution process, the value attribute can be measured by using the Pearson product moments Ra, b. The larger the absolute value, the stronger the correlation between the two. The data value conflict problem is mainly manifested by the fact that the unified entities with different sources have different data values.

C. Data transformation

Data conversion is the process of dealing with inconsistencies in the extracted data. Data conversion generally consists of two categories:

The first category, the uniformity of data names and formats, namely data granularity conversion, business rule calculation and unified naming, data format, unit of measurement, etc.; the second category, there is data in the data warehouse that may not exist in the source database, so it is required Combine, split, or calculate fields. The data conversion actually includes the work of data cleaning. The abnormal data needs to be cleaned according to the business rules to ensure the accuracy of the subsequent analysis results.

D. Data protocol

Data reduction refers to minimizing the amount of data while maintaining the original appearance of the data as much as possible, including: data party aggregation, dimensional specification, data compression, numerical specification, and concept layering. Data protocol techniques can be used to derive a protocol representation of a data set, making the data set smaller, while still being close to maintaining the integrity of the original data. That is to say, mining on the data set after the statute can still obtain nearly the same analysis results as using the original data set.

Brushless DC Permanent Magnet Gear Motor

Brushless Dc Motor,Brushed Dc Motor,Industrial Machine Dc Motor,Brushless Dc Permanent Magnet Gear Motor

NingBo BeiLun HengFeng Electromotor Manufacture Co.,Ltd. , https://www.hengfengmotor.com