Energy Big Data: A Survey - Universitetet i oslo · Data size in electric power systems have...

SPECIAL SECTION ON THEORETICAL FOUNDATIONS FOR BIGDATA APPLICATIONS: CHALLENGES AND OPPORTUNITIES

Received March 15, 2016, accepted March 28, 2016, date of current version August 15, 2016.

Digital Object Identifier 10.1109/ACCESS.2016.2580581

Energy Big Data: A SurveyHUI JIANG1, KUN WANG1, (Member, IEEE), YIHUI WANG1, MIN GAO2,AND YAN ZHANG3, (Senior Member, IEEE)1Key Laboratory of Broadband Wireless Communication and Sensor Network Technology, Ministry of Education, Nanjing University of Posts andTelecommunications, Nanjing 210003, China2Department of Electrical Engineering, University of California at Los Angeles, Los Angeles, CA 90095, USA3Simula Research Laboratory and University of Oslo, Norway

Corresponding author: Y. Zhang ([email protected])

This work was supported in part by the National Natural Science Foundation of China under Grant 61572262, in part by the NationalScience Foundation of Jiangsu Province under Grant BK20141427, in part by the Nanjing University of Posts andTelecommunications under Grant NY214097, in part by the Open Research Fund within the Key Laboratory ofBroadband Wireless Communication and Sensor Network Technology, Ministry of Education, NanjingUniversity of Posts and Telecommunications under Grant NYKL201507, in part by the Qinlan Projectof Jiangsu Province, and in part by the Research Council of Norway under Project 240079/F20.

ABSTRACT Energy is one of the most important parts in human life. As a significant application of energy,smart grid is a complicated interconnected power grid that involves sensors, deployment strategies, smartmeters, and real-time data processing. It continuously generates data with large volume, high velocity, anddiverse variety. In this paper, we first give a brief introduction on big data, smart grid, and big data applicationin the smart grid scenario. Then, recent studies and developments are summarized in the context of integratedarchitecture and key enabling technologies. Meanwhile, security issues are specifically addressed. Finally,we introduce several typical big data applications and point out future challenges in the energy domain.

INDEX TERMS Big data, energy, smart grids, data processing, data mining, energy internet, survey.

I. INTRODUCTIONSmart grid is a major research and development direction intoday’s energy industry. It refers to the next generation powergrid which integrates conventional power grid and manage-ment of advanced communication networks and pervasivecomputing capabilities to improve control, efficiency, relia-bility and safety of the grid [1]. Smart grid delivers electricitybetween suppliers and consumers so as to form a bidirectionalelectricity and information flow infrastructure. It fulfills thedemands of each stakeholder, functionality coordination inelectric power generation, terminal electricity consuming andpower market. It also improves the efficiency in each part ofthe system operation and reduces the cost and environmentalimpact. At the same time, the system reliability, self-healingability and stability are improved [2].

Integrating telecommunication, automation and electricnetwork control, smart grid requires reliable real-time dataprocessing. To support this requirement and benefit futureanalysis and decision making, huge amount of historic datashould be well fetched and stored in a reasonable time budget.In addition, there are various sources that huge amount of datacan be generated through diverse measurements acquired byIntelligent Electronic Devices (IEDs) in the smart grid:

• Data from power utilization habits of users;• Data from Phasor Measurement Units (PMUs) for situ-ation awareness;

• Data from energy consumption measured by thewidespread smart meters;

• Data from energy market pricing and bidding collectedby Automated Revenue Metering (ARM) system;

• Data from management, control and maintenance ofdevice and equipment in the electric power generation,transmission and distribution in the grid;

• Data from operating utilities, like financial data andlarge data sets which are not directly obtained throughthe network measurement.

All aforementioned sources increase the grid’s data vol-ume. According to the investigation in [3], by 2009 theamount of data in electric utilities’ system has alreadyreached the level of TeraBytes (TBs). With every passing day,more TBs of data are emerging at infrastructure company datacenters. This fact accelerates the pace of big data technologyapplying to the smart grid area.

To obtain a better understanding of the big data appli-cation on smart grid, we give an overview of energy bigdata (big data applied in the energy domain) covering the

38442169-3536 2016 IEEE. Translations and content mining are permitted for academic research only.

Personal use is also permitted, but republication/redistribution requires IEEE permission.See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

VOLUME 4, 2016

H. Jiang, et al.: Energy Big Data: A Survey

recent researches and development in the context of anintegrated architecture, key enabling technologies, security,typical applications and challenges.

The remainder of this survey is organized as follows.Overview of the energy big data is provided in Section II,and related analytics frameworks are presented in Section III.In Section IV, key enabling technologies are listed. Section Villustrates the security issues and Section VI looks into mainapplication areas. Challenges are discussed in SectionVII andSection VIII concludes this paper.

II. BACKGROUNDA. SMART GRIDSmart grid is a form of electric power network. It is a combi-nation of all the power utilities resources, both of renewableand non-renewable. As a significant innovation, smart gridprovides automatic monitoring, protecting and optimizingfor the operation of the interconnected systems. Not onlycovering the traditional central generator in the transmissionnetwork and emerging renewal distributed generator in thedistribution system, it also serves industrial consumers andhome users in aspects of thermostats, electric vehicles andintelligent appliances [4]. Characterized by the bidirectionalconnection of electricity and information sets, the smart gridhas formed an automated, widely distributed delivery powernetwork. Taking advantages from modern communications,it can not only deliver real-time data and information but alsoimplement the instantaneous management balance of demandand supply [5].

Operations on energy storage, demand response and com-munication between terminal users and power companieswill make a real large scale data-communicated grid system.This adds complexity in the conventional grid and promoteoffering sustainable progresses to utilities and customers [6].In order to match its complexity, many technologies areexploited in the smart grid domain. Examples include wire-less networks in telecommunications, sensor networks inmanufacturing, and big data in computer science. Especially,big data is attracting more and more attention in the powergrid domain because it can help extract valuable informationfrom data sources of wide diversity and high volume.

B. BIG DATA TECHNOLOGYBig data technology is an emerging technology which appliesto data sets where data size is so large and common data-related technology tools are hard to capture, manage, andoperate under multi-limits. Common data-related technologyuse a consistent, single, traditional data management frame-work, which is only suitable for data with low diversity andvolume [7], [8]. If coming from diverse data sets, data will notbe correlated in space and time, and it is hard for those datato get correlated to a unified and generalized power systemmodel [9]. Therefore, traditional data-related technology willno longer guarantee reliable data management, operation,etc. In this way, big data technology provides integrated

architectures for complex power grid, effective data analysisto aware the unfavorable situation in advance, and variousdata processing methods for different conditions [10]–[13].

C. BIG DATA IN SMART GRIDData size in electric power systems have increased dramati-cally, which results in gaps and challenges. The 4Vs data (datawith characteristics of volume, variety, velocity, and veracity)in smart grid are difficult to be handled within a tolerableoperating time and hardware resources [14]. Therefore, thebig data technology has been introduced to the power system.The emerging big data technology does not conflict withclassical analysis or pretreatment. In fact, it has already beenapplied successfully as a powerful data-driven tool in numer-ous research areas such as financial systems [15], quantumsystems [16], biological systems [17], and wireless commu-nication networks [18].

Big data sets and their relationship with several sampleapplications are depicted in Fig. 1.

FIGURE 1. Big data and applications.

In the aspect of the field measurement showed in Fig. 1,the deployment of PMUs has improved the measurements oninstantaneous values of voltages and currents for operatorswith accurate timestamps [19]. PMUs can be imported tothe state estimator to provide direct state measurement, sincethey provide measurements in both magnitude and angleaspects [20]. This is especially interrelated when the systemtopology such as operations on automated breaker changerapidly. PMU data can also be applied in protective relayingoperations [21]. The use of PMU data is helpful for moreaccurate and more rapid tripping response in measurement.

In the aspect of distribution, the management of big datais also crucial in order to facilitate applications in the fieldsof demand response, Distributed Energy Resources (DERs),and Electric Vehicles (EVs) [22]. Improved and securebi-direction communication and management of the associ-ated big data are required for turning loads into dispatchableresources, thus promoting DERs to participate in electricitymarkets [23].

VOLUME 4, 2016 3845

H. Jiang et al.: Energy Big Data: A Survey

1) MAIN TECHNIQUESOver the past few years, almost all major companies havelaunched their projects on big data, and brought a set oftools which are brand new to the power industry. Thosecompanies include IBM, Oracle from IT domain, GeneralElectric, Siemens from the power grid domain and start-upslike AutoGrid, Opower, and C3 [24]. In addition, scientistshave proposed a variety of techniques including optimizationand data mining to capture, analyze, process and visualize bigdata that is of large volume and needed to be handled withinlimited time.

In order to tackle communication problems and data pro-cessing problems in the smart grid area, a variety of emergingscalable and distributed architectures and frameworks havebeen proposed. Rogers et al. [25] proposed an authenti-cated control framework for distributed voltage supportedon the smart grid. A layered architecture was presentedfor an optimized algorithm to follow a chain of commandfrom the transmission grid to the distribution grid [26].In addition, many optimization methods have been exploitedto solve quantitative problems in smart grid branches suchas power distribution, sensor topology, etc. Furthermore,Aquino-Lugo et al. [27] provided another analytical frame-work using agent-based technologies to manage dataprocessing.

Data mining extracts valuable information from data sets.It includes a series of clustering, classification and regres-sion. A typical example of data mining under the smartgrid scenario is the grouping process of customers. Thosecustomers are grouped into different classes according to theirelectricity consumption types and other characteristics. Then,the clustering method is applied to generating residential loaddocuments [28] and differentiating pattern changes caused byseasonal and temporal factors [29].

In smart grid, this data mining technology can be classifiedinto two categories: one is Artificial Intelligence (AI) formodeling risk and uncertainty of obtained data [30], and theother is estimation based on recorded data. Utilizing fuzzywavelet neural network [31] and estimating with nonparamet-ric methods [32], data mining methods obtain enforcementrecently. For example, the fuzzy decision tree has been usedto classify disturbances of power quality [33].

2) MAIN TOOLSApache Hadoop is one of the most well-known and powerfulbig data tools written in JAVA [34]. Providing platforms andinfrastructures for specific big data applications in businessand commerce, it plays an important role on big data relatedcompanies, such as MapR, Cloudera and Hortonworks [35].Hadoop usually consists of two major components, one is theHadoop Distributed File System (HDFS) [36] and the otheris the Hadoop execution engine: MapReduce [37]. HDFS isoften applied for data storage while MapReduce is used indata processing [38].

As for applications of data streaming, operation inthe electric power system needs real-time response for

data processing. For example, a technology called Storm isdesigned especially for real-time data stream analysis [39].In order to analyze and process interactively, the data isgenerated from interactive environments. Users are able tohave real-time interaction with computers due to the directconnection to it. Apache Drill is a distributed system usedfor interactive analysis of the big data [40]. The purpose ofDrill is to have low latency in response to ad-hoc queries.Therefore, it has the capability to process PetaBytes (PBs) ofdata and trillions of records within seconds.

III. ENERGY BIG DATA: ARCHITECTURESupported by many kinds of different techniques, big dataanalysis plays an important role in smart grid. One solutionarchitecture provides a three-tier framework for better dealingwith data high in volume, velocity and variety. For variousdata generated from smart grid, an integrated architecture canalso be used for cloud storage and efficient processing.

A. ROLE OF BIG DATA IN THE SMART GRIDThe data obtained from PMUs, smart meters, sensors andother IEDs have opened up a plethora of chances, suchas predictive analytics, real-time vulnerability assessment,theft detection, demand-side-management, economic dis-patch, energy trading, etc [41].

Big data can help improve the smart grid managementto a higher level. For example, by enhancing the acces-sibility of a customer’s electricity consumption data, thedemand response will be expanded and energy efficiencywill be improved [42]. Similarly, analyzing data obtainedfrom PMUs and IEDs will help to improve customer ser-vice, prevent outages, maximize safety and ensure servicereliability [43], [44]. Furthermore, electric utilities are usingprediction data analytics for estimating several parameterswhich are helpful to operate the smart grid in an efficient,economical and reliable way. For instance, whether the excessenergy is available from renewable sources can be predictedthrough accessing the ability of the smart grid to transmit it.At the same time, calculating equipment downtime, accessingpower system failures, and managing unit commitment canalso be done effectively in high correlation with integrat-ing distributed generation. Thus, the grid management statewill be enhanced, the efficiency will be improved and therobustness of generation and deployment will be ensured.With the integration of various distributed generating sources,refined forecasting, load planning, and unit commitment aremore convenient to avoid inefficient energy transmitting ordispatching extra generation.

B. A THREE-TIER ENERGY BIG DATAANALYTICS ARCHITECTUREA widely accepted analytical framework of big data is shownin Fig. 2, which has three layers including data access andcomputing, data privacy, domain knowledge and big datamining algorithm [45].

The inner core data mining platform is mainly responsiblefor data access and computation process. With the increasing

3846 VOLUME 4, 2016


FIGURE 2. Three-tier big data analytics architecture.

growth of the data volume, distributed storage of large scaledata need to be taken into account while computing. That isto say, data analysis and task processing are divided into mul-tiple sub tasks and executed on a large number of computingnodes through a parallel program.

The middle layer of the structure plays an important roleto connect inner layer and outer one. The data mining tech-nology in the inner layer provides a platform for data-relatedwork in middle layer such as the information sharing, privacyprotection, and knowledge acquisition from areas and appli-cations, etc. In the whole process, the information sharing isnot only the guarantee of each phase, but also the goal ofprocessing and analyzing with the big data in smart grid.

In the outer layer of the architecture, preprocessing isnecessary for the heterogeneous, uncertain, incomplete, andmulti-source data through data fusion technology. After pre-processing, complex and dynamic data will be excavated,and then, pervasive smart grid global knowledge can beobtained through local learning and model fusion. Finally,model and its parameters need adjustment according to thefeedback [46].

C. AN INTEGRATED BIG DATA ARCHITECTUREIN THE SMART GRIDBased on big data analytics and cloud technology, an inte-grated architecture can be used in the smart grid in manyways, such as optimizing power transmission, controllingpower consumption, keeping the balance between powerdemand and supply, etc [47]. This architecture takes advan-tages of the three technologies including big data analyt-ics, smart grid and cloud computing, which composes anenhanced version of smart grid. The improved smart grid hasthe following functionalities:

• Analyze the historic data and estimate the energyproduction.

• Analyze the consumer behavior patterns aboutelectricity consumption to estimate the demand inadvance.

• Keep record of the energy production from differentsources and make decision to switch demands betweenthe high/low priority.

• Balance the load of the demand/supply chain.• Do efficiently on the storage/transfer of the generatedpower.

The improved smart grid architecture is illustrated in Fig. 3,which consists of the smart grid, the HDFS and the relatedcloud environment. In order to manage the storage andretrieval of massive data, HDFS is used in this system. TheHDFS gives concentration on distributed storage to nodesin racks [48]. This architecture also contains a databaseincluding consumer behavior pattern, historic data, details inpower supply and demand. Each time the system estimatesthe demand and calculates the supply, it will refer to theconsumer behavior patterns and historic data. These data arestored in a cloud-based Cassandra DataBase (Cassandra DB).This improved smart grid utilizes a prediction algorithm [49]to estimate the demand and supply of electric power. In adistributed environment, the smart grid uses distributed powerresources like solar, wind, nuclear sources, and is appliedin many areas, such as industrial production and socialinfrastructure.

FIGURE 3. An improved smart grid architecture.

VOLUME 4, 2016 3847


IV. ENERGY BIG DATA: KEY TECHNOLOGIESWith the remarkable development of big data analysis inenergy domain recently, some significant technologies haveemerged to be applied in various fields. The key techniques ofbig data in energy system can be divided into four categories:data acquisition and storing, data correlation analysis, crowd-sourced data control and data visualization. Among them thedata acquisition and storing is the most crucial component.Crowd-sourced data control is applied especially to solve thedata problems which are hard to be processed. To obtainbetter understanding of recent research trend, we analyze thedata acquisition and storing technologies in detail and take abrief overview on big data technologies.

A. DATA ACQUISITION AND STORINGData acquisition and storing is the initial problem in big data.The big data technologies in acquisition and storing stagegather data from various information in energy system. Thecollected data are of different sources, different formats anddifferent features that is stored in data repositories.

The acquisition and storing technology of big data belongsto data management which involves data fusion, data integra-tion, data management, and data transforming which is usu-ally called Extract Transform Load (ETL) technology [50].Fig. 4 shows the data acquisition and storing flow usingETL technology.

FIGURE 4. Data acquisition and storing.

Data acquisition involves data access and collection. Sincedata have private information, its confidentiality and secu-rity should be considered during accessing and transmitting.We investigate several homomorphic encryption and decryp-tion methods for data access and storing. Further, secure dataaggregation and routing are also studied in this part.

1) HOMOMORPHIC ENCRYPTION SCHEMEThe transmitted information are always private, so encryp-tion technology is necessary to prevent unpredictableattacks [51], [52]. Paillier [53] used an homomorphicprobabilistic encryption scheme to deal with the compos-ite residuosity class problem. Boneh et al. [54] proposedanother homomorphic encryption scheme which can encrypt

the information and ensure the confidentiality. The encryp-tion protocol has three following components:• Key generation: When given a security coefficient, thisalgorithm generates public keys and global parameters.Set n = q1 · q2, where q1, q2 are primes. Select N ∈ Z∗

n2such that N has an order which is a multiple of n andZn2 = {0, 1, 2, · · · , n, · · · , n

2−1}, whereZ∗ represents

the adjoint matrix of Z. Set λ(n) = lcm(q1 − 1, q2 − 1),where lcm means the least common multiple. Let i rep-resents the i-th intended receiver that a message is sentfor. Then the public key of i-th receiver can be expressedas: PK [i] = (n,N ), the secret key can be expressed as:SK [i] = (λ(n)).

• Encryption method: Select m ∈ Zn which representsa message. Choose a random number r ∈ Z∗n. Thenthe ciphertext c can be given by: c = E(m) = Nmrn

mod n2 where rn is an enhancing factor to make thehomomorphic computation indeterministic.

• Decryption method: Decryption from c to m is accord-ing to:

m = D(c) =L(cλ(n) mod n2)L(Nλ(n) mod n2)

mod n,

where the L-function takes input from the data setU = {u|u = 1 mod nandu < n2} and the output can becalculated in: L(u) = (u− 1)/n.

Supposing c1 = E(m1) and c2 = E(m2) are two ciphertextsfor m1, m2 ∈ Zn respectively, then the sum of the plaintextsS can be obtained from the ciphertexts: S = D(c1 · c2mod n2) = (m1 + m2) mod n.Except for the additive homomorphism model above, the

same information can be encrypted into different cipher-texts by various homomorphism encryption techniques.Rivest et al. [55] used a multiplicative homomorphismencryption method to keep privacy and authority in an elec-tronic mail system. Gentry [56] proposed fully homomorphicscheme which can support complicated functions.

2) SECURE AGGREGATION BY SMART METERSIn power network, data can be aggregated through HomeArea Network (HAN), Building Area Network (BAN), andNeighboringAreaNetwork (NAN) in theway that customers’privacy is protected [57]. For each HAN, there is a gatewaysmart meter called han which collects and sends informationto the BAN. The gateway smart meter ban in BAN aggregatesthese information and sends them to the gateway smart meternan in the neighborhood area. Finally, nan in NAN reports allthe information to the data repository which is monitored bythe Remote Terminal Unit (RTU).

Ruj and Nayak [58] proposed a new decentralized securityframework to keep the whole aggregation process safe. Thei-th RTU Ri in data repository is securely given a public keyPK [i] = (n,N ) and a secret one SK [i] = (λ(n)) as in thekey generation in additive homomorphismmodel. Each smartmeter knows the public key PK [i] of its nearest RTU datarepository Ri in the network. The j-th gateway smart meter

3848 VOLUME 4, 2016


hanj in HAN sends a data packet that consists of two fields:the attribute field a and the power consumption field pj cor-responding with different attributes. The power consumptionfield can be encrypted with the data repository public key.Then the packet can be described as: PPK [hanj] = a||cj =a||E(pj) = a||(Nmjrnj mod n2), where PPK [hanj] is shortfor the packet sent by smart meter gateway hanj and rj ∈ Z∗nis randomly chosen by the smart meter.

The attribute field will be checked in BAN. When thepacket is sent to the BAN, the gateway banl will aggregate allthe packets which have the same set of attributes into a newpacket and process the aggregated power consumption. Thisaggregated result can be given by: cbanl =

∏hanj∈HAN chanj .

The new packet can be described as: PPK [banl] = a||cbanl .Packets aggregated by a set of gateway smart meters{ban1, ban2, · · · , banj} then will be sent to the NAN. TheNAN gateway nank aggregates the packets and operations theinformation at the same way of ban. The aggregated resultcan be figured out by: cnank =

∏banl∈BAN cbanl . The new

packet PPK [nank ] = a||cnank will be sent to the nearest datarepository. The RTU Ri in the data repository will receivepackets and decrypt the encrypted results by using its secretkey SK [i]. Thus the RTUs can aggregate information withoutknowing what exact data was send by the smart meters atthe HANs. The whole aggregation progress is similar to aspanning tree, where the HAN is at the bottom and datarepository is at the top.

3) ACCESS CONTROL SCHEMEAll the sensitive information must be sent separately tothe specific terminals, thus access should be restricted andcontrolled. Sahai and Waters [59] proposed a cryptographicprotocol called Attribute Based Encryption (ABE) which ismainly to distribute attributes to senders and receivers so thatonly the receiver with matching attributes set has the access tothe data. However, this protocol is restricted to only supportthreshold access structures. Therefore, many researches havebeen launched to improve this protocol. Goyal et al. [60]proposed a novel ABE scheme called Key-Policybased (KP-ABE) scheme which can handle any mono-tonic access structure. Bethencourt et al. [61] proposedanother type of protocol which is known as Ciphertext-PolicyABE (CP-ABE). In these schemes, ciphertexts are encryptedunder a given access structure with a set of attributes. Onlya receiver has a matching attributes set can it decrypt theinformation.

Lewko andWaters [62] proposed an access control schemeand gave limited access to authorized data users. It is mainlybased on a ABE cryptographic technique where RTUs haveattributes and cryptographic keys distributed by multiple KeyDistribution Centers (KDCs). When an RTU Ri wants tostore a message m, Ri defines an access structure S firstlywhich helps it to select the authorized users to access themessage m. Ri then creates a k × l-matrix R (k representsthe attributes’ number in the access structure and l is theleaves’ number in the access spanning tree). Ri also defines

a mapping function ψ of which rows is a set of attributes:ψ : {1, 2, · · · , k} → W, where W means a attributes set.Obviously, ψ is a permutation. Then the RTU Ri runs theABE encryption algorithm to calculate the ciphertext C byC = ABE .Encrypt(m,R, ψ). Ciphertext C is thus stored inthe data repository. If a user u has a valid set of attributes, hisrequest to access to the ciphertext C from the data repositorywill be accepted and theC will be decrypted into the messagem by m = ABE .Decrypt(C, {SK [i, u]}), where SK [i, u] rep-resents the secret key of KDC corresponding to the attribute igiven to the user u.

4) DATA STORING AND ROUTINGThe large amount of data generated by the widespreadsensor networks in the smart grid requires scalability,self-organization, routing algorithms and efficient data dis-semination. Generally, the content of the data should bemore important than the identity of the sensor used to storedata [63]. Therefore, comparing with the common routingtechnology, the data-centric storing and routing technologieshave been developed more rapidly in recent decades.

In the data storing and routing technology, the data isdefined and routed referring to their names instead of thestorage node’s address. The data can be queried for users afterdefining certain conditions and adopting certain name-basedrouting algorithms, like the Data-Centric Storage (DCS)and the Greedy Perimeter Stateless Routing (GPSR) algo-rithm [64]. In these algorithms, each data object belongs to anassociated key and each working node stores a group of keys.Any node in the system is allowed to locate the storage nodefor any key and every node can input and output files basedon its keys, via a hash-table-like interface. The followingexample shows the data storage algorithm in informationencryption and decryption.• Encrypt2DS: Each terminal user can encrypt informa-tion I into a ciphertextCDS with the data storage encryp-tion algorithm, which encrypts with params and thedata storage identity DS in the regional cloud nodes.Therefore, this encryption process is denoted as:

CDS ← Encrypt2DS(params,DS, I ).

• Decrypt2DS: Each regional cloud node can decrypt anaccepted ciphertext CDS into the information I with thedata storage decryption algorithm, which decrypts withparams and the private key KDS relevant to the datastorage identity DS. Therefore, this decryption processis denoted as:

I ← Encrypt2DS(params,KDS ,CDS ).

B. DATA CORRELATION ANALYSISThe big data technologies in data analysis involve correlationanalysis, data mining and machine learning. Through dataanalysis techniques, valuable information can be extractedfrom large amount of big data in a power system.

VOLUME 4, 2016 3849


Correlation is a well-known statistical and mathematicalmethod for the compatibility analysis of large data sets.Meier et al. [65] successfully used Pearson Product-Momentcorrelation to determine how well data is linearly correlated.Given two independent input data sets of X , Y with the lengthofN , and X , Y being either the phase data values of two PMUsites or the momentary magnitudes, the Pearson correlationsubmits to a correlation coefficient C (C ∈ [−1, 1]), whichis based on the following equation:

C =

∑Ni=1(XY )−

∑Ni=1 X

∑Ni=1 Y

N√(∑N

i=1(X2)− (∑N

i=1 X )2

N )× (∑N

i=1(Y 2)− (∑N

i=1 Y )2

N ).

Two application-specific improvements and modificationshave been made to this mathematical formula. Firstly,because the algorithm was made incremental, each i datapoint (xi, yi) could be gained from the Phasor Data Con-centrator (PDC) engine and incorporated into its correlationcoefficient at once. Thus, there is no need to directly calculateeach average, summation, and standard deviation at each timestep repeatedly. This helps reduce the operation time. Sec-ondly, the correlation algorithm is set to maintain correlationinformation through varying windows of time. The streamdata structure actually maintains multiple separate points toterminal positions of each defined sliding window. However,separate lists for each sliding window are not created. Rather,points from a single list are controlled to minimize memoryusage and copies of data which are already managed.

Correlation analysis techniques can figure out correlationor causal relationship and have been widely applied in powersystem for making investment decisions and forecasting theelectric power load. As mentioned above, data mining rangesin the most popular techniques of data analysis, which ismainly applied for prediction and presentation in power grid.Machine learning is also widely applied in energy system.Pandy et al. [66] used correlation analysis of big data tosupport machine learning through best fit linear regression oflarge data sets. In addition, correlation analysis also appliesfor monitoring power equipment conditions, assessing tran-sient stability, etc [67].

De Silva et al. [68] proposed an Incremental Summariza-tion and Pattern Characterization (ISPC) framework usingdata mining technology to address problems in energy con-sumption analysis. This framework has two basic functions:incremental learning and interim summarization. Incrementallearning can extend dynamic self-organization of senor nodesin power grid with the dynamic topology preservation tech-nology. Using a structure-adapting feature map, this technol-ogy can ensure natural groupings in the data being exposed.The structure adaption is affected by a total quantization errorEi,t , predefined values x andw, a spread factor which controlsgrowth of the dynamic feature map:

Ei,t =∑Hi

D∑j=1

(xj(t)− wj(t))2,

where Hi means the ith hit node and D refers to the numbersof energy environment location phases.WhenEi,t exceeds thespread factor, the network structure will be adapted to fit thischange, through growing new energy nodes or distributingthe error to neighboring nodes. Interim summarization usesdimensional model representing the power environment inorder to generate online digests of flow data. The digestsfacilitates data analysis in various directions, such as trendanalysis, granularity exploration and ad-hoc querying.

C. CROWD-SOURCED DATA CONTROLAlthough various technologies can be applied for differentdata processing periods, some vital data management andanalysis tasks in the power system cannot be addressedcompletely by data processing technologies. However, thesehard tasks such as entity resolution, sentiment analysis, andimage recognition, can be dealt with human cognitive abil-ity [69]. Generally, utilizing the capabilities of the crowdis an effective method to handle computer-hard tasks in thepower system. Therefore, crowd-sourced data managementhas aroused increasing attention in energy big data research,such as data integration [70], knowledge construction [71],data cleaning [72], etc. In the energy domain, widespreadsensor nodes are the same as workers in crowd-sourced datacontrol. As workers being task operators, sensor nodes arealways assigned to deal with diverse data collected frompower grid.

Problems in crowd-sourced data control can be dividedinto two groups: quality control and cost control. The qualitycontrol is the most important because crowd-sourcing oftengenerated relatively low-quality results or even noise. Thus,we will analyze quality control in detail.

1) QUALITY CONTROLThe first step of the quality control is to characterize workers’quality, includingmodeling workers and defining parameters.The current researches have proposed several methods tomodel a worker’s quality. We summarize some as follows.• Worker Probability [73]–[75]In worker probability, each worker’s quality is taken as a

single parameter p ∈ [0, 1], which indicates the accuracy rateof the worker’s answer:

p = Prob(the worker ′s answer = the task ′s true answer).

For example, if a worker’s answer has a probability of80 percent to be correct, the workers quality would be char-acterized as p = 0.8. Since the accuracy parameter p isdefined in the range of [0, 1], some studies extend it with twoapproaches:i) Extending to a wider range of (−∞,+∞), which

enables a higher probability to the correct answer [76].ii) Extending another parameter as a confidence interval to

describe how confident the calculated p is. For example,if the confidence interval is ranged among [0.3, 0.8], itmay narrow to [0.55, 0.6] after a worker answeringmoretasks [77].

3850 VOLUME 4, 2016


• Confusion Matrix [78]Confusion matrix is often used to represent a worker’s abil-

ity to process single-choice tasks with n potential answers.A confusion matrix is a n× n matrix:

P =

P1,1 P1,2 · · · P1,nP2,1 P2,2 · · · P2,n...

.... . .

...

Pn,1 Pn,2 · · · Pn,n

,where the i-th (i ∈ [1, n]) row represents the distribution ofthe worker’s answers for the i-th task’s answer and the j-th(j ∈ [1, n]) column represents the correct answer probabilityof the j-th worker’s answer. Therefore,

pi,j = Prob(the worker ′s answer is j

| the task ′s true answer is i).

Some techniques are studied in order to compute theparameters in worker models as above, and we just introducetwo typical examples:• Qualification Test [78]The qualification test sets a series of tasks with known

true answers. Only when the workers pass the qualificationtest and achieve good marks can they answer the real answertasks. Referred to the worker’s test answers, the worker’squality can be deduced. For instance, if a worker is modeledby Worker Probability and he answers correctly 8 out of 10test tasks, his quality is p = 0.8.• Test-Injected Method [79]The gold-injected method integrates test tasks and real

tasks. For instance, if 10 percent of tasks are set as test tasks,each time when a worker is assigned with 10 tasks, he willtake 1 test task and 9 real tasks. Based on the accuracy ofworker’s answers on test tasks, his quality can be calculated.However, this method is different from the qualification testbecause worker doesn’t know that some tasks are the tests.

When the workers are in the low quality, worker elimina-tion is essential to improve the whole quality. There are manyapproaches to detect low-quality workers. A simple one isstill using qualification test or test-injected method. It usestest tasks to compute the worker’s quality and workers withlow quality (e.g. p ≤ 0.5) are not allowed to process the realtasks. Therefore, the overall quality can keep ranging in a highlevel.

2) COST CONTROLThe crowd is not free. When there are a lager number of tasksto do, the crowd-sourcing can be quite expensive. Therefore,cost control is challenging for crowd-sourced data control.In recent researches, several effective techniques have beenproposed for cost control. One is pruning, which first appliescomputer algorithms to preprocess all the tasks and removeunnecessary tasks, guiding workers only to answer the essen-tial tasks [80]. Another technique is task selection, which canimprove quality and reduce cost by prioritizing all the tasksand selecting the most beneficial task to do first [81].

D. DATA VISUALIZATIONThe data visualization of big data has three main components:historical information visualization, geographical visualiza-tion, and 3-dimensional visualization, which can intuitivelyand accurately express information of electric power system.The visualization technologies are used for monitoring real-time power system status, which heighten the automationdegree of power system. If combined with complex systemnetwork theories, it also can identify potential relationshipsand patterns in smart grid.

Zhang et al. [82] proposed a 5ws parallel coordinate visu-alization model for big data density analytics. The 5ws iscomposed of: where did the data generate, when did thedata generate, why did the data generate, what was the datacontent, how was the data transferred and who received thedata. Therefore, each data can be denoted into a function:f (p, t, r, x, y, z), where p represents where data came from,t represents the time stamp for each data incidence, r repre-sents the reason data generated, x represents the data contentsuch as ‘‘attack’’, ‘‘like’’ or ‘‘dislike’’, y represents the waydata transferred and z represents the data receiver.Assuming all n data items are in the time period

T {t1, t2, t3, · · · , tn}, a function set F can be set and containall data items within a certain time interval:

F = {f1, f2, f3, · · · , fn}.

For a specific data item, supposing p = α, r = β,x = γ , y = δ, z = ε, the data pattern can be denoted asf (p(α), t, r(β), x(γ ), y(δ), z(ε)). Therefore, the subset f(α,β,γ,δ,ε)can be mapped in the parallel coordinates using 5 polylines,which represent the data in 5ws dimensions at a particulartime. Then, a 5Ws parallel coordinate visualization will begained.

All key technologies and related work have been listedin Table 1.

V. ENERGY BIG DATA: SECURITYAlthough the integration big data with power grid realizeshigh degree of network connectivity, it also brings complexsecurity vulnerabilities into grid [83]. From the perspectiveof big data, we discuss the main security vulnerabilities andrequirements in privacy, integrity, authentication and third-party protection. A summary of all four security aspects andrelated work have been provided in Fig. 5.

A. PRIVACYThe customer power consumption data collected from smartmeters is a certain kind of privacy that showed be protected.From the detailed consumption information, an insight intoa customer’s behavior can be obtained. Smart grid storescustomer privacy consequences and acts like an information-rich channel which is easy to exploit users’ habits and offerintelligence service. As the history shows in [84], a seriesof financial or political incentives boost the development ofinformation mining and behavior learning technology.

VOLUME 4, 2016 3851


TABLE 1. Key technologies in energy big data.

Terminal user’s privacy is a well-known security problemabout big data applied in power system. Many approacheshave been proposed to solve the privacy problems duringdata access. Li et al. [85] presented a distributed incrementaldata aggregation approach by using neighborhood gateways.Kalogridis et al. [86] introduced amethod to hide informationcontained in the energy consumption data by using batterystorage. Rastogi and Nath [87] proposed a differential privateaggregation algorithm for time-series data which offers goodpractical utility without any trusted server.

Although these solutions manage to support the utility andprivacy in various ways, they lack a robust theoretical basisfrom both utility and privacy. Rajagopalan et al. [88] pro-posed a theoretical framework which integrates most existingsolutions of the trade-off and provides a universal model insmart meters. Such a theoretical framework is suitable formany current problems of the privacy-utility tradeoff usinga technology-independent method. Assume that power con-sumption data is distributed normally, a data sequence {xk}is defined of random variables xk ∈ x, −∞ < k < ∞,which generated by stationary continuous valued sourceswith memory got via the autocorrelation function: Cxx(m) =E[xkxk+m],m = 0,±1,±2, · · · . Then, in order to estimate

the privacy loss via the mutual information, authors wouldperturb data between two data sequences.

Since continuous amplitude data transmission is lossythrough finite capacity network links, a tested sequence ofk-load measurements {xk} is usually quantized and com-pressed before the transmission. Therefore, data perturbationis needed to guarantee data privacy in some way. However,such a perturbation also needs encoding and decoding tomaintain a desired level of fidelity.• Encoding: Assuming that a smart meter can collectn(n � 1) measurements in a time interval beforecommunication, and n is large enough to capture thememory of data sources. Then the encoding functionperforms a mapping of resulting source sequence Xn

=

{X1,X2, · · · ,Xn}, where Xk ∈ R, k = 1, 2, · · · , n, anindex Wi ∈Wn will be given by:

FE : Xn→Wn

= {1, 2, · · · ,Wn},

where each index is a quantized sequence.• Decoding: The decoder at the data collector computesan output sequence X̂n

= {X̂2, X̂2, · · · , X̂n}, X̂k ∈ R, forall X̂k , using the decoding function:

FD :Wn→ X̂n.

3852 VOLUME 4, 2016


FIGURE 5. Security branches and related solutions.

The selection of the encoder should promote the input andoutput sequences to achieve a desired utility which is givenby an average distortion constraint:

Dn =1n

n∑k=1

E[(Xk − x̂k )2],

and a constraint on the privacy leakage about the desiredsequence {Yk} from the revealed one X̂k is quantized via theleakage function:

Ln =1nI(Y k ; X̂ k ),

where function E[·] computes the expectation and I(·) rep-resents the mutual information. Dn and Ln are functions ofthe number of measurements n. When stationary sources con-verge to limiting values, the corresponding limiting values forutility can be given by D = limn→∞ Dn and L = limn→∞ Lnfor privacy respectively [89].

B. INTEGRITYPreventing unauthorized persons or systems modifying infor-mation is all about integrity. In the smart grid domain theintegrity focuses on information such as sensor values, prod-uct recipes and control commands. Avoiding modificationthrough message delay, message reply and message injection

are also its relevant work. Security issues may be caused bythe violation of integrity.

Liu et al. [90] pointed out that the risk of integrity-targetingattacks in smart grid is indeed real. Compared to availability-targeting attacks, integrity-targeting attacks are less brute-force but more sophisticated. Either customers’ informationor network operation information can be the objective of theintegrity attacks. That is to say, modifying the original infor-mation deliberately and then corrupting critical data exchangein smart grid are the attempt of such attacks.

Ruj and Pal [91] proposed a targeted attack model in powernetwork where attackers attack nodes by removing them withprobability proportional to the nodes’ degree. Therefore, anode with higher degree will be removed with higher prob-ability. In the power network, once a node with higher degreehas been removed, the integrity of energy data collected bynodes will be affected more. Note øk as the probability that anode i of k degree hasn’t been removed:

øk = 1−deg(i)∑

v∈VA deg(v)= 1−

Ak−αA

2mA,

where VA is the nodes set in power network NA, αA is thepower-law coefficient, mA represents the number of edgesin NA, deg(i) = Ak−αA represents the degree of node i andαA = 0 means the random removal of nodes.

VOLUME 4, 2016 3853


Several studies have been conducted to address integrityattack. Xie et al. [92] firstly investigated the impact ofintegrity attacks on energy market through virtual bidding.They showed the process of an attack constructing a prof-itable attacking strategy without being detected. This resultis helpful to examine the potential loss caused by such attack.Kosut et al. [93] evaluated the proposed data attacks bygenerated market revenues and the work was further studiedby Jia et al. [94] in maximizing the revenues. Yuan et al. [95]found that the data integrity attacks can result in increas-ing system operating cost due to the inordinate generationdispatch and energy routing. In order to control the real-time Locational Marginal Prices (LMP) directly under dataattacks, Tan et al. [96] employed a control theory basedapproaches to analyze the attack effect on pricing stability.Esmalifalak et al. [97] adopted a novel approach to char-acterize the relationship between attackers and defenders.Tan et al. [98] proposed a novel systematic online attackconstruction strategy which doesn’t need any power gridtopology or parameter information. They presented a corre-sponding online defense strategy to identify the maliciousattack in smart grid. Further, Jia et al. [99] proposed an analy-sis framework to quantify the data quality impact on real-timeLMP.All these related researches are based on the assumptionthat attackers fully know about the targeted power systemsincluding the grid topology architecture, branch parametersand other related information.

C. AUTHENTICATIONIn general, authentication means to verify participator’s iden-tity and map this identity to the existing authentication tablein power network. A user gains access to a smart grid commu-nication system with a valid account. Most security solutionsuse authentication as a basis to distinguish legitimate andillegitimate identity [100].

Hamlyn et al. [101] proposed a utility network authen-tication and security management in smart grid operations.A new security architecture was designed for this manage-ment to deal with actions and commands requests in multiplesecurity domains. However, their research only focused onkeeping electric power systems and electric circuits securein host areas. Furthermore, Fouda et al. [102] proposeda light-weight message authentication mechanism which isbased on the Diffie-Hellman key establishment protocol.This mechanism allows distributed smart meters to makemutual authentication with low latency and has similari-ties with secure aggregation technology above. Assume thatHAN gateway i and BAN gateway j both have their pri-vate and secret key pairs which are respectively denotedas PKhani , SKhani , PKbani , SKbani . The Diffie-Hellman keyestablishment protocol will be adopted at the initial hand-shake between HAN and BAN gateways [103]. DefineG = 〈g〉 as a group of large prime order p under the Com-putational Diffie-Hellman (CDH) assumption. Otherwise, ifgiven ga, gb, for unknown a, b ∈ Z∗p, it is difficult to calculatewhether gab ∈ G. The following steps are shown the

procedure of the envisioned lightweight message authentica-tion scheme:• HAN gateway hani chooses a number a ∈ Z∗p) ran-domly and computes ga, then sends ga with an encryptedrequest packets to BAN gateway banj:

hani→ banj : {ga}{encr}PKbanj .

• Gateway banj in BAN decrypts the packets and sends aencrypted response including gb(b ∈ Z∗p):

banj→ hani : {ga ‖ gb}{encr}PKhani .

• After receiving banj’s response packets, hani recovers ga

with its Secret Key (SK). If the result is correct, it meansthat the banj is authenticated by hani. Then, with numbera and gb, hani can calculate the shared session key Ki =H ((gb)a), where H : {0, 1}∗ → Z∗p is a cryptographichash function. Then hani sends gb to banj in plaintext.

• Once banj has received the correct gb, banj will authen-ticate hani and also compute the session key using thesame way: Ki = H ((ga)b).

Because Ki is shared only between hani and banj, banjcan verify the authenticity of sender and the integrity ofmessageMi. Thus, it can provide the NAN gateways with theauthenticated messages.

D. THIRD-PARTY PROTECTIONDamages via the communication systems, except safety haz-ards of the controlled plant itself, can be averted by theprotection of a third-party, such as a cloud service provider.The power system can be used for launching various attackson the smart grid or the third party after being attacked. Thiswill do damage to a smart grid system owner’s reputation, oreven reach to legal liability for damaging the third party.

Since the third party isn’t listed in authentication table inpower network, it may have no access permission. TrustedThird Party (TTP) has been a promising approach for Person-ally Identifiable Information (PII) in power system security.Ranchal et al. [104] used a predicate encryption scheme andmulti-party computing to process identity data on untrustedhosts. Shamir [105] proposed a threshold secret sharingon private information. First a private data set D is dividedinto n shares: {d1, d2, · · · , dn}. Then a threshold t is chosento recover D. This operation may require t or more shares ofarbitrary Di. Ben-Or et al. [106] defined a multi-party com-puting protocol which uses secret input from all the parties.This protocol involves n mutual parties which only calculatepartial function outputs. In this protocol, one player froma party has been selected as the dealer(DLR) and providedpartial outputs to figure out the full computation results.Denote a linear function f of n degree which is known to nparties, an arbitrary threshold t , the i-th party Pi, and xi asthe input of Pi for f . By using a set of input {x1, x2, · · · , xn},n parties will calculate the output through f and send the resultto the DLR. Each Pi generates a polynomial hi of t degreesuch that hi(0) = xi. When Pi sends one share to Pj in the

3854 VOLUME 4, 2016


TABLE 2. Key applications in energy big data.

left parties, Pi will compute the portion of function f usingthis sent share.

VI. ENERGY BIG DATA: APPLICATIONSWe take a brief overview on three application directions ofbig data: renewable energy, demand response and EVs. Eachdirection contains numerous technical fields, as shown inTABLE 2. Although we present these three applications sep-arately, there are no clear boundaries between them. Demandresponse should be considered in both renewable energy andEVs while EVs is also a concrete implementation of renew-able energy.

A. RENEWABLE ENERGYThe new and renewable energy are integrated in the elec-tric power generation, which is different from the tradi-tional power generation mode. This difference causes the

measurement and management of generated data from elec-tric power become increasingly difficult and complex. Sincethe big data technology helps to make better prediction,management and processing complex big data in the energydomain, it has become increasingly popular and widelyapplied in renewable energy companies once emerging.

Many researches have launched on various branchesof renewable energy. Paro and Fadigas [107] proposed amethodology to measure and calculate the overall biomassenergy efficiency. This methodology can be applied in smartprograms about regeneration plants and generator groupsconnected to a grid which will improve the efficiency contin-uously. MacGillivray et al. [108] concentrated their attentionon marine energy. The marine energy have attracted consid-erable interest but its commercial prospects mainly dependon cost reduction and substantial learning. Therefore, the

VOLUME 4, 2016 3855


authors presented an explicit solution on the key uncertaintiesinvolved in marine energy learning rates analyses includingwave and tidal stream. They provided a simple learningmodel and describe a range of learning investments whichmade marine energy technologies become cost-competitive.Wool et al. [109] looked at the tribological design of threegreen marine energy systems: tidal, offshore wind, and wavemachines. They pointed out that most marine renewableenergy conversion systems need tribological components topromote rotational motion of wind or tidal streams to generateelectricity. Therefore, they studied and highlighted currentresearch thrusts to address present issues related to the tri-bology of marine energy conversion technologies. As foranother important renewable energy sources, wind energyhas also attracted wide attention. For instance, in order toimprove the efficiency of electric power generation, theVestas Wind Technology Group of Denmark optimized thewind turbine geographical placement by using big data tech-nologies to analyze the data information including weatherreports, tidal conditions, satellite images and geographicalinformation [110]. Besides, they also updated the under-lying architecture for better data collection and used bigdata processing technologies to meet their high-performancecomputing needs. Furthermore, Kaldellis [111] proposed anoptimum autonomous wind power system sizing for remoteconsumers by using long-term wind speed data. Aiming toimprove the life quality level of remote consumers, they usedthe proposed autonomous wind power systems to addressthe electricity demand requirements, especially in high wind-potential locations.

All researches above in different application areas have acommon purpose: cost minimization. In addition, owing tothe over-exploitation of fossil fuels, energy becomes preciousand energy consumption problem has arisen wide attention.Kung and Wang [112] proposed a recommender system forthe best combination of renewable energy resources withcost-benefit analysis, which include analytical module, clouddata base, and user interface. This study used Markov Chainto investigate the influences of decision-making related torenewable energy and electricity demand in random time.Since the historical electricity data is recorded in continuoustime series, Continuous Markov Chain can be applied toanalyze energy big data in order to help power enterprisesmake optimal investment of renewable energy and evaluateoptimal energy configuration.

B. ENERGY DEMAND RESPONSEThe demand of electric power consumers has been expandingalong with the increasing use of machines and emergingtypes of appliances such as EVs and smart home furniture.Thus, the concern on the stability and reliability of electricitysupply has also been aroused. However, since the traditionalpower grid lacks real-time response between demand andsupply, it cannot meet these demands due to its inflexibledistribution [113]. Therefore, demand response is expectedto be an aspect for the future smart grid.

Demand response is usually based on the classificationanalysis [114]. Users can be classified into several typesaccording to climatic features, geographic conditions andsocial stratum. Targeting ar different class of users, dif-ferent daily load curves of electrical equipments can bedrawn. These curves indicates the electrical characteris-tics, including time intervals, effective factors of electricityconsumption, etc.

Based on the classification analysis, the total demandresponse from a region or a kind of users can be availablethrough polymerization. Meanwhile, the analysis results willwork in designing incentive demand response mechanism.Liu et al. [115] provided a power demand model whichis derived from workload models and cooling demands ofthe data center. Specifically, they calculated the total powerdemand by: d(t) = dIT (t) + c(dIT (t)), where dIT (t) repre-sents the total Information Technology (IT)workload demandat time t which is the energy necessary to serve demand,and function c() represents cooling power demand which isassociated with IT demand dIT . Note that PUE(t) means thePower Usage Effectiveness (PUE) at time t , thus, c(dIT (t))can be calculated by c(dIT (t)) = (PUE(t)− 1) ∗ dIT (t).Goyena and Acciona [116] proposed an energetic balance

algorithm to warranty the electric demand in a large energystorage system. This system analyzes the measured dataconstantly and commits for high accuracy.In this algo-rithm, electric demand situations of overproduction andunderproduction are distinguished. Compared with thenone-renewable production, renewable production has beenconsidered as the priority in order to meet the energy demand.According to the results of this balance algorithm, data-storage level will be increased, decreased or limited.

C. ELECTRIC VEHICLESElectric Vehicles becomes a promising alternative transporta-tion method which is related to the smart grid domain. Withthe increasing number of EVs, many problems in variousareas of EVs such as performance evaluation, driving rangeand battery capacity become great concerns by researchers.

Besides, EV is one of the typical application examples ofbig data. Therefore, most of the issues above can be handledwith big data technologies. For instance, Wu et al. [117]launched the EV grid integration study based on drivingpatterns analysis. Recently, most researches are focused onanalyzing the energy efficiency, addressing routing prob-lems by recording driving data, estimating the driving range.Su and Chow [118] proposed an Estimation of DistributionAlgorithm (EDA) based on big data theories to allocate elec-tric energy intelligently to EVs. The simultaneous connectionof a large number of EVs to the power grid may have ainfluence on quality and stability of the overall power system.This proposed algorithm was devoted to optimally managea large number of EVs charging at a municipal parking lot.Midlam-Mohler et al. [119] focused on the the analysis ofenergy efficiency via recording driving data. They developeda data collection system to support the study which allows

3856 VOLUME 4, 2016


data acquired not only from disparate data sources that areavailable on most modern vehicles but from supplementalinstrumentation as well.

As for the issues on EVs charging and driving range,Rahimi-Eichi et al. [120] proposed another range estima-tion framework based on big data technology which cancollect energy data with various structures from variousresources and incorporate them via the range estimationalgorithm. They also used historical and real-time data forsimulation and thus calculated the remaining driving range.Zhang and Grijalva [121] proposed a data-driven queuingmodel for EV charging demand through performing bigdata analysis on smart meter measurements. In addition,Zhang et al. [122] focused on the charging-transaction-recordEV data for better charging service, by detecting, analyz-ing, and correcting the abnormal data with error types anddistributed locations. Soares et al. [123] presented a novelmethodology based on Monte Carlo Simulation (MCS) toestimate the possible EVs states referring to their locationsand periods of connection in the power grid. Futhermore,Lee and Wu [124] provided a EV-battery big data modelingmethod which has been used to improve the driving rangeestimation of EVs, using data cloud analysis and processingtechnologies. The authors first presented a simple approachto project life cycles of battery packs based on collected testdata. The operating voltage of a battery pack E(I ,SOC,T ) canbe evaluated by:

E(I ,SOC,T ) = OCV(SOC,T ) − I (R(SOC,T ) + ηr(I ,SOC,T ) )r,

where I means the current, SOC is noted as the state ofcharge, T represents the temperature, OCV is short for theopen circuit voltage, R is the resistance of battery cell andη represents the resistance caused by polarization. Then theyused a machine learning approach to cluster the collected EVdata by ever-increasing hierarchical self-organizing maps.

VII. ENERGY BIG DATA: CHALLENGESComing from wide various sources, the volume of energy bigdata is increasing at an exponential speed. At the same time,difficulties also arise up in data storage, mining, querying,processing, etc. Therefore, cryptography technologies, fuzzydata computing, qualified data processing are all essential forbig data applied better in smart grid.

A. DATA UNCERTAINTYData uncertainty comes from data complexity. Because ofthe complicated and distributed smart grid, the sources andforms of energy big data become diverse. There is no need orunrealistic to obtain accurate data samples or give a preciseresponse. Therefore, uncertain data mining and imprecisedata querying are presented for data uncertainty [125].

1) UNCERTAIN DATA MININGPrevious applications in smart grid using traditional datamining techniques require the data samples be accurateor with definite values. However, in the latest decades,there have been some researches conducted to capture the

uncertain factors of the data, which are common in the bigdata applications [15], [126]. These techniques often assumea distribution of the data values based on a probabilisticmethod, and then generate the probabilistic results of datamining. For example, the K-means clustering algorithm wasoptimized to the UK-means clustering algorithm throughspecifying a data object from an uncertain region withan uncertain probabilistic distribution method [15]. In theUK-means clustering algorithm, the precise error andthe cluster number of a data object are less important.Tsang et al. [126] developed a decision tree classifier used foruncertain data and proposed the related improved algorithms.Extension of classical decision tree in building algorithmswas proposed in to handle data tuples without certain values.

2) IMPRECISE DATA QUERYINGIn the traditional data querying techniques used in smart grid,the precise results are required in the DataBase ManagementSystem (DBMS) according to the query conditions. In recentyears, a lot of querying and indexing techniques began to con-sider about the uncertainty of the data. Instead of giving theprecise response, these techniques estimate data uncertaintyand provide probabilistic guarantee. For instance, the uncer-tainty of the data has been modeled as the stochastic valuewithin certain limits, and the data queries have been classifiedinto different classes. Different algorithms are then developedto compute the probabilistic answers. Chen and Cheng [127]studied location data with uncertainty and classified thequery issues into two types, according to the degree of datauncertainty.

B. DATA QUALITYDue to the multiple origins of big data in the compli-cated power system, the smart grid database always containsincomplete, inconsistent and uncorrected data. Therefore, aseries of data preprocessing technologies are necessary toimprove the data quality, especially after uncertain data min-ing and imprecise querying [128].

Data preprocessing contains data manipulation, compres-sion and normalization of the older data into improved for-mat [129]. Data manipulation refers to transformation, whichis used to reduce the impact from data fault sensing and pro-vide more reliable energy data. The compression operationis used for massive data measurements to be reduced into areasonable extent. Meanwhile, the normalization technologyis used to regulate the collected data by minimizing its noise,inconsistencies and incompleteness of data. These three oper-ations are main components of data preprocessing but theorder is not unique.

Furthermore, different regulations on big data cause con-flicts between collaborators and lead to inconvenience insmart grid. Therefore, a unified and complete system standardneeds to be set up in the future.

C. QUANTUM CRYPTOGRAPHY FOR DATA SECURITYSince the large amount of data is stored inmultiple distributedgrids, it is difficult to monitor the security of data storage

VOLUME 4, 2016 3857


in real time. However, most data involve personal privacy,commercial secrets, financial information, etc. It causes theproblem of data larceny, which has become increasinglyserious. Therefore, some kinds of cryptography technologieshave been proposed by researches to solute security problemsof big data in energy domain.

Modern cryptography technique is based on the Rack ScaleArchitecture (RSA), which is especially used for the hardlyfactoring large data [130]. In 1994, Peter Shor discoveredthat quantum computers can handle discrete logarithm oper-ation [131] and this would lead to the loss of the efficiencyof the RSA architecture. Since the quantum computers havebeen put into use recently, a new data cryptograph technologycalled quantum cryptography [132] is considered as the nextgeneration cryptography solution. The quantum cryptogra-phy can absolutely guarantee the security of smart grid databased on the quantum mechanics. In this technology, the datais encoded as qubits instead of transmitting bits and differentdirections will affect the qubit’s polarization. The quantumkey distribution begins when a large amount of qubits are sentfrom the sender to the receiver. If an eavesdropper wants toget the information about any qubit, he will have to measureit. However, having no idea about the qubit’s polarization, theeavesdropper can only process it by guess and then transmitanother randomly polarized qubit to the receiver. All theseoperations are under the surveillance. At the end of quantumkey distribution, the key will be identified safe or not by abrief testing.

VIII. CONCLUSIONIn this paper, we gave an introduction on big data and powergrid, presented the background in the aspects of related work,techniques and tools for energy big data. Then, we discussedthe important role of big data, which brought efficiency andaccuracy to the power system. Furthermore, an integratedarchitecture was introduced for the big data analysis, andwe discussed the key techniques of big data in the energysystem in four categories: data acquisition and storing, datacorrelation analysis, crowd-sourced data control and datavisualization. As another crucial part of the energy big data,the security in the power system was illustrated in aspects ofprivacy, integrity, authentication and the third-party protec-tion. We also discussed the applications of energy big data inbranches of renewable energy, energy demand response, andelectric vehicles. Finally, we listed some challenges need tobe addressed about data uncertainty, quality and security.

REFERENCES[1] C. W. Gellings, M. Samotyj, and B. Howe, ‘‘The future’s smart delivery

system [electric power supply],’’ IEEE Power Energy Mag., vol. 2, no. 5,pp. 40–48, Sep./Oct. 2004.

[2] Y. Yan, Y. Qian, H. Sharif, and D. Tipper, ‘‘A survey on smart gridcommunication infrastructures: Motivations, requirements and chal-lenges,’’ IEEE Commun. Surveys Tuts., vol. 15, no. 1, pp. 5–20,1st Quart., 2013.

[3] K. Wang, Y. Shao, L. Shu, Y. Zhang, and C. Zhu, ‘‘Mobile big data fault-tolerant processing for eHealth networks,’’ IEEE Netw., vol. 30, no. 1,pp. 1–7, Jan. 2016.

[4] B. Gu, X. Sun, and V. S. Sheng, ‘‘Structural minimax probabilitymachine,’’ IEEE Trans. Neural Netw. Learn. Syst., to be published,Apr. 2016, doi: 10.1109/TNNLS.2016.2544779.

[5] F. Rahimi and A. Ipakchi, ‘‘Demand response as a market resourceunder the smart grid paradigm,’’ IEEE Trans. Smart Grid, vol. 1, no. 1,pp. 82–88, Jun. 2010.

[6] S. M. Amin and B. F. Wollenberg, ‘‘Toward a smart grid: Power deliveryfor the 21st century,’’ IEEE Power Energy Mag., vol. 3, no. 5, pp. 34–41,Sep. 2005.

[7] T. Niimura, M. Dhaliwal, and K. Ozawa, ‘‘Fuzzy regression models torepresent electricity market data in deregulated power industry,’’ in Proc.IFSA World Congr. NAFIPS Int. Conf., vol. 5. Jul. 2001, pp. 2556–2561.

[8] Z. Fu, X. Sun, Q. Liu, L. Zhou, and J. Shu, ‘‘Achieving efficient cloudsearch services: Multi-keyword ranked search over encrypted cloud datasupporting parallel computing,’’ IEICE Trans. Commun., vol. E98-B,no. 1, pp. 190–200, 2015.

[9] X. He, Q. Ai, R. C. Qiu, W. Huang, L. Piao, and H. Liu, ‘‘A big dataarchitecture design for smart grids based on randommatrix theory,’’ IEEETrans. Smart Grid, Jul. 2015, doi: 10.1109/TSG.2015.2445828.

[10] M. Kezunovic and B.M. Cuka. (2013).Hierarchical Coordinated Protec-tionWith High Penetration of Smart Grid Renewable Resources. [Online].Available: http://eppe.tamu.edu

[11] Y. Zhang et al., ‘‘Wide-area frequency monitoring network (FNET)architecture and applications,’’ IEEE Trans. Smart Grid, vol. 1, no. 2,pp. 159–167, Sep. 2010.

[12] K. Wang, L. Zhuo, Y. Shao, D. Yue, and K. F. Tsang, ‘‘Towards dis-tributed data processing on intelligent leakpoints prediction in petro-chemical industries,’’ IEEE Trans. Ind. Informat., Apr. 2016, doi:10.1109/TII.2016.2537788.

[13] B. Gu, V. S. Sheng, K. Y. Tay, W. Romano, and S. Li, ‘‘Incrementalsupport vector learning for ordinal regression,’’ IEEE Trans. Neural Netw.Learn. Syst., vol. 26, no. 7, pp. 1403–1416, Jul. 2015.

[14] M. Kezunovic, L. Xie, and S. Grijalva, ‘‘The role of big data in improvingpower system operation and protection,’’ in Proc. Bulk Power Syst. Dyn.Control-IXOptim., Secur. Control Emerg. PowerGrid (IREP), Aug. 2013,pp. 1–9.

[15] H. Chen, R. H. L. Chiang, and V. C. Storey, ‘‘Business intelligence andanalytics: From big data to big impact,’’ MIS Quart., vol. 36, no. 4,pp. 1165–1188, Dec. 2012.

[16] T. A. Brody, J. Flores, J. B. French, P. A. Mello, A. Pandey, andS. S. M. Wong, ‘‘Random-matrix physics: Spectrum and strength fluc-tuations,’’ Rev. Mod. Phys., vol. 53, no. 3, pp. 385–479, Jul. 1981.

[17] D. Howe et al., ‘‘Big data: The future of biocuration,’’ Nature, vol. 455,no. 7209, pp. 47–50, Sep. 2008.

[18] C. Zhang and R. C. Qiu. (Apr. 2014). ‘‘Data modeling withlarge random matrices in a cognitive radio network testbed: Ini-tial experimental demonstrations with 70 nodes.’’ [Online]. Available:http://arxiv.org/abs/1404.3788

[19] A. Bose, ‘‘Smart transmission grid applications and their supportinginfrastructure,’’ IEEE Trans. Smart Grid, vol. 1, no. 1, pp. 11–19,Jun. 2010.

[20] A. Abur and F. Galvan, ‘‘Synchro-phasor assisted state estimation(SPASE),’’ in Proc. IEEE PES Innov. Smart Grid Technol., Jan. 2012,pp. 1–2.

[21] J. O’Brien et al., ‘‘Use of synchrophasor measurements in protec-tive relaying applications,’’ in Proc. Conf. Protective Relay Eng.,Mar./Apr. 2014, pp. 23–29.

[22] B. Gu and V. S. Sheng, ‘‘A robust regularization path algorithm forϑ-support vector classification,’’ IEEE Trans. Neural Netw. Learn. Syst.,to be published, doi: 10.1109/TNNLS.2016.2527796.

[23] H. Shen and Y. Zhu, ‘‘The collaboration between demand responseand smart grid,’’ in Proc. Int. Conf. Sustain. Power Generat. Supply,Sep. 2012, pp. 1–4.

[24] M. Arenas-Martinez et al., ‘‘A comparative study of data storage andprocessing architectures for the smart grid,’’ in Proc. IEEE Int. Conf.Smart Grid Commun. (SmartGridComm), Oct. 2010, pp. 285–290.

[25] K. M. Rogers, R. Klump, H. Khurana, A. A. Aquino-Lugo, andT. J. Overbye, ‘‘An authenticated control framework for distributed volt-age support on the smart grid,’’ IEEE Trans. Smart Grid, vol. 1, no. 1,pp. 40–47, Jun. 2010.

[26] A. Swarnkar, N. Gupta, and K. R. Niazi, ‘‘Adapted ant colony optimiza-tion for efficient reconfiguration of balanced and unbalanced distributionsystems for loss minimization,’’ Swarm Evol. Comput., vol. 1, no. 3,pp. 129–137, Sep. 2011.

3858 VOLUME 4, 2016


[27] A. A. Aquino-Lugo, R. Klump, and T. J. Overbye, ‘‘A control frameworkfor the smart grid for voltage support using agent-based technologies,’’IEEE Trans. Smart Grid, vol. 2, no. 1, pp. 173–200, Mar. 2011.

[28] W. Labeeuw and G. Deconinck, ‘‘Residential electrical load model basedon mixture model clustering and Markov models,’’ IEEE Trans. Ind.Informat., vol. 9, no. 3, pp. 1561–1569, Aug. 2013.

[29] A. M. S. Ferreira, C. A. M. T. Cavalcante, C. H. O. Fontes, andJ. E. S. Marambio, ‘‘A newmethod for pattern recognition in load profilesto support decision-making in the management of the electric sector,’’ Int.J. Elect. Power Energy Syst., vol. 53, no. 4, pp. 824–831, Dec. 2013.

[30] C. E. Borges, Y. K. Penya, and I. Fernandez, ‘‘Evaluating combined loadforecasting in large power systems and smart grids,’’ IEEE Trans. Ind.Informat., vol. 9, no. 3, pp. 1570–1577, Aug. 2013.

[31] M. Amina, V. S. Kodogiannis, I. Petrounias, and D. Tomtsis, ‘‘A hybridintelligent approach for the prediction of electricity consumption,’’ Int. J.Elect. Power Energy Syst., vol. 43, no. 1, pp. 99–108, Dec. 2012.

[32] N. Ding, Y. Besanger, F. Wurtz, and G. Antoine, ‘‘Individual nonpara-metric load estimation model for power distribution network planning,’’IEEE Trans. Ind. Informat., vol. 9, no. 3, pp. 1578–1587, Aug. 2013.

[33] M. Biswal and P. K. Dash, ‘‘Measurement and classification of simultane-ous power signal patterns with an S-transform variant and fuzzy decisiontree,’’ IEEE Trans. Ind. Informat., vol. 9, no. 4, pp. 1819–1827, Nov. 2013.

[34] A. K. Karun and K. Chitharanjan, ‘‘Locality sensitive hashing basedincremental clustering for creating affinity groups in Hadoop—HDFS—An infrastructure extension,’’ in Proc. IEEE Int. Conf. Circuits, PowerComput. Technol., Mar. 2013, pp. 1243–1249.

[35] C. Kaewkasi and W. Srisuruk, ‘‘A study of big data processing con-straints on a low-power Hadoop cluster,’’ in Proc. Int. Comput. Sci. Eng.Conf. (ICSEC), Jul./Aug. 2014, pp. 267–272.

[36] K. Shvachko, H. Kuang, S. Radia, and R. Chansler, ‘‘The Hadoop dis-tributed file system,’’ in Proc. IEEE 26th Symp. Mass Storage Syst.Technol. (MSST), May 2014, pp. 1–10.

[37] J. Dean and S. Ghemawat, ‘‘MapReduce: Simplified dataprocessing on large clusters,’’ Commun. ACM, vol. 51, no. 1,pp. 107–113, 2008.

[38] K. V. Rama Satish and N. P. Kavya, ‘‘Big data processing with harnessingHadoop—MapReduce for optimizing analytical workloads,’’ in Proc. Int.Conf. Contemp. Comput. Informat. (IC3I), Nov. 2014, pp. 49–54.

[39] M. M. Maske and P. Prasad, ‘‘A real time processing and streaming ofwireless network data using storm,’’ in Proc. IEEE Int. Conf. Comput.Power, Energy Inf. Commun., Apr. 2015, pp. 244–249.

[40] P. Chandarana and M. Vijayalakshmi, ‘‘Big data analytics frameworks,’’in Proc. IEEE Int. Conf. Circuits, Syst., Commun. Inf. Technol. Appl.,Apr. 2014, pp. 430–434.

[41] Z. Asad and M. A. R. Chaudhry, ‘‘A two-way street: Green big dataprocessing for a greener smart grid,’’ IEEE Syst. J., Feb. 2016, doi:10.1109/JSYST.2015.2498639.

[42] S. Maharjan, Q. Zhu, Y. Zhang, S. Gjessing, and T. Basar, ‘‘Demandresponse management in the smart grid in a large population regime,’’IEEE Trans. Smart Grid, vol. 7, no. 1, pp. 189–199, Jan. 2016.

[43] M. K. H. Pulok and M. Omar Faruque, ‘‘Utilization of PMU data toevaluate the effectiveness of voltage stability boundary and indices,’’ inProc. IEEE North Amer. Power Symp., Oct. 2015, pp. 1–6.

[44] S. J. Wangeman, ‘‘IED dataset generation: Analysis across theaters,’’ inProc. IEEE Syst. Inf. Eng. Design Symp., Apr. 2013, pp. 92–97.

[45] H. Hu, Y. Wen, T.-S. Chua, and X. Li, ‘‘Toward scalable systems for bigdata analytics: A technology tutorial,’’ IEEE Access, vol. 2, pp. 652–687,2014.

[46] B. Gu, V. S. Sheng, Z. Wang, D. Ho, S. Osman, and S. Li, ‘‘Incre-mental learning for ϑ-support vector regression,’’ Neural Netw., vol. 67,pp. 140–150, Jul. 2015.

[47] Y. Simmhan et al., ‘‘An informatics approach to demand response opti-mization in smart grids,’’ Natural Gas, vol. 31, p. 60, Mar. 2011.

[48] A. Pal and S. Agrawal, ‘‘An experimental approach towards big datafor analyzing memory utilization on a Hadoop cluster using HDFS andMapReduce,’’ in Proc. IEEE Int. Conf. Netw. Soft Comput. (ICNSC),Aug. 2014, pp. 442–447.

[49] B. Matic-Cuka and M. Kezunovic, ‘‘Improving smart grid operation withnew hierarchically coordinated protection approach,’’ in Proc. Medit.Conf. Power Generat., Transmiss., Distrib. Energy Convers., Oct. 2012,pp. 1–6.

[50] S. K. Bansal, ‘‘Towards a semantic extract-transform-load (ETL) frame-work for big data integration,’’ in Proc. IEEE Int. Congr. Big Data,Jun./Jul. 2014, pp. 522–529.

[51] Z. Xia, X. Wang, X. Sun, and Q. Wang, ‘‘A secure and dynamic multi-keyword ranked search scheme over encrypted cloud data,’’ IEEE Trans.Parallel Distrib. Syst., vol. 27, no. 2, pp. 340–352, Feb. 2016.

[52] Z. Fu, K. Ren, J. Shu, X. Sun, and F. Huang, ‘‘Enabling per-sonalized search over encrypted outsourced data with efficiencyimprovement,’’ IEEE Trans. Parallel Distrib. Syst., Dec. 2015, doi:10.1109/TPDS.2015.2506573.

[53] P. Paillier, ‘‘Public-key cryptosystems based on compositedegree residuosity classes,’’ in Proc. Int. Conf. Theory Appl.Cryptograph. (EUROCRYPT), 1999, pp. 223–238.

[54] D. Boneh, E.-J. Goh, and K. Nissim, ‘‘Evaluating 2-DNF formulas onciphertexts,’’ in Proc. 2nd Theory Cryptogr., 2005, pp. 325–341.

[55] R. L. Rivest, A. Shamir, and L. Adleman, ‘‘A method for obtaining digitalsignatures and public-key cryptosystems,’’Commun. ACM, vol. 21, no. 2,pp. 120–126, Feb. 1978.

[56] C. Gentry, ‘‘Fully homomorphic encryption using ideal lattices,’’ in Proc.ACM STOC, vol. 9. 2009, pp. 169–178.

[57] Y. Yan, Y. Qian, and H. Sharif, ‘‘A secure data aggregation and dispatchscheme for home area networks in smart grid,’’ in Proc. IEEE GlobalTelecommun. Conf. (GLOBECOM), Dec. 2011, pp. 1–6.

[58] S. Ruj and A. Nayak, ‘‘A decentralized security framework for dataaggregation and access control in smart grids,’’ IEEE Trans. Smart Grid,vol. 4, no. 1, pp. 196–205, Mar. 2013.

[59] A. Sahai and B.Waters, ‘‘Fuzzy identity-based encryption,’’ in Proc. 24thAnnu. Int. Conf. Adv. Cryptol. (EUROCRYPT), 2005, pp. 457–473.

[60] V. Goyal, O. Pandey, A. Sahai, and B. Waters, ‘‘Attribute-based encryp-tion for fine-grained access control of encrypted data,’’ inProc. 13th ACMConf. Comput. Commun. Secur., 2006, pp. 89–98.

[61] J. Bethencourt, A. Sahai, and B. Waters, ‘‘Ciphertext-policy attribute-based encryption,’’ in Proc. IEEE Symp. Secur. Privacy (SP), May 2007,pp. 321–334.

[62] A. Lewko and B. Waters, ‘‘Decentralizing attribute-based encryption,’’in Proc. 30th Annu. Int. Conf. Adv. Cryptology (EUROCRYPT), 2011,pp. 568–588.

[63] S. Ratnasamy et al., ‘‘Data-centric storage in sensornets with GHT, ageographic hash table,’’ Mobile Netw. Appl., vol. 8, no. 4, pp. 427–442,Aug. 2003.

[64] B. Karp and H. T. Kung, ‘‘GPSR: Greedy perimeter stateless routingfor wireless networks,’’ in Proc. Int. Conf. Mobile Comput. Netw., 2000,pp. 243–254.

[65] R. Meier et al., ‘‘Power system data management and analysis usingsynchrophasor data,’’ in Proc. IEEE Conf. Technol. Sustain. (SusTech),Jul. 2014, pp. 225–231.

[66] R. Pandey, M. Dhoundiyal, and A. Kumar, ‘‘Correlation analysis of bigdata to support machine learning,’’ in Proc. 5th Int. Conf. Commun. Syst.Netw. Technol. (CSNT), Apr. 2015, pp. 996–999.

[67] M. Pavella, D. Ernst, and D. Ruiz-Vega, Transient Stability of PowerSystems: A Unified Approach to Assessment and Control. New York, NY,USA: Springer, 2013.

[68] D. De Silva, X. Yu, D. Alahakoon, and G. Holmes, ‘‘A data miningframework for electricity consumption analysis from meter data,’’ IEEETrans. Ind. Informat., vol. 7, no. 3, pp. 399–407, Aug. 2011.

[69] T. Yan, V. Kumar, and D. Ganesan, ‘‘CrowdSearch: Exploiting crowdsfor accurate real-time image search on mobile phones,’’ in Proc. 8th Int.Conf. Mobile Syst., Appl., Services, 2010, pp. 77–90.

[70] J.Whitehill, P. Ruvolo, T.-F.Wu, J. Bergsma, and J. R.Movellan, ‘‘Whosevote should count more: Optimal integration of labels from labelersof unknown expertise,’’ in Proc. Adv. Neural Inf. Process. Syst., 2009,pp. 2035–2043.

[71] Y. Amsterdamer, Y. Grossman, T. Milo, and P. Senellart, ‘‘Crowd min-ing,’’ in Proc. ACM SIGMOD Int. Conf. Manage. Data, Jun. 2013,pp. 241–252.

[72] J. Wang, S. Krishnan, M. J. Franklin, K. Goldberg, T. Kraska, andT. Milo, ‘‘A sample-and-clean framework for fast and accurate queryprocessing on dirty data,’’ in Proc. ACM SIGMOD Int. Conf. Manage.Data, Jun. 2014, pp. 469–480.

[73] S. Guo, A. Parameswaran, and H. Garcia-Molina, ‘‘So who won?:Dynamic max discovery with the crowd,’’ in Proc. ACM SIGMOD Int.Conf. Manage. Data, 2012, pp. 385–396.

[74] L. Kazemi, C. Shahabi, and L. Chen, ‘‘GeoTruCrowd: Trustworthy queryanswering with spatial crowdsourcing,’’ in Proc. 21st ACM SIGSPATIALInt. Conf. Adv. Geograph. Inf. Syst., 2013, pp. 314–323.

VOLUME 4, 2016 3859


[75] C. C. Cao, J. She, Y. Tong, and L. Chen, ‘‘Whom to ask?: Jury selec-tion for decision making tasks on micro-blog services,’’ in Proc. VLDBEndowment, Jul. 2012, vol. 5. no. 11, pp. 1495–1506.

[76] D. R. Karger, S. Oh, and D. Shah, ‘‘Iterative learning for reliablecrowdsourcing systems,’’ in Proc. Adv. Neural Inf. Process. Syst., 2011,pp. 1953–1961.

[77] M. Joglekar, H. Garcia-Molina, and A. Parameswaran, ‘‘Evaluating thecrowd with confidence,’’ in Proc. 19th ACM SIGKDD Int. Conf. Knowl.Discovery Data Mining, 2013, pp. 686–694.

[78] V. C. Raykar et al., ‘‘Supervised learning from multiple experts: Whomto trust when everyone lies a bit,’’ in Proc. 26th Annu. Int. Conf. Mach.Learn., 2009, pp. 889–896.

[79] X. Liu, M. Lu, B. C. Ooi, Y. Shen, S. Wu, and M. Zhang, ‘‘CDAS:A crowdsourcing data analytics system,’’ in Proc. VLDB Endowment,Jun. 2012, vol. 5. no. 10, pp. 1040–1051.

[80] S. Wang, X. Xiao, and C.-H. Lee, ‘‘Crowd-based deduplication: An adap-tive approach,’’ in Proc. ACM SIGMOD Int. Conf. Manage. Data, 2015,pp. 1263–1277.

[81] Y. Zheng, J. Wang, G. Li, R. Cheng, and J. Feng, ‘‘QASCA: A quality-aware task assignment system for crowdsourcing applications,’’ in Proc.ACM SIGMOD Int. Conf. Manage. Data, 2015, pp. 1031–1046.

[82] J. Zhang, M. L. Huang, W. B. Wang, L. F. Lu, and Z.-P. Meng, ‘‘Big datadensity analytics using parallel coordinate visualization,’’ in Proc. IEEEInt. Conf. Comput. Sci. Eng. (CSE), Dec. 2014, pp. 1115–1120.

[83] D. Dzung, M. Naedele, T. P. V. Hoff, and M. Crevatin, ‘‘Securityfor industrial communication systems,’’ Proc. IEEE, vol. 93, no. 6,pp. 1152–1177, Jun. 2005.

[84] P. McDaniel and S. McLaughlin, ‘‘Security and privacy challengesin the smart grid,’’ IEEE Security Privacy, vol. 7, no. 3, pp. 75–77,May/Jun. 2009.

[85] F. Li, B. Luo, and P. Liu, ‘‘Secure information aggregation for smart gridsusing homomorphic encryption,’’ in Proc. 1st IEEE Int. Conf. Smart GridCommun. (SmartGridComm), Oct. 2010, pp. 327–332.

[86] G. Kalogridis, C. Efthymiou, S. Z. Denic, T. A. Lewis, andR. Cepeda, ‘‘Privacy for smart meters: Towards undetectable appli-ance load signatures,’’ in Proc. 1st IEEE Int. Conf. Smart GridCommun. (SmartGridComm), Oct. 2010, pp. 232–237.

[87] V. Rastogi and S. Nath, ‘‘Differentially private aggregation of distributedtime-series with transformation and encryption,’’ in Proc. ACM SIGMODInt. Conf. Manage. Data, 2010, pp. 735–746.

[88] S. R. Rajagopalan, L. Sankar, S. Mohajer, and H. V. Poor, ‘‘Smart meterprivacy: A utility-privacy framework,’’ in Proc. IEEE Int. Conf. SmartGrid Commun. (SmartGridComm), Oct. 2011, pp. 190–195.

[89] T. M. Cover and J. A. Thomas, Elements of Information Theory.New York, NY, USA: Wiley, 2012.

[90] Y. Liu, P. Ning, and M. K. Reiter, ‘‘False data injection attacks againststate estimation in electric power grids,’’ ACM Trans. Inf. Syst. Secur.,vol. 14, no. 1, p. 13, Sep. 2009.

[91] S. Ruj and A. Pal, ‘‘Analyzing cascading failures in smart grids underrandom and targeted attacks,’’ inProc. IEEE 28th Int. Conf. Adv. Inf. Netw.Appl. (AINA), May 2014, pp. 226–233.

[92] L. Xie, Y. Mo, and B. Sinopoli, ‘‘Integrity data attacks in power mar-ket operations,’’ IEEE Trans. Smart Grid, vol. 2, no. 4, pp. 659–666,Dec. 2011.

[93] O. Kosut, L. Jia, R. J. Thomas, and L. Tong, ‘‘Malicious data attackson the smart grid,’’ IEEE Trans. Smart Grid, vol. 2, no. 4, pp. 645–658,Dec. 2011.

[94] L. Jia, R. Thomas, and L. Tong, ‘‘Malicious data attack on real-timeelectricity market,’’ in Proc. IEEE Int. Conf. Acoust., Speech SignalProcess. (ICASSP), May 2011, pp. 5952–5955.

[95] Y. Yuan, Z. Li, and K. Ren, ‘‘Quantitative analysis of load redistributionattacks in power systems,’’ IEEE Trans. Parallel Distrib. Syst., vol. 23,no. 9, pp. 1731–1738, Sep. 2012.

[96] R. Tan, V. B. Krishna, D. K. Y. Yau, and Z. Kalbarczyk, ‘‘Impact ofintegrity attacks on real-time pricing in smart grids,’’ in Proc. ACMSIGSAC Conf. Comput. Commun. Secur., 2013, pp. 439–450.

[97] M. Esmalifalak, G. Shi, Z. Han, and L. Song, ‘‘Bad data injection attackand defense in electricity market using game theory study,’’ IEEE Trans.Smart Grid, vol. 4, no. 1, pp. 160–169, Mar. 2013.

[98] S. Tan, W.-Z. Song, M. Stewart, J. Yang, and L. Tong, ‘‘Online dataintegrity attacks against real-time electrical market in smart grid,’’ IEEETrans. Smart Grid, Apr. 2016, doi: 10.1109/TSG.2016.2550801.

[99] L. Jia, J. Kim, R. J. Thomas, and L. Tong, ‘‘Impact of data quality on real-time locational marginal price,’’ IEEE Trans. Power Syst., vol. 29, no. 2,pp. 627–636, Mar. 2014.

[100] H. Liu, H. Ning, Y. Zhang, and L. T. Yang, ‘‘Aggregated-proofs basedprivacy-preserving authentication for V2G networks in the smart grid,’’IEEE Trans. Smart Grid, vol. 3, no. 4, pp. 1722–1733, Dec. 2012.

[101] A. Hamlyn, H. Cheung, T. Mander, L. Wang, C. Yang, andR. Cheung, ‘‘Network security management and authentication of actionsfor smart grids operations,’’ in Proc. IEEE Elect. Power Conf., Oct. 2007,pp. 31–36.

[102] M. M. Fouda, Z. M. Fadlullah, N. Kato, R. Lu, and X. Shen, ‘‘Towardsa light-weight message authentication mechanism tailored for smartgrid communications,’’ in Proc. IEEE Conf. Comput. Commun. Work-shops (INFOCOM WKSHPS), Apr. 2011, pp. 1018–1023.

[103] D. R. Stinson, Cryptography: Theory and Practice. Boca Raton, FL,USA: CRC Press, 2005.

[104] R. Ranchal et al., ‘‘Protection of identity information in cloud computingwithout trusted third party,’’ in Proc. 29th IEEE Symp. Reliable Distrib.Syst., Oct./Nov. 2010, pp. 368–372.

[105] A. Shamir, ‘‘How to share a secret,’’ Commun. ACM, vol. 22, no. 11,pp. 612–613, Nov. 1979.

[106] M. Ben-Or, S. Goldwasser, and A. Wigderson, ‘‘Completeness theoremsfor non-cryptographic fault-tolerant distributed computation,’’ in Proc.12th Annu. ACM Symp. Theory Comput., 1988, pp. 1–10.

[107] A. C. Paro and E. A. F. A. Fadigas, ‘‘A methodology for biomass cogen-eration plants overall energy efficiency calculation and measurement—A basis for generators real time efficiency data disclosure,’’ in Proc.IEEE/PES Power Syst. Conf. Expo. (PSCE), Mar. 2011, pp. 1–7.

[108] A. MacGillivray, H. Jeffrey, M. Winskel, and I. Bryden, ‘‘Innovationand cost reduction for marine renewable energy: A learning invest-ment sensitivity analysis,’’ Technol. Forecasting Social Change, vol. 87,pp. 108–124, Sep. 2014.

[109] R. J. K. Wood, A. S. Bahaj, S. R. Turnock, L. Wang, and M. Evans,‘‘Tribological design constraints of marine renewable energy systems,’’Philos. Trans. Roy. Soc. London A, Math. Phys. Sci., vol. 368, no. 1929,pp. 4807–4827, Oct. 2010.

[110] R. Billinton and Y. Gao, ‘‘Multistate wind energy conversion systemmodels for adequacy assessment of generating systems incorporatingwind energy,’’ IEEE Trans. Energy Convers., vol. 23, no. 1, pp. 163–170,Mar. 2008.

[111] J. K. Kaldellis, ‘‘Optimum autonomous wind–power system sizing forremote consumers, using long-term wind speed data,’’ Appl. Energy,vol. 71, no. 3, pp. 215–233, Mar. 2002.

[112] L. Kung and H.-F.Wang, ‘‘A recommender system for the optimal combi-nation of energy resources with cost-benefit analysis,’’ in Proc. Int. Conf.Ind. Eng. Oper. Manage. (IEOM), Mar. 2015, pp. 1–10.

[113] S. Maharjan, Q. Zhu, Y. Zhang, S. Gjessing, and T. Basar, ‘‘Depend-able demand response management in the smart grid: A Stackelberggame approach,’’ IEEE Trans. Smart Grid, vol. 4, no. 1, pp. 120–132,Mar. 2013.

[114] J. Kwac and R. Rajagopal, ‘‘Demand response targeting using bigdata analytics,’’ in Proc. IEEE Int. Conf. Big Data, Oct. 2013,pp. 683–690.

[115] Z. Liu, A. Wierman, Y. Chen, B. Razon, and N. Chen, ‘‘Data centerdemand response: Avoiding the coincident peak via workload shifting andlocal generation,’’ Perform. Eval., vol. 70, no. 10, pp. 770–791, Oct. 2013.

[116] S. G. Goyena and S. A. Acciona, ‘‘Sizing and analysis of big scale andisolated electric systems based on renewable sources with energy stor-age,’’ in Proc. IEEE PES/IAS Conf. Sustain. Alternative Energy (SAE),Sep. 2009, pp. 1–7.

[117] Q. Wu et al., ‘‘Driving pattern analysis for electric vehicle (EV) grid inte-gration study,’’ in Proc. IEEE Innov. Smart Grid Technol. Conf. (ISGT),Oct. 2010, pp. 1–6.

[118] W. Su andM.-Y. Chow, ‘‘Performance evaluation of an EDA-based large-scale plug-in hybrid electric vehicle charging algorithm,’’ IEEE Trans.Smart Grid, vol. 3, no. 1, pp. 308–315, Mar. 2012.

[119] S. Midlam-Mohler, S. Ewing, V. Marano, Y. Guezennec, and G. Rizzoni,‘‘PHEV fleet data collection and analysis,’’ in Proc. IEEE Vehicle PowerPropuls. Conf. (VPPC), Sep. 2009, pp. 1205–1210.

[120] H. Rahimi-Eichi, P. B. Jeon, M.-Y. Chow, and T.-J. Yeo, ‘‘Incorporat-ing big data analysis in speed profile classification for range estima-tion,’’ in Proc. IEEE 13th Int. Conf. Ind. Informat. (INDIN), Jul. 2015,pp. 1290–1295.

3860 VOLUME 4, 2016


[121] X. Zhang and S. Grijalva, ‘‘An advanced data driven model for residentialelectric vehicle charging demand,’’ in Proc. IEEE Power Energy Soc.General Meeting, Jul. 2015, pp. 1–5.

[122] L. Zhang, Y. Chen, J. Zhu, M. Pan, Z. Sun, and W. Wang, ‘‘Data qual-ity analysis and improved strategy research on operations managementsystem for electric vehicles,’’ in Proc. 5th Int. Conf. Electr. Utility Dereg-ulation Restruct. Power Technol. (DRPT), Nov. 2015, pp. 2715–2720.

[123] J. Soares, N. Borges, B. Canizes, and Z. Vale, ‘‘Probabilistic estimationof the state of electric vehicles for smart grid applications in big datacontext,’’ in Proc. IEEE Power Energy Soc. General Meeting, Jul. 2015,pp. 1–5.

[124] C.-H. Lee and C.-H. Wu, ‘‘A novel big data modeling method forimproving driving range estimation of EVs,’’ IEEE Access, vol. 3,pp. 1980–1993, 2015.

[125] R. Cheng, D. V. Kalashnikov, and S. Prabhakar, ‘‘Evaluating probabilisticqueries over imprecise data,’’ in Proc. ACM SIGMOD Int. Conf. Manage.Data, Jun. 2003, pp. 551–562.

[126] S. Tsang, B. Kao, K. Y. Yip, W.-S. Ho, and S. D. Lee, ‘‘Decision trees foruncertain data,’’ IEEE Trans. Knowl. Data Eng., vol. 23, no. 1, pp. 64–78,Jan. 2011.

[127] J. Chen and R. Cheng, ‘‘Efficient evaluation of imprecise location-dependent queries,’’ in Proc. IEEE Int. Conf. Data Eng., Apr. 2007,pp. 586–595.

[128] W. K. Ngai, B. Kao, C. K. Chui, R. Cheng, M. Chau, and K. Y. Yip,‘‘Efficient clustering of uncertain data,’’ in Proc. Int. Conf. Data Mining,Dec. 2006, pp. 436–455.

[129] R. Gutierrez-Osuna and H. T. Nagle, ‘‘A method for evaluating data-preprocessing techniques for odour classification with an array of gassensors,’’ IEEE Trans. Syst., Man, Cybern. B, Cybern., vol. 29, no. 5,pp. 626–632, Oct. 1999.

[130] J.-L. Sheu, M.-D. Shieh, C.-H. Wu, and M.-H. Sheu, ‘‘A pipelined archi-tecture of fast modular multiplication for RSA cryptography,’’ in Proc.IEEE Int. Symp. Circuits Syst., vol. 2. May/Jun. 1998, pp. 121–124.

[131] P. W. Shor, ‘‘Algorithms for quantum computation: Discrete logarithmsand factoring,’’ in Proc. Annu. Symp. Found. Comput. Sci., Nov. 1994,pp. 124–134.

[132] C. H. Bennett, F. Bessette, G. Brassard, L. Salvail, and J. Smolin,‘‘Experimental quantum cryptography,’’ J. Cryptol., vol. 5, no. 1,pp. 3–28, Jan. 1992.

HUI JIANG is currently pursuing the degree ininformation network with the Nanjing Universityof Posts and Telecommunications, China. Her cur-rent research interests include wireless sensor net-work, computer network, Internet of Things, socialnetworks, security, smart grid communications,and cyber-physical systems.

KUN WANG (M’-13) received the B.Eng. andPh.D. degrees from the School of Computer, Nan-jing University of Posts and Telecommunications,Nanjing, China, in 2004 and 2009, respectively.From 2013 to 2015, he was a Post-Doctoral Fel-low with the Electrical Engineering Department,University of California at Los Angeles, CA,USA. In 2016, he was a Research Fellow withthe School of Computer Science and Engineer-ing, University of Aizu, Fukushima, Japan. He is

currently an Associate Professor with the School of Internet of Things,Nanjing University of Posts and Telecommunications. He has authored over50 papers in referred international conferences and journals, including theIEEE TRANSACTIONS ON INDUSTRIAL INFORMATICS, ACM Transactions on Mul-timedia Computing, Communications and Applications, ACM Transactionson Embedded Computing Systems, the IEEEWireless CommunicationsMag-azine, the IEEE Communications Magazine, the IEEE Network, the IEEEACCESS, and the IEEE SENSORS journal. His current research interests aremainly in the area of big data, wireless communications and networking,smart grid, energy Internet, and information security technologies. He is amember of ACM.

YIHUI WANG is currently pursuing the degree inInternet of Things with the Nanjing University ofPosts and Telecommunications, China. Her currentresearch interests include wireless sensor network,embedded systems, and cyber-physical systems.

MIN GAO received the B.E. degree in electricalengineering and automation from the Nanjing Uni-versity of Technology in 2011, and M.S. degreein electrical engineering from the University ofCalifornia, Los Angeles, in 2012, where he iscurrently pursuing the Ph.D. degree in electricalengineering. His research interests include smartgrid, IoT, and software formal verification.

YAN ZHANG (M’05–SM’10) received thePh.D. degree from the School of Electrical andElectronics Engineering, Nanyang TechnologicalUniversity, Singapore. He is currently the Headof the Department of Networks with the Sim-ula Research Laboratory, Norway, and an Asso-ciate Professor (part-time) with the Departmentof Informatics, University of Oslo, Norway. Hiscurrent research interests include next-generationwireless networks and reliable and secure cyber-

physical systems (e.g., smart grid, transport, and healthcare). He is a SeniorMember of the IEEE ComSoc, the IEEE VT Society, the IEEE PES, andthe IEEE Computer society. He serves as the TPC Member for numerousinternational conferences, including the IEEE INFOCOM, the IEEE ICC,the IEEE GLOBECOM, and the IEEE WCNC. He has received eight bestpaper awards. He is a fellow of IET. He serves as the Chair in a numberof conferences, including the IEEE PIMRC 2016, the IEEE Cloudcom2016/2015, the IEEE CCNC 2016, the IEEE ICCC 2016, WICON 2016,and the IEEE SmartGridComm 2015. He is an Associate Editor or on theEditorial Board of a number of well-established scientific internationaljournals, e.g., Wiley Wireless Communications and Mobile Computing.He also serves as the Guest Editor of the IEEE TRANSACTIONS ON SMART

GRID, the IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, theIEEE TRANSACTIONS ON INDUSTRIAL INFORMATICS, the IEEE CommunicationsMagazine, the IEEEWireless Communications, the IEEE Network, the IEEESystems, the IEEE ACCESS, and the IEEE INTERNET OF THINGS.

VOLUME 4, 2016 3861

Energy Big Data: A Survey - Universitetet i oslo · Data size in electric power systems have...

Documents

Transcript of Energy Big Data: A Survey - Universitetet i oslo · Data size in electric power systems have...