logoGlobal Energy Interconnection

Contents

Figure(0

    Tables(0

      Global Energy Interconnection

      Volume 6, Issue 1, Feb 2023, Pages 1-14
      Ref.

      Deep reinforcement learning based multi-level dynamic reconfiguration for urban distribution network:a cloud-edge collaboration architecture

      Siyuan Jiang1 ,Hongjun Gao1 ,Xiaohui Wang2 ,Junyong Liu1 ,Kunyu Zuo3

      Abstract

      With the construction of the power Internet of Things(IoT),communication between smart devices in urban distribution networks has been gradually moving towards high speed,high compatibility,and low latency,which provides reliable support for reconfiguration optimization in urban distribution networks.Thus,this study proposed a deep reinforcement learning based multi-level dynamic reconfiguration method for urban distribution networks in a cloud-edge collaboration architecture to obtain a real-time optimal multi-level dynamic reconfiguration solution.First,the multi-level dynamic reconfiguration method was discussed,which included feeder-,transformer-,and substation-levels.Subsequently,the multi-agent system was combined with the cloud-edge collaboration architecture to build a deep reinforcement learning model for multi-level dynamic reconfiguration in an urban distribution network.The cloud-edge collaboration architecture can effectively support the multi-agent system to conduct “centralized training and decentralized execution” operation modes and improve the learning efficiency of the model.Thereafter,for a multi-agent system,this study adopted a combination of offline and online learning to endow the model with the ability to realize automatic optimization and updation of the strategy.In the offline learning phase,a Q-learning-based multi-agent conservative Q-learning(MACQL)algorithm was proposed to stabilize the learning results and reduce the risk of the next online learning phase.In the online learning phase,a multiagent deep deterministic policy gradient(MADDPG)algorithm based on policy gradients was proposed to explore the action space and update the experience pool.Finally,the effectiveness of the proposed method was verified through a simulation analysis of a real-world 445-node system.

      0 Introduction

      In recent years,the rapid development of distribution network(DN)automation has rendered its reconfiguration book=2,ebook=6possible.By changing the topology of the DN to transfer the load,this technology can play a role in balancing the load distribution,reducing the transmission loss,and improving the operation economy[1-2].Currently,the scale of urban distribution networks(UDN)is increasing,and multi-level dynamic reconfiguration methods are being developed.This method divides the reconfiguration range of the DN into multiple levels to satisfy the flexible and efficient optimization requirements of the UDN.However,the dynamic real-time environment of a large-scale UDN is too complex,and there are several uncertain variables;therefore,the traditional reconfiguration method based on day-ahead optimization can no longer meet the demand.

      With the rapid development of 5G technology and the construction of the power Internet of Things(IoT),a cloud-edge collaboration architecture has been proposed.Under this architecture,the communication between smart devices in a UDN exhibits the characteristics of high speed,high compatibility,and low latency,which provide reliable support for real-time reconfiguration in a UDN[3-4].In addition,considering that the reinforcement learning(RL)method does not rely on a mathematical model,and it can directly realize the nonlinear mapping between the input data and the output data at a speed of seconds,the advantages of RL become more obvious with increase in the complexity of the problem.Consequently,this can provide a new idea for the high-speed solution of the multi-level dynamic reconfiguration in the complex UDN.Therefore,the multi-level dynamic reconfiguration technology based on RL under cloud-edge collaboration architecture must be studied.

      Although several studies have focused on the field of DN reconfiguration,most belong to the field of traditional day-ahead reconfiguration.The methods employed can be mainly divided into two types:mathematical optimization[5-7]and heuristic[8-10]methods.Certain studies[11-12]have focused on the traditional global reconfiguration,which is only applicable to small-scale DN and cannot satisfy the actual requirement in the face of an increasingly large UDN.To address this problem,a multi-level mode was proposed in[13]for the first time.This mode fundamentally improves upon the defects of the traditional reconfiguration method.However,the solution adopted in this study is a heuristic method,which is slow in problem solving and may fall into a local optimum.Therefore,the method proposed in[13]is a traditional day-ahead optimization,and it cannot satisfy real-time requirements.Consequently,certain scholars have focused on the field of deep reinforcement learning(DRL).DRL can solve sequential decision optimization problems without the process of labelling data,which renders it suitable for the optimal decision-making problem of dynamic reconfiguration in DN.Further,a DRL model can quickly realize the nonlinear optimal mapping from the environment state to the action after training,implying that the state-action pair between the DN environment and the switch state combination can realize the optimal mapping quickly.Previous studies[14-16]have adopted the singleagent DRL method and achieved good results in the smallscale DN reconfiguration problem.However,the huge state and action spaces results in the training of agents being time-consuming and achieving convergence in the face of large-scale UDN is challenging.Therefore,a multi-agent deep reinforcement learning(MADRL)method is required to divide the large action space and reduce the computing pressure of each agent.However,if agents send all action strategies to a multitude of unordered remote controllable switches simultaneously,problems such as high latency and packet loss would occur owing to channel congestion,which would result in information transmission errors or failures.

      The emergence of cloud-edge collaborative architectures can provide feasible solutions to these problems.Relying on the powerful information transmission speed and bandwidth of 5G technology,this communication architecture can aid in the realization of real-time perception and transmission of UDN information and perfectly support the training and deployment of MADRL models.Studies[17-18]have shown that the optimization problem of cloud-edge collaboration architecture can be further improved in real time;however,owing to the limitation of solving speed,achieving real real-time optimization is challenging.However,when the cloud-edge collaboration architecture is combined with a multi-agent system,real-time optimization can be achieved.Specifically,the structural characteristics of cloud-edge collaboration architecture coincides with the‘centralized training and decentralized execution’(CTDE)paradigm of MADRL,where global information is used for centralized training in the cloud computing center,and trained agents are deployed in edge servers for decentralized execution.A study[19]showed that a MADRL method with cloud-edge collaboration architecture can improve model performance;however,this method has not been used in the field of DN reconfiguration.In addition,the above study does not combine the CTDE paradigm with the cloud-edge collaboration architecture well.Rather,it simply deploys agents in the cloud computing center for training without exploiting the edge servers to the best extent,which results in a waste of computing resources.A study[20]considered the structural similarity between the CTDE paradigm and cloud-edge collaboration architectures.However,when book=3,ebook=7the cloud computing center deploys the value network of agents,it does not integrate all of the value networks into a whole,which makes it difficult for the cloud center to have sufficient awareness of the environment.

      In addition,considering that the essence of RL is that agents train their own value and policy networks through continuous wrong trials in a dynamic environment,edge servers cannot directly execute actions in the real-time DN environment in practice;otherwise,the stable operation of the DN would be affected.Thus,the offline RL method was introduced to address the instability of the MADRL method owing to continuous incorrect trials in the early stage of online agent training.The essential difference between offline and online RL is that latter relies on existing data to directly train value networks and policy networks without interacting with the environment and obtains better strategies through the RL algorithm.Following training in this manner,agents can significantly improve the instability of MADRL in the early stage of online training when interacting with the real-time dynamic DN environment.The effectiveness of offline RL has been demonstrated in both[21]and[22];however,this method has not been used in the field of DN reconfiguration.

      Based on the above analysis,this study proposed a deep reinforcement learning based multi-level dynamic reconfiguration method for urban distribution networks in a cloud-edge collaboration architecture to obtain a real-time optimal multi-level dynamic reconfiguration solution.First,a model of multi-level dynamic reconfiguration and its mathematical model were proposed.This study combined a multi-agent system with a cloud-edge collaboration architecture to build a DRL model for multi-level dynamic reconfiguration in an urban distribution network.The cloudedge collaboration architecture can effectively support the multi-agent system to conduct CTDE and improve the learning efficiency of the model.For the multi-agent system,a combination of offline and online learning was adopted to endow the model with the ability to automatically optimize and update the strategy.In the offline learning phase,a multi-agent conservative Q-learning(MACQL)algorithm based on Q-learning was proposed to stabilize the learning results and reduce the risk of the next online learning phase.Whereas,in the online learning phase,a multi-agent deep deterministic policy gradient(MADDPG)algorithm based on a strategy gradient was proposed to explore the action space and update the experience pool.Finally,the effectiveness of the proposed method was verified through a simulation analysis of a real 445-node system.

      In particular,the contributions of this study mainly include the following two aspects:First,MADRL was effectively combined with the cloud-edge collaboration architecture to solve the problem of multi-level dynamic reconfiguration in UDN and avoid the wastage of several computing and communication resources.Second,owing to the difficulty of conducting an incorrect trial in a real-time dynamic DN environment,a method combining offline RL with online RL was proposed.This method can significantly reduce the instability in the early training of online learning and realize automatic optimization and updation.

      The remainder of this paper is organized as follows.In the Section 1,the multi-level dynamic reconfiguration of the UDN is analyzed,and a system framework combining cloud-edge collaboration architecture and MADRL is proposed.The Section 2 introduces the online MADDPG and offline MACQL methods.The Section 3 analyzes the simulation results based on an actual 445 bus distribution system.

      1 The cloud-edge collaboration architecture for multi-level dynamic reconfiguration

      1.1 Multi-level dynamic reconfiguration mode

      The UDN has a natural layered structure that can be divided into three levels:feeder,transformer,and substation.The combination of feeder tie switch(FS),transformer tie switch(TS),substation tie switch(SS),and branch sectionalizing switch(BS)can realize power flow transfer at the feeder,transformer,and substation levels,respectively,which is the net load transfer in the UDN.

      The multi-level dynamic reconfiguration for the UDN comprises three levels:feeder-,transformer-,and substationlevel reconfigurations,and the level of switches involved in the reconfigurations are different at different levels.As shown in Table 1,feeder-level reconfiguration only changes the state combination of FSs and BSs,and the power flow is transferred between different feeders that are under the same transformer.Transformer-level reconfiguration changes the state combination of TSs,FSs,and BSs simultaneously,and the power flow can be transferred between different transformers that are under the same substation.Finally,substation-level reconfiguration changes the state combination of all levels of switches,and the power flow can be transferred between feeders,transformers,and substations.The proposed level increases from low to high,and the range of the adjustable power flow also increases.To avoid large-scale power flow transfer and the non-ideal marginal benefit of global reconfiguration,a reconfiguration strategy of “priority local autonomy and lagging global coordination” should be used.

      Table 1 Multi-level switching modes

      Switching mode Tie switchTransfer mean FS TS SS Feedertransfer Transformertransfer Substationtransfer Feeder-level √ × ×√××Transformerlevel√ √ ×√√×Substationlevel√ √ √√√√

      As shown in Fig.1,this study assumed that thephotovoltaic(PV)on feeder S1T11 in the area indicated by the dotted line was curtailed.According to the reconfiguration strategy of ‘priority local autonomy,lagging global coordination’,first,the feeder-level reconfiguration can be performed by changing the state of FS1 to transfer the load from S1T11 to S1T12 to consume PV.When the feeder-level reconfiguration cannot solve the curtailment problem,a transformer-level reconfiguration is adopted to transfer the load from transformer S1T1 to transformer S1T2.However,if the problem is still not solvable,the reconfiguration scope can be further expanded by changing the state of SS1 to transfer the load from substation S1 to substation S2 to jointly consume the PV.

      Fig.1 Multi-level dynamic reconfiguration

      book=4,ebook=8

      1.2 Multi-level dynamic reconfiguration mathematical model

      In this Section,a mathematical model for multi-level dynamic reconfiguration in a UDN is presented.The objective function is presented in Section 3 when the reward function is discussed.Further,the constraints of the model are discussed in this Section.

      a)Reconfiguration level decision constraints

      Coupling constraints exist among the feeder-,transformer-,and substation-level reconfiguration modes.

      wheredenote the state identifiers of substation j for the three reconfiguration modes,respectively,BSub denotes the set of substation nodes, denotes the total number of transformers contained in substation j,and fγ(j )denotes transformer f subordinate to substation j.Equation(1)is the case where the substation j can only choose to perform transformer-level reconfiguration or substation-level reconfiguration simultaneously.Equation(2)is the case where the interrelated substations a,b remain consistent in their actions on whether to perform substationlevel reconfiguration.Finally,Eq.(3)is the case where when transformer-level or substation-level reconfiguration is performed at substation j,feeder-level reconfiguration cannot be simultaneously performed within that substation.

      b)Power flow constraints

      where Pi,t and Qi,t denote the active and reactive powers at node i at moment t,respectively,and Vi,t denotes the voltage at node i at moment t.The conductance between adjacent nodes j is Gij and Bij,while θij,t is the voltage phase angle difference at moment t.

      c)Nodal power balance constraints

      where Iij,t denotes the current on branch ij at moment t.Further,wij,tdenotes the switch state of branch ij at moment t,and a value of 1 indicates that the branch ij switch is closed.

      d)Network reconfiguration constraints

      where Ealways denotes the total number of branches in the book=5,ebook=9grid that are always closed and nonadjustable,α(j)denotes the set of branch terminal nodes with j as the initial node,β(j)denotes the set of branch initial nodes with j as the terminal node,and N denotes the number of substations in the optimization subject.The DN containing DG may experience islanded operation under the constraint of Eq.(9);thus,Eqs.(10-12)must be supplemented by injecting power εat the non-substation nodes and keeping the nodes connected to each other.Further,denotes the auxiliary flow active power on branch ij at time t.

      e)Photovoltaic power output constraints

      where denote the predicted and true values of PV power output at node i at moment t,respectively.

      1.3 Combination of multi-agent system and cloud-edge collaboration architecture

      As it is almost impossible to build accurate mathematical models of real power systems,the data-driven model-free MADRL method is a promising alternative.A well-trained MADRL model can obtain optimal solutions without a long solving process;therefore,it can satisfy the need for real-time dynamic reconfiguration.The multi-agent system can significantly relieve the training pressure on the agents by dividing the original discrete action space into several sub-action spaces,which can aid in solving the reconfiguration optimization problem for large-scale UDN.In addition,the huge continuous state space in value networks can be handled by the cloud computing center with low computational cost and high computational power;therefore,a cloud-edge collaboration architecture was introduced to cooperate with the multi-agent system to solve the existing problem.

      In this context,this study proposed a MADRL system under a cloud-edge collaboration architecture,and the system structure is shown in Fig.2.The proposed system is a top-down structure with a cloud computing center at the top,several edge servers in the middle,and several smart perceptive machines at the bottom.The cloud computing center was deployed in the grid dispatching center,edge servers were deployed in several 110 kV substations in the UDN,and smart perceptive machines were deployed in several terminals in the UDN.The structure is explained in detail as follows.

      Fig.2 Multi level dynamic reconfiguration system architecture of urban distribution network based on multiagent deep reinforcement learning under cloud edge collaborative framework

      a)Cloud computing center

      The cloud computing center was deployed in the grid dispatching center and served two purposes.first,the cloud interacted with the edge for real-time information and received all real-time information from several edge sides[23-25],that is,global state and action information,which were input data for learning and training of the value network.Thereafter,it transmitted the obtained returns to each edge side in real time;second,with its fast computing speed and low computing cost,the cloud can easily process real-time information and reduce information latency and packet loss rates,which showed better performance for centralized training tasks.

      b)Edge server

      Each edge server was deployed at each 110 kV substation in each sub-distribution network(SDN)and served three purposes.first,the edge side interacted with the cloud for real-time information as described earlier,transmitted state and action information,and obtained the reward information;second,the edge side interacts with the terminals for real-time information,transmitted action commands,and simultaneously obtained the observation information sensed by the terminal sensors in SDN.Finally,the real-time observation,action,and value information were used as input data for the policy network training,and the policy network was trained by the edge computer.

      c)Terminal perceptron

      Each smart perceptive machine was deployed on the respective terminals in each SDN,such as meter boxes,remote controllable switches,and PV generator sets.Its role was to interact with the edge side for real-time information.The terminal sensors were responsible for book=6,ebook=10sensing and collecting operation data from the terminals,such as injected active and reactive power,voltage profiles,and switch opening and closing states.They preprocessed these operation data and transmitted them to the edges while receiving action commands from terminals[26].In this study,action commands were the opening and closing action commands to remotely controllable switches.

      Specifically,the multi-agent system functioned under a cloud-edge collaboration architecture,as follows:

      Each agent had value and policy networks.The policy network assigned an action to interact with the DN environment,which was a switch state combination,to obtain a reward and a new DN environment.The value network evaluated the action made by the policy network,which represents the quality of the action,to assist the policy network in making the optimal action,which denotes the optimal switch state combination.

      Because the state space in the input of the value network is the entire DN environment,the value network of all the agents can be deployed at the cloud computing center.In addition,the complete state space can be used to realize the full perception of the distribution network operation state,such that the state space is large.Further,a cloud computing center with powerful computing power and low computing cost is needed to train the value network.As the main role of the policy network was to execute the actions,the policy network of each agent can be deployed on a separate edge computer.In the training phase,the observation space in the input of each policy network was the observable local DN environment.Simultaneously,the policy network used the output value of the upper value network to assist the training,and when it obtained the optimal action,it sent the command to remotely controllable switches,to achieve the function of decentralized execution.

      In contrast to common multi-agent systems,the value and policy networks of each agent in this study were deployed separately under the cloud-edge collaboration architecture,which exploited the fast and reliable information transmission under the cloud-edge collaboration architecture.

      Based on the above structure,the overall process of this method is as follows.First,the cloud center pretrained the MACQL model based on a limited historical dataset.Second,the data of pre-trained multi-agent actors and critics were stored in the cloud,and the actors of each agent were deployed at the edge for decentralized execution.Third,real-time state data from each edge were uploaded to the cloud.Finally,the cloud center received global information.The parameters of the Actor-Critic(AC)model were then updated using the MADDPG method and passed to the edge.The AC model can be iteratively updated based on the proposed update mechanism.

      2 Multi-agent deep reinforcement learning model in cloud-edge collaboration architecture

      2.1 Markov game process

      Considering that the multi-level dynamic reconfiguration in the DN examined in this study aimed to obtain the optimal switch state combination scheme in each period of 24 h,it belongs to a sequential decision-making problem.In addition,the decision-making scheme of multi-level dynamic reconfiguration in DN at the next moment would not be affected by the DN environment in the past.Rather,it would only be affected by the DN environment at the current moment,satisfying the Markov nature[27-28].According to the above analysis,the multi-level dynamic reconfiguration in UDN combined with the MADRL method can be transformed into a Markov Game Process.

      Assuming that there are N existing 110 kV substations located in the UDN,each substation manages an SDN under the jurisdiction of the UDN,each edge server is deployed at each SDN n,and each terminal perceptron is deployed at each remote controllable switch,load node,and PV node under each SDN n.The multi-level dynamic reconfiguration problem in the UDN to be solved in this study can be transformed into a Markov Game Process,as follows:

      where N denotes the number of agents,S denotes the space of all possible states of the UDN,which is global information,denotes thejointaction spaceof N agents,denotesthelongterm discounted reward of N agents,denotes the state transfer probability of N agents,where the state transfer probability of agent n is denotes the joint observation space of N agents,and γ denotes the discount factor,which is used to change the weights of current rewards and future rewards.The interactive game process of the multiagent system is shown in Fig.3.

      Fig.3 Architecture of multi-agent deep reinforcement learning system

      a)Global state space

      The global environment denotes the overall dynamic environment in the UDN,and the global state space denotes the operational state of the entire UDN.In this study,the global state-space St at time t is defined as follows:

      where Pn,t,Qn,t,Vn,t,and In,t denote the set of active power,reactive power,voltage level,and all branch currents in the SDN under substation n at time t respectively,denotes book=7,ebook=11the set of active power outputs of all PV nodes in the SDN,and wn,t denotes the set of states of all remotely controllable switches in the SDN.

      b)Joint action space

      The joint action space denotes the set comprising the actions of each agent.In this study,joint action space Atat time t is defined as follows:

      where wn,t denotes the combination of all remotely controllable switch states in SDN under substation n at time t.These switch combinations are subject to DN reconfiguration topology constraints.

      c)Discounted reward function

      The discounted reward function denotes the set of rewards received by all agents.For a single agent at time t the reward function is defined in this study as follows:

      where denote the economic operating cost,voltage offset,and load balance indicators,respectively.

      where denote the transmission loss,PVG curtailment,and switch operation costs of the SDN to which the smart agent belongs,respectively,and Δt denotes the time interval.

      where closs,cPV,and denote the cost of transmission loss per unit,cost of PVG curtailments per unit,and cost of single switch operation,respectively;Bline,BPV,and BSW denote the set of information on all branches,PV nodes,and remotely controllable switches in the SDN,respectively,rij denotes the resistance of branch ij,and denotes the amount of PVG curtailments at node i at time t.

      where Vi,t and denote the real-time and rated values of voltage at node i at time t,respectively.

      where Mi,t denotes the load factor of node i at time t,denotes the average load factor of the SDN at time t,and Pi,max denotes the maximum allowable injected active power at node i.

      d)Probability of state transfer

      describes the impact of a joint action of multiple agents on the environment anddenotes the probability of an agent n taking action at to transfer to state sn, t+1 in state sn, t.The probability of a state transfer under the current policy π is:

      e)Joint observation space

      is the set of observations available to all agents,that is,observations shared between agents,and on is the local measurement information directly available to agents.The state space of the real-time environment was simplified in this study,such that it was a fully observable problem and S=O .

      2.2 Online reinforcement learning based on improved MADDPG

      Although each agent in the MADDPG algorithm was trained using the AC method,in contrast to the traditional single-agent case,the critic part of each agent in the MADDPG algorithm had access to the policy information of other agents.

      Specifically,a game with N agents,each with a strategy parameter of was considered.Denotingbook=8,ebook=12 as the set of strategies of all agents,the policy gradient of the expected payoff of each agent in the case of a stochastic strategy can be expressed as

      where is a centralized action value function.

      For deterministic strategies,consider that there are N consecutive strategies μn.The gradient formula for DDPG can be obtained as:

      where D is the empirical replay pool used to store the data,which stores the data as an In MADDPG,the centralized action value function can be updated according to the following loss function:

      where is the set of target policies used to update the value function,which contains parameters for delayed updates.The flow of the specific algorithm is presented in Algorithm 1.

      Considering that the MADDPG algorithm deals with the interaction between continuous actions and the environment,it cannot address the problem of discrete switching state combinations in the distribution network.Therefore,this study made certain improvements to the MADDPG algorithm.

      First,the joint action space was discretized.The joint action space at time t after discretization is as follows:

      where g is the granularity of the discretization.Considering that the combination of states of the remotely controllable switches under the jurisdiction of each SDN is an enumerable quantity,g is a certain value for each agent.

      Next,the output of the strategy network in the original MADDPG algorithm is a SoftPlus distribution over the continuous action space.In this study,the output of the strategy network was modified to a softmax distribution over the discrete action space,and the input of the value network was modified to a distribution over the continuous state space and discrete action space.Finally,considering that both the strategy and value networks would receive state space data,and because the load data and PV data of UDN present obvious spatial distribution imbalance characteristics,this study modified the feedforward neural network(FNN)used in the original strategy and value networks into a convolutional neural network(CNN)for spatial feature learning to fully perceive the potential connections in their high-dimensional feature spaces.

      Thus,the modified MADDPG algorithm can be used to solve the multi-level dynamic reconfiguration problem in UDN,and an illustration of the input-output mapping relationships of the value and policy networks of each agent shown in Fig.3 is explained in the subsequent sections.

      a)Value network

      The value network is referred to as a critic,which assists the policy network in converging to the best strategy.The input data of the value network are( S,A, R,S' ),and the output data are the evaluation Q made on the action.

      b)Policy network

      The policy network is referred to as an actor,and it calculates the best action.The input data of the policy network are( o, a,r,o'),and the output data are the action at the next moment,which is the combination of switch states in the UDN.

      Algorithm I:Online RL Training Process 1)Randomly initialize the actor network and critical network of each agent 2)For sequence e= 1→E do:3)Initialize a random process H for action exploration;4)Obtain the initial observation o of all agents 5)For t= 1→T do:6)For each agent i,select an action ao θi( )H with the current strategy;7)Perform actions a aa i =µ+it 1…N and obtain rewards r and new observations o';8)Store( ,,,)=( ,,)o a r o' in the experience playback pool D;9)Randomly sample some data from D;10)For each agent i,train the critical network centrally and train its own actor network;11)For each agent i,update the target actor network and target critical network;12)End For 13)End For

      2.3 Offline reinforcement learning based on improved MACQL

      Offline reinforcement learning,also known as batchconstrained learning,is a class of RL algorithms that does not require interaction with the environment.It simply learns historical data from a pool of experience to complete the training of an agent.However,this does not mean that online RL algorithms can be converted to offline RL by removing the mode that includes interaction with the environment,because the presence of extrapolation errors can result in the overestimation of functions at points far from the dataset,and even upward scattering of values.Further,offline learning does not have the opportunity to correct for these errors by interacting with the environment book=9,ebook=13and sampling new data in time as online learning does.Therefore,online learning cannot be transferred directly to offline learning.To address this problem,conservative Q-learning(CQL)was proposed.The CQL algorithm introduces additional restrictions on the ordinary Bellman equation to maintain the function at points where the algorithm deviates from the dataset at a very low value,thus eliminating certain effects of extrapolation errors[29-30].

      In this study,the CQL algorithm based on the Soft Actor-Critic(SAC)framework was selected;specifically,in traditional Q-learning,the update equation of Q is

      where Bπ is the Bellman operator for strategy π at the time of actual computation.

      Considering that the computation of requires the use of a',which does not exist in the experience pool,the states in the dataset were penalized by the actions obtained by policy μ.Additionally,to prevent overfitting,a regularterm ℜ( u)was added,which used the KL distance-DKL(µ ,ρ)from the prior policyIn this study,the prior policy was assumed to be uniformly distributed.The updated equation for the Q function of the modified CQL algorithm is

      where β is the balance factor.

      The policy update equation of the CQL algorithm is expressed as:

      where α denotes the entropy regularity factor,which is updated as follows:

      The specific flow of the MACQL algorithm is shown in Algorithm 2.

      Algorithm II:Offline RL Training Process 1)Initialization Q network,target network Qθ and strategy network πφ,entropy regularization coefficient α of each agent.2)For sequence e= 1→E do:3)For t= 1→T do:4)For each agent i,update entropy regularization coefficient according to formula(35)

      5)For e ach agent i,update Q function according to formula(33)6)For each agent i,update strategy according to formula(34)7)End For 8)End For

      As this study adopted the SAC framework,which is an algorithm for continuous action space interacting with the environment,the action space must be discretized.In addition,the same improvements were made to the value and strategy networks used in MACQL.Moreover,the original feedforward neural network(FNN)was modified into a convolutional neural network(CNN)to exploit the powerful spatial feature extraction capability of CNN to achieve full perception of large-scale load and PV nodes in UDN in high-dimensional space.

      The modified MACQL algorithm can be used to solve the multi-level dynamic reconfiguration problem in UDN,and the input-output mapping relationships of the value network and the policy network are the same as MADDPG.

      3 Case analysis

      3.1 Case description

      To verify the effectiveness of the proposed method,a modified real 445-node system was used to simulate a largescale UDN,with a Python 3.8 simulation environment,a Pytorch library for neural networks used in the agent,and an AMD R7-5800H CPU with 16G DDR4 memory.The node system was composed of four substations,eight transformers,and 16 feeders.The system was equipped with 48 remotely controllable switches,which were divided into eight switch groups according to the number of transformers.Each switch was set to operate eight times in 24 h.The reference voltage for each substation was 10 kV.This study used two years of historical load data from 4 actual 110 kV substations,corresponding to the four SDNs in the 445-node system.The location of the PV generation system in the 445-node system is shown in Table 2,and the PV output data were generated by the Monte Carlo method to simulate the PV output uncertainty.Further,the Pypower tool was used to calculate the power flow,and after obtaining a multi-level dynamic reconfiguration scheme,it was used as input to obtain the next transfer UDN environment required by the agents.The cost of PVG curtailment and transmission loss was set at $500/MWh and the cost of switch operation was $10/operation.The hyperparameter settings for the MADRL algorithm used in this study are presented in Table 3.

      Table 2 Photovoltaic node information

      Node Type Node Number PV29,36,68,69,105,110,140,144,187,196,203,266,278,300,302,333,338

      Table 3 Super parameter setting

      Learning rate γBatch Size Conv kernel Buffer capacity ActorCriticλ 0.0005 0.00050.00050.95643*310000

      book=10,ebook=14

      3.2 Analysis of multi-agent reinforcement learning algorithms

      To verify the effectiveness of the MADDPG algorithm proposed in the online RL phase,the algorithm was compared with the MASAC,QTRAN,and QMIX algorithms,which are also applicable to MADRL.The MASAC algorithm is based on the SAC framework,which incorporates the concept of maximum entropy compared to the AC framework,and there is no target policy network in the SAC framework for individual agents.Whereas,the QTRAN and QMIX algorithms are both value functionbased methods.A comparison of the performance of the four methods for the large-scale UDN multi-level dynamic reconfiguration problem is shown in Fig.4.The four algorithms,MADDPG,MASAC,QTRAN,and QMIX,are shown in order in Figures(1)-(4).

      Fig.4 Algorithm comparison

      Before analyzing the results,the basis for the convergence of the training curve in this study must be understood;the relevant parameters were set as follows:

      After 770 episodes,if the average reward value in any 10 consecutive episodes changed within ±100,the training curve was considered to have converged.Thus,if the absolute value of the difference between the reward value of any episode and the average reward value in the last 10 episodes was less than 50,the training curve was considered to have converged reliably.

      Analysis of Fig.4 leads to the following conclusions:

      (a)In the problem considered in this study,four types of algorithms converged within 800 episodes.MASAC began to converge at 200 episodes and converged reliably at approximately 760 episodes with stable reward value approximately -1890.QMIX tended to converge at an episode of approximately 400,and the reward value was stable at -1810;however,the training curve was not stable,indicating that it is difficult for QMIX to converge reliably to this problem.Whereas,training curve of QTRAN was unstable in the early training period,the jitter of the curve was significantly reduced after 500 episodes,and the training curve did not start to converge until 700 episodes.Finally,MADDPG began to converge at approximately 500 episodes,and the training curve was very stable after 700 episodes,indicating reliable convergence.In addition,the reward value was stable at approximately -1800,which is higher than MASAC,QTRAN,and QMIX.

      (b)Both QTRAN and QMIX do not have stable training curves with a large jitter amplitude,which indicates that the MADRL algorithm based on the value function cannot exhibit good performance when addressing the problem in this study.In contrast,both MADDPG and MASAC are based on the AC framework.The MASAC algorithm has an extremely strong exploration ability inherently and thus,it began to converge as soon as possible;however,there was no target strategy network,so it cannot obtain the best combination of actions well.Consequently,the reward value is not as good as MADDPG.Compared with MADDPG,which has an additional target strategy network,it can obtain better optimization results.

      (c)In the above analysis,the MADDPG algorithm exhibited the best optimization effect and the best convergence stability;therefore,the MADDPG algorithm proposed in this study exhibited the best performance.

      3.3 Analysis of neural networks

      To verify the effectiveness of the proposed improvements to the MADDPG and MACQL algorithms,a comparison was made between the proposed improved algorithms and the original unimproved algorithms.The neural networks used in the original MADDPG and MACQL algorithms were feedforward neural networks with two hidden layers with 128 and 64 neurons,respectively.A performance comparison of the two methods for multi-level book=11,ebook=15dynamic reconfiguration of a large-scale UDN is shown in Fig.5.Fig.(1)and(2)show the MADRL algorithm using the CNN and FNN,respectively.

      Fig.5 Comparison of effects of different neural networks on performance

      Analysis of Fig.5 leads to the following conclusions:

      (a)In the problem of this study,both algorithms can converge within 800 episodes;the original algorithm converged at approximately 700 episodes,with a reward value of approximately -1800;the analysis of the results of the algorithm proposed in this paper is shown in the previous section.

      (b)Comparing the performance of the two algorithms,it was found that the MADRL algorithm using FNN converged significantly slower than the proposed algorithm,with almost no increase in the training curve at the early phase of training(0-200 episodes).This indicates that the original algorithm learns almost nothing useful at the early phase.Subsequently,at the middle phase of training(400-500 episodes),the training curve showed obvious jitters,which indicates that the FNN network still cannot perceive the connection between the highdimensional load information in the large-scale UDN environment well.Finally,at the late phase of training(700-800 episodes),the original algorithm can converge,indicating that the FNN network was fitted to the best policy and value networks.In contrast,because the FNN network was replaced by the CNN network,the multiagent system can use the advantageous characteristics of the CNN network to sense the unbalanced spatial distribution of the high-dimensional load information in the UDN environment during the early phase of training,and the learning process of the multi-agent system entered a smooth state in the middle phase of training.This suggests that by relying on the powerful spatial featureaware extraction capability of the CNN network,the multiagent system had sufficiently learned the switch state combination scheme that maximized the reward value.

      (c)Thus,the performance of the MADRL algorithm using the CNN network proposed in this study was significantly stronger than that of the MADRL algorithm using the FNN network.

      3.4 Analysis of offline reinforcement learning

      To verify the effectiveness of the proposed method,which combines offline RL with online RL,the method is compared with a method that performs online RL without offline RL.A comparison of the performances of the two methods is shown in Fig.6.Figures(1)and(2)show the training process of the MADDPG algorithm combined with the MACQL algorithm and MADDPG algorithm without the MACQL algorithm,respectively.

      Fig.6 Comparison of model performance with or without offline reinforcement learning

      Analysis of Fig.6 leads to the following conclusions:

      (a)In the problem of this study,both algorithms converge within 800 episodes;when MADDPG is not combined with MACQL,it starts to converge at approximately 500 episodes,and the reward value is stable at approximately -1880;when MADDPG is combined with MACQL,the results are analyzed in the previous section.

      (b)The original purpose of introducing offline RL was to allow the initial phase of training for online RL(0-100 episodes)to maintain a more stable learning effect and prevent the appearance of an action space that exceeded the constraint generated by interaction with the environment.Consequently,the entire interaction process of online learning was allowed to satisfy the constraint and reduce the cost of training.As evident in Fig.(2),the training curve showed large jitters in the early phase of training,indicating the existence of an action space that exceeded the constraints in the early phase of interaction with the UDN environment,which is to be expected.In contrast,the training curve in Fig.(1)showed fewer jitters in the early phase of training,and the action space satisfied the constraints.In the middle phase of training,the training curves of the two methods showed high similarity,indicating that the method in Fig.(2)reduced the influence of offline RL.Finally,in the late phase of training,both methods showed good convergence reliability;however,owing to the learning experience brought by offline RL,the reward value of the method in Fig.(1)was higher than that in Fig.(2),thereby indicating book=12,ebook=16that the learning effect of the method in Fig.(1)was better.

      (c)Thus,the MADDPG algorithm combined with the MACQL algorithm proposed in this study exhibited a stronger performance than the MADDPG algorithm without the MACQL algorithm,thus confirming the better effectiveness of the combination of offline and online RL.

      3.5 Comparative analysis with traditional methods

      To prove the advantages of the proposed method compared with the traditional optimization algorithm,a comparison and analysis of the optimization results of the proposed and traditional methods were conducted.The traditional methods are the global reconfiguration method based on a mathematical programming algorithm,global dynamic reconfiguration method based on a binary particle swarm optimization algorithm,and multi-level dynamic reconfiguration method based on a mathematical programming algorithm.The proposed method is denoted as 1,and the traditional methods are numbered 2,3,and 4.The UDN operation scenario was obtained by random sampling of the samples.The optimization results under this scenario are presented in Table 4.

      Analysis of Table 4 leads to the following conclusions:

      Table 4 Comparison of optimization results

      Method Switches operation cost($)Transmission loss cost($)Solving speed(s)1 3207548.661.83 2——∞3 4908435.2464895.56 4 3807913.38879.20

      (a)From the perspective of solution time,for method 2,the UDN was excessively large and the 0-1 variable dimension in the model was very high;thus,it could not solve the problem.Whereas,although method 3 required a long time,it obtained the global optimal solution.Further,method 4 used a multi-level dynamic reconfiguration method,which greatly reduced the dimension of the 0-1 variable in the model by determining the reconfiguration level,and thus offered great advantages in solving speed compared with method 3.However,the solving speed of method 1 Was the fastest,reaching the second level solving speed,and it satisfied the real-time optimization requirements of large-scale UDN.

      (b)From the perspective of switch operation and transmission loss costs,method 3 belongs to global reconfiguration;thus,it required the most number of switch actions,and the switch operation cost rose accordingly.Further,the multiple global reconfiguration caused the UDN to undergo large-scale power flow transfer,which considerably increased the transmission loss.Further,method 4 introduced a multi-level reconfiguration method,such that the number of large-scale switch operations were reduced as much as possible;thus,the switch operation and transmission loss costs were lower than method 3.Method 1 was solved using the MADRL method;after 800 episodes of training,agents learnt a better reconfiguration strategy than method 4.Thus,the switch operation and transmission loss costs were the lowest.

      (c)Thus,the method proposed in this study was very fast in solving and superior to traditional methods in terms of optimization results.Further,it offered greater advantages compared with traditional methods when facing large-scale UDN.

      4 Conclusion

      This study proposed a DRL based multi-level dynamic reconfiguration method for urban distribution networks in a cloud-edge collaboration architecture to obtain a real-time optimal multi-level dynamic reconfiguration solution.First,the three-level dynamic reconfiguration model including the feeder,transformer,and substation levels was combined in the cloud-edge collaboration architecture based on its structural characteristics.Subsequently,the cloud-edge collaboration architecture was combined with multi-agent DRL to build a DRL model.The cloud-edge collaboration architecture can effectively support multi-agent system to perform ‘centralized training and decentralized execution’principle.Further,to address the inevitable action-overrun problem in the early phase of online learning,this study an improved MACQL algorithm based on Q-learning in the offline learning phase and an improved MADDPG algorithm based on the AC framework in the online learning phase to explore the action space and update the experience pool.This study showed that the proposed system framework can address the large-scale urban distribution network reconfiguration problem better than other MADRL and traditional methods,and the improved method proposed for neural networks can obtain a better reconfiguration solution compared with the original method.In addition,offline reinforcement learning can provide better results for online reinforcement learning in the early phase of training.

      Acknowledgments

      This study was supported by the National Natural Science Foundation of China under Grant 52077146.

      book=13,ebook=17

      Declaration of Competing Interest

      We declare that we have no conflict of interest.

      References

      1. [1]

        W Huang,W Zheng,D J Hill(2021)Distribution network reconfiguration for short-term voltage stability enhancement:an efficient deep learning approach.IEEE Transactions on Smart Grid,12(6):5385-5395 [百度学术]

      2. [2]

        T V Tran,B H Truong,T P Nguyen,et al.(2021)Reconfiguration of distribution networks with distributed generations using an improved neural network algorithm.IEEE Access,9:165618-165647 [百度学术]

      3. [3]

        W Zheng,W Huang,D J Hill,et al.(2021)An adaptive distributionally robust model for three-phase distribution network reconfiguration.IEEE Transactions on Smart Grid,12(2):1224-1237 [百度学术]

      4. [4]

        D Vergnaud(2020)Comment on efficient and secure outsourcing scheme for RSA decryption in internet of things.IEEE Internet of Things Journal,7(11):11327-11329 [百度学术]

      5. [5]

        Q Peng,Y Tang,S H Low(2015)Feeder reconfiguration in distribution networks based on convex relaxation of OPF.IEEE Transactions on Power Systems,30(4):1793-1804 [百度学术]

      6. [6]

        R A Jabr,R Singh,B C Pal(2012)Minimum loss network reconfiguration using mixed-integer convex programming.IEEE Transactions on Power Systems,27(2):1106-1115 [百度学术]

      7. [7]

        N C Koutsoukis,D O Siagkas,P S Georgilakis,et al.(2017)Online reconfiguration of active distribution networks for maximum integration of distributed generation.IEEE Transactions on Automation Science and Engineering,14(2):437-448 [百度学术]

      8. [8]

        R Srinivasa Rao,S V L Narasimham,M Ramalinga Raju,et al.(2011)Optimal network reconfiguration of large-scale distribution system using harmony search method.IEEE Transactions on Power Systems,26(3):1080-1088 [百度学术]

      9. [9]

        F V Gomes,S Carneiro,J L R Pereira,et al.(2005)A new heuristic reconfiguration method for large distribution systems.IEEE Transactions on Power Systems,20(3):1373-1378 [百度学术]

      10. [10]

        S Chen,W Hu,Z Chen(2016)Comprehensive cost minimization in distribution networks using segmented-time feeder reconfiguration and reactive power control of distributed generators.IEEE Transactions on Power Systems,31(2):983-993 [百度学术]

      11. [11]

        Y Liu,J Li,L Wu(2019)Coordinated optimal network reconfiguration and voltage regulator/DER control for unbalanced distribution systems.IEEE Transactions on Smart Grid,10(3):2912-2922 [百度学术]

      12. [12]

        S M Razavi,H R Momeni,M R Haghifam,et al.(2022)Multi-objective optimization of distribution networks via daily reconfiguration.IEEE Transactions on Power Delivery,37(2):775-785 [百度学术]

      13. [13]

        H Gao,W Ma,Y Xiang,et al.(2021)Multi-objective dynamic reconfiguration for urban distribution network considering multilevel switching modes.Journal of Modern Power Systems and Clean Energy,10(5):1241-1255 [百度学术]

      14. [14]

        Y Li,G Hao,Y Liu,et al.(2022)Many-objective distribution network reconfiguration via deep reinforcement learning assisted optimization algorithm.IEEE Transactions on Power Delivery,37(3):2230-2244 [百度学术]

      15. [15]

        B Wang,H Zhu,H Xu,et al.(2021)Distribution network reconfiguration based on NoisyNet deep Q-learning network.IEEE Access,9:90358-90365 [百度学术]

      16. [16]

        Y Zhang,F Qiu,T Hong,et al.(2022)Hybrid imitation learning for real-time service restoration in resilient distribution systems.IEEE Transactions on Industrial Informatics,18(3):2089-2099 [百度学术]

      17. [17]

        H Gao,W Ma,S He,et al.(2022)Time-segmented multi-level reconfiguration in distribution network:a novel cloud-edge collaboration framework.IEEE Transactions on Smart Grid,13(4):3319-3322 [百度学术]

      18. [18]

        T Jing,X Tian,H Hu,et al.(2022)Deep learning-based cloudedge collaboration framework for remaining useful life prediction of machinery.IEEE Transactions on Industrial Informatics,18(10):7208-7218 [百度学术]

      19. [19]

        C Xu,S Liu,C Zhang,et al.(2021)Multi-agent reinforcement learning based distributed transmission in collaborative cloudedge systems.IEEE Transactions on Vehicular Technology,70(2):1658-1672 [百度学术]

      20. [20]

        Y Tao,J Qiu,S Lai(2022)A hybrid cloud and edge control strategy for demand responses using deep reinforcement learning and transfer learning.IEEE Transactions on Cloud Computing,10(1):56-71 [百度学术]

      21. [21]

        G Fragkos,J Johnson,E Tsiropoulou(2022)Dynamic role-based access control policy for smart grid applications:an offline deep reinforcement learning approach.IEEE Transactions on Human-Machine Systems,52(4):761-773 [百度学术]

      22. [22]

        B Hu,J Li(2022)A deployment-efficient energy management strategy for connected hybrid electric vehicle based on offline reinforcement learning.IEEE Transactions on Industrial Electronics,69(9):9644-9654 [百度学术]

      23. [23]

        Z Zhang,Y Zhang,D Yue,et al.(2022)Economic-driven hierarchical voltage regulation of incremental distribution networks:a cloud-edge collaboration based perspective.IEEE Transactions on Industrial Informatics,18(3):1746-1757 [百度学术]

      24. [24]

        S D Okegbile,J Cai,A S Alfa(2022)Performance analysis of blockchain-enabled data-sharing scheme in cloud-edge computing-based IoT networks.IEEE Internet of Things Journal,9(21):21520-21536 [百度学术]

      25. [25]

        A Bachoumis(2022)Cloud-edge interoperability for demand response-enabled fast frequency response service provision.IEEE Transactions on Cloud Computing,10(1):123-133 [百度学术]

      26. [26]

        L Shen,X Dou,H Long,et al.(2021)A cloud-edge cooperative dispatching method for distribution networks considering photovoltaic generation uncertainty.Journal of Modern Power Systems and Clean Energy,9(5):1111-1120 [百度学术]

      27. [27]

        S Y Oh,J W Song,W Chang,et al.(2019)Estimation and forecasting of sovereign credit rating migration based on regime switching markov chain.IEEE Access,7:115317-115330 [百度学术]

      28. [28]

        I Tzortzis,C D Charalambous,T Charalambous,et al.(2017)Approximation of markov processes by lower dimensional book=14,ebook=18processes via total variation metrics.IEEE Transactions on Automatic Control,62(3):1030-1045 [百度学术]

      29. [29]

        J R Regatti,A Gupta(2022)Finite sample analysis of minmax variant of offline reinforcement learning for general MDPs.IEEE Open Journal of Control Systems,1:152-163 [百度学术]

      30. [30]

        C Shiranthika,K W Chen,C Y Wang,et al.(2022)Supervised optimal chemotherapy regimen based on offline reinforcement learning.IEEE Journal of Biomedical and Health Informatics,26(9):4763-4772 [百度学术]

      Fund Information

      This study was supported by the National Natural Science Foundation of China under Grant 52077146.

      Author

      • Siyuan Jiang

        Siyuan Jiang received his B.S.degree in electrical engineering from Nanjing University of Information Science and Technology,Nanjing,China in 2021,and he is currently pursuing the M.S.degree in Sichuan University.His research interests include optimal dispatching of distribution networks and machine learning.

      • Hongjun Gao

        Hongjun Gao received the B.S.,M.S.,and Ph.D.degrees in electrical engineering from Sichuan University,Chengdu,China in 2011,2014,and 2017,respectively.From 2015 to 2016,he was a Visiting Scholar at the Department of Electrical Engineering and Computer Science,University of Wisconsin-Milwaukee,Milwaukee,WI,USA.He is currently an Associate Professor at the College of Electrical Engineering,Sichuan University,Chengdu,China.His research interests include active distribution system planning and operation,unit commitment,economic dispatch,distributed generation integration,and multi-energy system optimization.

      • Xiaohui Wang

        Xiaohui Wang received his PhD from the North China Electric Power University,Beijing in 2012.He is currently at the China Electric Power Research Institute Co.Ltd.,Beijing.His research interests include power big data technology,artificial intelligence,active distributed networks,and the Internet.

      • Junyong Liu

        Junyong Liu received the Ph.D.degree in Electrical Engineering from Brunel University,Uxbridge,U.K.,in 1998.He is currently a Professor with the College of Electrical Engineering,Sichuan University,Chengdu,China.He is the director of Sichuan Province Key Smart Grid Laboratory.His current research interests include power system planning,operations,and computer applications.

      • Kunyu Zuo

        Kunyu Zuo (S’21) Received the B.S.and M.S.degrees in electrical engineering from Sichuan University,Chengdu,China,in 2016 and 2019,respectively.From 2017 to 2018,he was a Visiting Student at School of Electronics,Electrical Engineering and Computer Science,Queen’s University Belfast,UK.He is currently pursuing the Ph.D.degree in the Electrical and Computer Engineering Department,Stevens Institute of Technology,Hoboken,NJ,USA.His research interests include renewable microgrid solutions and distributed control.

      Publish Info

      Received:2022-10-10

      Accepted:2022-12-07

      Pubulished:2023-02-25

      Reference: Siyuan Jiang,Hongjun Gao,Xiaohui Wang,et al.(2023) Deep reinforcement learning based multi-level dynamic reconfiguration for urban distribution network:a cloud-edge collaboration architecture.Global Energy Interconnection,6(1):1-14.

      (Editor Yanbo Wang)
      Share to WeChat friends or circle of friends

      Use the WeChat “Scan” function to share this article with
      your WeChat friends or circle of friends