Online monitoring of batch processes combining subspace design of latent variables with support vector data description

The correlation relations of batch process variables are quite complex. For local abnormalities, there is a problem that the variant features are overwhelmed. In addition, batch process variables have obvious non-Gaussian distributions. In response to the above two problems, a new multiple subspace monitoring method called principal component analysis - multiple subspace support vector data description (PCA-MSSVDD) is proposed


INTRODUCTION
To ensure the safe and reliable operation of a batch process, it is necessary to find faults in time. Therefore, it is of great practical significance to apply process monitoring [1][2][3] . Generally, process monitoring methods can be divided into three types: mechanism-driven approaches, knowledge-driven approaches, and datadriven approaches [4,5] . Many data are recorded and stored in modern industry, and a lot of information is contained in these data, but it is not used effectively. In recent years, in view of the difficulty of establishing mechanism models in complex industrial processes and the difficulty of obtaining expert knowledge in practice, data-driven process methods have attracted a lot of attention [6][7][8][9] .
In most batch process monitoring algorithms, the influence of complex correlations among process variables on monitoring effectiveness is not taken into account. In batch processes, the correlation among process variables is very complex; some variables have a strong correlation and some variables have a weak correlation. Variables with a strong correlation have a similar mutation behavior to faults, while those with a weak correlation have a different mutation behavior to faults. When a fault occurs, some process variables may mutate. For the above situation, if the monitoring is carried out in the whole monitoring space, there will be the risk of submergence of mutation features, thus increasing the difficulty of fault detection. In view of the complex correlation among process variables, many monitoring algorithms based on variable subspace have been studied in recent years [10][11][12][13][14] . These algorithms place variables with similar characteristics in a subspace and monitor them, highlighting the local characteristics of process variables. If some variables have mutation characteristics, the mutation characteristics will be more obvious in the subspace than in the whole space, which is conducive to detecting the fault. Meanwhile, meaningful subspace design is conducive to process understanding and learning. In addition, a monitoring method based on the subspace of latent variables is proposed by principal component analysis (PCA). This method can eliminate redundant information in the original process variables through PCA mapping [15] .
The above algorithm based on latent variable subspace design can reduce the risk of local variation characteristics being inundated; however, the calculation method of the control limit of the model still has the assumption that the data need to obey a Gaussian distribution. Since batch process data have obvious non-Gaussian characteristics, the fault detection ability of this algorithm is sometimes reduced. Support vector data description (SVDD) can adapt to the non-Gaussian features of the data [16,17] . To distinguish normal data samples from abnormal data samples, which is the purpose of statistical process monitoring, all normal samples can be used as a category to establish an SVDD monitoring model. Multiple subspaces SVDD (MSSVDD) methods have been proposed by fusing variable subspace design methods with SVDD and applied to non-Gaussian processes [18] . However, the application of subspace monitoring methods for latent variables in non-Gaussian processes still has not been studied.
Aiming at the complex correlation and the non-Gaussian distribution among batch process variables, this paper proposes a batch process monitoring algorithm called PCA-MSSVDD, which combines latent variable subspace design with SVDD. In offline modeling, firstly, the three-dimensional matrix of the batch process is expanded and converted into a two-dimensional matrix by twice expansion technology, and the original variable is converted into a latent variable by PCA transformation, eliminating redundant information. The extension matrix is defined according to the PCA transformation matrix, i.e., the load matrix. The vector of the extension matrix can reflect the influence of process variables on latent variables, and it is defined as the characteristic vector of latent variables. When the process variables which have important influence on latent variables change, the corresponding latent variables will also have the variation characteristics. Therefore, latent variables with similar eigenvectors have similar variation characteristics and should be monitored in the same subspace. K-means is introduced to cluster eigenvectors, and the clustering result is the design result of the latent variables subspace. Then, the latent variable time slice matrix is obtained by the sliding time window technique, and the latent variable subspace design results are applied to the latent variable time slice matrix. Finally, SVDD monitoring model is established based on latent variable subspace data. When online monitoring, PCA mapping of online samples is carried out to obtain the latent variables of online samples. Then, the monitoring model is selected by time. Finally, the weighted average strategy is used to fuse the monitoring results of subspace as the final monitoring results. The effective fusion of latent variable subspace design and SVDD can effectively improve the efficiency of fault detection.
The remainder of this paper is structured as follows. Section 2 introduces the SVDD and K-means. Section 3 mainly describes the proposed latent variable subspace design method based on PCA and K-means, and then describes the monitoring steps of PCA-MSSVDD. In Section 4, PCA-MSSVDD is validated by the numerical simulation model and penicillin fermentation process. Finally, Section 5 concludes.

SVDD
SVDD is a monitoring algorithm based on pattern recognition. Samples are projected into the feature space by mapping, and a minimum hypersphere is found in the feature space. Under the condition of minimum structural risk, the sample data are surrounded as much as possible [19] . The optimal objective function of SVDD is: where a represents the center, R 2 represents the square of the radius, C represents the penalty coefficient, ξ i represents the relaxation variable, and Φ represents the kernel function. The objective function can also be formed as: By calculating the above objective functions, a set of vectors v i and corresponding coefficients α i can be obtained. If α i > 0, then the corresponding vectors are defined as support vector (SV). For all support vectors v∈SV, the radius of the hypersphere is calculated as follows: The square of the distance of online sample x is calculated as follows: (4) Therefore, the SVDD method can be used to distinguish between normal and abnormal samples. In this paper, the calculation formula based on SVDD statistics is defined as: (5) Meanwhile, the corresponding control limit is DR lim = 1.

K-means
K-means [20] uses Euclidean distance as the similarity evaluation index, that is, the closer are the two samples, the higher is their similarity. The clustering algorithm considers that each class of samples is composed of samples with close Euclidean distance, so the minimum sum of squares of errors from each class of samples to the center of the sample is taken as the objective function of clustering.
The K-means clustering algorithm clusters column variables of a two-dimensional matrix X (n × m). m vectors y 1 ,y 2 ,…,y m should belong to C categories. The cluster centers of C categories are represented as l 1 ,l 2 ,…,l C . X c corresponds to the eighth category represented by 9, where c = 1,2,…,C. The corresponding c-th category is represented as X c , where c = 1,2,…,C. The detailed subspace design steps are as follows: (1) Firstly, given the order of iteration t = 1, cluster centers l 1 ,l 2 ,…,l C are randomly selected in column vectors of X (n × m).
(2) Between the i-th sample and the k-th clustering center, the square of the Euclidean distance is calculated as follows: (6) (3) The sample of the same category X c (t + 1) is updated, and the formula is as follows: (4) The center l c (t + 1) of the sample of the same category X c (t + 1) is updated, and the formula is as follows: (5) If l c (t + 1) = l c (t), where c = 1,2,…,C, the algorithm converges and ends; otherwise, if t = t + 1, go back to Step 2.
In this paper, the formula of the matrix column vector clustering is expressed as follows: When the K-means clustering algorithm is applied, it is necessary to determine the number of classifications. When the number of classifications is known, K-means can re-classify unreasonably classified samples through its own optimization iteration steps. Therefore, in the case of a small number of data samples, it can achieve satisfactory results. In the case of an uncertain number of classifications, it is necessary to determine the number of classifications by other analysis methods.

Subspace design of latent variables based on PCA and K-means
The two-dimensional matrix X (n × m) has n samples and m variables. PCA is decomposed as follows [21] : Where P is the load matrix and T is the latent variable matrix. P can be obtained as follows: Where λ i (i = 1,2,…,m) represents the m eigenvalues of the covariance matrix Λ . Meanwhile, the eigenvalues are arranged in order from largest to smallest. The number of latent variables retained is determined by the method of cumulative contribution rate. The formula is as follows: Where m c represents the number of latent variables retained. Therefore, when there is a lot of redundant information in the original data, it should be m c ≤ m. In the process of eliminating redundant information, selecting a higher cumulative variance contribution rate can reduce the phenomenon of information loss and retain more process information. The retained load matrix and latent variable matrix are ĉ (m m ) × P and ĉ (n m ) × T , respectively. Their mapping relationship is as follows: (14) Formula (14) shows that latent variables are linear combinations of original process variables. Therefore, different process variables have different effects on latent variables. The importance of process variables to latent variables is defined as a feature of this latent variable. When the process variables are mutated due to faults, the latent variables will be mutated. Because latent variables with similar characteristics have similar variation characteristics, monitoring latent variables with similar characteristics in the same subspace can reduce the risk of submergence of variation characteristics.
Based on the data transformation matrix, i.e. the load matrix, an extension matrix D is defined to represent all the characteristics of each latent variable. The element of the extension matrix D is obtained as follows: Where p j is the j-th column of P and p i,j is the i-th element of p j . The size of the numerical value indicates the significance of the latent variable characteristics, and the larger is the numerical value, the more obvious are the characteristics. The column vectors of the extension matrix D represent the eigenvectors of latent variables. K-means clustering is applied to the eigenvectors. IDX (m c × 1) represents the results of clustering. The expression of clustering results is as follows: According to the results of IDX, the calculation method of subspace segmentation results is as follows: Where IDX j is the j-th element of IDX and T c is the latent variable matrix of the c-th subspace. Therefore, the design results of latent variable subspace can be obtained offline. The design flow chart of latent subspace variable based on PCA-K-means is shown in Figure 1.

Online monitoring of batch process based on PCA-MSSVDD
The batch process historical data comprise a three-dimensional data matrix X (I × J × K). Here, I represents the number of batches, J represents the number of variables, and K represents the number of sampling points. First, three-dimensional data matrix X (I × J × K) is expanded into two-dimensional matrix X B (I × JK) by batch expansion. Then, the data are standardized to mean 0 and variance 1, and the batch expansion The latent variables are obtained by PCA transformation of X V (IK × J). On the basis of latent variables, the sliding time window technology and subspace design method are applied to build the SVDD model of the obtained subspace matrix of time slice. The length of all time windows is set to 1 moment. The reason the window length is selected as 1 moment is that, the shorter is the time window, the smaller is the data fluctuation, which can avoid the influence of data fluctuation on the modeling and thus highlight the influence of subspace design on monitoring. When online monitoring, online samples are first mapped to latent variables by PCA model of X V (IK × J). Then, the monitoring model is selected according to the time. Finally, the fusion method of weighted average is used for monitoring results in different subspaces, where C represents the number of subspaces and DR c represents the monitoring results of the c-th subspace. The calculation formula is as follows: (20) The flow chart of PCA-MSSVDD is shown in Figure 2. The specific monitoring steps are as follows: (1) The three-dimensional matrix is transformed into the two-dimensional matrix of variable expansion by two expansion techniques.

Numerical simulation
This paper designs the following numerical simulation models with two subsystems: Here, randi (10,800,1) represents a randomly generated column vector with 800 rows and 1 column, and the sample is uniformly distributed between 0 and 10. randi (10,800,16) represents a randomly generated matrix with 800 rows and 16 columns, and the sample of each column obeys a uniform distribution between 0 and 10. x i,j is the i-th row and the j-th sample, h i is the i-th row vector in the matrix T, and noise i,j is the i-th row and the j-th sample.
The gray scale diagram of the extension indicator matrix of the load matrix in the numerical simulation process is shown in Figure 3. The comparison of SVDD, MSSVDD, and PCA-MSSVDD in monitoring the numerical process is shown in Table 3.
Comparing the false alarm rates shows that the false alarm rate of PCA-MSSVDD is 1.5, slightly higher than the best value of 0. The results show that the missing alarm rate of PCA-MSSVDD is 15.3, which is significantly lower than those of SVDD and MSSVDD (47.0 and 68.5, respectively). The error rate comparison results show that the error rate of PCA-MSSVDD is 11.8, which is obviously better than those of SVDD and MSSD (35.3 and 51.5, respectively). The comparison results of the first time to detect fault shows that the first time to detect fault by PCA-MSSVDD, SVDD, and MSSVDD is the 201st time. Therefore, PCA-MSSVDD has better monitoring effect for local faults scattered in different subsystems.
The comparison charts of SVDD, MSSVDD, and PCA-MSSVDD for the test case in monitoring the numerical process is shown in Figure 4. During the whole phase of introducing fault signals, the statistics of PCA-MSSVDD are mostly above the control limit, while only a few of those of SVDD and MSSD are above the control limit.
The comparison charts of the PCA-MSSVDD subspace for the text case in monitoring the numerical process is shown in Figure 5. It can be clearly seen that most faults are detected in the second subspace, while are few faults are detected in the first subspace.

Simulation test of the penicillin fermentation process
The simulation model of the penicillin fermentation process is designed to provide a standard testing platform for data-driven batch process monitoring methods [22] . Under the normal state set value of process variables, the production cycle of the penicillin fermentation process is set to 400 h, data are recorded once every 0.5 g, and 800 sampling data can be recorded by one simulation [23] . There is random noise in the     simulation model. Under the same initial set value, the data between different batches fluctuate randomly. Therefore, 100 batches of simulation data are collected as a historical reference database.
The gray scale diagram of the extension indicator matrix of the load matrix retained during penicillin fermentation is shown in Figure 6. All features of each latent variable can be clearly seen. The abscissa indicates the serial number of the load vector, while the ordinate indicates the serial number of the process variable. The color of the square in the figure indicates the importance of process variables to latent variables. The lighter is the color, the more important are the process variables to latent variables. In this section, latent variables are also divided into two subspaces according to prior knowledge. Hidden variable subspace design results for the penicillin fermentation process are shown in Table 4, where Hidden Variables 1, 2, and 10-16 form subspace T 1 , while Hidden Variables 3-9 form subspace T 2 . As shown in Figure 6, the information of Variables 1, 2, and 4 is more projected in the first subspace, while the information of Variables 3, 5, and 7-9 is more projected in the second subspace.
The simulation of the penicillin fermentation process provides three types of faults: (1) the fault of the ventilation rate variable; (2) the fault of the stirring power variable; and (3) the fault of the glucose flow rate variable. In this paper, six test faults are designed through the simulation test platform to simulate abnormal operation behavior in actual production. The size and types of faults used to test the monitoring algorithm are shown in Table 5.
The above six kinds of faults are used for monitoring and comparing SVDD, MSSD, and PCA-MASVDD. The comparison of the false alarm rate, missed alarm rate, error rate, and first time to detect fault using SVDD, MSSVDD, and PCA-MSSVDD in monitoring the penicillin fermentation process are shown in Table 6.
Comparing the false alarm rates shows that the false alarm rate of PCA-MSSVDD has the optimal value for Faults 4-6. Comparing the missed alarm rate shows that the missed alarm rate of PCA-MSSVDD only has the optimal value in Fault 2. In Faults 1, 2, 4, and 5, the missed alarm rate of PCA-MSSVDD is lower than that of SVDD. In Faults 1, 4, and 5, the missed alarm rate of PCA-MSSVDD is slightly higher than that of MSSVDD. Comparing the error rates shows that the error rate of PCA-MSSVDD has no optimal value. In Faults 1, 2, 4, and 5, the error rate of PCA-MSSVDD is lower than that of SVDD. In Faults 1, 2, 4, and 5, the error rate of PCA-MSSVDD is slightly higher than that of MSSD. In faults 1, 2, 4, and 5, the error rate of PCA-MSSVDD is slightly higher than that of MSSVDD. Comparing the first time to detect fault shows that, Faults 1, 3, 4, and 5, PCA-MSSVDD has the earliest time to detect fault. Therefore, PCA-MSSVDD is more sensitive to fault signals.
The comparison charts of SVDD, MSSVDD, and PCA-MSSVDD for Fault 3 in monitoring the penicillin fermentation process are shown in Figure 7. At about the 120th hour, PCA-MSSVDD has a peak value and a fault can be detected. At this time, the statistics of SVDD and MSSVDD have not exceeded the control limit. The statistics of SVDD and MSSVDD fall back to the position below the control limit after the 350th hour, while the statistics of PCA-MSSVDD are not below the control limit, and faults can still be detected.
The comparison charts of SVDD, MSSVDD, and PCA-MSSVDD for Case 6 in monitoring the penicillin fermentation process are shown in Figure 8. From the 150th hour to the 200th hour, the statistics of SVDD, MSSD, and PCA-MSSD all have an upward trend. However, the statistics of PCA-MSSD have a small jump signal, which exceeds the control limit and can detect the fault earlier. Combining the above two monitoring comparison charts, Figure 7 shows that PCA-MSSVDD has a better monitoring effect than MSSVDD at about the 125th hour and from the 350th hour to the end of the process; however, MSSVDD has a better monitoring effect than PCA-MSSVDD from the 150th hour to the 200th hour. Figure 8 shows that PCA-MSSVDD can detect faults earlier, but MSSVDD has a better monitoring effect than PCA-MSSVDD around the 200th hour. Therefore, the simulation test of the penicillin fermentation process shows that PCA-MSSVDD has better fault detection capability than MSSD in some cases but a worse one in other cases.
The comparison charts of PCA-MSSVDD subspace for Faults 1-6 in monitoring the penicillin fermentation process are shown in Figures 9-14. It can be easily seen that Faults 1 and 4 both occur on Variable 1, so Figure 6. Gray schematic diagram of the denotative matrix in the penicillin fermentation process.
Using the above six test faults of the penicillin fermentation process simulation, the comparison of the false alarm rate of the penicillin fermentation process monitoring based on multi-way principal component analysis (MPCA) [24] , multi-way independent component analysis (MICA) [25] , batch dynamic principal component analysis (BDPCA) [26] , mixture probabilistic principal component analysis (MPPCA) [27] , and PCA-MSSVDD is shown in Table 7. For Faults 4-6, PCA-MSSVDD has the most merit. Based on MPCA, MICA, BDPCA, MPPCA, and PCA-MSSVDD, the false negative rate of the penicillin fermentation process monitoring is shown in Table 8. For Faults 1-3, PCA-MSSVD has the optimal value. Based on MPCA, MICA, BDPCA, MPPCA, and PCA-MSSVDD, the monitoring error rate of the penicillin fermentation process is shown in Table 9. For Faults 1-3, PCA-MSSVDD has the optimal value. The first time to detect fault of the penicillin fermentation process monitoring based on MPCA, MICA, BDPCA, MPPCA, and PCA-MSSVDD is shown in Table 10. For Fault 1, PCA-MSSVDD has the optimal value. Therefore, in some test failures, PCA-MSSVDD has better monitoring results.

CONCLUSIONS
In this paper, a batch process monitoring algorithm based on PCA-MSSVDD is proposed by combining latent variable subspace design with SVDD. Subspace monitoring by PCA and K-means can effectively reduce the risk of inundation of variation features; using SVDD to establish subspace monitoring model can make the proposed method applicable to any non-Gaussian process.    I 2  SPE  T 2  SPE  T 2  SPE  DR   1  /  100  397  100  100  100  100  127  100  2  111  101  105  101  104  101  100  111  105  3  142  137  121  137  106  106  104  142  121  4  /  142  395  163  166  112  142  352  125  5  397  171  106  102  102  108  102  397  106  6  334  164  328  194  171  148  167 166 161  Through the numerical simulation process and penicillin fermentation simulation process test, the comparison results between PCA-MSSVDD and SVDD show that the subspace monitoring algorithm can effectively reduce the risk of variation characteristics being submerged and improve the monitoring performance. The comparison results between PCA-MSSVDD and MSSVDD show that the fault detection capability of PCA-MSSVDD may be higher than that of MSSVDD or lower than that of MSSVDD. For local failures of weakly correlated variables, the proposed PCA-MSSVDD method will have better results, while, for strongly correlated variables, the MSSVDD method will have better results, and both methods have better performance than SVDD.