Moving-Window-ImprovedMonteCarloUninformativeVariable ... · 2020. 8. 3. · to the NIR...

12
Research Article Moving-Window-Improved Monte Carlo Uninformative Variable Elimination Combining Successive Projections Algorithm for Near-Infrared Spectroscopy (NIRS) Weiwei Jiang, 1 Changhua Lu, 1,2 Yujun Zhang, 2 Wei Ju, 3 Jizhou Wang, 4 Feng Hong, 1 Tao Wang, 1 and Chunsheng Ou 1 1 School of Computer Science and Information Engineering, Hefei University of Technology, Hefei 230009, China 2 Anhui Institute of Optics Fine Mechanics, Chinese Academy of Sciences, Hefei 230031, China 3 School of Internet, Anhui University, Hefei 230039, China 4 Department of Electronics, Hefei University, Hefei 230061, China Correspondence should be addressed to Chunsheng Ou; [email protected] Received 17 February 2020; Revised 26 June 2020; Accepted 11 July 2020; Published 3 August 2020 Academic Editor: Alessandra Durazzo Copyright © 2020 Weiwei Jiang et al. is is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. e MC-UVE-SPA method is commonly proposed as a variable selection approach for multivariate calibration. However, the SPA tends to select wavelength variables that are sparsely distributed over the wavelength ranges of the variables selected by the MC- UVE algorithm, and the MC-UVE-SPA cascade cannot improve the problem of wavelength point discontinuity. It is addressed in this paper by proposing a moving-window- (MW-) improved MC-UVE-SPA wavelength selection algorithm. e proposed algorithm improves the continuity of the selected wavelength variables and thereby better exploits the advantages of the MC-UVE algorithm and the SPA to obtain regression models with high prediction accuracy. e MC-UVE, MC-UVE-SPA, and MC-UVE- SPA-MW algorithms are applied for conducting wavelength variable selection for the NIR spectral absorbance data of corn, diesel fuel, and ethylene. Here, partial least squares regression (PLSR) models reflecting the oil content of corn, the boiling point of diesel fuel, and the ethylene concentration are established after conducting wavelength selection using the MC-UVE algorithm, and corresponding multiple linear regression (MLR) models are established after conducting wavelength selection using the MC- UVE-SPA and MC-UVE-SPA-MW algorithms. Experimental results demonstrate that the progressive elimination of uncor- related and collinear variables generates increasingly simplified partial-spectrum models with greater prediction accuracy than the full-spectrum model. Among the three wavelength selection algorithms, the MC-UVE-SPA selected the least number of wavelength variables, while the proposed MC-UVE-SPA-MW algorithm provided models with the greatest prediction accuracy. 1. Introduction With the characteristics of simple, rapid, noninvasive, and no sample pretreatment, near-infrared (NIR) spectroscopy [1] has been adopted as a popular analytical tool for both qualitative and quantitative analyses in various fields [2–5]. e quantitative analysis of NIR spectral data is generally conducted through the construction of regression models, such as those based on principle component analysis (PCA) [6], partial least squares (PLS) regression [7], and multiple linear regression (MLR) [8], which take the characteristic wavelengths of the spectral data as input variables. However, the development of modern analytical instruments has led to the capability of acquiring NIR spectral data that can easily contain hundreds to tens of thousands of individual wavelengths [9]. us, the full-band spectral data were adopted for modeling, but the model contained a large amount of redundant information, which resulted in inef- ficiency [10]. In addition, spectral data usually contain noise, interference, and/or mixed spectral components that can often greatly detract from the prediction accuracy of full- spectrum models developed for spectral data analysis [11]. Yun et al. pointed out that there are three ways to address these problems, namely, regularization, dimension reduction, Hindawi Journal of Spectroscopy Volume 2020, Article ID 3590301, 12 pages https://doi.org/10.1155/2020/3590301

Transcript of Moving-Window-ImprovedMonteCarloUninformativeVariable ... · 2020. 8. 3. · to the NIR...

Page 1: Moving-Window-ImprovedMonteCarloUninformativeVariable ... · 2020. 8. 3. · to the NIR spectroscopic analysis of nicotine in tobacco lamina and active pharmaceutical ingredients

Research ArticleMoving-Window-Improved Monte Carlo Uninformative VariableElimination Combining Successive Projections Algorithm forNear-Infrared Spectroscopy (NIRS)

Weiwei Jiang1 Changhua Lu12 Yujun Zhang2 Wei Ju3 Jizhou Wang4 Feng Hong1

Tao Wang1 and Chunsheng Ou 1

1School of Computer Science and Information Engineering Hefei University of Technology Hefei 230009 China2Anhui Institute of Optics Fine Mechanics Chinese Academy of Sciences Hefei 230031 China3School of Internet Anhui University Hefei 230039 China4Department of Electronics Hefei University Hefei 230061 China

Correspondence should be addressed to Chunsheng Ou ouchunshenghfuteducn

Received 17 February 2020 Revised 26 June 2020 Accepted 11 July 2020 Published 3 August 2020

Academic Editor Alessandra Durazzo

Copyright copy 2020Weiwei Jiang et al is is an open access article distributed under the Creative Commons Attribution Licensewhich permits unrestricted use distribution and reproduction in any medium provided the original work is properly cited

eMC-UVE-SPAmethod is commonly proposed as a variable selection approach for multivariate calibration However the SPAtends to select wavelength variables that are sparsely distributed over the wavelength ranges of the variables selected by the MC-UVE algorithm and the MC-UVE-SPA cascade cannot improve the problem of wavelength point discontinuity It is addressed inthis paper by proposing a moving-window- (MW-) improved MC-UVE-SPA wavelength selection algorithm e proposedalgorithm improves the continuity of the selected wavelength variables and thereby better exploits the advantages of the MC-UVEalgorithm and the SPA to obtain regression models with high prediction accuracy e MC-UVE MC-UVE-SPA and MC-UVE-SPA-MW algorithms are applied for conducting wavelength variable selection for the NIR spectral absorbance data of corn dieselfuel and ethylene Here partial least squares regression (PLSR) models reflecting the oil content of corn the boiling point of dieselfuel and the ethylene concentration are established after conducting wavelength selection using the MC-UVE algorithm andcorresponding multiple linear regression (MLR) models are established after conducting wavelength selection using the MC-UVE-SPA and MC-UVE-SPA-MW algorithms Experimental results demonstrate that the progressive elimination of uncor-related and collinear variables generates increasingly simplified partial-spectrummodels with greater prediction accuracy than thefull-spectrum model Among the three wavelength selection algorithms the MC-UVE-SPA selected the least number ofwavelength variables while the proposed MC-UVE-SPA-MW algorithm provided models with the greatest prediction accuracy

1 Introduction

With the characteristics of simple rapid noninvasive andno sample pretreatment near-infrared (NIR) spectroscopy[1] has been adopted as a popular analytical tool for bothqualitative and quantitative analyses in various fields [2ndash5]e quantitative analysis of NIR spectral data is generallyconducted through the construction of regression modelssuch as those based on principle component analysis (PCA)[6] partial least squares (PLS) regression [7] and multiplelinear regression (MLR) [8] which take the characteristicwavelengths of the spectral data as input variables However

the development of modern analytical instruments has led tothe capability of acquiring NIR spectral data that can easilycontain hundreds to tens of thousands of individualwavelengths [9] us the full-band spectral data wereadopted for modeling but the model contained a largeamount of redundant information which resulted in inef-ficiency [10] In addition spectral data usually contain noiseinterference andor mixed spectral components that canoften greatly detract from the prediction accuracy of full-spectrum models developed for spectral data analysis [11]Yun et al pointed out that there are three ways to addressthese problems namely regularization dimension reduction

HindawiJournal of SpectroscopyVolume 2020 Article ID 3590301 12 pageshttpsdoiorg10115520203590301

and variable selection [12] Among the above-discussedmethods variable selection has become the dominant methodof interest in recent years for the development of NIR spectralanalysis technology and chemometrics [11ndash14]

e goal of wavelength selection is to identify the mostinformative wavelengths for use as variables in partial-spec-trum regression models Here uninformative wavelengthvariables have either no effect or a negative effect on themodeling performanceewavelength selection process fulfilsthree purposes including (1) providing models with greaterpredicative capability (2) obtaining wavelength variables thatprovide greater modeling efficiency and (3) providing simplermodels with improved interpretability [9] e most com-monly employed wavelength selection algorithms developedthus far include uninformative variable elimination (UVE) andthe successive projections algorithm (SPA)

e goal of UVE first proposed by Centner et al [15] isnot to select variables directly but to effectively eliminateuninformative variables in the spectral data such that onlyinformative wavelength variables remaine SPA employssimple projection to select variables with a minimum ofcollinearity but variables selected by SPA may make littlecontribution to multivariate calibration which can affectmodel prediction [16] A significant development in recentyears has been the combined use of different algorithmsthrough a cascade strategy where the results of onewavelength selection algorithm are used as the inputs of thenext selection algorithm in a stepwise manner is cancombine the advantages of various wavelength selectionalgorithms in a complementary way and thereby obtainbetter and more effective prediction results e commonvariable selection method combined with SPA method cangreatly simplify the model and improve the predictionaccuracy is strategy has been effectively used in manystudies to address the problem associated with the appli-cation of the SPA to NIR spectral data by first reducing thedimension of the spectral data by applying some initialalgorithm such as UVE MC-UVE particle swarm opti-mization (PSO) or genetic algorithm (GA) optimization[16ndash20] Among them UVE and MC-UVE are commonlyused as the primary wavelength algorithms of SPA Forexample Ye et al proposed the combination of UVE andSPA to integrate the bright side of each successfully appliedto the NIR spectroscopic analysis of nicotine in tobaccolamina and active pharmaceutical ingredients in intacttablets for variable selection UVE was employed to selectinformative variables and SPA was followed to selectvariables that have minimum redundant information fromthe informative variables [20] Li et al proposed a newcombination of MC-UVE and SPA MC-UVE wasemployed to select informative variables in the full spec-trum and SPA was also employed as a powerful method forfurther characteristic variable selection [18]

Nonetheless most of the informative wavelengths in amolecular NIR spectrum typically exhibit some continuitywhere wavelength points adjacent to an informative wave-length point also represent informative wavelengths [21]However the MC-UVE algorithm and the SPA are bothwavelength selection algorithms based on optimal

wavelength points which are most likely isolated pointsalong the full NIR spectrum e MC-UVE-SPA cascadecannot improve the problem of wavelength point discon-tinuity which may result in the least number of selectedwavelength variables but the modeling effect is not the bestFan et al constructed a model for visibleNIR spectral datareflecting the lycopene content based on wavelength variableselection obtained using UVE SPA and CARS individuallyand in various two-stage cascaded combinations [22] eUVE-SPA combination was found to retain the smallestnumber of wavelength variables of all the selection algo-rithms considered but the prediction accuracy of the modelconstructed using this wavelength variable set was the worstof all models obtained using all other wavelength selectionalgorithms Sun et al showed that the prediction results ofthe model constructed by the cascaded wavelength selectionalgorithm were not always the most accurate and theprediction results of the improved cascaded wavelengthselection algorithm were better than those of the direct two-stage cascaded strategy [23]

Few studies have considered improving the continuity ofthe selected wavelength in the wavelength point selectionalgorithm erefore this paper considers the continuity ofthe wavelength selected by the MC-UVE-SPA In this studythis is employed as a moving-window-improved cascadestrategy for wavelength selection that is herein denoted asthe MC-UVE-SPA-MW algorithm First the uninformativevariable is eliminated by MC-UVE the collinear variable iseliminated by SPA and then the wavelength variables areselected by extending outward from the optimal wavelengthpoints by MC-UVE-SPA in conjunction with a movingwindow is reduces the number of isolated wavelengthvariables preserves the continuity between informativewavelength points in an NIR spectrum and expects toimprove the accuracy of the established prediction model

2 Materials and Methods

21 Experiments and Data Experiments based on the NIRspectral absorbance data of corn diesel fuel and ethylenewere employed for verifying the wavelength variable se-lection performance of the proposed MC-UVE-SPA-MWalgorithm and were conducted using the libPLS toolkit [24]while the remaining code was written and executed in theMATLAB R2017b environment

211 Corn Spectral Data e NIR spectral absorbance datafor corn were provided by Eigenvector Research Inc (httpwwweigenvectorcomdataCornindexhtml) e m5spectra of corn data set consist of 80 corn samples measuredover a wavelength range of 1100sim2498 nm in 2 nm intervalsAccordingly the data set includes a total of 700 wavelengthpoints It also contains four component reference values ofmoisture oil protein and starch contents determined bychemical methods for each sample Table 1 shows themaximum minimum and average values of the relativeconcentrations of moisture oil protein and starch in the 80corn samples

2 Journal of Spectroscopy

212 Diesel Fuel Spectral Data e NIR spectral absor-bance data for diesel fuel were provided by the SouthwestResearch Institute (SWRI) (httpwwweigenvectorcomdataSWRIindexhtml) e data set comprises unpro-cessed spectra derived from 784 diesel fuel samples mea-sured over a wavelength range of 550sim750 nm in 2 nmintervals Accordingly the data set includes a total of 401wavelength points e data set also contains variousproperties including the boiling point cetane numberdensity freezing point total aromatic hydrocarbon contentand viscosity Some of the parameter samples have missingvalues (NaN) which are eliminated during the experimentTable 2 shows the maximum minimum and average valuesof the boiling point of diesel fuel

213 Ethylene Gas Spectral Data Ethylene gas samples wereprepared within a closed cell filled with nitrogen gas at apressure of 1 atm and a temperature of 296K by distributingC2H4 gas into the cell to form samples with 72 known C2H4concentrations ranging from 6015 ppm to 2005 ppm in2005 ppm intervals e C2H4 gas distribution deviceadopted a gas distribution platform shown in Figure 1independently developed by the Hefei Material ScienceResearch Institute of the Chinese Academy of Sciencesrough visual control software set the gas distributionproportion according to the requirements adjust the volumeratio of the auxiliary gas nitrogen and the gas to be dis-tributed through the high-precision gas distribution plat-form and configure the required concentration of standardgas according to the requirements Fourier transform in-frared (FTIR) spectroscopy was applied to capture thespectral absorbance intensity of the gas in a sealed samplecell e optical path length of the cell was 10m and therange of the measured wavenumbers was 400sim5000 cmminus1

with a resolution of 1 cmminus1 e apodization function used aHamming window the number of scans was 16 and a totalof 96 spectral data of different concentrations were collected

Accordingly the data set includes a total of 4601wavelength points e absorption spectrum of C2H4 gasobtained from the HITRAN database (httphitraniaoru)over a wavenumber range of 400sim5000 cmminus1 is shown inFigure 2 Figure 3 presents the background spectral intensitymeasured after the closed cell was filled with nitrogen gas atroom temperature Figure 4 presents the measured ab-sorption spectral intensity of the cell after adding variousconcentrations of C2H4 gas A comparison of Figures 3 and 4indicates that the spectral intensities in the two regions of794sim1105 cmminus1 and 2917sim3242 cmminus1 are drastically differentdue to the spectral absorption characteristics of the addedC2H4 gas

22 Evaluation Indices e NIR spectral absorbance dataare first preprocessed to generate normalized data for fa-cilitating consistent analyses e normalized data are thendivided into a calibration data set and a prediction data setwhich are respectively applied for establishing the variousregression models and for testing the established models byadopting the KennardndashStone method (3 1) e extent ofinformation provided by the selected wavelength variables isgenerally difficult to directly evaluate erefore indirectevaluation methods are usually adopted Typically the in-formation value of wavelength variables is evaluatedaccording to the prediction accuracy of the model con-structed with the selected wavelengths e indices forevaluating the prediction accuracy of regression models arethe root mean square error of cross validation (RMSECV)for calibration set the root mean square error of predictionthe correlation coefficient (r) and the relative percent de-viation (RPD) for prediction set ese indices are defined asfollows

RMSECV

1113936ni1 yk minus yk( 1113857

2

n minus 1

1113971

RMSEP

1113936ni1 1113954yi minus yi( 1113857

2

n minus 1

1113971

r 1113936

ni1 yi minus yAVE( 1113857 1113954yi minus 1113954yAVE( 1113857

1113936ni1 yi minus yAVE( 1113857

21113954yi minus 1113954yAVE( 1113857

21113969

RPD

1113936ni1 yi minus yAVE( 1113857

2

1113936ni1 1113954yi minus yi( 1113857

2

11139741113972

(1)

Here n is the number of samples in the calibration set orthe prediction set yk is the measured value and 1113954yk is thepredicted value of sample i in calibration set yi is themeasured value and 1113954yi is the predicted value of sample i inprediction set and yAVE and 1113954yAVE are the respective averagemeasured value and the average predicted value of allsamples in prediction set

We note that the evaluated prediction performanceincreases with decreasing RMSE and increasing r and RPDe RMSE is denoted as the RMSECV when referring to thevalue associated with the calibration data set and as theRMSEP when referring to the value associated with theprediction data set

23 MC-UVE-SPA Method e fundamental basis of UVEis to use the stability of the regression coefficient vectorcharacteristic of a constructed PLS multiple regressionmodel as a measure of the significance of a given wavelengthHowever the UVE tends to suffer from model overfitting

Table 1 Statistical analysis of the relative concentrations of corn

Component to be tested Max Min AveMoisture 109930 93770 102335Oil 38320 30880 34984Protein 97110 76540 86683Starch 664720 628260 646956

Table 2 Statistical analysis of the boiling point of diesel fuel

Property to be tested Effective sample size Max Min AveBoiling point 395 297 182 25837

Journal of Spectroscopy 3

[25]is was addressed by the development of Monte Carlo(MC) UVE (MC-UVE) proposed by Cai et al [26] whichreplaces the leave-one-out cross-validation (LOOCV) pro-cess calculating the regression coefficient matrixβ [β1 β2 β2 ] in conventional UVE with the MC cross-validation (MCCV) process e reliability of each variable jcan be quantitatively measured by

Sj mean βj1113872 1113873

std βj1113872 1113873 (2)

where mean (βj) and std (βj) are the mean and standarddeviation of the regression coefficients of variable j egreater the absolute value of stability the more importantthe corresponding variable e stability of uninformativevariables should be less than a threshold

e SPA first proposed by Bregman [27] is a forward-cycling variable selection method For spectral data analysiseach cycle of the process calculates the projection of a se-lected wavelength on an unselected wavelength and includesthe unselected wavelength with the largest projection vectorin the set of selected wavelengths [28] is process is re-peated for each selected wavelength as it is added to the setuntil the selected wavelength set includes a specified numberof wavelengths [16] More detailed information on the stepsof SPA can be seen in literature [16 29] In selecting the nextwavelength each of the newly selected wavelengths has thelowest correlation with the previous oneerefore SPA caneffectively eliminate collinear wavelength variables and re-duce the number of dimensions of the sample spectrumwhich accordingly reduces the calculation burden of themodel

FTIRspectrometer

MultipassgascellPump

Detector

Infraredsource

Computer

Gas sampling hose Gas sampling hose

Target gas

Serial port

NetworkGas distribution

system

Figure 1 Schematic diagram of the standard gas spectra acquisition system

500 1000 1500 2000 2500 3000 3500 4000 4500 5000Wavenumber (cmndash1)

0

5

10

15

20

25

Abs

orba

nce

Figure 2 NIR absorbance spectrum for C2H4 gas from the HITRAN database

4 Journal of Spectroscopy

e MC-UVE-SPA method is a combination method ofMC-UVE and SPA Jiangbo Li et al proved that the com-bination (MC-UVE-SPA) of both Monte Carlo uninfor-mative variable elimination (MC-UVE) and successiveprojections algorithm (SPA) was more effective than MC-UVE or SPA alone [30] Although the effect of UVE-SPA isbetter than that of using UVE or SPA alone there is stillsomething to be improved In this paper the UVE-SPA isimproved by using the wavelength effective continuity andits effectiveness is verified by experiments

24 Proposed MC-UVE-SPA-MW Wavelength SelectionAlgorithm e proposed wavelength selection algorithmfirst applies MC-UVE to the calibration data set to construct

a PLS regression model e threshold of the MC-UVEprocess is set to provide a number of wavelength variablesthat minimize the RMSECV of the constructed PLS re-gressionmodele largest number of principal components(PCs) was set to 10 and the optimal number of PCs wasdetermined based on the minimum RMSECV value Sub-sequently the wavelength variables retained by the MC-UVE algorithm are applied as the input of the SPA Here anMLRmodel is constructed based on the wavelength variablesselected by the SPA for conducting cross-validation analysiswhere the number of selected wavelength variables is de-termined according to the minimum of the RMSECV of theconstructed MLR model In order to reduce the number ofisolated wavelength variables and maintain the continuity ofadjacent information wavelength points of near-infrared

500 1000 1500 2000 2500 3000 3500 4000 4500 5000Wavenumber (cmndash1)

05

1

15

2

25

3

35

Inte

nsity

times104

Figure 3 Measured background NIR spectrum with the cell filled with nitrogen gas (N2)

5000

05

1

15

25

35

2

3

4

1000 1500 2000 3000 4000 50002500 3500 4500

Inte

nsity

Wavenumber (cmndash1)

times104

Figure 4 Measured C2H4 gas absorbance spectrum

Journal of Spectroscopy 5

spectrum it extends outward from the best wavelength pointselected by UVE-SPA In the original spectrum the optimalwavelength point selected by the MC-UVE-SPA is used asthe starting point of a moving window of width w ewavelengths in the moving window are used as the selectedwavelengths and the number of wavelengths finally selectedvaries with the window width Set the window width w 2(Left) 2 (Right) or 3 Here 2 (Left) means to extend theselected wavelength point to the left to 2 wavelength points2 (Right) means to extend the selected wavelength point tothe right to 2 wavelength points and 3 means to extend theselected wavelength point to 3 wavelength points using theselected wavelength point as the center point e optimalwindow width is determined by the minimum RMSECV ofthe MLR model e processing flow of the proposed MC-UVE-SPA-MW algorithm and extending the wavelengthpoint outward are illustrated in Figure 5

3 Results and Discussion

31 Corn Spectral Data Experiments e wavelength vari-able stability distribution map of the PLS regression modelreflecting the oil concentration in corn constructed forcalibration set using the MC-UVE algorithm is presented inFigure 6 Here all wavelengths greater than the thresholdvalue shown by the horizontal red line in the figure areselected for use in the model is threshold was selected toprovide the number of wavelength variables correspondingto theminimumRMSECV of the constructed PLS regressionmodel is is illustrated in Figure 7 where the RMSECV ofthe constructed PLS regression model is plotted with respectto the number of selected wavelength variables It can beseen from Figure 7 that the RMSECV is relatively large whenthe number of wavelength variables is small and theRMSECV drops sharply as the number of selected variablesincreases is is because an overly small number ofwavelength variables exclude useful information and theprediction accuracy of the model is therefore improved as anincreasing amount of useful information is incorporatedinto the model A minimum value of RMSECV 00289 isobtained when the number of selected wavelength variablesis 106 and the RMSECV increases again when the number ofvariables exceeds 106 is increase results from the impactof selecting an increasing number of uninformative variableson the prediction accuracy of the model We also note thatthe RMSECV changes very little when the number ofwavelength variables exceeds 300 us the MC-UVE al-gorithm eliminates a large number of wavelengths that arenot related to the oil concentration of corn where the finalnumber of selected wavelength variables is just 151 of thefull-spectrum value of 700

e optimal number of 106 wavelength variables selectedby MC-UVE is then used as the inputs of the SPA whichiteratively generates wavelength variable combinations usingeach wavelength as a starting point and applies them forconstructing an MLR model e wavelength combinationcorresponding to the minimumRMSECV of theMLRmodelis then taken as the optimal wavelength combination erelationship between the number of selected wavelength

variables and the RMSECV of the MLR model constructedfrom variables selected by the MC-UVE-SPA is shown inFigure 8 where we note that the minimum RMSECV isobtained when the number of selected variables is 37 usthe SPA further reduces the number of informative wave-lengths mainly by eliminating collinear variables in the MLRmodel where the final number of selected wavelengthvariables is reduced to just 53 of the full-spectrum value of700

In the original spectrum the optimal wavelength pointselected by the MC-UVE-SPA is used as the starting point orcenter of a moving window of width w 2 (Left) 2 (Right)or 3 e results of the PLS or MLR model constructed usingthe wavelength variables selected by different algorithms areshown in Figure 9 and the details are listed in Table 3 alongwith the results obtained for different models In Table 3 theoptimal number of PLS principal components was 10 Asshown in Table 3 there were 37 characteristic wavelengthsselected by the MC-UVE-SPA accounting for only 53 ofthe total number of wavelengths and the accuracy of thealgorithm is better than that of MC-UVE algorithm which isdue to the elimination of wavelength collinearity ewavelengths selected by the MC-UVE-SPA-MW are ex-tended by the algorithm proposed in this paper When thewindow width w 2 (Left) and w 3 the model accuracy ofthe MC-UVE-SPA-MW algorithm is higher than that of theMC-UVE-SPA When w 2 (Left) the MC-UVE-SPA-MWexpands 37 wavelength variables selected by the MC-UVE-SPA to 64 At this point RMSEP is 00381 r value is 09713RPD value is 163666 and the model is optimal Althoughthe MC-UVE-SPA-MW provides improved continuity byincreasing the number of wavelength variables from thoseobtained by the MC-UVE-SPA the final number is still just91 of the full-spectrum value of 700

e wavelength variables selected from the NIR spectralabsorbance data of a single sample by the MC-UVE MC-UVE-SPA and proposed MC-UVE-SPA-MW algorithmsare compared in Figure 10 e results in Figure 10 arederived from the fact that oil is a complex organic moleculewith infrared and NIR spectral absorption that occupies awide wavenumber band ranging 3900sim12000 cmminus1

(833sim2564 nm) is is mainly caused by the frequencydoubling and frequency combinations of the stretching andvibrational energy level transitions of hydrogen-containinggroups From the results of Figure 10 we note that thewavelength variables selected by the MC-UVE MC-UVE-SPA and proposed MC-UVE-SPA-MW algorithms aremainly distributed between 1662sim1790 2222sim22682288sim2316 2390sim2428 and 2476sim2498 nm which is exactlythe range of the spectral absorption peaks generated by thefirst and second frequency doubling of the -C-H stretchingvibrations of the -CH2 -CH3 and -CH-CH- functionalgroups of oil [31]

We note from Figure 10 that the moving windowemployed by theMC-UVE-SPA-MW algorithm expands thewavelength variables selected by the MC-UVE-SPAresulting in a greater number of wavelength variables thanthat obtained by the MC-UVE-SPA and the improvedcontinuity of the wavelength variables selected by the MC-

6 Journal of Spectroscopy

UVE-SPA-MW algorithm is very apparent in Figure 10compared with the wavelength variables selected by theMC-UVE-SPA We can also note from Table 3 that the full-spectrum model was relatively complicated and its pre-diction accuracy was the worst of all models considered dueto the impact of the large number of uninformative wave-length variables included within the model In comparisonthe models established with spectral data selected by theMC-UVEMC-UVE-SPA andMC-UVE-SPA-MW (w 2L2R 3) algorithms are all greatly simplified and better modelprediction accuracies are uniformly obtained We also notefrom the table that of the five wavelength selection

algorithms the MC-UVE-SPA selected the least number ofwavelengths and the MC-UVE-SPA-MW (w 2L) algo-rithm provided a model with the greatest predictionaccuracy

32 Diesel Spectral Data Experiments e number ofwavelength variables selected from the NIR spectral data ofdiesel fuel reflecting the boiling point by the MC-UVE MC-UVE-SPA and MC-UVE-SPA-MW (w 3 2L 2R) algo-rithms were respectively 262 30 83 58 and 59 as shown inTable 4 ese respectively represent 653 75 207

Begin

Preprocessing

MCUVE

Moving window improved

End

SPA

w = 3

w = 2 L

w = 2 R

Extending the wavelength point outwardwith the window width w

0 0 0 1 0 0 0 0 0 1 1 1 0 0

0 0 0 1 0 0 0 0 0 1 1 0 0 0

0 0 0 1 0 0 0 0 0 0 1 1 0 0

Figure 5 Processing flow of the proposed moving-window-improved MC-UVE-SPA (MC-UVE-SPA-MW) wavelength selection algo-rithm based on the Monte Carlo uninformative variable elimination (MC-UVE) algorithm cascaded with the successive projectionsalgorithm (SPA)

0 100 200 300 400 500 600 700Wavelength variable

0

1

2

3

4

5

6

7

Stab

ility

inde

x

Figure 6 Wavelength variable stability distribution map of the partial least squares (PLS) regression model reflecting the oil concentrationusing the MC-UVE algorithm

Journal of Spectroscopy 7

145 and 147 of the 401 wavelength variables includedin the full spectrum e prediction results of the PLS orMLR models constructed from the selected wavelengthvariables are shown in Figure 11 and the details are listedin Table 4 along with the results obtained for a full-spectrum PLS model We note from Table 4 and Figure 11that the models established with spectral data selected byMC-UVE MC-UVE-SPA and MC-UVE-SPA-MW(w 2L 2R 3) algorithms are greatly simplified comparedwith the full-spectrum model MC-UVE retains 262wavelength points and the prediction accuracy is theworst of all the models considered which may be due tothe existence of wavelength collinearity When SPA al-gorithm is used to further screen the wavelength pointsselected by MC-UVE only 30 wavelength points areretained while the prediction accuracy of the model is

greatly improved RMSEP is reduced to 88676 r value isincreased to 09341 and RPD value is increased to 24650We note from Figure 12 that the moving windowemployed by the MC-UVE-SPA-MW expands the wave-length variables selected by the MC-UVE-SPA and im-proves the continuity of the wavelength variables selectedby the MC-UVE-SPA-MWWhen the window width w 2(Left) 2 (Right) and 3 the accuracy of the three modelsobtained by the MC-UVE-SPA-MW are all improvedWhen w 3 the MC-UVE-SPA-MW expands 30 wave-length variables selected by the MC-UVE-SPA to 83 Atthis point RMSEP is reduced to 59694 R value is in-creased to 09752 RPD value is increased to 39994 andthe model is optimal We can also note from Table 4 andFigure 11 that of the five wavelength selection algorithmsthe MC-UVE-SPA selected the least number of wave-lengths and the MC-UVE-SPA-MW (w 3) algorithmprovided a model with the greatest prediction accuracy

33 Ethylene Gas Spectral Data Experiments e number ofwavelength variables selected from the spectral datareflecting the C2H4 concentration by the MVUVE MC-UVE-SPA and MC-UVE-SPA-MW (w 3 2L 2R) algo-rithms were respectively 214 17 48 34 and 34 as shown inFigure 13 ese respectively represent 47 037 10074 and 074 of the 4601 wavelength variables includedin the full spectrum It can be determined from Figure 13that greater than half of the selected wavelength variables fallwithin the strong absorption regions in the wavenumberranges 794sim1105 cmminus1 and 2917sim3242 cmminus1 ese resultscan be explained according to the description given on theHITRAN web page which states that the absorption spectralband of C2H4 gas is in the range of 614sim3242 cmminus1 and thatthe two isotopes H2

12C12CH2 and H212C13CH2 of C2H4

present strong absorption bands in the wavenumber rangesof 794sim1105 cmminus1 and 2917sim3242 cmminus1 respectively FromFigure 4 it can be seen that in some areas that are not C2H4absorption bands the spectral intensity has a significantlinear relationship with C2H4 content which may be due tothe interference caused by the background spectrum withthe change of C2H4 concentration so in some areas that arenot C2H4 absorption bands the wavelength point is alsoselected

e details regarding the prediction results of the PLS orMLR models constructed from the selected wavelengthvariables are listed in Table 5 along with the results obtainedfor a full-spectrum PLS model We again note from thetable that the full-spectrum model is more complicatedand its prediction accuracy was the worst of all modelsconsidered In comparison the models established withspectral data selected by the MC-UVE MC-UVE-SPA andMC-UVE-SPA-MW algorithms are all greatly simplifiedand better model prediction accuracies are uniformlyobtained Of the five wavelength selection algorithms weagain note that theMC-UVE-SPA selected the least numberof wavelengths and the MC-UVE-SPA-MW (w 3) algo-rithm provided a model with the greatest predictionaccuracy

0 100 200 300 400 500 600 700Number of wavelength variables

002

004

006

008

01

012

014

016

018

RMSE

C

0028931

Figure 7 Relationship between the root mean square error(RMSE) of the corn oil concentration predicted by the PLS re-gression model and the number of selected wavelength variables

0 10 20 30 40 50Number of variables included in the model

0

002

004

006

008

01

012

014

016

018

RMSE

Final number of selected variables 37 (RMSE = 0011356)

Figure 8 Relationship between the RMSE of the corn oil con-centration predicted by the multiple linear regression (MLR)modeland the number of selected wavelength variables using the MC-UVE-SPA

8 Journal of Spectroscopy

34 Summary of the Experimental Results It can be notedfrom the above experimental results that the predictionaccuracy of the models established by the wavelength

selection algorithm are higher than that of the full-spectrum model e MC-UVE-SPA selects the leastcharacteristic wavelengths and eliminates the collinearity

3 35 4Measured concentration ()

3

32

34

36

38

4

Pred

icte

d co

ncen

trat

ion

()

(a)

Pred

icte

d co

ncen

trat

ion

()

3 35 4Measured concentration ()

3

32

34

36

38

4

(b)

Pred

icte

d co

ncen

trat

ion

()

3 35 4Measured concentration ()

3

32

34

36

38

4

(c)

Pred

icte

d co

ncen

trat

ion

()

3 35 4Measured concentration ()

3

32

34

36

38

4

(d)Pr

edic

ted

conc

entr

atio

n (

)

3 35 4Measured concentration ()

3

32

34

36

38

4

(e)

Figure 9 Comparison of corn oil concentration predictions from the prediction data set using the (a) MC-UVE-PLS (b) MC-UVE-SPA-MLR (c) MC-UVE-SPA-MW-MLR (w 2L) (d) MC-UVE-SPA-MW-MLR (w 2R) and (e) MC-UVE-SPA-MW-MLR (w 3)

Table 3 Number of selected wavelengths and corn oil concentration prediction results

Wavelength selection algorithm Number of wavelengths Model type RMSECV RMSEP r RPDFull spectrum 700 PLS 01431 01345 09257 10550MC-UVE 102 PLS 00289 01447 09024 07948MC-UVE-SPA 37 MLR 00114 01225 09267 20225MC-UVE-SPA-MW(w 2L) 64 MLR 00035 00381 09713 163666MC-UVE-SPA-MW(w 2R) 64 MLR 00101 01009 08062 13703MC-UVE-SPA-MW(w 3) 82 MLR 00089 00532 09335 61627

1200 1400 1600 1800 2000 2200 2400Wavelength (nm)

0

05

1

Abs

orba

nce

(a)

1200 1400 1600 1800 2000 2200 2400Wavelength (nm)

0

05

1

Abs

orba

nce

(b)

1200 1400 1600 1800 2000 2200 2400Wavelength (nm)

0

05

Abs

orba

nce

(c)

Figure 10 Wavelengths selected from the NIR spectral data reflecting corn oil concentration by the (a) MC-UVE (b) MC-UVE-SPA and(c) MC-UVE-SPA-MW algorithms

Journal of Spectroscopy 9

between variables e prediction accuracy of the modelestablished by the MC-UVE-SPA is higher than thatestablished by MC-UVE e number of characteristic

wavelengths finally selected by the MC-UVE-SPA-MW ismore than that of MC-UVE-SPA but with better pre-diction accuracy

Table 4 Number of selected wavelengths and boiling point prediction results

Wavelength selection algorithm Number of wavelengths Model type RMSECV RMSEP r RPDFull spectrum 401 PLS 46350 83841 09103 23704MC-UVE 262 PLS 43184 100221 08311 17710MC-UVE-SPA 30 MLR 35503 88676 09341 24650MC-UVE-SPA-MW(w 2L) 58 MLR 39921 70442 09534 28212MC-UVE-SPA-MW(w 2R) 59 MLR 40024 68573 09556 28982MC-UVE-SPA-MW(w 3) 83 MLR 26478 59694 09752 39994e maximum number of PCs was again set to 10 for both PLS models

Pred

icte

d bo

iling

poi

nt (deg

C)

150 200 250 300Measured boiling point (degC)

150

200

250

300

(a)

Pred

icte

d bo

iling

poi

nt (deg

C)

Measured boiling point (degC)150 200 250 300

150

200

250

300

(b)

Pred

icte

d bo

iling

poi

nt (deg

C)

Measured boiling point (degC)150 200 250 300

150

200

250

300

(c)

Pred

icte

d bo

iling

poi

nt (deg

C)

Measured boiling point (degC)150 200 250 300

150

200

250

300

(d)

Pred

icte

d bo

iling

poi

nt (deg

C)

Measured boiling point (degC)150 200 250 300

150

200

250

300

(e)

Figure 11 Comparison of diesel fuel boiling point predictions from the prediction data set using (a) the MC-UVE-PLS (b) MC-UVE-SPA-MLR (c) MC-UVE-SPA-MW-MLR (w 2L) (d) MC-UVE-SPA-MW-MLR (w 2R) and (e) MC-UVE-SPA-MW-MLR (w 3)

800 900 1000 1100 1200 1300 1400 1500Wavelength (nm)

ndash1012

Abs

orba

nce

(a)

800 900 1000 1100 1200 1300 1400 1500Wavelength (nm)

ndash1012

Abs

orba

nce

(b)

800 900 1000 1100 1200 1300 1400 1500Wavelength (nm)

ndash1012

Abs

orba

nce

(c)

Figure 12 Wavelengths selected from the NIR spectral data reflecting the boiling point of diesel fuel by the (a) MC-UVE (b) MC-UVE-SPA and (c) MC-UVE-SPA-MW algorithms

10 Journal of Spectroscopy

4 Conclusions

e present study addressed the sparsity of wavelengthvariables selected by the cascaded MC-UVE-SPA throughthe application of a moving window which improved thecontinuity of the selected wavelength variables andthereby better exploited the advantages of the MC-UVEalgorithm and the SPA to obtain regression models withhigh prediction accuracy e advantages of the proposedMC-UVE-SPA-MW were demonstrated by applying theMC-UVE MC-UVE-SPA and MC-UVE-SPA-MW al-gorithms to the selection of wavelength variables from theNIR spectral absorbance data of corn diesel fuel andethylene and PLS and MLR models reflecting the oilcontent of corn the boiling point of diesel fuel and theethylene concentration were thereby established andtested e experimental results demonstrated that theprogressive elimination of uncorrelated and collinearvariables generated increasingly simplified partial-spec-trum models with greater prediction accuracy than thefull-spectrum model Among the three wavelength se-lection algorithms the MC-UVE-SPA selected the leastnumber of wavelength variables and the proposed MC-UVE-SPA-MW algorithm provided models with thegreatest prediction accuracy

Data Availability

e data used to support the findings of this study areavailable from the corresponding author upon request

Conflicts of Interest

e authors declare that they have no conflicts of interest

Acknowledgments

is study was supported by grants from the Major NationalScience and Technology Special Project of China(JZ2015KJZZ0254) and the Key Projects of Natural ScienceResearch in Universities in Anhui China (KJ2018A0544)

References

[1] C Pasquini ldquoNear infrared spectroscopy fundamentalspractical aspects and analytical applicationsrdquo Journal of theBrazilian Chemical Society vol 14 no 2 pp 198ndash219 2003

[2] X Sun M Zhou and Y Sun ldquoSpectroscopy quantitativeanalysis cotton content of blend fabricsrdquo InternationalJournal of Clothing Science and Technology vol 28 no 1pp 65ndash76 2016

500 1000 1500 2000 2500 3000 3500 4000 45000

05

1

15

2

25

3

35

4

Abs

orba

nce

OriginalMCUVE

MCUVE-SPAMCUVE-SPA-MW

times104

Wavenumber (cmndash1)

Figure 13 Measured C2H4 gas absorbance spectra and wavelength variables selected by the MC-UVE MC-UVE-SPA and MC-UVE-SPA-MW algorithms

Table 5 Number of selected wavelengths and C2H4 gas concentration prediction results

Wavelength selection algorithm Number of wavelengths Model type RMSECV RMSEP rFull spectrum 4601 PLS 01431 01342 09801MC-UVE 214 PLS 01009 01212 09887MC-UVE-SPA 17 MLR 01375 00537 09913MC-UVE-SPA-MW(w 2L) 34 MLR 00989 00495 09988MC-UVE-SPA-MW(w 2R) 34 MLR 00988 00486 09990MC-UVE-SPA-MW(w 3) 48 MLR 01005 00434 09999e maximum number of PCs was again set to 10 for both PLS models

Journal of Spectroscopy 11

[3] M Schwanninger J C Rodrigues and K Fackler ldquoA reviewof band assignments in near infrared spectra of wood andwood Componentsrdquo Journal of Near Infrared Spectroscopyvol 19 no 5 pp 287ndash308 2011

[4] Y-H Yun Y-C Wei X-B Zhao W-J Wu Y-Z Liang andH-M Lu ldquoA green method for the quantification of poly-saccharides in Dendrobium officinalerdquo RSC Advances vol 5no 127 pp 105057ndash105065 2015

[5] C K Vance D R Tolleson K Kinoshita J Rodriguez andW J Foley ldquoNear infrared spectroscopy in wildlife andbiodiversityrdquo Journal of Near Infrared Spectroscopy vol 24no 1 pp 1ndash25 2016

[6] H Hotelling ldquoAnalysis of a complex of statistical variablesinto principal componentsrdquo Journal of Educational Psychol-ogy vol 24 no 6 pp 417ndash441 1933

[7] P Geladi and B R Kowalski ldquoPartial least-squares regressiona tutorialrdquo Analytica Chimica Acta vol 185 no 1 pp 1ndash171986

[8] A A Kardamakis and N Pasadakis ldquoAutoregressive mod-eling of near-IR spectra and MLR to predict RON values ofgasolinesrdquo Fuel vol 89 no 1 pp 158ndash161 2010

[9] I Guyon and A Elisseeff ldquoAn introduction to variable andfeature selectionrdquo Journal of Machine Learning Researchvol 3 pp 1157ndash1182 3003

[10] B Lu N Liu H Li et al ldquoQuantitative determination andcharacteristic wavelength selection of available nitrogen incoco-peat by NIR spectroscopyrdquo Soil and Tillage Researchvol 191 pp 266ndash274 2019

[11] M J Anzanello and F S Fogliatto ldquoA review of recentvariable selection methods in industrial and chemometricsapplicationsrdquo European Journal of Industrial Engineeringvol 8 no 5 p 619 2014

[12] Y-H Yun H-D Li B-C Deng and D-S Cao ldquoAn overviewof variable selection methods in multivariate analysis of near-infrared spectrardquo TrAC Trends in Analytical Chemistryvol 113 pp 102ndash115 2019

[13] B Nadler and R R Coifman ldquoe prediction error in CLSand PLS the importance of feature selection prior to mul-tivariate calibrationrdquo Journal of Chemometrics vol 19 no 2pp 107ndash118 2005

[14] T Mehmood K H Liland L Snipen and S Saeligboslash ldquoA reviewof variable selection methods in Partial Least Squares Re-gressionrdquo Chemometrics and Intelligent Laboratory Systemsvol 118 no 8 pp 62ndash69 2012

[15] V Centner D-L Massart O E de Noord S de JongB M Vandeginste and C Sterna ldquoElimination of uninfor-mative variables for multivariate calibrationrdquo AnalyticalChemistry vol 68 no 21 pp 3851ndash3858 1996

[16] G Tang Y Huang K Tian et al ldquoA new spectral variableselection pattern using competitive adaptive reweightedsampling combined with successive projections algorithmrdquo9e Analyst vol 139 no 19 pp 4894ndash4902 2014

[17] Y Li Y Guo C Liu et al ldquoSPA combined with swarm in-telligence optimization algorithms for wavelength variableselection to rapidly discriminate the adulteration of applejuicerdquo Food Analytical Methods vol 10 no 6 pp 1965ndash19712017

[18] J-B Li C-J Zhao W-Q Huang et al ldquoA combination al-gorithm for variable selection to determine soluble solidcontent and firmness of pearsrdquo Analytical Methods vol 6no 7 pp 2170ndash2180 2014

[19] Z Xiaobo Z Jiewen M Hanpin S Jiyong Y Xiaopin andL Yanxiao ldquoGenetic algorithm interval partial least squaresregression combined successive projections algorithm for

variable selection in near-infrared quantitative analysis ofpigment in cucumber leavesrdquo Applied Spectroscopy vol 64no 7 pp 786ndash794 2010

[20] S Ye D Wang and S Min ldquoSuccessive projections algorithmcombined with uninformative variable elimination forspectral variable selectionrdquo Chemometrics and IntelligentLaboratory Systems vol 91 no 2 pp 194ndash199 2008

[21] B-C Deng Y-H Yun P Ma C-C Lin D-B Ren andY-Z Liang ldquoA new method for wavelength interval selectionthat intelligently optimizes the locations widths and com-binations of the intervalsrdquo 9e Analyst vol 140 no 6pp 1876ndash1885 2015

[22] W Fan Y-Y Li Y-K Peng et al ldquoNondestructive deter-mination of lycopene content based on visiblenear infraredtransmission spectrumrdquo Chinese Journal of AnalyticalChemistry vol 46 no 9 pp 1424ndash1431 2018

[23] Z Sun J Fan J Wang et al ldquoAssessment of the humanalbumin in acid precipitation process using NIRS and multi-variable selection methods combined with SPArdquo Journal ofMolecular Structure vol 1199 p 126942 2020

[24] H-D Li Q-S Xu and Y-Z Liang ldquolibPLS an integratedlibrary for partial least squares regression and linear dis-criminant analysisrdquo Chemometrics and Intelligent LaboratorySystems vol 176 pp 34ndash43 2018

[25] R Zhang Y-Y Chen Z-B Wang and L Kewu ldquoA novelensemble L1 regularization based variable selection frame-work with an application in near infrared spectroscopyrdquoChemometrics and Intelligent Laboratory Systems vol 163pp 7ndash15 2017

[26] W Cai Y Li and X Shao ldquoA variable selection method basedon uninformative variable elimination for multivariate cali-bration of near-infrared spectrardquo Chemometrics and Intelli-gent Laboratory Systems vol 90 no 2 pp 188ndash194 2008

[27] L M Bregman ldquoFinding the common point of convex sets bythe method of successive projectionrdquo Proceedings of the USSRAcademy of Sciences vol 162 no 3 pp 487ndash490 1965

[28] X Peng T Shi A Song Y Chen and W Gao ldquoEstimatingsoil organic carbon using VISNIR spectroscopy with SVMRand SPA methodsrdquo Remote Sensing vol 6 no 4 pp 2699ndash2717 2014

[29] Y-H Liu Q-Q Wang X-W Gao and A-G Xie ldquoTotalphenolic content prediction in Flos Lonicerae using hyper-spectral imaging combined with wavelengths selectionmethodsrdquo Journal of Food Process Engineering vol 42 no 6Article ID e13224 2019

[30] J Li H Zhang B Zhan Y Zhang R Li and J Li ldquoNon-destructive firmness measurement of the multiple cultivars ofpears by Vis-NIR spectroscopy coupled with multivariatecalibration analysis and MC-UVE-SPA methodrdquo InfraredPhysics amp Technology vol 104 Article ID 103154 2020

[31] P Hourant V Baeten M T Morales M Meurens andR Aparicio ldquoOil and fat classification by selected bands ofnear-infrared spectroscopyrdquo Applied Spectroscopy vol 54no 8 pp 1168ndash1174 2000

12 Journal of Spectroscopy

Page 2: Moving-Window-ImprovedMonteCarloUninformativeVariable ... · 2020. 8. 3. · to the NIR spectroscopic analysis of nicotine in tobacco lamina and active pharmaceutical ingredients

and variable selection [12] Among the above-discussedmethods variable selection has become the dominant methodof interest in recent years for the development of NIR spectralanalysis technology and chemometrics [11ndash14]

e goal of wavelength selection is to identify the mostinformative wavelengths for use as variables in partial-spec-trum regression models Here uninformative wavelengthvariables have either no effect or a negative effect on themodeling performanceewavelength selection process fulfilsthree purposes including (1) providing models with greaterpredicative capability (2) obtaining wavelength variables thatprovide greater modeling efficiency and (3) providing simplermodels with improved interpretability [9] e most com-monly employed wavelength selection algorithms developedthus far include uninformative variable elimination (UVE) andthe successive projections algorithm (SPA)

e goal of UVE first proposed by Centner et al [15] isnot to select variables directly but to effectively eliminateuninformative variables in the spectral data such that onlyinformative wavelength variables remaine SPA employssimple projection to select variables with a minimum ofcollinearity but variables selected by SPA may make littlecontribution to multivariate calibration which can affectmodel prediction [16] A significant development in recentyears has been the combined use of different algorithmsthrough a cascade strategy where the results of onewavelength selection algorithm are used as the inputs of thenext selection algorithm in a stepwise manner is cancombine the advantages of various wavelength selectionalgorithms in a complementary way and thereby obtainbetter and more effective prediction results e commonvariable selection method combined with SPA method cangreatly simplify the model and improve the predictionaccuracy is strategy has been effectively used in manystudies to address the problem associated with the appli-cation of the SPA to NIR spectral data by first reducing thedimension of the spectral data by applying some initialalgorithm such as UVE MC-UVE particle swarm opti-mization (PSO) or genetic algorithm (GA) optimization[16ndash20] Among them UVE and MC-UVE are commonlyused as the primary wavelength algorithms of SPA Forexample Ye et al proposed the combination of UVE andSPA to integrate the bright side of each successfully appliedto the NIR spectroscopic analysis of nicotine in tobaccolamina and active pharmaceutical ingredients in intacttablets for variable selection UVE was employed to selectinformative variables and SPA was followed to selectvariables that have minimum redundant information fromthe informative variables [20] Li et al proposed a newcombination of MC-UVE and SPA MC-UVE wasemployed to select informative variables in the full spec-trum and SPA was also employed as a powerful method forfurther characteristic variable selection [18]

Nonetheless most of the informative wavelengths in amolecular NIR spectrum typically exhibit some continuitywhere wavelength points adjacent to an informative wave-length point also represent informative wavelengths [21]However the MC-UVE algorithm and the SPA are bothwavelength selection algorithms based on optimal

wavelength points which are most likely isolated pointsalong the full NIR spectrum e MC-UVE-SPA cascadecannot improve the problem of wavelength point discon-tinuity which may result in the least number of selectedwavelength variables but the modeling effect is not the bestFan et al constructed a model for visibleNIR spectral datareflecting the lycopene content based on wavelength variableselection obtained using UVE SPA and CARS individuallyand in various two-stage cascaded combinations [22] eUVE-SPA combination was found to retain the smallestnumber of wavelength variables of all the selection algo-rithms considered but the prediction accuracy of the modelconstructed using this wavelength variable set was the worstof all models obtained using all other wavelength selectionalgorithms Sun et al showed that the prediction results ofthe model constructed by the cascaded wavelength selectionalgorithm were not always the most accurate and theprediction results of the improved cascaded wavelengthselection algorithm were better than those of the direct two-stage cascaded strategy [23]

Few studies have considered improving the continuity ofthe selected wavelength in the wavelength point selectionalgorithm erefore this paper considers the continuity ofthe wavelength selected by the MC-UVE-SPA In this studythis is employed as a moving-window-improved cascadestrategy for wavelength selection that is herein denoted asthe MC-UVE-SPA-MW algorithm First the uninformativevariable is eliminated by MC-UVE the collinear variable iseliminated by SPA and then the wavelength variables areselected by extending outward from the optimal wavelengthpoints by MC-UVE-SPA in conjunction with a movingwindow is reduces the number of isolated wavelengthvariables preserves the continuity between informativewavelength points in an NIR spectrum and expects toimprove the accuracy of the established prediction model

2 Materials and Methods

21 Experiments and Data Experiments based on the NIRspectral absorbance data of corn diesel fuel and ethylenewere employed for verifying the wavelength variable se-lection performance of the proposed MC-UVE-SPA-MWalgorithm and were conducted using the libPLS toolkit [24]while the remaining code was written and executed in theMATLAB R2017b environment

211 Corn Spectral Data e NIR spectral absorbance datafor corn were provided by Eigenvector Research Inc (httpwwweigenvectorcomdataCornindexhtml) e m5spectra of corn data set consist of 80 corn samples measuredover a wavelength range of 1100sim2498 nm in 2 nm intervalsAccordingly the data set includes a total of 700 wavelengthpoints It also contains four component reference values ofmoisture oil protein and starch contents determined bychemical methods for each sample Table 1 shows themaximum minimum and average values of the relativeconcentrations of moisture oil protein and starch in the 80corn samples

2 Journal of Spectroscopy

212 Diesel Fuel Spectral Data e NIR spectral absor-bance data for diesel fuel were provided by the SouthwestResearch Institute (SWRI) (httpwwweigenvectorcomdataSWRIindexhtml) e data set comprises unpro-cessed spectra derived from 784 diesel fuel samples mea-sured over a wavelength range of 550sim750 nm in 2 nmintervals Accordingly the data set includes a total of 401wavelength points e data set also contains variousproperties including the boiling point cetane numberdensity freezing point total aromatic hydrocarbon contentand viscosity Some of the parameter samples have missingvalues (NaN) which are eliminated during the experimentTable 2 shows the maximum minimum and average valuesof the boiling point of diesel fuel

213 Ethylene Gas Spectral Data Ethylene gas samples wereprepared within a closed cell filled with nitrogen gas at apressure of 1 atm and a temperature of 296K by distributingC2H4 gas into the cell to form samples with 72 known C2H4concentrations ranging from 6015 ppm to 2005 ppm in2005 ppm intervals e C2H4 gas distribution deviceadopted a gas distribution platform shown in Figure 1independently developed by the Hefei Material ScienceResearch Institute of the Chinese Academy of Sciencesrough visual control software set the gas distributionproportion according to the requirements adjust the volumeratio of the auxiliary gas nitrogen and the gas to be dis-tributed through the high-precision gas distribution plat-form and configure the required concentration of standardgas according to the requirements Fourier transform in-frared (FTIR) spectroscopy was applied to capture thespectral absorbance intensity of the gas in a sealed samplecell e optical path length of the cell was 10m and therange of the measured wavenumbers was 400sim5000 cmminus1

with a resolution of 1 cmminus1 e apodization function used aHamming window the number of scans was 16 and a totalof 96 spectral data of different concentrations were collected

Accordingly the data set includes a total of 4601wavelength points e absorption spectrum of C2H4 gasobtained from the HITRAN database (httphitraniaoru)over a wavenumber range of 400sim5000 cmminus1 is shown inFigure 2 Figure 3 presents the background spectral intensitymeasured after the closed cell was filled with nitrogen gas atroom temperature Figure 4 presents the measured ab-sorption spectral intensity of the cell after adding variousconcentrations of C2H4 gas A comparison of Figures 3 and 4indicates that the spectral intensities in the two regions of794sim1105 cmminus1 and 2917sim3242 cmminus1 are drastically differentdue to the spectral absorption characteristics of the addedC2H4 gas

22 Evaluation Indices e NIR spectral absorbance dataare first preprocessed to generate normalized data for fa-cilitating consistent analyses e normalized data are thendivided into a calibration data set and a prediction data setwhich are respectively applied for establishing the variousregression models and for testing the established models byadopting the KennardndashStone method (3 1) e extent ofinformation provided by the selected wavelength variables isgenerally difficult to directly evaluate erefore indirectevaluation methods are usually adopted Typically the in-formation value of wavelength variables is evaluatedaccording to the prediction accuracy of the model con-structed with the selected wavelengths e indices forevaluating the prediction accuracy of regression models arethe root mean square error of cross validation (RMSECV)for calibration set the root mean square error of predictionthe correlation coefficient (r) and the relative percent de-viation (RPD) for prediction set ese indices are defined asfollows

RMSECV

1113936ni1 yk minus yk( 1113857

2

n minus 1

1113971

RMSEP

1113936ni1 1113954yi minus yi( 1113857

2

n minus 1

1113971

r 1113936

ni1 yi minus yAVE( 1113857 1113954yi minus 1113954yAVE( 1113857

1113936ni1 yi minus yAVE( 1113857

21113954yi minus 1113954yAVE( 1113857

21113969

RPD

1113936ni1 yi minus yAVE( 1113857

2

1113936ni1 1113954yi minus yi( 1113857

2

11139741113972

(1)

Here n is the number of samples in the calibration set orthe prediction set yk is the measured value and 1113954yk is thepredicted value of sample i in calibration set yi is themeasured value and 1113954yi is the predicted value of sample i inprediction set and yAVE and 1113954yAVE are the respective averagemeasured value and the average predicted value of allsamples in prediction set

We note that the evaluated prediction performanceincreases with decreasing RMSE and increasing r and RPDe RMSE is denoted as the RMSECV when referring to thevalue associated with the calibration data set and as theRMSEP when referring to the value associated with theprediction data set

23 MC-UVE-SPA Method e fundamental basis of UVEis to use the stability of the regression coefficient vectorcharacteristic of a constructed PLS multiple regressionmodel as a measure of the significance of a given wavelengthHowever the UVE tends to suffer from model overfitting

Table 1 Statistical analysis of the relative concentrations of corn

Component to be tested Max Min AveMoisture 109930 93770 102335Oil 38320 30880 34984Protein 97110 76540 86683Starch 664720 628260 646956

Table 2 Statistical analysis of the boiling point of diesel fuel

Property to be tested Effective sample size Max Min AveBoiling point 395 297 182 25837

Journal of Spectroscopy 3

[25]is was addressed by the development of Monte Carlo(MC) UVE (MC-UVE) proposed by Cai et al [26] whichreplaces the leave-one-out cross-validation (LOOCV) pro-cess calculating the regression coefficient matrixβ [β1 β2 β2 ] in conventional UVE with the MC cross-validation (MCCV) process e reliability of each variable jcan be quantitatively measured by

Sj mean βj1113872 1113873

std βj1113872 1113873 (2)

where mean (βj) and std (βj) are the mean and standarddeviation of the regression coefficients of variable j egreater the absolute value of stability the more importantthe corresponding variable e stability of uninformativevariables should be less than a threshold

e SPA first proposed by Bregman [27] is a forward-cycling variable selection method For spectral data analysiseach cycle of the process calculates the projection of a se-lected wavelength on an unselected wavelength and includesthe unselected wavelength with the largest projection vectorin the set of selected wavelengths [28] is process is re-peated for each selected wavelength as it is added to the setuntil the selected wavelength set includes a specified numberof wavelengths [16] More detailed information on the stepsof SPA can be seen in literature [16 29] In selecting the nextwavelength each of the newly selected wavelengths has thelowest correlation with the previous oneerefore SPA caneffectively eliminate collinear wavelength variables and re-duce the number of dimensions of the sample spectrumwhich accordingly reduces the calculation burden of themodel

FTIRspectrometer

MultipassgascellPump

Detector

Infraredsource

Computer

Gas sampling hose Gas sampling hose

Target gas

Serial port

NetworkGas distribution

system

Figure 1 Schematic diagram of the standard gas spectra acquisition system

500 1000 1500 2000 2500 3000 3500 4000 4500 5000Wavenumber (cmndash1)

0

5

10

15

20

25

Abs

orba

nce

Figure 2 NIR absorbance spectrum for C2H4 gas from the HITRAN database

4 Journal of Spectroscopy

e MC-UVE-SPA method is a combination method ofMC-UVE and SPA Jiangbo Li et al proved that the com-bination (MC-UVE-SPA) of both Monte Carlo uninfor-mative variable elimination (MC-UVE) and successiveprojections algorithm (SPA) was more effective than MC-UVE or SPA alone [30] Although the effect of UVE-SPA isbetter than that of using UVE or SPA alone there is stillsomething to be improved In this paper the UVE-SPA isimproved by using the wavelength effective continuity andits effectiveness is verified by experiments

24 Proposed MC-UVE-SPA-MW Wavelength SelectionAlgorithm e proposed wavelength selection algorithmfirst applies MC-UVE to the calibration data set to construct

a PLS regression model e threshold of the MC-UVEprocess is set to provide a number of wavelength variablesthat minimize the RMSECV of the constructed PLS re-gressionmodele largest number of principal components(PCs) was set to 10 and the optimal number of PCs wasdetermined based on the minimum RMSECV value Sub-sequently the wavelength variables retained by the MC-UVE algorithm are applied as the input of the SPA Here anMLRmodel is constructed based on the wavelength variablesselected by the SPA for conducting cross-validation analysiswhere the number of selected wavelength variables is de-termined according to the minimum of the RMSECV of theconstructed MLR model In order to reduce the number ofisolated wavelength variables and maintain the continuity ofadjacent information wavelength points of near-infrared

500 1000 1500 2000 2500 3000 3500 4000 4500 5000Wavenumber (cmndash1)

05

1

15

2

25

3

35

Inte

nsity

times104

Figure 3 Measured background NIR spectrum with the cell filled with nitrogen gas (N2)

5000

05

1

15

25

35

2

3

4

1000 1500 2000 3000 4000 50002500 3500 4500

Inte

nsity

Wavenumber (cmndash1)

times104

Figure 4 Measured C2H4 gas absorbance spectrum

Journal of Spectroscopy 5

spectrum it extends outward from the best wavelength pointselected by UVE-SPA In the original spectrum the optimalwavelength point selected by the MC-UVE-SPA is used asthe starting point of a moving window of width w ewavelengths in the moving window are used as the selectedwavelengths and the number of wavelengths finally selectedvaries with the window width Set the window width w 2(Left) 2 (Right) or 3 Here 2 (Left) means to extend theselected wavelength point to the left to 2 wavelength points2 (Right) means to extend the selected wavelength point tothe right to 2 wavelength points and 3 means to extend theselected wavelength point to 3 wavelength points using theselected wavelength point as the center point e optimalwindow width is determined by the minimum RMSECV ofthe MLR model e processing flow of the proposed MC-UVE-SPA-MW algorithm and extending the wavelengthpoint outward are illustrated in Figure 5

3 Results and Discussion

31 Corn Spectral Data Experiments e wavelength vari-able stability distribution map of the PLS regression modelreflecting the oil concentration in corn constructed forcalibration set using the MC-UVE algorithm is presented inFigure 6 Here all wavelengths greater than the thresholdvalue shown by the horizontal red line in the figure areselected for use in the model is threshold was selected toprovide the number of wavelength variables correspondingto theminimumRMSECV of the constructed PLS regressionmodel is is illustrated in Figure 7 where the RMSECV ofthe constructed PLS regression model is plotted with respectto the number of selected wavelength variables It can beseen from Figure 7 that the RMSECV is relatively large whenthe number of wavelength variables is small and theRMSECV drops sharply as the number of selected variablesincreases is is because an overly small number ofwavelength variables exclude useful information and theprediction accuracy of the model is therefore improved as anincreasing amount of useful information is incorporatedinto the model A minimum value of RMSECV 00289 isobtained when the number of selected wavelength variablesis 106 and the RMSECV increases again when the number ofvariables exceeds 106 is increase results from the impactof selecting an increasing number of uninformative variableson the prediction accuracy of the model We also note thatthe RMSECV changes very little when the number ofwavelength variables exceeds 300 us the MC-UVE al-gorithm eliminates a large number of wavelengths that arenot related to the oil concentration of corn where the finalnumber of selected wavelength variables is just 151 of thefull-spectrum value of 700

e optimal number of 106 wavelength variables selectedby MC-UVE is then used as the inputs of the SPA whichiteratively generates wavelength variable combinations usingeach wavelength as a starting point and applies them forconstructing an MLR model e wavelength combinationcorresponding to the minimumRMSECV of theMLRmodelis then taken as the optimal wavelength combination erelationship between the number of selected wavelength

variables and the RMSECV of the MLR model constructedfrom variables selected by the MC-UVE-SPA is shown inFigure 8 where we note that the minimum RMSECV isobtained when the number of selected variables is 37 usthe SPA further reduces the number of informative wave-lengths mainly by eliminating collinear variables in the MLRmodel where the final number of selected wavelengthvariables is reduced to just 53 of the full-spectrum value of700

In the original spectrum the optimal wavelength pointselected by the MC-UVE-SPA is used as the starting point orcenter of a moving window of width w 2 (Left) 2 (Right)or 3 e results of the PLS or MLR model constructed usingthe wavelength variables selected by different algorithms areshown in Figure 9 and the details are listed in Table 3 alongwith the results obtained for different models In Table 3 theoptimal number of PLS principal components was 10 Asshown in Table 3 there were 37 characteristic wavelengthsselected by the MC-UVE-SPA accounting for only 53 ofthe total number of wavelengths and the accuracy of thealgorithm is better than that of MC-UVE algorithm which isdue to the elimination of wavelength collinearity ewavelengths selected by the MC-UVE-SPA-MW are ex-tended by the algorithm proposed in this paper When thewindow width w 2 (Left) and w 3 the model accuracy ofthe MC-UVE-SPA-MW algorithm is higher than that of theMC-UVE-SPA When w 2 (Left) the MC-UVE-SPA-MWexpands 37 wavelength variables selected by the MC-UVE-SPA to 64 At this point RMSEP is 00381 r value is 09713RPD value is 163666 and the model is optimal Althoughthe MC-UVE-SPA-MW provides improved continuity byincreasing the number of wavelength variables from thoseobtained by the MC-UVE-SPA the final number is still just91 of the full-spectrum value of 700

e wavelength variables selected from the NIR spectralabsorbance data of a single sample by the MC-UVE MC-UVE-SPA and proposed MC-UVE-SPA-MW algorithmsare compared in Figure 10 e results in Figure 10 arederived from the fact that oil is a complex organic moleculewith infrared and NIR spectral absorption that occupies awide wavenumber band ranging 3900sim12000 cmminus1

(833sim2564 nm) is is mainly caused by the frequencydoubling and frequency combinations of the stretching andvibrational energy level transitions of hydrogen-containinggroups From the results of Figure 10 we note that thewavelength variables selected by the MC-UVE MC-UVE-SPA and proposed MC-UVE-SPA-MW algorithms aremainly distributed between 1662sim1790 2222sim22682288sim2316 2390sim2428 and 2476sim2498 nm which is exactlythe range of the spectral absorption peaks generated by thefirst and second frequency doubling of the -C-H stretchingvibrations of the -CH2 -CH3 and -CH-CH- functionalgroups of oil [31]

We note from Figure 10 that the moving windowemployed by theMC-UVE-SPA-MW algorithm expands thewavelength variables selected by the MC-UVE-SPAresulting in a greater number of wavelength variables thanthat obtained by the MC-UVE-SPA and the improvedcontinuity of the wavelength variables selected by the MC-

6 Journal of Spectroscopy

UVE-SPA-MW algorithm is very apparent in Figure 10compared with the wavelength variables selected by theMC-UVE-SPA We can also note from Table 3 that the full-spectrum model was relatively complicated and its pre-diction accuracy was the worst of all models considered dueto the impact of the large number of uninformative wave-length variables included within the model In comparisonthe models established with spectral data selected by theMC-UVEMC-UVE-SPA andMC-UVE-SPA-MW (w 2L2R 3) algorithms are all greatly simplified and better modelprediction accuracies are uniformly obtained We also notefrom the table that of the five wavelength selection

algorithms the MC-UVE-SPA selected the least number ofwavelengths and the MC-UVE-SPA-MW (w 2L) algo-rithm provided a model with the greatest predictionaccuracy

32 Diesel Spectral Data Experiments e number ofwavelength variables selected from the NIR spectral data ofdiesel fuel reflecting the boiling point by the MC-UVE MC-UVE-SPA and MC-UVE-SPA-MW (w 3 2L 2R) algo-rithms were respectively 262 30 83 58 and 59 as shown inTable 4 ese respectively represent 653 75 207

Begin

Preprocessing

MCUVE

Moving window improved

End

SPA

w = 3

w = 2 L

w = 2 R

Extending the wavelength point outwardwith the window width w

0 0 0 1 0 0 0 0 0 1 1 1 0 0

0 0 0 1 0 0 0 0 0 1 1 0 0 0

0 0 0 1 0 0 0 0 0 0 1 1 0 0

Figure 5 Processing flow of the proposed moving-window-improved MC-UVE-SPA (MC-UVE-SPA-MW) wavelength selection algo-rithm based on the Monte Carlo uninformative variable elimination (MC-UVE) algorithm cascaded with the successive projectionsalgorithm (SPA)

0 100 200 300 400 500 600 700Wavelength variable

0

1

2

3

4

5

6

7

Stab

ility

inde

x

Figure 6 Wavelength variable stability distribution map of the partial least squares (PLS) regression model reflecting the oil concentrationusing the MC-UVE algorithm

Journal of Spectroscopy 7

145 and 147 of the 401 wavelength variables includedin the full spectrum e prediction results of the PLS orMLR models constructed from the selected wavelengthvariables are shown in Figure 11 and the details are listedin Table 4 along with the results obtained for a full-spectrum PLS model We note from Table 4 and Figure 11that the models established with spectral data selected byMC-UVE MC-UVE-SPA and MC-UVE-SPA-MW(w 2L 2R 3) algorithms are greatly simplified comparedwith the full-spectrum model MC-UVE retains 262wavelength points and the prediction accuracy is theworst of all the models considered which may be due tothe existence of wavelength collinearity When SPA al-gorithm is used to further screen the wavelength pointsselected by MC-UVE only 30 wavelength points areretained while the prediction accuracy of the model is

greatly improved RMSEP is reduced to 88676 r value isincreased to 09341 and RPD value is increased to 24650We note from Figure 12 that the moving windowemployed by the MC-UVE-SPA-MW expands the wave-length variables selected by the MC-UVE-SPA and im-proves the continuity of the wavelength variables selectedby the MC-UVE-SPA-MWWhen the window width w 2(Left) 2 (Right) and 3 the accuracy of the three modelsobtained by the MC-UVE-SPA-MW are all improvedWhen w 3 the MC-UVE-SPA-MW expands 30 wave-length variables selected by the MC-UVE-SPA to 83 Atthis point RMSEP is reduced to 59694 R value is in-creased to 09752 RPD value is increased to 39994 andthe model is optimal We can also note from Table 4 andFigure 11 that of the five wavelength selection algorithmsthe MC-UVE-SPA selected the least number of wave-lengths and the MC-UVE-SPA-MW (w 3) algorithmprovided a model with the greatest prediction accuracy

33 Ethylene Gas Spectral Data Experiments e number ofwavelength variables selected from the spectral datareflecting the C2H4 concentration by the MVUVE MC-UVE-SPA and MC-UVE-SPA-MW (w 3 2L 2R) algo-rithms were respectively 214 17 48 34 and 34 as shown inFigure 13 ese respectively represent 47 037 10074 and 074 of the 4601 wavelength variables includedin the full spectrum It can be determined from Figure 13that greater than half of the selected wavelength variables fallwithin the strong absorption regions in the wavenumberranges 794sim1105 cmminus1 and 2917sim3242 cmminus1 ese resultscan be explained according to the description given on theHITRAN web page which states that the absorption spectralband of C2H4 gas is in the range of 614sim3242 cmminus1 and thatthe two isotopes H2

12C12CH2 and H212C13CH2 of C2H4

present strong absorption bands in the wavenumber rangesof 794sim1105 cmminus1 and 2917sim3242 cmminus1 respectively FromFigure 4 it can be seen that in some areas that are not C2H4absorption bands the spectral intensity has a significantlinear relationship with C2H4 content which may be due tothe interference caused by the background spectrum withthe change of C2H4 concentration so in some areas that arenot C2H4 absorption bands the wavelength point is alsoselected

e details regarding the prediction results of the PLS orMLR models constructed from the selected wavelengthvariables are listed in Table 5 along with the results obtainedfor a full-spectrum PLS model We again note from thetable that the full-spectrum model is more complicatedand its prediction accuracy was the worst of all modelsconsidered In comparison the models established withspectral data selected by the MC-UVE MC-UVE-SPA andMC-UVE-SPA-MW algorithms are all greatly simplifiedand better model prediction accuracies are uniformlyobtained Of the five wavelength selection algorithms weagain note that theMC-UVE-SPA selected the least numberof wavelengths and the MC-UVE-SPA-MW (w 3) algo-rithm provided a model with the greatest predictionaccuracy

0 100 200 300 400 500 600 700Number of wavelength variables

002

004

006

008

01

012

014

016

018

RMSE

C

0028931

Figure 7 Relationship between the root mean square error(RMSE) of the corn oil concentration predicted by the PLS re-gression model and the number of selected wavelength variables

0 10 20 30 40 50Number of variables included in the model

0

002

004

006

008

01

012

014

016

018

RMSE

Final number of selected variables 37 (RMSE = 0011356)

Figure 8 Relationship between the RMSE of the corn oil con-centration predicted by the multiple linear regression (MLR)modeland the number of selected wavelength variables using the MC-UVE-SPA

8 Journal of Spectroscopy

34 Summary of the Experimental Results It can be notedfrom the above experimental results that the predictionaccuracy of the models established by the wavelength

selection algorithm are higher than that of the full-spectrum model e MC-UVE-SPA selects the leastcharacteristic wavelengths and eliminates the collinearity

3 35 4Measured concentration ()

3

32

34

36

38

4

Pred

icte

d co

ncen

trat

ion

()

(a)

Pred

icte

d co

ncen

trat

ion

()

3 35 4Measured concentration ()

3

32

34

36

38

4

(b)

Pred

icte

d co

ncen

trat

ion

()

3 35 4Measured concentration ()

3

32

34

36

38

4

(c)

Pred

icte

d co

ncen

trat

ion

()

3 35 4Measured concentration ()

3

32

34

36

38

4

(d)Pr

edic

ted

conc

entr

atio

n (

)

3 35 4Measured concentration ()

3

32

34

36

38

4

(e)

Figure 9 Comparison of corn oil concentration predictions from the prediction data set using the (a) MC-UVE-PLS (b) MC-UVE-SPA-MLR (c) MC-UVE-SPA-MW-MLR (w 2L) (d) MC-UVE-SPA-MW-MLR (w 2R) and (e) MC-UVE-SPA-MW-MLR (w 3)

Table 3 Number of selected wavelengths and corn oil concentration prediction results

Wavelength selection algorithm Number of wavelengths Model type RMSECV RMSEP r RPDFull spectrum 700 PLS 01431 01345 09257 10550MC-UVE 102 PLS 00289 01447 09024 07948MC-UVE-SPA 37 MLR 00114 01225 09267 20225MC-UVE-SPA-MW(w 2L) 64 MLR 00035 00381 09713 163666MC-UVE-SPA-MW(w 2R) 64 MLR 00101 01009 08062 13703MC-UVE-SPA-MW(w 3) 82 MLR 00089 00532 09335 61627

1200 1400 1600 1800 2000 2200 2400Wavelength (nm)

0

05

1

Abs

orba

nce

(a)

1200 1400 1600 1800 2000 2200 2400Wavelength (nm)

0

05

1

Abs

orba

nce

(b)

1200 1400 1600 1800 2000 2200 2400Wavelength (nm)

0

05

Abs

orba

nce

(c)

Figure 10 Wavelengths selected from the NIR spectral data reflecting corn oil concentration by the (a) MC-UVE (b) MC-UVE-SPA and(c) MC-UVE-SPA-MW algorithms

Journal of Spectroscopy 9

between variables e prediction accuracy of the modelestablished by the MC-UVE-SPA is higher than thatestablished by MC-UVE e number of characteristic

wavelengths finally selected by the MC-UVE-SPA-MW ismore than that of MC-UVE-SPA but with better pre-diction accuracy

Table 4 Number of selected wavelengths and boiling point prediction results

Wavelength selection algorithm Number of wavelengths Model type RMSECV RMSEP r RPDFull spectrum 401 PLS 46350 83841 09103 23704MC-UVE 262 PLS 43184 100221 08311 17710MC-UVE-SPA 30 MLR 35503 88676 09341 24650MC-UVE-SPA-MW(w 2L) 58 MLR 39921 70442 09534 28212MC-UVE-SPA-MW(w 2R) 59 MLR 40024 68573 09556 28982MC-UVE-SPA-MW(w 3) 83 MLR 26478 59694 09752 39994e maximum number of PCs was again set to 10 for both PLS models

Pred

icte

d bo

iling

poi

nt (deg

C)

150 200 250 300Measured boiling point (degC)

150

200

250

300

(a)

Pred

icte

d bo

iling

poi

nt (deg

C)

Measured boiling point (degC)150 200 250 300

150

200

250

300

(b)

Pred

icte

d bo

iling

poi

nt (deg

C)

Measured boiling point (degC)150 200 250 300

150

200

250

300

(c)

Pred

icte

d bo

iling

poi

nt (deg

C)

Measured boiling point (degC)150 200 250 300

150

200

250

300

(d)

Pred

icte

d bo

iling

poi

nt (deg

C)

Measured boiling point (degC)150 200 250 300

150

200

250

300

(e)

Figure 11 Comparison of diesel fuel boiling point predictions from the prediction data set using (a) the MC-UVE-PLS (b) MC-UVE-SPA-MLR (c) MC-UVE-SPA-MW-MLR (w 2L) (d) MC-UVE-SPA-MW-MLR (w 2R) and (e) MC-UVE-SPA-MW-MLR (w 3)

800 900 1000 1100 1200 1300 1400 1500Wavelength (nm)

ndash1012

Abs

orba

nce

(a)

800 900 1000 1100 1200 1300 1400 1500Wavelength (nm)

ndash1012

Abs

orba

nce

(b)

800 900 1000 1100 1200 1300 1400 1500Wavelength (nm)

ndash1012

Abs

orba

nce

(c)

Figure 12 Wavelengths selected from the NIR spectral data reflecting the boiling point of diesel fuel by the (a) MC-UVE (b) MC-UVE-SPA and (c) MC-UVE-SPA-MW algorithms

10 Journal of Spectroscopy

4 Conclusions

e present study addressed the sparsity of wavelengthvariables selected by the cascaded MC-UVE-SPA throughthe application of a moving window which improved thecontinuity of the selected wavelength variables andthereby better exploited the advantages of the MC-UVEalgorithm and the SPA to obtain regression models withhigh prediction accuracy e advantages of the proposedMC-UVE-SPA-MW were demonstrated by applying theMC-UVE MC-UVE-SPA and MC-UVE-SPA-MW al-gorithms to the selection of wavelength variables from theNIR spectral absorbance data of corn diesel fuel andethylene and PLS and MLR models reflecting the oilcontent of corn the boiling point of diesel fuel and theethylene concentration were thereby established andtested e experimental results demonstrated that theprogressive elimination of uncorrelated and collinearvariables generated increasingly simplified partial-spec-trum models with greater prediction accuracy than thefull-spectrum model Among the three wavelength se-lection algorithms the MC-UVE-SPA selected the leastnumber of wavelength variables and the proposed MC-UVE-SPA-MW algorithm provided models with thegreatest prediction accuracy

Data Availability

e data used to support the findings of this study areavailable from the corresponding author upon request

Conflicts of Interest

e authors declare that they have no conflicts of interest

Acknowledgments

is study was supported by grants from the Major NationalScience and Technology Special Project of China(JZ2015KJZZ0254) and the Key Projects of Natural ScienceResearch in Universities in Anhui China (KJ2018A0544)

References

[1] C Pasquini ldquoNear infrared spectroscopy fundamentalspractical aspects and analytical applicationsrdquo Journal of theBrazilian Chemical Society vol 14 no 2 pp 198ndash219 2003

[2] X Sun M Zhou and Y Sun ldquoSpectroscopy quantitativeanalysis cotton content of blend fabricsrdquo InternationalJournal of Clothing Science and Technology vol 28 no 1pp 65ndash76 2016

500 1000 1500 2000 2500 3000 3500 4000 45000

05

1

15

2

25

3

35

4

Abs

orba

nce

OriginalMCUVE

MCUVE-SPAMCUVE-SPA-MW

times104

Wavenumber (cmndash1)

Figure 13 Measured C2H4 gas absorbance spectra and wavelength variables selected by the MC-UVE MC-UVE-SPA and MC-UVE-SPA-MW algorithms

Table 5 Number of selected wavelengths and C2H4 gas concentration prediction results

Wavelength selection algorithm Number of wavelengths Model type RMSECV RMSEP rFull spectrum 4601 PLS 01431 01342 09801MC-UVE 214 PLS 01009 01212 09887MC-UVE-SPA 17 MLR 01375 00537 09913MC-UVE-SPA-MW(w 2L) 34 MLR 00989 00495 09988MC-UVE-SPA-MW(w 2R) 34 MLR 00988 00486 09990MC-UVE-SPA-MW(w 3) 48 MLR 01005 00434 09999e maximum number of PCs was again set to 10 for both PLS models

Journal of Spectroscopy 11

[3] M Schwanninger J C Rodrigues and K Fackler ldquoA reviewof band assignments in near infrared spectra of wood andwood Componentsrdquo Journal of Near Infrared Spectroscopyvol 19 no 5 pp 287ndash308 2011

[4] Y-H Yun Y-C Wei X-B Zhao W-J Wu Y-Z Liang andH-M Lu ldquoA green method for the quantification of poly-saccharides in Dendrobium officinalerdquo RSC Advances vol 5no 127 pp 105057ndash105065 2015

[5] C K Vance D R Tolleson K Kinoshita J Rodriguez andW J Foley ldquoNear infrared spectroscopy in wildlife andbiodiversityrdquo Journal of Near Infrared Spectroscopy vol 24no 1 pp 1ndash25 2016

[6] H Hotelling ldquoAnalysis of a complex of statistical variablesinto principal componentsrdquo Journal of Educational Psychol-ogy vol 24 no 6 pp 417ndash441 1933

[7] P Geladi and B R Kowalski ldquoPartial least-squares regressiona tutorialrdquo Analytica Chimica Acta vol 185 no 1 pp 1ndash171986

[8] A A Kardamakis and N Pasadakis ldquoAutoregressive mod-eling of near-IR spectra and MLR to predict RON values ofgasolinesrdquo Fuel vol 89 no 1 pp 158ndash161 2010

[9] I Guyon and A Elisseeff ldquoAn introduction to variable andfeature selectionrdquo Journal of Machine Learning Researchvol 3 pp 1157ndash1182 3003

[10] B Lu N Liu H Li et al ldquoQuantitative determination andcharacteristic wavelength selection of available nitrogen incoco-peat by NIR spectroscopyrdquo Soil and Tillage Researchvol 191 pp 266ndash274 2019

[11] M J Anzanello and F S Fogliatto ldquoA review of recentvariable selection methods in industrial and chemometricsapplicationsrdquo European Journal of Industrial Engineeringvol 8 no 5 p 619 2014

[12] Y-H Yun H-D Li B-C Deng and D-S Cao ldquoAn overviewof variable selection methods in multivariate analysis of near-infrared spectrardquo TrAC Trends in Analytical Chemistryvol 113 pp 102ndash115 2019

[13] B Nadler and R R Coifman ldquoe prediction error in CLSand PLS the importance of feature selection prior to mul-tivariate calibrationrdquo Journal of Chemometrics vol 19 no 2pp 107ndash118 2005

[14] T Mehmood K H Liland L Snipen and S Saeligboslash ldquoA reviewof variable selection methods in Partial Least Squares Re-gressionrdquo Chemometrics and Intelligent Laboratory Systemsvol 118 no 8 pp 62ndash69 2012

[15] V Centner D-L Massart O E de Noord S de JongB M Vandeginste and C Sterna ldquoElimination of uninfor-mative variables for multivariate calibrationrdquo AnalyticalChemistry vol 68 no 21 pp 3851ndash3858 1996

[16] G Tang Y Huang K Tian et al ldquoA new spectral variableselection pattern using competitive adaptive reweightedsampling combined with successive projections algorithmrdquo9e Analyst vol 139 no 19 pp 4894ndash4902 2014

[17] Y Li Y Guo C Liu et al ldquoSPA combined with swarm in-telligence optimization algorithms for wavelength variableselection to rapidly discriminate the adulteration of applejuicerdquo Food Analytical Methods vol 10 no 6 pp 1965ndash19712017

[18] J-B Li C-J Zhao W-Q Huang et al ldquoA combination al-gorithm for variable selection to determine soluble solidcontent and firmness of pearsrdquo Analytical Methods vol 6no 7 pp 2170ndash2180 2014

[19] Z Xiaobo Z Jiewen M Hanpin S Jiyong Y Xiaopin andL Yanxiao ldquoGenetic algorithm interval partial least squaresregression combined successive projections algorithm for

variable selection in near-infrared quantitative analysis ofpigment in cucumber leavesrdquo Applied Spectroscopy vol 64no 7 pp 786ndash794 2010

[20] S Ye D Wang and S Min ldquoSuccessive projections algorithmcombined with uninformative variable elimination forspectral variable selectionrdquo Chemometrics and IntelligentLaboratory Systems vol 91 no 2 pp 194ndash199 2008

[21] B-C Deng Y-H Yun P Ma C-C Lin D-B Ren andY-Z Liang ldquoA new method for wavelength interval selectionthat intelligently optimizes the locations widths and com-binations of the intervalsrdquo 9e Analyst vol 140 no 6pp 1876ndash1885 2015

[22] W Fan Y-Y Li Y-K Peng et al ldquoNondestructive deter-mination of lycopene content based on visiblenear infraredtransmission spectrumrdquo Chinese Journal of AnalyticalChemistry vol 46 no 9 pp 1424ndash1431 2018

[23] Z Sun J Fan J Wang et al ldquoAssessment of the humanalbumin in acid precipitation process using NIRS and multi-variable selection methods combined with SPArdquo Journal ofMolecular Structure vol 1199 p 126942 2020

[24] H-D Li Q-S Xu and Y-Z Liang ldquolibPLS an integratedlibrary for partial least squares regression and linear dis-criminant analysisrdquo Chemometrics and Intelligent LaboratorySystems vol 176 pp 34ndash43 2018

[25] R Zhang Y-Y Chen Z-B Wang and L Kewu ldquoA novelensemble L1 regularization based variable selection frame-work with an application in near infrared spectroscopyrdquoChemometrics and Intelligent Laboratory Systems vol 163pp 7ndash15 2017

[26] W Cai Y Li and X Shao ldquoA variable selection method basedon uninformative variable elimination for multivariate cali-bration of near-infrared spectrardquo Chemometrics and Intelli-gent Laboratory Systems vol 90 no 2 pp 188ndash194 2008

[27] L M Bregman ldquoFinding the common point of convex sets bythe method of successive projectionrdquo Proceedings of the USSRAcademy of Sciences vol 162 no 3 pp 487ndash490 1965

[28] X Peng T Shi A Song Y Chen and W Gao ldquoEstimatingsoil organic carbon using VISNIR spectroscopy with SVMRand SPA methodsrdquo Remote Sensing vol 6 no 4 pp 2699ndash2717 2014

[29] Y-H Liu Q-Q Wang X-W Gao and A-G Xie ldquoTotalphenolic content prediction in Flos Lonicerae using hyper-spectral imaging combined with wavelengths selectionmethodsrdquo Journal of Food Process Engineering vol 42 no 6Article ID e13224 2019

[30] J Li H Zhang B Zhan Y Zhang R Li and J Li ldquoNon-destructive firmness measurement of the multiple cultivars ofpears by Vis-NIR spectroscopy coupled with multivariatecalibration analysis and MC-UVE-SPA methodrdquo InfraredPhysics amp Technology vol 104 Article ID 103154 2020

[31] P Hourant V Baeten M T Morales M Meurens andR Aparicio ldquoOil and fat classification by selected bands ofnear-infrared spectroscopyrdquo Applied Spectroscopy vol 54no 8 pp 1168ndash1174 2000

12 Journal of Spectroscopy

Page 3: Moving-Window-ImprovedMonteCarloUninformativeVariable ... · 2020. 8. 3. · to the NIR spectroscopic analysis of nicotine in tobacco lamina and active pharmaceutical ingredients

212 Diesel Fuel Spectral Data e NIR spectral absor-bance data for diesel fuel were provided by the SouthwestResearch Institute (SWRI) (httpwwweigenvectorcomdataSWRIindexhtml) e data set comprises unpro-cessed spectra derived from 784 diesel fuel samples mea-sured over a wavelength range of 550sim750 nm in 2 nmintervals Accordingly the data set includes a total of 401wavelength points e data set also contains variousproperties including the boiling point cetane numberdensity freezing point total aromatic hydrocarbon contentand viscosity Some of the parameter samples have missingvalues (NaN) which are eliminated during the experimentTable 2 shows the maximum minimum and average valuesof the boiling point of diesel fuel

213 Ethylene Gas Spectral Data Ethylene gas samples wereprepared within a closed cell filled with nitrogen gas at apressure of 1 atm and a temperature of 296K by distributingC2H4 gas into the cell to form samples with 72 known C2H4concentrations ranging from 6015 ppm to 2005 ppm in2005 ppm intervals e C2H4 gas distribution deviceadopted a gas distribution platform shown in Figure 1independently developed by the Hefei Material ScienceResearch Institute of the Chinese Academy of Sciencesrough visual control software set the gas distributionproportion according to the requirements adjust the volumeratio of the auxiliary gas nitrogen and the gas to be dis-tributed through the high-precision gas distribution plat-form and configure the required concentration of standardgas according to the requirements Fourier transform in-frared (FTIR) spectroscopy was applied to capture thespectral absorbance intensity of the gas in a sealed samplecell e optical path length of the cell was 10m and therange of the measured wavenumbers was 400sim5000 cmminus1

with a resolution of 1 cmminus1 e apodization function used aHamming window the number of scans was 16 and a totalof 96 spectral data of different concentrations were collected

Accordingly the data set includes a total of 4601wavelength points e absorption spectrum of C2H4 gasobtained from the HITRAN database (httphitraniaoru)over a wavenumber range of 400sim5000 cmminus1 is shown inFigure 2 Figure 3 presents the background spectral intensitymeasured after the closed cell was filled with nitrogen gas atroom temperature Figure 4 presents the measured ab-sorption spectral intensity of the cell after adding variousconcentrations of C2H4 gas A comparison of Figures 3 and 4indicates that the spectral intensities in the two regions of794sim1105 cmminus1 and 2917sim3242 cmminus1 are drastically differentdue to the spectral absorption characteristics of the addedC2H4 gas

22 Evaluation Indices e NIR spectral absorbance dataare first preprocessed to generate normalized data for fa-cilitating consistent analyses e normalized data are thendivided into a calibration data set and a prediction data setwhich are respectively applied for establishing the variousregression models and for testing the established models byadopting the KennardndashStone method (3 1) e extent ofinformation provided by the selected wavelength variables isgenerally difficult to directly evaluate erefore indirectevaluation methods are usually adopted Typically the in-formation value of wavelength variables is evaluatedaccording to the prediction accuracy of the model con-structed with the selected wavelengths e indices forevaluating the prediction accuracy of regression models arethe root mean square error of cross validation (RMSECV)for calibration set the root mean square error of predictionthe correlation coefficient (r) and the relative percent de-viation (RPD) for prediction set ese indices are defined asfollows

RMSECV

1113936ni1 yk minus yk( 1113857

2

n minus 1

1113971

RMSEP

1113936ni1 1113954yi minus yi( 1113857

2

n minus 1

1113971

r 1113936

ni1 yi minus yAVE( 1113857 1113954yi minus 1113954yAVE( 1113857

1113936ni1 yi minus yAVE( 1113857

21113954yi minus 1113954yAVE( 1113857

21113969

RPD

1113936ni1 yi minus yAVE( 1113857

2

1113936ni1 1113954yi minus yi( 1113857

2

11139741113972

(1)

Here n is the number of samples in the calibration set orthe prediction set yk is the measured value and 1113954yk is thepredicted value of sample i in calibration set yi is themeasured value and 1113954yi is the predicted value of sample i inprediction set and yAVE and 1113954yAVE are the respective averagemeasured value and the average predicted value of allsamples in prediction set

We note that the evaluated prediction performanceincreases with decreasing RMSE and increasing r and RPDe RMSE is denoted as the RMSECV when referring to thevalue associated with the calibration data set and as theRMSEP when referring to the value associated with theprediction data set

23 MC-UVE-SPA Method e fundamental basis of UVEis to use the stability of the regression coefficient vectorcharacteristic of a constructed PLS multiple regressionmodel as a measure of the significance of a given wavelengthHowever the UVE tends to suffer from model overfitting

Table 1 Statistical analysis of the relative concentrations of corn

Component to be tested Max Min AveMoisture 109930 93770 102335Oil 38320 30880 34984Protein 97110 76540 86683Starch 664720 628260 646956

Table 2 Statistical analysis of the boiling point of diesel fuel

Property to be tested Effective sample size Max Min AveBoiling point 395 297 182 25837

Journal of Spectroscopy 3

[25]is was addressed by the development of Monte Carlo(MC) UVE (MC-UVE) proposed by Cai et al [26] whichreplaces the leave-one-out cross-validation (LOOCV) pro-cess calculating the regression coefficient matrixβ [β1 β2 β2 ] in conventional UVE with the MC cross-validation (MCCV) process e reliability of each variable jcan be quantitatively measured by

Sj mean βj1113872 1113873

std βj1113872 1113873 (2)

where mean (βj) and std (βj) are the mean and standarddeviation of the regression coefficients of variable j egreater the absolute value of stability the more importantthe corresponding variable e stability of uninformativevariables should be less than a threshold

e SPA first proposed by Bregman [27] is a forward-cycling variable selection method For spectral data analysiseach cycle of the process calculates the projection of a se-lected wavelength on an unselected wavelength and includesthe unselected wavelength with the largest projection vectorin the set of selected wavelengths [28] is process is re-peated for each selected wavelength as it is added to the setuntil the selected wavelength set includes a specified numberof wavelengths [16] More detailed information on the stepsof SPA can be seen in literature [16 29] In selecting the nextwavelength each of the newly selected wavelengths has thelowest correlation with the previous oneerefore SPA caneffectively eliminate collinear wavelength variables and re-duce the number of dimensions of the sample spectrumwhich accordingly reduces the calculation burden of themodel

FTIRspectrometer

MultipassgascellPump

Detector

Infraredsource

Computer

Gas sampling hose Gas sampling hose

Target gas

Serial port

NetworkGas distribution

system

Figure 1 Schematic diagram of the standard gas spectra acquisition system

500 1000 1500 2000 2500 3000 3500 4000 4500 5000Wavenumber (cmndash1)

0

5

10

15

20

25

Abs

orba

nce

Figure 2 NIR absorbance spectrum for C2H4 gas from the HITRAN database

4 Journal of Spectroscopy

e MC-UVE-SPA method is a combination method ofMC-UVE and SPA Jiangbo Li et al proved that the com-bination (MC-UVE-SPA) of both Monte Carlo uninfor-mative variable elimination (MC-UVE) and successiveprojections algorithm (SPA) was more effective than MC-UVE or SPA alone [30] Although the effect of UVE-SPA isbetter than that of using UVE or SPA alone there is stillsomething to be improved In this paper the UVE-SPA isimproved by using the wavelength effective continuity andits effectiveness is verified by experiments

24 Proposed MC-UVE-SPA-MW Wavelength SelectionAlgorithm e proposed wavelength selection algorithmfirst applies MC-UVE to the calibration data set to construct

a PLS regression model e threshold of the MC-UVEprocess is set to provide a number of wavelength variablesthat minimize the RMSECV of the constructed PLS re-gressionmodele largest number of principal components(PCs) was set to 10 and the optimal number of PCs wasdetermined based on the minimum RMSECV value Sub-sequently the wavelength variables retained by the MC-UVE algorithm are applied as the input of the SPA Here anMLRmodel is constructed based on the wavelength variablesselected by the SPA for conducting cross-validation analysiswhere the number of selected wavelength variables is de-termined according to the minimum of the RMSECV of theconstructed MLR model In order to reduce the number ofisolated wavelength variables and maintain the continuity ofadjacent information wavelength points of near-infrared

500 1000 1500 2000 2500 3000 3500 4000 4500 5000Wavenumber (cmndash1)

05

1

15

2

25

3

35

Inte

nsity

times104

Figure 3 Measured background NIR spectrum with the cell filled with nitrogen gas (N2)

5000

05

1

15

25

35

2

3

4

1000 1500 2000 3000 4000 50002500 3500 4500

Inte

nsity

Wavenumber (cmndash1)

times104

Figure 4 Measured C2H4 gas absorbance spectrum

Journal of Spectroscopy 5

spectrum it extends outward from the best wavelength pointselected by UVE-SPA In the original spectrum the optimalwavelength point selected by the MC-UVE-SPA is used asthe starting point of a moving window of width w ewavelengths in the moving window are used as the selectedwavelengths and the number of wavelengths finally selectedvaries with the window width Set the window width w 2(Left) 2 (Right) or 3 Here 2 (Left) means to extend theselected wavelength point to the left to 2 wavelength points2 (Right) means to extend the selected wavelength point tothe right to 2 wavelength points and 3 means to extend theselected wavelength point to 3 wavelength points using theselected wavelength point as the center point e optimalwindow width is determined by the minimum RMSECV ofthe MLR model e processing flow of the proposed MC-UVE-SPA-MW algorithm and extending the wavelengthpoint outward are illustrated in Figure 5

3 Results and Discussion

31 Corn Spectral Data Experiments e wavelength vari-able stability distribution map of the PLS regression modelreflecting the oil concentration in corn constructed forcalibration set using the MC-UVE algorithm is presented inFigure 6 Here all wavelengths greater than the thresholdvalue shown by the horizontal red line in the figure areselected for use in the model is threshold was selected toprovide the number of wavelength variables correspondingto theminimumRMSECV of the constructed PLS regressionmodel is is illustrated in Figure 7 where the RMSECV ofthe constructed PLS regression model is plotted with respectto the number of selected wavelength variables It can beseen from Figure 7 that the RMSECV is relatively large whenthe number of wavelength variables is small and theRMSECV drops sharply as the number of selected variablesincreases is is because an overly small number ofwavelength variables exclude useful information and theprediction accuracy of the model is therefore improved as anincreasing amount of useful information is incorporatedinto the model A minimum value of RMSECV 00289 isobtained when the number of selected wavelength variablesis 106 and the RMSECV increases again when the number ofvariables exceeds 106 is increase results from the impactof selecting an increasing number of uninformative variableson the prediction accuracy of the model We also note thatthe RMSECV changes very little when the number ofwavelength variables exceeds 300 us the MC-UVE al-gorithm eliminates a large number of wavelengths that arenot related to the oil concentration of corn where the finalnumber of selected wavelength variables is just 151 of thefull-spectrum value of 700

e optimal number of 106 wavelength variables selectedby MC-UVE is then used as the inputs of the SPA whichiteratively generates wavelength variable combinations usingeach wavelength as a starting point and applies them forconstructing an MLR model e wavelength combinationcorresponding to the minimumRMSECV of theMLRmodelis then taken as the optimal wavelength combination erelationship between the number of selected wavelength

variables and the RMSECV of the MLR model constructedfrom variables selected by the MC-UVE-SPA is shown inFigure 8 where we note that the minimum RMSECV isobtained when the number of selected variables is 37 usthe SPA further reduces the number of informative wave-lengths mainly by eliminating collinear variables in the MLRmodel where the final number of selected wavelengthvariables is reduced to just 53 of the full-spectrum value of700

In the original spectrum the optimal wavelength pointselected by the MC-UVE-SPA is used as the starting point orcenter of a moving window of width w 2 (Left) 2 (Right)or 3 e results of the PLS or MLR model constructed usingthe wavelength variables selected by different algorithms areshown in Figure 9 and the details are listed in Table 3 alongwith the results obtained for different models In Table 3 theoptimal number of PLS principal components was 10 Asshown in Table 3 there were 37 characteristic wavelengthsselected by the MC-UVE-SPA accounting for only 53 ofthe total number of wavelengths and the accuracy of thealgorithm is better than that of MC-UVE algorithm which isdue to the elimination of wavelength collinearity ewavelengths selected by the MC-UVE-SPA-MW are ex-tended by the algorithm proposed in this paper When thewindow width w 2 (Left) and w 3 the model accuracy ofthe MC-UVE-SPA-MW algorithm is higher than that of theMC-UVE-SPA When w 2 (Left) the MC-UVE-SPA-MWexpands 37 wavelength variables selected by the MC-UVE-SPA to 64 At this point RMSEP is 00381 r value is 09713RPD value is 163666 and the model is optimal Althoughthe MC-UVE-SPA-MW provides improved continuity byincreasing the number of wavelength variables from thoseobtained by the MC-UVE-SPA the final number is still just91 of the full-spectrum value of 700

e wavelength variables selected from the NIR spectralabsorbance data of a single sample by the MC-UVE MC-UVE-SPA and proposed MC-UVE-SPA-MW algorithmsare compared in Figure 10 e results in Figure 10 arederived from the fact that oil is a complex organic moleculewith infrared and NIR spectral absorption that occupies awide wavenumber band ranging 3900sim12000 cmminus1

(833sim2564 nm) is is mainly caused by the frequencydoubling and frequency combinations of the stretching andvibrational energy level transitions of hydrogen-containinggroups From the results of Figure 10 we note that thewavelength variables selected by the MC-UVE MC-UVE-SPA and proposed MC-UVE-SPA-MW algorithms aremainly distributed between 1662sim1790 2222sim22682288sim2316 2390sim2428 and 2476sim2498 nm which is exactlythe range of the spectral absorption peaks generated by thefirst and second frequency doubling of the -C-H stretchingvibrations of the -CH2 -CH3 and -CH-CH- functionalgroups of oil [31]

We note from Figure 10 that the moving windowemployed by theMC-UVE-SPA-MW algorithm expands thewavelength variables selected by the MC-UVE-SPAresulting in a greater number of wavelength variables thanthat obtained by the MC-UVE-SPA and the improvedcontinuity of the wavelength variables selected by the MC-

6 Journal of Spectroscopy

UVE-SPA-MW algorithm is very apparent in Figure 10compared with the wavelength variables selected by theMC-UVE-SPA We can also note from Table 3 that the full-spectrum model was relatively complicated and its pre-diction accuracy was the worst of all models considered dueto the impact of the large number of uninformative wave-length variables included within the model In comparisonthe models established with spectral data selected by theMC-UVEMC-UVE-SPA andMC-UVE-SPA-MW (w 2L2R 3) algorithms are all greatly simplified and better modelprediction accuracies are uniformly obtained We also notefrom the table that of the five wavelength selection

algorithms the MC-UVE-SPA selected the least number ofwavelengths and the MC-UVE-SPA-MW (w 2L) algo-rithm provided a model with the greatest predictionaccuracy

32 Diesel Spectral Data Experiments e number ofwavelength variables selected from the NIR spectral data ofdiesel fuel reflecting the boiling point by the MC-UVE MC-UVE-SPA and MC-UVE-SPA-MW (w 3 2L 2R) algo-rithms were respectively 262 30 83 58 and 59 as shown inTable 4 ese respectively represent 653 75 207

Begin

Preprocessing

MCUVE

Moving window improved

End

SPA

w = 3

w = 2 L

w = 2 R

Extending the wavelength point outwardwith the window width w

0 0 0 1 0 0 0 0 0 1 1 1 0 0

0 0 0 1 0 0 0 0 0 1 1 0 0 0

0 0 0 1 0 0 0 0 0 0 1 1 0 0

Figure 5 Processing flow of the proposed moving-window-improved MC-UVE-SPA (MC-UVE-SPA-MW) wavelength selection algo-rithm based on the Monte Carlo uninformative variable elimination (MC-UVE) algorithm cascaded with the successive projectionsalgorithm (SPA)

0 100 200 300 400 500 600 700Wavelength variable

0

1

2

3

4

5

6

7

Stab

ility

inde

x

Figure 6 Wavelength variable stability distribution map of the partial least squares (PLS) regression model reflecting the oil concentrationusing the MC-UVE algorithm

Journal of Spectroscopy 7

145 and 147 of the 401 wavelength variables includedin the full spectrum e prediction results of the PLS orMLR models constructed from the selected wavelengthvariables are shown in Figure 11 and the details are listedin Table 4 along with the results obtained for a full-spectrum PLS model We note from Table 4 and Figure 11that the models established with spectral data selected byMC-UVE MC-UVE-SPA and MC-UVE-SPA-MW(w 2L 2R 3) algorithms are greatly simplified comparedwith the full-spectrum model MC-UVE retains 262wavelength points and the prediction accuracy is theworst of all the models considered which may be due tothe existence of wavelength collinearity When SPA al-gorithm is used to further screen the wavelength pointsselected by MC-UVE only 30 wavelength points areretained while the prediction accuracy of the model is

greatly improved RMSEP is reduced to 88676 r value isincreased to 09341 and RPD value is increased to 24650We note from Figure 12 that the moving windowemployed by the MC-UVE-SPA-MW expands the wave-length variables selected by the MC-UVE-SPA and im-proves the continuity of the wavelength variables selectedby the MC-UVE-SPA-MWWhen the window width w 2(Left) 2 (Right) and 3 the accuracy of the three modelsobtained by the MC-UVE-SPA-MW are all improvedWhen w 3 the MC-UVE-SPA-MW expands 30 wave-length variables selected by the MC-UVE-SPA to 83 Atthis point RMSEP is reduced to 59694 R value is in-creased to 09752 RPD value is increased to 39994 andthe model is optimal We can also note from Table 4 andFigure 11 that of the five wavelength selection algorithmsthe MC-UVE-SPA selected the least number of wave-lengths and the MC-UVE-SPA-MW (w 3) algorithmprovided a model with the greatest prediction accuracy

33 Ethylene Gas Spectral Data Experiments e number ofwavelength variables selected from the spectral datareflecting the C2H4 concentration by the MVUVE MC-UVE-SPA and MC-UVE-SPA-MW (w 3 2L 2R) algo-rithms were respectively 214 17 48 34 and 34 as shown inFigure 13 ese respectively represent 47 037 10074 and 074 of the 4601 wavelength variables includedin the full spectrum It can be determined from Figure 13that greater than half of the selected wavelength variables fallwithin the strong absorption regions in the wavenumberranges 794sim1105 cmminus1 and 2917sim3242 cmminus1 ese resultscan be explained according to the description given on theHITRAN web page which states that the absorption spectralband of C2H4 gas is in the range of 614sim3242 cmminus1 and thatthe two isotopes H2

12C12CH2 and H212C13CH2 of C2H4

present strong absorption bands in the wavenumber rangesof 794sim1105 cmminus1 and 2917sim3242 cmminus1 respectively FromFigure 4 it can be seen that in some areas that are not C2H4absorption bands the spectral intensity has a significantlinear relationship with C2H4 content which may be due tothe interference caused by the background spectrum withthe change of C2H4 concentration so in some areas that arenot C2H4 absorption bands the wavelength point is alsoselected

e details regarding the prediction results of the PLS orMLR models constructed from the selected wavelengthvariables are listed in Table 5 along with the results obtainedfor a full-spectrum PLS model We again note from thetable that the full-spectrum model is more complicatedand its prediction accuracy was the worst of all modelsconsidered In comparison the models established withspectral data selected by the MC-UVE MC-UVE-SPA andMC-UVE-SPA-MW algorithms are all greatly simplifiedand better model prediction accuracies are uniformlyobtained Of the five wavelength selection algorithms weagain note that theMC-UVE-SPA selected the least numberof wavelengths and the MC-UVE-SPA-MW (w 3) algo-rithm provided a model with the greatest predictionaccuracy

0 100 200 300 400 500 600 700Number of wavelength variables

002

004

006

008

01

012

014

016

018

RMSE

C

0028931

Figure 7 Relationship between the root mean square error(RMSE) of the corn oil concentration predicted by the PLS re-gression model and the number of selected wavelength variables

0 10 20 30 40 50Number of variables included in the model

0

002

004

006

008

01

012

014

016

018

RMSE

Final number of selected variables 37 (RMSE = 0011356)

Figure 8 Relationship between the RMSE of the corn oil con-centration predicted by the multiple linear regression (MLR)modeland the number of selected wavelength variables using the MC-UVE-SPA

8 Journal of Spectroscopy

34 Summary of the Experimental Results It can be notedfrom the above experimental results that the predictionaccuracy of the models established by the wavelength

selection algorithm are higher than that of the full-spectrum model e MC-UVE-SPA selects the leastcharacteristic wavelengths and eliminates the collinearity

3 35 4Measured concentration ()

3

32

34

36

38

4

Pred

icte

d co

ncen

trat

ion

()

(a)

Pred

icte

d co

ncen

trat

ion

()

3 35 4Measured concentration ()

3

32

34

36

38

4

(b)

Pred

icte

d co

ncen

trat

ion

()

3 35 4Measured concentration ()

3

32

34

36

38

4

(c)

Pred

icte

d co

ncen

trat

ion

()

3 35 4Measured concentration ()

3

32

34

36

38

4

(d)Pr

edic

ted

conc

entr

atio

n (

)

3 35 4Measured concentration ()

3

32

34

36

38

4

(e)

Figure 9 Comparison of corn oil concentration predictions from the prediction data set using the (a) MC-UVE-PLS (b) MC-UVE-SPA-MLR (c) MC-UVE-SPA-MW-MLR (w 2L) (d) MC-UVE-SPA-MW-MLR (w 2R) and (e) MC-UVE-SPA-MW-MLR (w 3)

Table 3 Number of selected wavelengths and corn oil concentration prediction results

Wavelength selection algorithm Number of wavelengths Model type RMSECV RMSEP r RPDFull spectrum 700 PLS 01431 01345 09257 10550MC-UVE 102 PLS 00289 01447 09024 07948MC-UVE-SPA 37 MLR 00114 01225 09267 20225MC-UVE-SPA-MW(w 2L) 64 MLR 00035 00381 09713 163666MC-UVE-SPA-MW(w 2R) 64 MLR 00101 01009 08062 13703MC-UVE-SPA-MW(w 3) 82 MLR 00089 00532 09335 61627

1200 1400 1600 1800 2000 2200 2400Wavelength (nm)

0

05

1

Abs

orba

nce

(a)

1200 1400 1600 1800 2000 2200 2400Wavelength (nm)

0

05

1

Abs

orba

nce

(b)

1200 1400 1600 1800 2000 2200 2400Wavelength (nm)

0

05

Abs

orba

nce

(c)

Figure 10 Wavelengths selected from the NIR spectral data reflecting corn oil concentration by the (a) MC-UVE (b) MC-UVE-SPA and(c) MC-UVE-SPA-MW algorithms

Journal of Spectroscopy 9

between variables e prediction accuracy of the modelestablished by the MC-UVE-SPA is higher than thatestablished by MC-UVE e number of characteristic

wavelengths finally selected by the MC-UVE-SPA-MW ismore than that of MC-UVE-SPA but with better pre-diction accuracy

Table 4 Number of selected wavelengths and boiling point prediction results

Wavelength selection algorithm Number of wavelengths Model type RMSECV RMSEP r RPDFull spectrum 401 PLS 46350 83841 09103 23704MC-UVE 262 PLS 43184 100221 08311 17710MC-UVE-SPA 30 MLR 35503 88676 09341 24650MC-UVE-SPA-MW(w 2L) 58 MLR 39921 70442 09534 28212MC-UVE-SPA-MW(w 2R) 59 MLR 40024 68573 09556 28982MC-UVE-SPA-MW(w 3) 83 MLR 26478 59694 09752 39994e maximum number of PCs was again set to 10 for both PLS models

Pred

icte

d bo

iling

poi

nt (deg

C)

150 200 250 300Measured boiling point (degC)

150

200

250

300

(a)

Pred

icte

d bo

iling

poi

nt (deg

C)

Measured boiling point (degC)150 200 250 300

150

200

250

300

(b)

Pred

icte

d bo

iling

poi

nt (deg

C)

Measured boiling point (degC)150 200 250 300

150

200

250

300

(c)

Pred

icte

d bo

iling

poi

nt (deg

C)

Measured boiling point (degC)150 200 250 300

150

200

250

300

(d)

Pred

icte

d bo

iling

poi

nt (deg

C)

Measured boiling point (degC)150 200 250 300

150

200

250

300

(e)

Figure 11 Comparison of diesel fuel boiling point predictions from the prediction data set using (a) the MC-UVE-PLS (b) MC-UVE-SPA-MLR (c) MC-UVE-SPA-MW-MLR (w 2L) (d) MC-UVE-SPA-MW-MLR (w 2R) and (e) MC-UVE-SPA-MW-MLR (w 3)

800 900 1000 1100 1200 1300 1400 1500Wavelength (nm)

ndash1012

Abs

orba

nce

(a)

800 900 1000 1100 1200 1300 1400 1500Wavelength (nm)

ndash1012

Abs

orba

nce

(b)

800 900 1000 1100 1200 1300 1400 1500Wavelength (nm)

ndash1012

Abs

orba

nce

(c)

Figure 12 Wavelengths selected from the NIR spectral data reflecting the boiling point of diesel fuel by the (a) MC-UVE (b) MC-UVE-SPA and (c) MC-UVE-SPA-MW algorithms

10 Journal of Spectroscopy

4 Conclusions

e present study addressed the sparsity of wavelengthvariables selected by the cascaded MC-UVE-SPA throughthe application of a moving window which improved thecontinuity of the selected wavelength variables andthereby better exploited the advantages of the MC-UVEalgorithm and the SPA to obtain regression models withhigh prediction accuracy e advantages of the proposedMC-UVE-SPA-MW were demonstrated by applying theMC-UVE MC-UVE-SPA and MC-UVE-SPA-MW al-gorithms to the selection of wavelength variables from theNIR spectral absorbance data of corn diesel fuel andethylene and PLS and MLR models reflecting the oilcontent of corn the boiling point of diesel fuel and theethylene concentration were thereby established andtested e experimental results demonstrated that theprogressive elimination of uncorrelated and collinearvariables generated increasingly simplified partial-spec-trum models with greater prediction accuracy than thefull-spectrum model Among the three wavelength se-lection algorithms the MC-UVE-SPA selected the leastnumber of wavelength variables and the proposed MC-UVE-SPA-MW algorithm provided models with thegreatest prediction accuracy

Data Availability

e data used to support the findings of this study areavailable from the corresponding author upon request

Conflicts of Interest

e authors declare that they have no conflicts of interest

Acknowledgments

is study was supported by grants from the Major NationalScience and Technology Special Project of China(JZ2015KJZZ0254) and the Key Projects of Natural ScienceResearch in Universities in Anhui China (KJ2018A0544)

References

[1] C Pasquini ldquoNear infrared spectroscopy fundamentalspractical aspects and analytical applicationsrdquo Journal of theBrazilian Chemical Society vol 14 no 2 pp 198ndash219 2003

[2] X Sun M Zhou and Y Sun ldquoSpectroscopy quantitativeanalysis cotton content of blend fabricsrdquo InternationalJournal of Clothing Science and Technology vol 28 no 1pp 65ndash76 2016

500 1000 1500 2000 2500 3000 3500 4000 45000

05

1

15

2

25

3

35

4

Abs

orba

nce

OriginalMCUVE

MCUVE-SPAMCUVE-SPA-MW

times104

Wavenumber (cmndash1)

Figure 13 Measured C2H4 gas absorbance spectra and wavelength variables selected by the MC-UVE MC-UVE-SPA and MC-UVE-SPA-MW algorithms

Table 5 Number of selected wavelengths and C2H4 gas concentration prediction results

Wavelength selection algorithm Number of wavelengths Model type RMSECV RMSEP rFull spectrum 4601 PLS 01431 01342 09801MC-UVE 214 PLS 01009 01212 09887MC-UVE-SPA 17 MLR 01375 00537 09913MC-UVE-SPA-MW(w 2L) 34 MLR 00989 00495 09988MC-UVE-SPA-MW(w 2R) 34 MLR 00988 00486 09990MC-UVE-SPA-MW(w 3) 48 MLR 01005 00434 09999e maximum number of PCs was again set to 10 for both PLS models

Journal of Spectroscopy 11

[3] M Schwanninger J C Rodrigues and K Fackler ldquoA reviewof band assignments in near infrared spectra of wood andwood Componentsrdquo Journal of Near Infrared Spectroscopyvol 19 no 5 pp 287ndash308 2011

[4] Y-H Yun Y-C Wei X-B Zhao W-J Wu Y-Z Liang andH-M Lu ldquoA green method for the quantification of poly-saccharides in Dendrobium officinalerdquo RSC Advances vol 5no 127 pp 105057ndash105065 2015

[5] C K Vance D R Tolleson K Kinoshita J Rodriguez andW J Foley ldquoNear infrared spectroscopy in wildlife andbiodiversityrdquo Journal of Near Infrared Spectroscopy vol 24no 1 pp 1ndash25 2016

[6] H Hotelling ldquoAnalysis of a complex of statistical variablesinto principal componentsrdquo Journal of Educational Psychol-ogy vol 24 no 6 pp 417ndash441 1933

[7] P Geladi and B R Kowalski ldquoPartial least-squares regressiona tutorialrdquo Analytica Chimica Acta vol 185 no 1 pp 1ndash171986

[8] A A Kardamakis and N Pasadakis ldquoAutoregressive mod-eling of near-IR spectra and MLR to predict RON values ofgasolinesrdquo Fuel vol 89 no 1 pp 158ndash161 2010

[9] I Guyon and A Elisseeff ldquoAn introduction to variable andfeature selectionrdquo Journal of Machine Learning Researchvol 3 pp 1157ndash1182 3003

[10] B Lu N Liu H Li et al ldquoQuantitative determination andcharacteristic wavelength selection of available nitrogen incoco-peat by NIR spectroscopyrdquo Soil and Tillage Researchvol 191 pp 266ndash274 2019

[11] M J Anzanello and F S Fogliatto ldquoA review of recentvariable selection methods in industrial and chemometricsapplicationsrdquo European Journal of Industrial Engineeringvol 8 no 5 p 619 2014

[12] Y-H Yun H-D Li B-C Deng and D-S Cao ldquoAn overviewof variable selection methods in multivariate analysis of near-infrared spectrardquo TrAC Trends in Analytical Chemistryvol 113 pp 102ndash115 2019

[13] B Nadler and R R Coifman ldquoe prediction error in CLSand PLS the importance of feature selection prior to mul-tivariate calibrationrdquo Journal of Chemometrics vol 19 no 2pp 107ndash118 2005

[14] T Mehmood K H Liland L Snipen and S Saeligboslash ldquoA reviewof variable selection methods in Partial Least Squares Re-gressionrdquo Chemometrics and Intelligent Laboratory Systemsvol 118 no 8 pp 62ndash69 2012

[15] V Centner D-L Massart O E de Noord S de JongB M Vandeginste and C Sterna ldquoElimination of uninfor-mative variables for multivariate calibrationrdquo AnalyticalChemistry vol 68 no 21 pp 3851ndash3858 1996

[16] G Tang Y Huang K Tian et al ldquoA new spectral variableselection pattern using competitive adaptive reweightedsampling combined with successive projections algorithmrdquo9e Analyst vol 139 no 19 pp 4894ndash4902 2014

[17] Y Li Y Guo C Liu et al ldquoSPA combined with swarm in-telligence optimization algorithms for wavelength variableselection to rapidly discriminate the adulteration of applejuicerdquo Food Analytical Methods vol 10 no 6 pp 1965ndash19712017

[18] J-B Li C-J Zhao W-Q Huang et al ldquoA combination al-gorithm for variable selection to determine soluble solidcontent and firmness of pearsrdquo Analytical Methods vol 6no 7 pp 2170ndash2180 2014

[19] Z Xiaobo Z Jiewen M Hanpin S Jiyong Y Xiaopin andL Yanxiao ldquoGenetic algorithm interval partial least squaresregression combined successive projections algorithm for

variable selection in near-infrared quantitative analysis ofpigment in cucumber leavesrdquo Applied Spectroscopy vol 64no 7 pp 786ndash794 2010

[20] S Ye D Wang and S Min ldquoSuccessive projections algorithmcombined with uninformative variable elimination forspectral variable selectionrdquo Chemometrics and IntelligentLaboratory Systems vol 91 no 2 pp 194ndash199 2008

[21] B-C Deng Y-H Yun P Ma C-C Lin D-B Ren andY-Z Liang ldquoA new method for wavelength interval selectionthat intelligently optimizes the locations widths and com-binations of the intervalsrdquo 9e Analyst vol 140 no 6pp 1876ndash1885 2015

[22] W Fan Y-Y Li Y-K Peng et al ldquoNondestructive deter-mination of lycopene content based on visiblenear infraredtransmission spectrumrdquo Chinese Journal of AnalyticalChemistry vol 46 no 9 pp 1424ndash1431 2018

[23] Z Sun J Fan J Wang et al ldquoAssessment of the humanalbumin in acid precipitation process using NIRS and multi-variable selection methods combined with SPArdquo Journal ofMolecular Structure vol 1199 p 126942 2020

[24] H-D Li Q-S Xu and Y-Z Liang ldquolibPLS an integratedlibrary for partial least squares regression and linear dis-criminant analysisrdquo Chemometrics and Intelligent LaboratorySystems vol 176 pp 34ndash43 2018

[25] R Zhang Y-Y Chen Z-B Wang and L Kewu ldquoA novelensemble L1 regularization based variable selection frame-work with an application in near infrared spectroscopyrdquoChemometrics and Intelligent Laboratory Systems vol 163pp 7ndash15 2017

[26] W Cai Y Li and X Shao ldquoA variable selection method basedon uninformative variable elimination for multivariate cali-bration of near-infrared spectrardquo Chemometrics and Intelli-gent Laboratory Systems vol 90 no 2 pp 188ndash194 2008

[27] L M Bregman ldquoFinding the common point of convex sets bythe method of successive projectionrdquo Proceedings of the USSRAcademy of Sciences vol 162 no 3 pp 487ndash490 1965

[28] X Peng T Shi A Song Y Chen and W Gao ldquoEstimatingsoil organic carbon using VISNIR spectroscopy with SVMRand SPA methodsrdquo Remote Sensing vol 6 no 4 pp 2699ndash2717 2014

[29] Y-H Liu Q-Q Wang X-W Gao and A-G Xie ldquoTotalphenolic content prediction in Flos Lonicerae using hyper-spectral imaging combined with wavelengths selectionmethodsrdquo Journal of Food Process Engineering vol 42 no 6Article ID e13224 2019

[30] J Li H Zhang B Zhan Y Zhang R Li and J Li ldquoNon-destructive firmness measurement of the multiple cultivars ofpears by Vis-NIR spectroscopy coupled with multivariatecalibration analysis and MC-UVE-SPA methodrdquo InfraredPhysics amp Technology vol 104 Article ID 103154 2020

[31] P Hourant V Baeten M T Morales M Meurens andR Aparicio ldquoOil and fat classification by selected bands ofnear-infrared spectroscopyrdquo Applied Spectroscopy vol 54no 8 pp 1168ndash1174 2000

12 Journal of Spectroscopy

Page 4: Moving-Window-ImprovedMonteCarloUninformativeVariable ... · 2020. 8. 3. · to the NIR spectroscopic analysis of nicotine in tobacco lamina and active pharmaceutical ingredients

[25]is was addressed by the development of Monte Carlo(MC) UVE (MC-UVE) proposed by Cai et al [26] whichreplaces the leave-one-out cross-validation (LOOCV) pro-cess calculating the regression coefficient matrixβ [β1 β2 β2 ] in conventional UVE with the MC cross-validation (MCCV) process e reliability of each variable jcan be quantitatively measured by

Sj mean βj1113872 1113873

std βj1113872 1113873 (2)

where mean (βj) and std (βj) are the mean and standarddeviation of the regression coefficients of variable j egreater the absolute value of stability the more importantthe corresponding variable e stability of uninformativevariables should be less than a threshold

e SPA first proposed by Bregman [27] is a forward-cycling variable selection method For spectral data analysiseach cycle of the process calculates the projection of a se-lected wavelength on an unselected wavelength and includesthe unselected wavelength with the largest projection vectorin the set of selected wavelengths [28] is process is re-peated for each selected wavelength as it is added to the setuntil the selected wavelength set includes a specified numberof wavelengths [16] More detailed information on the stepsof SPA can be seen in literature [16 29] In selecting the nextwavelength each of the newly selected wavelengths has thelowest correlation with the previous oneerefore SPA caneffectively eliminate collinear wavelength variables and re-duce the number of dimensions of the sample spectrumwhich accordingly reduces the calculation burden of themodel

FTIRspectrometer

MultipassgascellPump

Detector

Infraredsource

Computer

Gas sampling hose Gas sampling hose

Target gas

Serial port

NetworkGas distribution

system

Figure 1 Schematic diagram of the standard gas spectra acquisition system

500 1000 1500 2000 2500 3000 3500 4000 4500 5000Wavenumber (cmndash1)

0

5

10

15

20

25

Abs

orba

nce

Figure 2 NIR absorbance spectrum for C2H4 gas from the HITRAN database

4 Journal of Spectroscopy

e MC-UVE-SPA method is a combination method ofMC-UVE and SPA Jiangbo Li et al proved that the com-bination (MC-UVE-SPA) of both Monte Carlo uninfor-mative variable elimination (MC-UVE) and successiveprojections algorithm (SPA) was more effective than MC-UVE or SPA alone [30] Although the effect of UVE-SPA isbetter than that of using UVE or SPA alone there is stillsomething to be improved In this paper the UVE-SPA isimproved by using the wavelength effective continuity andits effectiveness is verified by experiments

24 Proposed MC-UVE-SPA-MW Wavelength SelectionAlgorithm e proposed wavelength selection algorithmfirst applies MC-UVE to the calibration data set to construct

a PLS regression model e threshold of the MC-UVEprocess is set to provide a number of wavelength variablesthat minimize the RMSECV of the constructed PLS re-gressionmodele largest number of principal components(PCs) was set to 10 and the optimal number of PCs wasdetermined based on the minimum RMSECV value Sub-sequently the wavelength variables retained by the MC-UVE algorithm are applied as the input of the SPA Here anMLRmodel is constructed based on the wavelength variablesselected by the SPA for conducting cross-validation analysiswhere the number of selected wavelength variables is de-termined according to the minimum of the RMSECV of theconstructed MLR model In order to reduce the number ofisolated wavelength variables and maintain the continuity ofadjacent information wavelength points of near-infrared

500 1000 1500 2000 2500 3000 3500 4000 4500 5000Wavenumber (cmndash1)

05

1

15

2

25

3

35

Inte

nsity

times104

Figure 3 Measured background NIR spectrum with the cell filled with nitrogen gas (N2)

5000

05

1

15

25

35

2

3

4

1000 1500 2000 3000 4000 50002500 3500 4500

Inte

nsity

Wavenumber (cmndash1)

times104

Figure 4 Measured C2H4 gas absorbance spectrum

Journal of Spectroscopy 5

spectrum it extends outward from the best wavelength pointselected by UVE-SPA In the original spectrum the optimalwavelength point selected by the MC-UVE-SPA is used asthe starting point of a moving window of width w ewavelengths in the moving window are used as the selectedwavelengths and the number of wavelengths finally selectedvaries with the window width Set the window width w 2(Left) 2 (Right) or 3 Here 2 (Left) means to extend theselected wavelength point to the left to 2 wavelength points2 (Right) means to extend the selected wavelength point tothe right to 2 wavelength points and 3 means to extend theselected wavelength point to 3 wavelength points using theselected wavelength point as the center point e optimalwindow width is determined by the minimum RMSECV ofthe MLR model e processing flow of the proposed MC-UVE-SPA-MW algorithm and extending the wavelengthpoint outward are illustrated in Figure 5

3 Results and Discussion

31 Corn Spectral Data Experiments e wavelength vari-able stability distribution map of the PLS regression modelreflecting the oil concentration in corn constructed forcalibration set using the MC-UVE algorithm is presented inFigure 6 Here all wavelengths greater than the thresholdvalue shown by the horizontal red line in the figure areselected for use in the model is threshold was selected toprovide the number of wavelength variables correspondingto theminimumRMSECV of the constructed PLS regressionmodel is is illustrated in Figure 7 where the RMSECV ofthe constructed PLS regression model is plotted with respectto the number of selected wavelength variables It can beseen from Figure 7 that the RMSECV is relatively large whenthe number of wavelength variables is small and theRMSECV drops sharply as the number of selected variablesincreases is is because an overly small number ofwavelength variables exclude useful information and theprediction accuracy of the model is therefore improved as anincreasing amount of useful information is incorporatedinto the model A minimum value of RMSECV 00289 isobtained when the number of selected wavelength variablesis 106 and the RMSECV increases again when the number ofvariables exceeds 106 is increase results from the impactof selecting an increasing number of uninformative variableson the prediction accuracy of the model We also note thatthe RMSECV changes very little when the number ofwavelength variables exceeds 300 us the MC-UVE al-gorithm eliminates a large number of wavelengths that arenot related to the oil concentration of corn where the finalnumber of selected wavelength variables is just 151 of thefull-spectrum value of 700

e optimal number of 106 wavelength variables selectedby MC-UVE is then used as the inputs of the SPA whichiteratively generates wavelength variable combinations usingeach wavelength as a starting point and applies them forconstructing an MLR model e wavelength combinationcorresponding to the minimumRMSECV of theMLRmodelis then taken as the optimal wavelength combination erelationship between the number of selected wavelength

variables and the RMSECV of the MLR model constructedfrom variables selected by the MC-UVE-SPA is shown inFigure 8 where we note that the minimum RMSECV isobtained when the number of selected variables is 37 usthe SPA further reduces the number of informative wave-lengths mainly by eliminating collinear variables in the MLRmodel where the final number of selected wavelengthvariables is reduced to just 53 of the full-spectrum value of700

In the original spectrum the optimal wavelength pointselected by the MC-UVE-SPA is used as the starting point orcenter of a moving window of width w 2 (Left) 2 (Right)or 3 e results of the PLS or MLR model constructed usingthe wavelength variables selected by different algorithms areshown in Figure 9 and the details are listed in Table 3 alongwith the results obtained for different models In Table 3 theoptimal number of PLS principal components was 10 Asshown in Table 3 there were 37 characteristic wavelengthsselected by the MC-UVE-SPA accounting for only 53 ofthe total number of wavelengths and the accuracy of thealgorithm is better than that of MC-UVE algorithm which isdue to the elimination of wavelength collinearity ewavelengths selected by the MC-UVE-SPA-MW are ex-tended by the algorithm proposed in this paper When thewindow width w 2 (Left) and w 3 the model accuracy ofthe MC-UVE-SPA-MW algorithm is higher than that of theMC-UVE-SPA When w 2 (Left) the MC-UVE-SPA-MWexpands 37 wavelength variables selected by the MC-UVE-SPA to 64 At this point RMSEP is 00381 r value is 09713RPD value is 163666 and the model is optimal Althoughthe MC-UVE-SPA-MW provides improved continuity byincreasing the number of wavelength variables from thoseobtained by the MC-UVE-SPA the final number is still just91 of the full-spectrum value of 700

e wavelength variables selected from the NIR spectralabsorbance data of a single sample by the MC-UVE MC-UVE-SPA and proposed MC-UVE-SPA-MW algorithmsare compared in Figure 10 e results in Figure 10 arederived from the fact that oil is a complex organic moleculewith infrared and NIR spectral absorption that occupies awide wavenumber band ranging 3900sim12000 cmminus1

(833sim2564 nm) is is mainly caused by the frequencydoubling and frequency combinations of the stretching andvibrational energy level transitions of hydrogen-containinggroups From the results of Figure 10 we note that thewavelength variables selected by the MC-UVE MC-UVE-SPA and proposed MC-UVE-SPA-MW algorithms aremainly distributed between 1662sim1790 2222sim22682288sim2316 2390sim2428 and 2476sim2498 nm which is exactlythe range of the spectral absorption peaks generated by thefirst and second frequency doubling of the -C-H stretchingvibrations of the -CH2 -CH3 and -CH-CH- functionalgroups of oil [31]

We note from Figure 10 that the moving windowemployed by theMC-UVE-SPA-MW algorithm expands thewavelength variables selected by the MC-UVE-SPAresulting in a greater number of wavelength variables thanthat obtained by the MC-UVE-SPA and the improvedcontinuity of the wavelength variables selected by the MC-

6 Journal of Spectroscopy

UVE-SPA-MW algorithm is very apparent in Figure 10compared with the wavelength variables selected by theMC-UVE-SPA We can also note from Table 3 that the full-spectrum model was relatively complicated and its pre-diction accuracy was the worst of all models considered dueto the impact of the large number of uninformative wave-length variables included within the model In comparisonthe models established with spectral data selected by theMC-UVEMC-UVE-SPA andMC-UVE-SPA-MW (w 2L2R 3) algorithms are all greatly simplified and better modelprediction accuracies are uniformly obtained We also notefrom the table that of the five wavelength selection

algorithms the MC-UVE-SPA selected the least number ofwavelengths and the MC-UVE-SPA-MW (w 2L) algo-rithm provided a model with the greatest predictionaccuracy

32 Diesel Spectral Data Experiments e number ofwavelength variables selected from the NIR spectral data ofdiesel fuel reflecting the boiling point by the MC-UVE MC-UVE-SPA and MC-UVE-SPA-MW (w 3 2L 2R) algo-rithms were respectively 262 30 83 58 and 59 as shown inTable 4 ese respectively represent 653 75 207

Begin

Preprocessing

MCUVE

Moving window improved

End

SPA

w = 3

w = 2 L

w = 2 R

Extending the wavelength point outwardwith the window width w

0 0 0 1 0 0 0 0 0 1 1 1 0 0

0 0 0 1 0 0 0 0 0 1 1 0 0 0

0 0 0 1 0 0 0 0 0 0 1 1 0 0

Figure 5 Processing flow of the proposed moving-window-improved MC-UVE-SPA (MC-UVE-SPA-MW) wavelength selection algo-rithm based on the Monte Carlo uninformative variable elimination (MC-UVE) algorithm cascaded with the successive projectionsalgorithm (SPA)

0 100 200 300 400 500 600 700Wavelength variable

0

1

2

3

4

5

6

7

Stab

ility

inde

x

Figure 6 Wavelength variable stability distribution map of the partial least squares (PLS) regression model reflecting the oil concentrationusing the MC-UVE algorithm

Journal of Spectroscopy 7

145 and 147 of the 401 wavelength variables includedin the full spectrum e prediction results of the PLS orMLR models constructed from the selected wavelengthvariables are shown in Figure 11 and the details are listedin Table 4 along with the results obtained for a full-spectrum PLS model We note from Table 4 and Figure 11that the models established with spectral data selected byMC-UVE MC-UVE-SPA and MC-UVE-SPA-MW(w 2L 2R 3) algorithms are greatly simplified comparedwith the full-spectrum model MC-UVE retains 262wavelength points and the prediction accuracy is theworst of all the models considered which may be due tothe existence of wavelength collinearity When SPA al-gorithm is used to further screen the wavelength pointsselected by MC-UVE only 30 wavelength points areretained while the prediction accuracy of the model is

greatly improved RMSEP is reduced to 88676 r value isincreased to 09341 and RPD value is increased to 24650We note from Figure 12 that the moving windowemployed by the MC-UVE-SPA-MW expands the wave-length variables selected by the MC-UVE-SPA and im-proves the continuity of the wavelength variables selectedby the MC-UVE-SPA-MWWhen the window width w 2(Left) 2 (Right) and 3 the accuracy of the three modelsobtained by the MC-UVE-SPA-MW are all improvedWhen w 3 the MC-UVE-SPA-MW expands 30 wave-length variables selected by the MC-UVE-SPA to 83 Atthis point RMSEP is reduced to 59694 R value is in-creased to 09752 RPD value is increased to 39994 andthe model is optimal We can also note from Table 4 andFigure 11 that of the five wavelength selection algorithmsthe MC-UVE-SPA selected the least number of wave-lengths and the MC-UVE-SPA-MW (w 3) algorithmprovided a model with the greatest prediction accuracy

33 Ethylene Gas Spectral Data Experiments e number ofwavelength variables selected from the spectral datareflecting the C2H4 concentration by the MVUVE MC-UVE-SPA and MC-UVE-SPA-MW (w 3 2L 2R) algo-rithms were respectively 214 17 48 34 and 34 as shown inFigure 13 ese respectively represent 47 037 10074 and 074 of the 4601 wavelength variables includedin the full spectrum It can be determined from Figure 13that greater than half of the selected wavelength variables fallwithin the strong absorption regions in the wavenumberranges 794sim1105 cmminus1 and 2917sim3242 cmminus1 ese resultscan be explained according to the description given on theHITRAN web page which states that the absorption spectralband of C2H4 gas is in the range of 614sim3242 cmminus1 and thatthe two isotopes H2

12C12CH2 and H212C13CH2 of C2H4

present strong absorption bands in the wavenumber rangesof 794sim1105 cmminus1 and 2917sim3242 cmminus1 respectively FromFigure 4 it can be seen that in some areas that are not C2H4absorption bands the spectral intensity has a significantlinear relationship with C2H4 content which may be due tothe interference caused by the background spectrum withthe change of C2H4 concentration so in some areas that arenot C2H4 absorption bands the wavelength point is alsoselected

e details regarding the prediction results of the PLS orMLR models constructed from the selected wavelengthvariables are listed in Table 5 along with the results obtainedfor a full-spectrum PLS model We again note from thetable that the full-spectrum model is more complicatedand its prediction accuracy was the worst of all modelsconsidered In comparison the models established withspectral data selected by the MC-UVE MC-UVE-SPA andMC-UVE-SPA-MW algorithms are all greatly simplifiedand better model prediction accuracies are uniformlyobtained Of the five wavelength selection algorithms weagain note that theMC-UVE-SPA selected the least numberof wavelengths and the MC-UVE-SPA-MW (w 3) algo-rithm provided a model with the greatest predictionaccuracy

0 100 200 300 400 500 600 700Number of wavelength variables

002

004

006

008

01

012

014

016

018

RMSE

C

0028931

Figure 7 Relationship between the root mean square error(RMSE) of the corn oil concentration predicted by the PLS re-gression model and the number of selected wavelength variables

0 10 20 30 40 50Number of variables included in the model

0

002

004

006

008

01

012

014

016

018

RMSE

Final number of selected variables 37 (RMSE = 0011356)

Figure 8 Relationship between the RMSE of the corn oil con-centration predicted by the multiple linear regression (MLR)modeland the number of selected wavelength variables using the MC-UVE-SPA

8 Journal of Spectroscopy

34 Summary of the Experimental Results It can be notedfrom the above experimental results that the predictionaccuracy of the models established by the wavelength

selection algorithm are higher than that of the full-spectrum model e MC-UVE-SPA selects the leastcharacteristic wavelengths and eliminates the collinearity

3 35 4Measured concentration ()

3

32

34

36

38

4

Pred

icte

d co

ncen

trat

ion

()

(a)

Pred

icte

d co

ncen

trat

ion

()

3 35 4Measured concentration ()

3

32

34

36

38

4

(b)

Pred

icte

d co

ncen

trat

ion

()

3 35 4Measured concentration ()

3

32

34

36

38

4

(c)

Pred

icte

d co

ncen

trat

ion

()

3 35 4Measured concentration ()

3

32

34

36

38

4

(d)Pr

edic

ted

conc

entr

atio

n (

)

3 35 4Measured concentration ()

3

32

34

36

38

4

(e)

Figure 9 Comparison of corn oil concentration predictions from the prediction data set using the (a) MC-UVE-PLS (b) MC-UVE-SPA-MLR (c) MC-UVE-SPA-MW-MLR (w 2L) (d) MC-UVE-SPA-MW-MLR (w 2R) and (e) MC-UVE-SPA-MW-MLR (w 3)

Table 3 Number of selected wavelengths and corn oil concentration prediction results

Wavelength selection algorithm Number of wavelengths Model type RMSECV RMSEP r RPDFull spectrum 700 PLS 01431 01345 09257 10550MC-UVE 102 PLS 00289 01447 09024 07948MC-UVE-SPA 37 MLR 00114 01225 09267 20225MC-UVE-SPA-MW(w 2L) 64 MLR 00035 00381 09713 163666MC-UVE-SPA-MW(w 2R) 64 MLR 00101 01009 08062 13703MC-UVE-SPA-MW(w 3) 82 MLR 00089 00532 09335 61627

1200 1400 1600 1800 2000 2200 2400Wavelength (nm)

0

05

1

Abs

orba

nce

(a)

1200 1400 1600 1800 2000 2200 2400Wavelength (nm)

0

05

1

Abs

orba

nce

(b)

1200 1400 1600 1800 2000 2200 2400Wavelength (nm)

0

05

Abs

orba

nce

(c)

Figure 10 Wavelengths selected from the NIR spectral data reflecting corn oil concentration by the (a) MC-UVE (b) MC-UVE-SPA and(c) MC-UVE-SPA-MW algorithms

Journal of Spectroscopy 9

between variables e prediction accuracy of the modelestablished by the MC-UVE-SPA is higher than thatestablished by MC-UVE e number of characteristic

wavelengths finally selected by the MC-UVE-SPA-MW ismore than that of MC-UVE-SPA but with better pre-diction accuracy

Table 4 Number of selected wavelengths and boiling point prediction results

Wavelength selection algorithm Number of wavelengths Model type RMSECV RMSEP r RPDFull spectrum 401 PLS 46350 83841 09103 23704MC-UVE 262 PLS 43184 100221 08311 17710MC-UVE-SPA 30 MLR 35503 88676 09341 24650MC-UVE-SPA-MW(w 2L) 58 MLR 39921 70442 09534 28212MC-UVE-SPA-MW(w 2R) 59 MLR 40024 68573 09556 28982MC-UVE-SPA-MW(w 3) 83 MLR 26478 59694 09752 39994e maximum number of PCs was again set to 10 for both PLS models

Pred

icte

d bo

iling

poi

nt (deg

C)

150 200 250 300Measured boiling point (degC)

150

200

250

300

(a)

Pred

icte

d bo

iling

poi

nt (deg

C)

Measured boiling point (degC)150 200 250 300

150

200

250

300

(b)

Pred

icte

d bo

iling

poi

nt (deg

C)

Measured boiling point (degC)150 200 250 300

150

200

250

300

(c)

Pred

icte

d bo

iling

poi

nt (deg

C)

Measured boiling point (degC)150 200 250 300

150

200

250

300

(d)

Pred

icte

d bo

iling

poi

nt (deg

C)

Measured boiling point (degC)150 200 250 300

150

200

250

300

(e)

Figure 11 Comparison of diesel fuel boiling point predictions from the prediction data set using (a) the MC-UVE-PLS (b) MC-UVE-SPA-MLR (c) MC-UVE-SPA-MW-MLR (w 2L) (d) MC-UVE-SPA-MW-MLR (w 2R) and (e) MC-UVE-SPA-MW-MLR (w 3)

800 900 1000 1100 1200 1300 1400 1500Wavelength (nm)

ndash1012

Abs

orba

nce

(a)

800 900 1000 1100 1200 1300 1400 1500Wavelength (nm)

ndash1012

Abs

orba

nce

(b)

800 900 1000 1100 1200 1300 1400 1500Wavelength (nm)

ndash1012

Abs

orba

nce

(c)

Figure 12 Wavelengths selected from the NIR spectral data reflecting the boiling point of diesel fuel by the (a) MC-UVE (b) MC-UVE-SPA and (c) MC-UVE-SPA-MW algorithms

10 Journal of Spectroscopy

4 Conclusions

e present study addressed the sparsity of wavelengthvariables selected by the cascaded MC-UVE-SPA throughthe application of a moving window which improved thecontinuity of the selected wavelength variables andthereby better exploited the advantages of the MC-UVEalgorithm and the SPA to obtain regression models withhigh prediction accuracy e advantages of the proposedMC-UVE-SPA-MW were demonstrated by applying theMC-UVE MC-UVE-SPA and MC-UVE-SPA-MW al-gorithms to the selection of wavelength variables from theNIR spectral absorbance data of corn diesel fuel andethylene and PLS and MLR models reflecting the oilcontent of corn the boiling point of diesel fuel and theethylene concentration were thereby established andtested e experimental results demonstrated that theprogressive elimination of uncorrelated and collinearvariables generated increasingly simplified partial-spec-trum models with greater prediction accuracy than thefull-spectrum model Among the three wavelength se-lection algorithms the MC-UVE-SPA selected the leastnumber of wavelength variables and the proposed MC-UVE-SPA-MW algorithm provided models with thegreatest prediction accuracy

Data Availability

e data used to support the findings of this study areavailable from the corresponding author upon request

Conflicts of Interest

e authors declare that they have no conflicts of interest

Acknowledgments

is study was supported by grants from the Major NationalScience and Technology Special Project of China(JZ2015KJZZ0254) and the Key Projects of Natural ScienceResearch in Universities in Anhui China (KJ2018A0544)

References

[1] C Pasquini ldquoNear infrared spectroscopy fundamentalspractical aspects and analytical applicationsrdquo Journal of theBrazilian Chemical Society vol 14 no 2 pp 198ndash219 2003

[2] X Sun M Zhou and Y Sun ldquoSpectroscopy quantitativeanalysis cotton content of blend fabricsrdquo InternationalJournal of Clothing Science and Technology vol 28 no 1pp 65ndash76 2016

500 1000 1500 2000 2500 3000 3500 4000 45000

05

1

15

2

25

3

35

4

Abs

orba

nce

OriginalMCUVE

MCUVE-SPAMCUVE-SPA-MW

times104

Wavenumber (cmndash1)

Figure 13 Measured C2H4 gas absorbance spectra and wavelength variables selected by the MC-UVE MC-UVE-SPA and MC-UVE-SPA-MW algorithms

Table 5 Number of selected wavelengths and C2H4 gas concentration prediction results

Wavelength selection algorithm Number of wavelengths Model type RMSECV RMSEP rFull spectrum 4601 PLS 01431 01342 09801MC-UVE 214 PLS 01009 01212 09887MC-UVE-SPA 17 MLR 01375 00537 09913MC-UVE-SPA-MW(w 2L) 34 MLR 00989 00495 09988MC-UVE-SPA-MW(w 2R) 34 MLR 00988 00486 09990MC-UVE-SPA-MW(w 3) 48 MLR 01005 00434 09999e maximum number of PCs was again set to 10 for both PLS models

Journal of Spectroscopy 11

[3] M Schwanninger J C Rodrigues and K Fackler ldquoA reviewof band assignments in near infrared spectra of wood andwood Componentsrdquo Journal of Near Infrared Spectroscopyvol 19 no 5 pp 287ndash308 2011

[4] Y-H Yun Y-C Wei X-B Zhao W-J Wu Y-Z Liang andH-M Lu ldquoA green method for the quantification of poly-saccharides in Dendrobium officinalerdquo RSC Advances vol 5no 127 pp 105057ndash105065 2015

[5] C K Vance D R Tolleson K Kinoshita J Rodriguez andW J Foley ldquoNear infrared spectroscopy in wildlife andbiodiversityrdquo Journal of Near Infrared Spectroscopy vol 24no 1 pp 1ndash25 2016

[6] H Hotelling ldquoAnalysis of a complex of statistical variablesinto principal componentsrdquo Journal of Educational Psychol-ogy vol 24 no 6 pp 417ndash441 1933

[7] P Geladi and B R Kowalski ldquoPartial least-squares regressiona tutorialrdquo Analytica Chimica Acta vol 185 no 1 pp 1ndash171986

[8] A A Kardamakis and N Pasadakis ldquoAutoregressive mod-eling of near-IR spectra and MLR to predict RON values ofgasolinesrdquo Fuel vol 89 no 1 pp 158ndash161 2010

[9] I Guyon and A Elisseeff ldquoAn introduction to variable andfeature selectionrdquo Journal of Machine Learning Researchvol 3 pp 1157ndash1182 3003

[10] B Lu N Liu H Li et al ldquoQuantitative determination andcharacteristic wavelength selection of available nitrogen incoco-peat by NIR spectroscopyrdquo Soil and Tillage Researchvol 191 pp 266ndash274 2019

[11] M J Anzanello and F S Fogliatto ldquoA review of recentvariable selection methods in industrial and chemometricsapplicationsrdquo European Journal of Industrial Engineeringvol 8 no 5 p 619 2014

[12] Y-H Yun H-D Li B-C Deng and D-S Cao ldquoAn overviewof variable selection methods in multivariate analysis of near-infrared spectrardquo TrAC Trends in Analytical Chemistryvol 113 pp 102ndash115 2019

[13] B Nadler and R R Coifman ldquoe prediction error in CLSand PLS the importance of feature selection prior to mul-tivariate calibrationrdquo Journal of Chemometrics vol 19 no 2pp 107ndash118 2005

[14] T Mehmood K H Liland L Snipen and S Saeligboslash ldquoA reviewof variable selection methods in Partial Least Squares Re-gressionrdquo Chemometrics and Intelligent Laboratory Systemsvol 118 no 8 pp 62ndash69 2012

[15] V Centner D-L Massart O E de Noord S de JongB M Vandeginste and C Sterna ldquoElimination of uninfor-mative variables for multivariate calibrationrdquo AnalyticalChemistry vol 68 no 21 pp 3851ndash3858 1996

[16] G Tang Y Huang K Tian et al ldquoA new spectral variableselection pattern using competitive adaptive reweightedsampling combined with successive projections algorithmrdquo9e Analyst vol 139 no 19 pp 4894ndash4902 2014

[17] Y Li Y Guo C Liu et al ldquoSPA combined with swarm in-telligence optimization algorithms for wavelength variableselection to rapidly discriminate the adulteration of applejuicerdquo Food Analytical Methods vol 10 no 6 pp 1965ndash19712017

[18] J-B Li C-J Zhao W-Q Huang et al ldquoA combination al-gorithm for variable selection to determine soluble solidcontent and firmness of pearsrdquo Analytical Methods vol 6no 7 pp 2170ndash2180 2014

[19] Z Xiaobo Z Jiewen M Hanpin S Jiyong Y Xiaopin andL Yanxiao ldquoGenetic algorithm interval partial least squaresregression combined successive projections algorithm for

variable selection in near-infrared quantitative analysis ofpigment in cucumber leavesrdquo Applied Spectroscopy vol 64no 7 pp 786ndash794 2010

[20] S Ye D Wang and S Min ldquoSuccessive projections algorithmcombined with uninformative variable elimination forspectral variable selectionrdquo Chemometrics and IntelligentLaboratory Systems vol 91 no 2 pp 194ndash199 2008

[21] B-C Deng Y-H Yun P Ma C-C Lin D-B Ren andY-Z Liang ldquoA new method for wavelength interval selectionthat intelligently optimizes the locations widths and com-binations of the intervalsrdquo 9e Analyst vol 140 no 6pp 1876ndash1885 2015

[22] W Fan Y-Y Li Y-K Peng et al ldquoNondestructive deter-mination of lycopene content based on visiblenear infraredtransmission spectrumrdquo Chinese Journal of AnalyticalChemistry vol 46 no 9 pp 1424ndash1431 2018

[23] Z Sun J Fan J Wang et al ldquoAssessment of the humanalbumin in acid precipitation process using NIRS and multi-variable selection methods combined with SPArdquo Journal ofMolecular Structure vol 1199 p 126942 2020

[24] H-D Li Q-S Xu and Y-Z Liang ldquolibPLS an integratedlibrary for partial least squares regression and linear dis-criminant analysisrdquo Chemometrics and Intelligent LaboratorySystems vol 176 pp 34ndash43 2018

[25] R Zhang Y-Y Chen Z-B Wang and L Kewu ldquoA novelensemble L1 regularization based variable selection frame-work with an application in near infrared spectroscopyrdquoChemometrics and Intelligent Laboratory Systems vol 163pp 7ndash15 2017

[26] W Cai Y Li and X Shao ldquoA variable selection method basedon uninformative variable elimination for multivariate cali-bration of near-infrared spectrardquo Chemometrics and Intelli-gent Laboratory Systems vol 90 no 2 pp 188ndash194 2008

[27] L M Bregman ldquoFinding the common point of convex sets bythe method of successive projectionrdquo Proceedings of the USSRAcademy of Sciences vol 162 no 3 pp 487ndash490 1965

[28] X Peng T Shi A Song Y Chen and W Gao ldquoEstimatingsoil organic carbon using VISNIR spectroscopy with SVMRand SPA methodsrdquo Remote Sensing vol 6 no 4 pp 2699ndash2717 2014

[29] Y-H Liu Q-Q Wang X-W Gao and A-G Xie ldquoTotalphenolic content prediction in Flos Lonicerae using hyper-spectral imaging combined with wavelengths selectionmethodsrdquo Journal of Food Process Engineering vol 42 no 6Article ID e13224 2019

[30] J Li H Zhang B Zhan Y Zhang R Li and J Li ldquoNon-destructive firmness measurement of the multiple cultivars ofpears by Vis-NIR spectroscopy coupled with multivariatecalibration analysis and MC-UVE-SPA methodrdquo InfraredPhysics amp Technology vol 104 Article ID 103154 2020

[31] P Hourant V Baeten M T Morales M Meurens andR Aparicio ldquoOil and fat classification by selected bands ofnear-infrared spectroscopyrdquo Applied Spectroscopy vol 54no 8 pp 1168ndash1174 2000

12 Journal of Spectroscopy

Page 5: Moving-Window-ImprovedMonteCarloUninformativeVariable ... · 2020. 8. 3. · to the NIR spectroscopic analysis of nicotine in tobacco lamina and active pharmaceutical ingredients

e MC-UVE-SPA method is a combination method ofMC-UVE and SPA Jiangbo Li et al proved that the com-bination (MC-UVE-SPA) of both Monte Carlo uninfor-mative variable elimination (MC-UVE) and successiveprojections algorithm (SPA) was more effective than MC-UVE or SPA alone [30] Although the effect of UVE-SPA isbetter than that of using UVE or SPA alone there is stillsomething to be improved In this paper the UVE-SPA isimproved by using the wavelength effective continuity andits effectiveness is verified by experiments

24 Proposed MC-UVE-SPA-MW Wavelength SelectionAlgorithm e proposed wavelength selection algorithmfirst applies MC-UVE to the calibration data set to construct

a PLS regression model e threshold of the MC-UVEprocess is set to provide a number of wavelength variablesthat minimize the RMSECV of the constructed PLS re-gressionmodele largest number of principal components(PCs) was set to 10 and the optimal number of PCs wasdetermined based on the minimum RMSECV value Sub-sequently the wavelength variables retained by the MC-UVE algorithm are applied as the input of the SPA Here anMLRmodel is constructed based on the wavelength variablesselected by the SPA for conducting cross-validation analysiswhere the number of selected wavelength variables is de-termined according to the minimum of the RMSECV of theconstructed MLR model In order to reduce the number ofisolated wavelength variables and maintain the continuity ofadjacent information wavelength points of near-infrared

500 1000 1500 2000 2500 3000 3500 4000 4500 5000Wavenumber (cmndash1)

05

1

15

2

25

3

35

Inte

nsity

times104

Figure 3 Measured background NIR spectrum with the cell filled with nitrogen gas (N2)

5000

05

1

15

25

35

2

3

4

1000 1500 2000 3000 4000 50002500 3500 4500

Inte

nsity

Wavenumber (cmndash1)

times104

Figure 4 Measured C2H4 gas absorbance spectrum

Journal of Spectroscopy 5

spectrum it extends outward from the best wavelength pointselected by UVE-SPA In the original spectrum the optimalwavelength point selected by the MC-UVE-SPA is used asthe starting point of a moving window of width w ewavelengths in the moving window are used as the selectedwavelengths and the number of wavelengths finally selectedvaries with the window width Set the window width w 2(Left) 2 (Right) or 3 Here 2 (Left) means to extend theselected wavelength point to the left to 2 wavelength points2 (Right) means to extend the selected wavelength point tothe right to 2 wavelength points and 3 means to extend theselected wavelength point to 3 wavelength points using theselected wavelength point as the center point e optimalwindow width is determined by the minimum RMSECV ofthe MLR model e processing flow of the proposed MC-UVE-SPA-MW algorithm and extending the wavelengthpoint outward are illustrated in Figure 5

3 Results and Discussion

31 Corn Spectral Data Experiments e wavelength vari-able stability distribution map of the PLS regression modelreflecting the oil concentration in corn constructed forcalibration set using the MC-UVE algorithm is presented inFigure 6 Here all wavelengths greater than the thresholdvalue shown by the horizontal red line in the figure areselected for use in the model is threshold was selected toprovide the number of wavelength variables correspondingto theminimumRMSECV of the constructed PLS regressionmodel is is illustrated in Figure 7 where the RMSECV ofthe constructed PLS regression model is plotted with respectto the number of selected wavelength variables It can beseen from Figure 7 that the RMSECV is relatively large whenthe number of wavelength variables is small and theRMSECV drops sharply as the number of selected variablesincreases is is because an overly small number ofwavelength variables exclude useful information and theprediction accuracy of the model is therefore improved as anincreasing amount of useful information is incorporatedinto the model A minimum value of RMSECV 00289 isobtained when the number of selected wavelength variablesis 106 and the RMSECV increases again when the number ofvariables exceeds 106 is increase results from the impactof selecting an increasing number of uninformative variableson the prediction accuracy of the model We also note thatthe RMSECV changes very little when the number ofwavelength variables exceeds 300 us the MC-UVE al-gorithm eliminates a large number of wavelengths that arenot related to the oil concentration of corn where the finalnumber of selected wavelength variables is just 151 of thefull-spectrum value of 700

e optimal number of 106 wavelength variables selectedby MC-UVE is then used as the inputs of the SPA whichiteratively generates wavelength variable combinations usingeach wavelength as a starting point and applies them forconstructing an MLR model e wavelength combinationcorresponding to the minimumRMSECV of theMLRmodelis then taken as the optimal wavelength combination erelationship between the number of selected wavelength

variables and the RMSECV of the MLR model constructedfrom variables selected by the MC-UVE-SPA is shown inFigure 8 where we note that the minimum RMSECV isobtained when the number of selected variables is 37 usthe SPA further reduces the number of informative wave-lengths mainly by eliminating collinear variables in the MLRmodel where the final number of selected wavelengthvariables is reduced to just 53 of the full-spectrum value of700

In the original spectrum the optimal wavelength pointselected by the MC-UVE-SPA is used as the starting point orcenter of a moving window of width w 2 (Left) 2 (Right)or 3 e results of the PLS or MLR model constructed usingthe wavelength variables selected by different algorithms areshown in Figure 9 and the details are listed in Table 3 alongwith the results obtained for different models In Table 3 theoptimal number of PLS principal components was 10 Asshown in Table 3 there were 37 characteristic wavelengthsselected by the MC-UVE-SPA accounting for only 53 ofthe total number of wavelengths and the accuracy of thealgorithm is better than that of MC-UVE algorithm which isdue to the elimination of wavelength collinearity ewavelengths selected by the MC-UVE-SPA-MW are ex-tended by the algorithm proposed in this paper When thewindow width w 2 (Left) and w 3 the model accuracy ofthe MC-UVE-SPA-MW algorithm is higher than that of theMC-UVE-SPA When w 2 (Left) the MC-UVE-SPA-MWexpands 37 wavelength variables selected by the MC-UVE-SPA to 64 At this point RMSEP is 00381 r value is 09713RPD value is 163666 and the model is optimal Althoughthe MC-UVE-SPA-MW provides improved continuity byincreasing the number of wavelength variables from thoseobtained by the MC-UVE-SPA the final number is still just91 of the full-spectrum value of 700

e wavelength variables selected from the NIR spectralabsorbance data of a single sample by the MC-UVE MC-UVE-SPA and proposed MC-UVE-SPA-MW algorithmsare compared in Figure 10 e results in Figure 10 arederived from the fact that oil is a complex organic moleculewith infrared and NIR spectral absorption that occupies awide wavenumber band ranging 3900sim12000 cmminus1

(833sim2564 nm) is is mainly caused by the frequencydoubling and frequency combinations of the stretching andvibrational energy level transitions of hydrogen-containinggroups From the results of Figure 10 we note that thewavelength variables selected by the MC-UVE MC-UVE-SPA and proposed MC-UVE-SPA-MW algorithms aremainly distributed between 1662sim1790 2222sim22682288sim2316 2390sim2428 and 2476sim2498 nm which is exactlythe range of the spectral absorption peaks generated by thefirst and second frequency doubling of the -C-H stretchingvibrations of the -CH2 -CH3 and -CH-CH- functionalgroups of oil [31]

We note from Figure 10 that the moving windowemployed by theMC-UVE-SPA-MW algorithm expands thewavelength variables selected by the MC-UVE-SPAresulting in a greater number of wavelength variables thanthat obtained by the MC-UVE-SPA and the improvedcontinuity of the wavelength variables selected by the MC-

6 Journal of Spectroscopy

UVE-SPA-MW algorithm is very apparent in Figure 10compared with the wavelength variables selected by theMC-UVE-SPA We can also note from Table 3 that the full-spectrum model was relatively complicated and its pre-diction accuracy was the worst of all models considered dueto the impact of the large number of uninformative wave-length variables included within the model In comparisonthe models established with spectral data selected by theMC-UVEMC-UVE-SPA andMC-UVE-SPA-MW (w 2L2R 3) algorithms are all greatly simplified and better modelprediction accuracies are uniformly obtained We also notefrom the table that of the five wavelength selection

algorithms the MC-UVE-SPA selected the least number ofwavelengths and the MC-UVE-SPA-MW (w 2L) algo-rithm provided a model with the greatest predictionaccuracy

32 Diesel Spectral Data Experiments e number ofwavelength variables selected from the NIR spectral data ofdiesel fuel reflecting the boiling point by the MC-UVE MC-UVE-SPA and MC-UVE-SPA-MW (w 3 2L 2R) algo-rithms were respectively 262 30 83 58 and 59 as shown inTable 4 ese respectively represent 653 75 207

Begin

Preprocessing

MCUVE

Moving window improved

End

SPA

w = 3

w = 2 L

w = 2 R

Extending the wavelength point outwardwith the window width w

0 0 0 1 0 0 0 0 0 1 1 1 0 0

0 0 0 1 0 0 0 0 0 1 1 0 0 0

0 0 0 1 0 0 0 0 0 0 1 1 0 0

Figure 5 Processing flow of the proposed moving-window-improved MC-UVE-SPA (MC-UVE-SPA-MW) wavelength selection algo-rithm based on the Monte Carlo uninformative variable elimination (MC-UVE) algorithm cascaded with the successive projectionsalgorithm (SPA)

0 100 200 300 400 500 600 700Wavelength variable

0

1

2

3

4

5

6

7

Stab

ility

inde

x

Figure 6 Wavelength variable stability distribution map of the partial least squares (PLS) regression model reflecting the oil concentrationusing the MC-UVE algorithm

Journal of Spectroscopy 7

145 and 147 of the 401 wavelength variables includedin the full spectrum e prediction results of the PLS orMLR models constructed from the selected wavelengthvariables are shown in Figure 11 and the details are listedin Table 4 along with the results obtained for a full-spectrum PLS model We note from Table 4 and Figure 11that the models established with spectral data selected byMC-UVE MC-UVE-SPA and MC-UVE-SPA-MW(w 2L 2R 3) algorithms are greatly simplified comparedwith the full-spectrum model MC-UVE retains 262wavelength points and the prediction accuracy is theworst of all the models considered which may be due tothe existence of wavelength collinearity When SPA al-gorithm is used to further screen the wavelength pointsselected by MC-UVE only 30 wavelength points areretained while the prediction accuracy of the model is

greatly improved RMSEP is reduced to 88676 r value isincreased to 09341 and RPD value is increased to 24650We note from Figure 12 that the moving windowemployed by the MC-UVE-SPA-MW expands the wave-length variables selected by the MC-UVE-SPA and im-proves the continuity of the wavelength variables selectedby the MC-UVE-SPA-MWWhen the window width w 2(Left) 2 (Right) and 3 the accuracy of the three modelsobtained by the MC-UVE-SPA-MW are all improvedWhen w 3 the MC-UVE-SPA-MW expands 30 wave-length variables selected by the MC-UVE-SPA to 83 Atthis point RMSEP is reduced to 59694 R value is in-creased to 09752 RPD value is increased to 39994 andthe model is optimal We can also note from Table 4 andFigure 11 that of the five wavelength selection algorithmsthe MC-UVE-SPA selected the least number of wave-lengths and the MC-UVE-SPA-MW (w 3) algorithmprovided a model with the greatest prediction accuracy

33 Ethylene Gas Spectral Data Experiments e number ofwavelength variables selected from the spectral datareflecting the C2H4 concentration by the MVUVE MC-UVE-SPA and MC-UVE-SPA-MW (w 3 2L 2R) algo-rithms were respectively 214 17 48 34 and 34 as shown inFigure 13 ese respectively represent 47 037 10074 and 074 of the 4601 wavelength variables includedin the full spectrum It can be determined from Figure 13that greater than half of the selected wavelength variables fallwithin the strong absorption regions in the wavenumberranges 794sim1105 cmminus1 and 2917sim3242 cmminus1 ese resultscan be explained according to the description given on theHITRAN web page which states that the absorption spectralband of C2H4 gas is in the range of 614sim3242 cmminus1 and thatthe two isotopes H2

12C12CH2 and H212C13CH2 of C2H4

present strong absorption bands in the wavenumber rangesof 794sim1105 cmminus1 and 2917sim3242 cmminus1 respectively FromFigure 4 it can be seen that in some areas that are not C2H4absorption bands the spectral intensity has a significantlinear relationship with C2H4 content which may be due tothe interference caused by the background spectrum withthe change of C2H4 concentration so in some areas that arenot C2H4 absorption bands the wavelength point is alsoselected

e details regarding the prediction results of the PLS orMLR models constructed from the selected wavelengthvariables are listed in Table 5 along with the results obtainedfor a full-spectrum PLS model We again note from thetable that the full-spectrum model is more complicatedand its prediction accuracy was the worst of all modelsconsidered In comparison the models established withspectral data selected by the MC-UVE MC-UVE-SPA andMC-UVE-SPA-MW algorithms are all greatly simplifiedand better model prediction accuracies are uniformlyobtained Of the five wavelength selection algorithms weagain note that theMC-UVE-SPA selected the least numberof wavelengths and the MC-UVE-SPA-MW (w 3) algo-rithm provided a model with the greatest predictionaccuracy

0 100 200 300 400 500 600 700Number of wavelength variables

002

004

006

008

01

012

014

016

018

RMSE

C

0028931

Figure 7 Relationship between the root mean square error(RMSE) of the corn oil concentration predicted by the PLS re-gression model and the number of selected wavelength variables

0 10 20 30 40 50Number of variables included in the model

0

002

004

006

008

01

012

014

016

018

RMSE

Final number of selected variables 37 (RMSE = 0011356)

Figure 8 Relationship between the RMSE of the corn oil con-centration predicted by the multiple linear regression (MLR)modeland the number of selected wavelength variables using the MC-UVE-SPA

8 Journal of Spectroscopy

34 Summary of the Experimental Results It can be notedfrom the above experimental results that the predictionaccuracy of the models established by the wavelength

selection algorithm are higher than that of the full-spectrum model e MC-UVE-SPA selects the leastcharacteristic wavelengths and eliminates the collinearity

3 35 4Measured concentration ()

3

32

34

36

38

4

Pred

icte

d co

ncen

trat

ion

()

(a)

Pred

icte

d co

ncen

trat

ion

()

3 35 4Measured concentration ()

3

32

34

36

38

4

(b)

Pred

icte

d co

ncen

trat

ion

()

3 35 4Measured concentration ()

3

32

34

36

38

4

(c)

Pred

icte

d co

ncen

trat

ion

()

3 35 4Measured concentration ()

3

32

34

36

38

4

(d)Pr

edic

ted

conc

entr

atio

n (

)

3 35 4Measured concentration ()

3

32

34

36

38

4

(e)

Figure 9 Comparison of corn oil concentration predictions from the prediction data set using the (a) MC-UVE-PLS (b) MC-UVE-SPA-MLR (c) MC-UVE-SPA-MW-MLR (w 2L) (d) MC-UVE-SPA-MW-MLR (w 2R) and (e) MC-UVE-SPA-MW-MLR (w 3)

Table 3 Number of selected wavelengths and corn oil concentration prediction results

Wavelength selection algorithm Number of wavelengths Model type RMSECV RMSEP r RPDFull spectrum 700 PLS 01431 01345 09257 10550MC-UVE 102 PLS 00289 01447 09024 07948MC-UVE-SPA 37 MLR 00114 01225 09267 20225MC-UVE-SPA-MW(w 2L) 64 MLR 00035 00381 09713 163666MC-UVE-SPA-MW(w 2R) 64 MLR 00101 01009 08062 13703MC-UVE-SPA-MW(w 3) 82 MLR 00089 00532 09335 61627

1200 1400 1600 1800 2000 2200 2400Wavelength (nm)

0

05

1

Abs

orba

nce

(a)

1200 1400 1600 1800 2000 2200 2400Wavelength (nm)

0

05

1

Abs

orba

nce

(b)

1200 1400 1600 1800 2000 2200 2400Wavelength (nm)

0

05

Abs

orba

nce

(c)

Figure 10 Wavelengths selected from the NIR spectral data reflecting corn oil concentration by the (a) MC-UVE (b) MC-UVE-SPA and(c) MC-UVE-SPA-MW algorithms

Journal of Spectroscopy 9

between variables e prediction accuracy of the modelestablished by the MC-UVE-SPA is higher than thatestablished by MC-UVE e number of characteristic

wavelengths finally selected by the MC-UVE-SPA-MW ismore than that of MC-UVE-SPA but with better pre-diction accuracy

Table 4 Number of selected wavelengths and boiling point prediction results

Wavelength selection algorithm Number of wavelengths Model type RMSECV RMSEP r RPDFull spectrum 401 PLS 46350 83841 09103 23704MC-UVE 262 PLS 43184 100221 08311 17710MC-UVE-SPA 30 MLR 35503 88676 09341 24650MC-UVE-SPA-MW(w 2L) 58 MLR 39921 70442 09534 28212MC-UVE-SPA-MW(w 2R) 59 MLR 40024 68573 09556 28982MC-UVE-SPA-MW(w 3) 83 MLR 26478 59694 09752 39994e maximum number of PCs was again set to 10 for both PLS models

Pred

icte

d bo

iling

poi

nt (deg

C)

150 200 250 300Measured boiling point (degC)

150

200

250

300

(a)

Pred

icte

d bo

iling

poi

nt (deg

C)

Measured boiling point (degC)150 200 250 300

150

200

250

300

(b)

Pred

icte

d bo

iling

poi

nt (deg

C)

Measured boiling point (degC)150 200 250 300

150

200

250

300

(c)

Pred

icte

d bo

iling

poi

nt (deg

C)

Measured boiling point (degC)150 200 250 300

150

200

250

300

(d)

Pred

icte

d bo

iling

poi

nt (deg

C)

Measured boiling point (degC)150 200 250 300

150

200

250

300

(e)

Figure 11 Comparison of diesel fuel boiling point predictions from the prediction data set using (a) the MC-UVE-PLS (b) MC-UVE-SPA-MLR (c) MC-UVE-SPA-MW-MLR (w 2L) (d) MC-UVE-SPA-MW-MLR (w 2R) and (e) MC-UVE-SPA-MW-MLR (w 3)

800 900 1000 1100 1200 1300 1400 1500Wavelength (nm)

ndash1012

Abs

orba

nce

(a)

800 900 1000 1100 1200 1300 1400 1500Wavelength (nm)

ndash1012

Abs

orba

nce

(b)

800 900 1000 1100 1200 1300 1400 1500Wavelength (nm)

ndash1012

Abs

orba

nce

(c)

Figure 12 Wavelengths selected from the NIR spectral data reflecting the boiling point of diesel fuel by the (a) MC-UVE (b) MC-UVE-SPA and (c) MC-UVE-SPA-MW algorithms

10 Journal of Spectroscopy

4 Conclusions

e present study addressed the sparsity of wavelengthvariables selected by the cascaded MC-UVE-SPA throughthe application of a moving window which improved thecontinuity of the selected wavelength variables andthereby better exploited the advantages of the MC-UVEalgorithm and the SPA to obtain regression models withhigh prediction accuracy e advantages of the proposedMC-UVE-SPA-MW were demonstrated by applying theMC-UVE MC-UVE-SPA and MC-UVE-SPA-MW al-gorithms to the selection of wavelength variables from theNIR spectral absorbance data of corn diesel fuel andethylene and PLS and MLR models reflecting the oilcontent of corn the boiling point of diesel fuel and theethylene concentration were thereby established andtested e experimental results demonstrated that theprogressive elimination of uncorrelated and collinearvariables generated increasingly simplified partial-spec-trum models with greater prediction accuracy than thefull-spectrum model Among the three wavelength se-lection algorithms the MC-UVE-SPA selected the leastnumber of wavelength variables and the proposed MC-UVE-SPA-MW algorithm provided models with thegreatest prediction accuracy

Data Availability

e data used to support the findings of this study areavailable from the corresponding author upon request

Conflicts of Interest

e authors declare that they have no conflicts of interest

Acknowledgments

is study was supported by grants from the Major NationalScience and Technology Special Project of China(JZ2015KJZZ0254) and the Key Projects of Natural ScienceResearch in Universities in Anhui China (KJ2018A0544)

References

[1] C Pasquini ldquoNear infrared spectroscopy fundamentalspractical aspects and analytical applicationsrdquo Journal of theBrazilian Chemical Society vol 14 no 2 pp 198ndash219 2003

[2] X Sun M Zhou and Y Sun ldquoSpectroscopy quantitativeanalysis cotton content of blend fabricsrdquo InternationalJournal of Clothing Science and Technology vol 28 no 1pp 65ndash76 2016

500 1000 1500 2000 2500 3000 3500 4000 45000

05

1

15

2

25

3

35

4

Abs

orba

nce

OriginalMCUVE

MCUVE-SPAMCUVE-SPA-MW

times104

Wavenumber (cmndash1)

Figure 13 Measured C2H4 gas absorbance spectra and wavelength variables selected by the MC-UVE MC-UVE-SPA and MC-UVE-SPA-MW algorithms

Table 5 Number of selected wavelengths and C2H4 gas concentration prediction results

Wavelength selection algorithm Number of wavelengths Model type RMSECV RMSEP rFull spectrum 4601 PLS 01431 01342 09801MC-UVE 214 PLS 01009 01212 09887MC-UVE-SPA 17 MLR 01375 00537 09913MC-UVE-SPA-MW(w 2L) 34 MLR 00989 00495 09988MC-UVE-SPA-MW(w 2R) 34 MLR 00988 00486 09990MC-UVE-SPA-MW(w 3) 48 MLR 01005 00434 09999e maximum number of PCs was again set to 10 for both PLS models

Journal of Spectroscopy 11

[3] M Schwanninger J C Rodrigues and K Fackler ldquoA reviewof band assignments in near infrared spectra of wood andwood Componentsrdquo Journal of Near Infrared Spectroscopyvol 19 no 5 pp 287ndash308 2011

[4] Y-H Yun Y-C Wei X-B Zhao W-J Wu Y-Z Liang andH-M Lu ldquoA green method for the quantification of poly-saccharides in Dendrobium officinalerdquo RSC Advances vol 5no 127 pp 105057ndash105065 2015

[5] C K Vance D R Tolleson K Kinoshita J Rodriguez andW J Foley ldquoNear infrared spectroscopy in wildlife andbiodiversityrdquo Journal of Near Infrared Spectroscopy vol 24no 1 pp 1ndash25 2016

[6] H Hotelling ldquoAnalysis of a complex of statistical variablesinto principal componentsrdquo Journal of Educational Psychol-ogy vol 24 no 6 pp 417ndash441 1933

[7] P Geladi and B R Kowalski ldquoPartial least-squares regressiona tutorialrdquo Analytica Chimica Acta vol 185 no 1 pp 1ndash171986

[8] A A Kardamakis and N Pasadakis ldquoAutoregressive mod-eling of near-IR spectra and MLR to predict RON values ofgasolinesrdquo Fuel vol 89 no 1 pp 158ndash161 2010

[9] I Guyon and A Elisseeff ldquoAn introduction to variable andfeature selectionrdquo Journal of Machine Learning Researchvol 3 pp 1157ndash1182 3003

[10] B Lu N Liu H Li et al ldquoQuantitative determination andcharacteristic wavelength selection of available nitrogen incoco-peat by NIR spectroscopyrdquo Soil and Tillage Researchvol 191 pp 266ndash274 2019

[11] M J Anzanello and F S Fogliatto ldquoA review of recentvariable selection methods in industrial and chemometricsapplicationsrdquo European Journal of Industrial Engineeringvol 8 no 5 p 619 2014

[12] Y-H Yun H-D Li B-C Deng and D-S Cao ldquoAn overviewof variable selection methods in multivariate analysis of near-infrared spectrardquo TrAC Trends in Analytical Chemistryvol 113 pp 102ndash115 2019

[13] B Nadler and R R Coifman ldquoe prediction error in CLSand PLS the importance of feature selection prior to mul-tivariate calibrationrdquo Journal of Chemometrics vol 19 no 2pp 107ndash118 2005

[14] T Mehmood K H Liland L Snipen and S Saeligboslash ldquoA reviewof variable selection methods in Partial Least Squares Re-gressionrdquo Chemometrics and Intelligent Laboratory Systemsvol 118 no 8 pp 62ndash69 2012

[15] V Centner D-L Massart O E de Noord S de JongB M Vandeginste and C Sterna ldquoElimination of uninfor-mative variables for multivariate calibrationrdquo AnalyticalChemistry vol 68 no 21 pp 3851ndash3858 1996

[16] G Tang Y Huang K Tian et al ldquoA new spectral variableselection pattern using competitive adaptive reweightedsampling combined with successive projections algorithmrdquo9e Analyst vol 139 no 19 pp 4894ndash4902 2014

[17] Y Li Y Guo C Liu et al ldquoSPA combined with swarm in-telligence optimization algorithms for wavelength variableselection to rapidly discriminate the adulteration of applejuicerdquo Food Analytical Methods vol 10 no 6 pp 1965ndash19712017

[18] J-B Li C-J Zhao W-Q Huang et al ldquoA combination al-gorithm for variable selection to determine soluble solidcontent and firmness of pearsrdquo Analytical Methods vol 6no 7 pp 2170ndash2180 2014

[19] Z Xiaobo Z Jiewen M Hanpin S Jiyong Y Xiaopin andL Yanxiao ldquoGenetic algorithm interval partial least squaresregression combined successive projections algorithm for

variable selection in near-infrared quantitative analysis ofpigment in cucumber leavesrdquo Applied Spectroscopy vol 64no 7 pp 786ndash794 2010

[20] S Ye D Wang and S Min ldquoSuccessive projections algorithmcombined with uninformative variable elimination forspectral variable selectionrdquo Chemometrics and IntelligentLaboratory Systems vol 91 no 2 pp 194ndash199 2008

[21] B-C Deng Y-H Yun P Ma C-C Lin D-B Ren andY-Z Liang ldquoA new method for wavelength interval selectionthat intelligently optimizes the locations widths and com-binations of the intervalsrdquo 9e Analyst vol 140 no 6pp 1876ndash1885 2015

[22] W Fan Y-Y Li Y-K Peng et al ldquoNondestructive deter-mination of lycopene content based on visiblenear infraredtransmission spectrumrdquo Chinese Journal of AnalyticalChemistry vol 46 no 9 pp 1424ndash1431 2018

[23] Z Sun J Fan J Wang et al ldquoAssessment of the humanalbumin in acid precipitation process using NIRS and multi-variable selection methods combined with SPArdquo Journal ofMolecular Structure vol 1199 p 126942 2020

[24] H-D Li Q-S Xu and Y-Z Liang ldquolibPLS an integratedlibrary for partial least squares regression and linear dis-criminant analysisrdquo Chemometrics and Intelligent LaboratorySystems vol 176 pp 34ndash43 2018

[25] R Zhang Y-Y Chen Z-B Wang and L Kewu ldquoA novelensemble L1 regularization based variable selection frame-work with an application in near infrared spectroscopyrdquoChemometrics and Intelligent Laboratory Systems vol 163pp 7ndash15 2017

[26] W Cai Y Li and X Shao ldquoA variable selection method basedon uninformative variable elimination for multivariate cali-bration of near-infrared spectrardquo Chemometrics and Intelli-gent Laboratory Systems vol 90 no 2 pp 188ndash194 2008

[27] L M Bregman ldquoFinding the common point of convex sets bythe method of successive projectionrdquo Proceedings of the USSRAcademy of Sciences vol 162 no 3 pp 487ndash490 1965

[28] X Peng T Shi A Song Y Chen and W Gao ldquoEstimatingsoil organic carbon using VISNIR spectroscopy with SVMRand SPA methodsrdquo Remote Sensing vol 6 no 4 pp 2699ndash2717 2014

[29] Y-H Liu Q-Q Wang X-W Gao and A-G Xie ldquoTotalphenolic content prediction in Flos Lonicerae using hyper-spectral imaging combined with wavelengths selectionmethodsrdquo Journal of Food Process Engineering vol 42 no 6Article ID e13224 2019

[30] J Li H Zhang B Zhan Y Zhang R Li and J Li ldquoNon-destructive firmness measurement of the multiple cultivars ofpears by Vis-NIR spectroscopy coupled with multivariatecalibration analysis and MC-UVE-SPA methodrdquo InfraredPhysics amp Technology vol 104 Article ID 103154 2020

[31] P Hourant V Baeten M T Morales M Meurens andR Aparicio ldquoOil and fat classification by selected bands ofnear-infrared spectroscopyrdquo Applied Spectroscopy vol 54no 8 pp 1168ndash1174 2000

12 Journal of Spectroscopy

Page 6: Moving-Window-ImprovedMonteCarloUninformativeVariable ... · 2020. 8. 3. · to the NIR spectroscopic analysis of nicotine in tobacco lamina and active pharmaceutical ingredients

spectrum it extends outward from the best wavelength pointselected by UVE-SPA In the original spectrum the optimalwavelength point selected by the MC-UVE-SPA is used asthe starting point of a moving window of width w ewavelengths in the moving window are used as the selectedwavelengths and the number of wavelengths finally selectedvaries with the window width Set the window width w 2(Left) 2 (Right) or 3 Here 2 (Left) means to extend theselected wavelength point to the left to 2 wavelength points2 (Right) means to extend the selected wavelength point tothe right to 2 wavelength points and 3 means to extend theselected wavelength point to 3 wavelength points using theselected wavelength point as the center point e optimalwindow width is determined by the minimum RMSECV ofthe MLR model e processing flow of the proposed MC-UVE-SPA-MW algorithm and extending the wavelengthpoint outward are illustrated in Figure 5

3 Results and Discussion

31 Corn Spectral Data Experiments e wavelength vari-able stability distribution map of the PLS regression modelreflecting the oil concentration in corn constructed forcalibration set using the MC-UVE algorithm is presented inFigure 6 Here all wavelengths greater than the thresholdvalue shown by the horizontal red line in the figure areselected for use in the model is threshold was selected toprovide the number of wavelength variables correspondingto theminimumRMSECV of the constructed PLS regressionmodel is is illustrated in Figure 7 where the RMSECV ofthe constructed PLS regression model is plotted with respectto the number of selected wavelength variables It can beseen from Figure 7 that the RMSECV is relatively large whenthe number of wavelength variables is small and theRMSECV drops sharply as the number of selected variablesincreases is is because an overly small number ofwavelength variables exclude useful information and theprediction accuracy of the model is therefore improved as anincreasing amount of useful information is incorporatedinto the model A minimum value of RMSECV 00289 isobtained when the number of selected wavelength variablesis 106 and the RMSECV increases again when the number ofvariables exceeds 106 is increase results from the impactof selecting an increasing number of uninformative variableson the prediction accuracy of the model We also note thatthe RMSECV changes very little when the number ofwavelength variables exceeds 300 us the MC-UVE al-gorithm eliminates a large number of wavelengths that arenot related to the oil concentration of corn where the finalnumber of selected wavelength variables is just 151 of thefull-spectrum value of 700

e optimal number of 106 wavelength variables selectedby MC-UVE is then used as the inputs of the SPA whichiteratively generates wavelength variable combinations usingeach wavelength as a starting point and applies them forconstructing an MLR model e wavelength combinationcorresponding to the minimumRMSECV of theMLRmodelis then taken as the optimal wavelength combination erelationship between the number of selected wavelength

variables and the RMSECV of the MLR model constructedfrom variables selected by the MC-UVE-SPA is shown inFigure 8 where we note that the minimum RMSECV isobtained when the number of selected variables is 37 usthe SPA further reduces the number of informative wave-lengths mainly by eliminating collinear variables in the MLRmodel where the final number of selected wavelengthvariables is reduced to just 53 of the full-spectrum value of700

In the original spectrum the optimal wavelength pointselected by the MC-UVE-SPA is used as the starting point orcenter of a moving window of width w 2 (Left) 2 (Right)or 3 e results of the PLS or MLR model constructed usingthe wavelength variables selected by different algorithms areshown in Figure 9 and the details are listed in Table 3 alongwith the results obtained for different models In Table 3 theoptimal number of PLS principal components was 10 Asshown in Table 3 there were 37 characteristic wavelengthsselected by the MC-UVE-SPA accounting for only 53 ofthe total number of wavelengths and the accuracy of thealgorithm is better than that of MC-UVE algorithm which isdue to the elimination of wavelength collinearity ewavelengths selected by the MC-UVE-SPA-MW are ex-tended by the algorithm proposed in this paper When thewindow width w 2 (Left) and w 3 the model accuracy ofthe MC-UVE-SPA-MW algorithm is higher than that of theMC-UVE-SPA When w 2 (Left) the MC-UVE-SPA-MWexpands 37 wavelength variables selected by the MC-UVE-SPA to 64 At this point RMSEP is 00381 r value is 09713RPD value is 163666 and the model is optimal Althoughthe MC-UVE-SPA-MW provides improved continuity byincreasing the number of wavelength variables from thoseobtained by the MC-UVE-SPA the final number is still just91 of the full-spectrum value of 700

e wavelength variables selected from the NIR spectralabsorbance data of a single sample by the MC-UVE MC-UVE-SPA and proposed MC-UVE-SPA-MW algorithmsare compared in Figure 10 e results in Figure 10 arederived from the fact that oil is a complex organic moleculewith infrared and NIR spectral absorption that occupies awide wavenumber band ranging 3900sim12000 cmminus1

(833sim2564 nm) is is mainly caused by the frequencydoubling and frequency combinations of the stretching andvibrational energy level transitions of hydrogen-containinggroups From the results of Figure 10 we note that thewavelength variables selected by the MC-UVE MC-UVE-SPA and proposed MC-UVE-SPA-MW algorithms aremainly distributed between 1662sim1790 2222sim22682288sim2316 2390sim2428 and 2476sim2498 nm which is exactlythe range of the spectral absorption peaks generated by thefirst and second frequency doubling of the -C-H stretchingvibrations of the -CH2 -CH3 and -CH-CH- functionalgroups of oil [31]

We note from Figure 10 that the moving windowemployed by theMC-UVE-SPA-MW algorithm expands thewavelength variables selected by the MC-UVE-SPAresulting in a greater number of wavelength variables thanthat obtained by the MC-UVE-SPA and the improvedcontinuity of the wavelength variables selected by the MC-

6 Journal of Spectroscopy

UVE-SPA-MW algorithm is very apparent in Figure 10compared with the wavelength variables selected by theMC-UVE-SPA We can also note from Table 3 that the full-spectrum model was relatively complicated and its pre-diction accuracy was the worst of all models considered dueto the impact of the large number of uninformative wave-length variables included within the model In comparisonthe models established with spectral data selected by theMC-UVEMC-UVE-SPA andMC-UVE-SPA-MW (w 2L2R 3) algorithms are all greatly simplified and better modelprediction accuracies are uniformly obtained We also notefrom the table that of the five wavelength selection

algorithms the MC-UVE-SPA selected the least number ofwavelengths and the MC-UVE-SPA-MW (w 2L) algo-rithm provided a model with the greatest predictionaccuracy

32 Diesel Spectral Data Experiments e number ofwavelength variables selected from the NIR spectral data ofdiesel fuel reflecting the boiling point by the MC-UVE MC-UVE-SPA and MC-UVE-SPA-MW (w 3 2L 2R) algo-rithms were respectively 262 30 83 58 and 59 as shown inTable 4 ese respectively represent 653 75 207

Begin

Preprocessing

MCUVE

Moving window improved

End

SPA

w = 3

w = 2 L

w = 2 R

Extending the wavelength point outwardwith the window width w

0 0 0 1 0 0 0 0 0 1 1 1 0 0

0 0 0 1 0 0 0 0 0 1 1 0 0 0

0 0 0 1 0 0 0 0 0 0 1 1 0 0

Figure 5 Processing flow of the proposed moving-window-improved MC-UVE-SPA (MC-UVE-SPA-MW) wavelength selection algo-rithm based on the Monte Carlo uninformative variable elimination (MC-UVE) algorithm cascaded with the successive projectionsalgorithm (SPA)

0 100 200 300 400 500 600 700Wavelength variable

0

1

2

3

4

5

6

7

Stab

ility

inde

x

Figure 6 Wavelength variable stability distribution map of the partial least squares (PLS) regression model reflecting the oil concentrationusing the MC-UVE algorithm

Journal of Spectroscopy 7

145 and 147 of the 401 wavelength variables includedin the full spectrum e prediction results of the PLS orMLR models constructed from the selected wavelengthvariables are shown in Figure 11 and the details are listedin Table 4 along with the results obtained for a full-spectrum PLS model We note from Table 4 and Figure 11that the models established with spectral data selected byMC-UVE MC-UVE-SPA and MC-UVE-SPA-MW(w 2L 2R 3) algorithms are greatly simplified comparedwith the full-spectrum model MC-UVE retains 262wavelength points and the prediction accuracy is theworst of all the models considered which may be due tothe existence of wavelength collinearity When SPA al-gorithm is used to further screen the wavelength pointsselected by MC-UVE only 30 wavelength points areretained while the prediction accuracy of the model is

greatly improved RMSEP is reduced to 88676 r value isincreased to 09341 and RPD value is increased to 24650We note from Figure 12 that the moving windowemployed by the MC-UVE-SPA-MW expands the wave-length variables selected by the MC-UVE-SPA and im-proves the continuity of the wavelength variables selectedby the MC-UVE-SPA-MWWhen the window width w 2(Left) 2 (Right) and 3 the accuracy of the three modelsobtained by the MC-UVE-SPA-MW are all improvedWhen w 3 the MC-UVE-SPA-MW expands 30 wave-length variables selected by the MC-UVE-SPA to 83 Atthis point RMSEP is reduced to 59694 R value is in-creased to 09752 RPD value is increased to 39994 andthe model is optimal We can also note from Table 4 andFigure 11 that of the five wavelength selection algorithmsthe MC-UVE-SPA selected the least number of wave-lengths and the MC-UVE-SPA-MW (w 3) algorithmprovided a model with the greatest prediction accuracy

33 Ethylene Gas Spectral Data Experiments e number ofwavelength variables selected from the spectral datareflecting the C2H4 concentration by the MVUVE MC-UVE-SPA and MC-UVE-SPA-MW (w 3 2L 2R) algo-rithms were respectively 214 17 48 34 and 34 as shown inFigure 13 ese respectively represent 47 037 10074 and 074 of the 4601 wavelength variables includedin the full spectrum It can be determined from Figure 13that greater than half of the selected wavelength variables fallwithin the strong absorption regions in the wavenumberranges 794sim1105 cmminus1 and 2917sim3242 cmminus1 ese resultscan be explained according to the description given on theHITRAN web page which states that the absorption spectralband of C2H4 gas is in the range of 614sim3242 cmminus1 and thatthe two isotopes H2

12C12CH2 and H212C13CH2 of C2H4

present strong absorption bands in the wavenumber rangesof 794sim1105 cmminus1 and 2917sim3242 cmminus1 respectively FromFigure 4 it can be seen that in some areas that are not C2H4absorption bands the spectral intensity has a significantlinear relationship with C2H4 content which may be due tothe interference caused by the background spectrum withthe change of C2H4 concentration so in some areas that arenot C2H4 absorption bands the wavelength point is alsoselected

e details regarding the prediction results of the PLS orMLR models constructed from the selected wavelengthvariables are listed in Table 5 along with the results obtainedfor a full-spectrum PLS model We again note from thetable that the full-spectrum model is more complicatedand its prediction accuracy was the worst of all modelsconsidered In comparison the models established withspectral data selected by the MC-UVE MC-UVE-SPA andMC-UVE-SPA-MW algorithms are all greatly simplifiedand better model prediction accuracies are uniformlyobtained Of the five wavelength selection algorithms weagain note that theMC-UVE-SPA selected the least numberof wavelengths and the MC-UVE-SPA-MW (w 3) algo-rithm provided a model with the greatest predictionaccuracy

0 100 200 300 400 500 600 700Number of wavelength variables

002

004

006

008

01

012

014

016

018

RMSE

C

0028931

Figure 7 Relationship between the root mean square error(RMSE) of the corn oil concentration predicted by the PLS re-gression model and the number of selected wavelength variables

0 10 20 30 40 50Number of variables included in the model

0

002

004

006

008

01

012

014

016

018

RMSE

Final number of selected variables 37 (RMSE = 0011356)

Figure 8 Relationship between the RMSE of the corn oil con-centration predicted by the multiple linear regression (MLR)modeland the number of selected wavelength variables using the MC-UVE-SPA

8 Journal of Spectroscopy

34 Summary of the Experimental Results It can be notedfrom the above experimental results that the predictionaccuracy of the models established by the wavelength

selection algorithm are higher than that of the full-spectrum model e MC-UVE-SPA selects the leastcharacteristic wavelengths and eliminates the collinearity

3 35 4Measured concentration ()

3

32

34

36

38

4

Pred

icte

d co

ncen

trat

ion

()

(a)

Pred

icte

d co

ncen

trat

ion

()

3 35 4Measured concentration ()

3

32

34

36

38

4

(b)

Pred

icte

d co

ncen

trat

ion

()

3 35 4Measured concentration ()

3

32

34

36

38

4

(c)

Pred

icte

d co

ncen

trat

ion

()

3 35 4Measured concentration ()

3

32

34

36

38

4

(d)Pr

edic

ted

conc

entr

atio

n (

)

3 35 4Measured concentration ()

3

32

34

36

38

4

(e)

Figure 9 Comparison of corn oil concentration predictions from the prediction data set using the (a) MC-UVE-PLS (b) MC-UVE-SPA-MLR (c) MC-UVE-SPA-MW-MLR (w 2L) (d) MC-UVE-SPA-MW-MLR (w 2R) and (e) MC-UVE-SPA-MW-MLR (w 3)

Table 3 Number of selected wavelengths and corn oil concentration prediction results

Wavelength selection algorithm Number of wavelengths Model type RMSECV RMSEP r RPDFull spectrum 700 PLS 01431 01345 09257 10550MC-UVE 102 PLS 00289 01447 09024 07948MC-UVE-SPA 37 MLR 00114 01225 09267 20225MC-UVE-SPA-MW(w 2L) 64 MLR 00035 00381 09713 163666MC-UVE-SPA-MW(w 2R) 64 MLR 00101 01009 08062 13703MC-UVE-SPA-MW(w 3) 82 MLR 00089 00532 09335 61627

1200 1400 1600 1800 2000 2200 2400Wavelength (nm)

0

05

1

Abs

orba

nce

(a)

1200 1400 1600 1800 2000 2200 2400Wavelength (nm)

0

05

1

Abs

orba

nce

(b)

1200 1400 1600 1800 2000 2200 2400Wavelength (nm)

0

05

Abs

orba

nce

(c)

Figure 10 Wavelengths selected from the NIR spectral data reflecting corn oil concentration by the (a) MC-UVE (b) MC-UVE-SPA and(c) MC-UVE-SPA-MW algorithms

Journal of Spectroscopy 9

between variables e prediction accuracy of the modelestablished by the MC-UVE-SPA is higher than thatestablished by MC-UVE e number of characteristic

wavelengths finally selected by the MC-UVE-SPA-MW ismore than that of MC-UVE-SPA but with better pre-diction accuracy

Table 4 Number of selected wavelengths and boiling point prediction results

Wavelength selection algorithm Number of wavelengths Model type RMSECV RMSEP r RPDFull spectrum 401 PLS 46350 83841 09103 23704MC-UVE 262 PLS 43184 100221 08311 17710MC-UVE-SPA 30 MLR 35503 88676 09341 24650MC-UVE-SPA-MW(w 2L) 58 MLR 39921 70442 09534 28212MC-UVE-SPA-MW(w 2R) 59 MLR 40024 68573 09556 28982MC-UVE-SPA-MW(w 3) 83 MLR 26478 59694 09752 39994e maximum number of PCs was again set to 10 for both PLS models

Pred

icte

d bo

iling

poi

nt (deg

C)

150 200 250 300Measured boiling point (degC)

150

200

250

300

(a)

Pred

icte

d bo

iling

poi

nt (deg

C)

Measured boiling point (degC)150 200 250 300

150

200

250

300

(b)

Pred

icte

d bo

iling

poi

nt (deg

C)

Measured boiling point (degC)150 200 250 300

150

200

250

300

(c)

Pred

icte

d bo

iling

poi

nt (deg

C)

Measured boiling point (degC)150 200 250 300

150

200

250

300

(d)

Pred

icte

d bo

iling

poi

nt (deg

C)

Measured boiling point (degC)150 200 250 300

150

200

250

300

(e)

Figure 11 Comparison of diesel fuel boiling point predictions from the prediction data set using (a) the MC-UVE-PLS (b) MC-UVE-SPA-MLR (c) MC-UVE-SPA-MW-MLR (w 2L) (d) MC-UVE-SPA-MW-MLR (w 2R) and (e) MC-UVE-SPA-MW-MLR (w 3)

800 900 1000 1100 1200 1300 1400 1500Wavelength (nm)

ndash1012

Abs

orba

nce

(a)

800 900 1000 1100 1200 1300 1400 1500Wavelength (nm)

ndash1012

Abs

orba

nce

(b)

800 900 1000 1100 1200 1300 1400 1500Wavelength (nm)

ndash1012

Abs

orba

nce

(c)

Figure 12 Wavelengths selected from the NIR spectral data reflecting the boiling point of diesel fuel by the (a) MC-UVE (b) MC-UVE-SPA and (c) MC-UVE-SPA-MW algorithms

10 Journal of Spectroscopy

4 Conclusions

e present study addressed the sparsity of wavelengthvariables selected by the cascaded MC-UVE-SPA throughthe application of a moving window which improved thecontinuity of the selected wavelength variables andthereby better exploited the advantages of the MC-UVEalgorithm and the SPA to obtain regression models withhigh prediction accuracy e advantages of the proposedMC-UVE-SPA-MW were demonstrated by applying theMC-UVE MC-UVE-SPA and MC-UVE-SPA-MW al-gorithms to the selection of wavelength variables from theNIR spectral absorbance data of corn diesel fuel andethylene and PLS and MLR models reflecting the oilcontent of corn the boiling point of diesel fuel and theethylene concentration were thereby established andtested e experimental results demonstrated that theprogressive elimination of uncorrelated and collinearvariables generated increasingly simplified partial-spec-trum models with greater prediction accuracy than thefull-spectrum model Among the three wavelength se-lection algorithms the MC-UVE-SPA selected the leastnumber of wavelength variables and the proposed MC-UVE-SPA-MW algorithm provided models with thegreatest prediction accuracy

Data Availability

e data used to support the findings of this study areavailable from the corresponding author upon request

Conflicts of Interest

e authors declare that they have no conflicts of interest

Acknowledgments

is study was supported by grants from the Major NationalScience and Technology Special Project of China(JZ2015KJZZ0254) and the Key Projects of Natural ScienceResearch in Universities in Anhui China (KJ2018A0544)

References

[1] C Pasquini ldquoNear infrared spectroscopy fundamentalspractical aspects and analytical applicationsrdquo Journal of theBrazilian Chemical Society vol 14 no 2 pp 198ndash219 2003

[2] X Sun M Zhou and Y Sun ldquoSpectroscopy quantitativeanalysis cotton content of blend fabricsrdquo InternationalJournal of Clothing Science and Technology vol 28 no 1pp 65ndash76 2016

500 1000 1500 2000 2500 3000 3500 4000 45000

05

1

15

2

25

3

35

4

Abs

orba

nce

OriginalMCUVE

MCUVE-SPAMCUVE-SPA-MW

times104

Wavenumber (cmndash1)

Figure 13 Measured C2H4 gas absorbance spectra and wavelength variables selected by the MC-UVE MC-UVE-SPA and MC-UVE-SPA-MW algorithms

Table 5 Number of selected wavelengths and C2H4 gas concentration prediction results

Wavelength selection algorithm Number of wavelengths Model type RMSECV RMSEP rFull spectrum 4601 PLS 01431 01342 09801MC-UVE 214 PLS 01009 01212 09887MC-UVE-SPA 17 MLR 01375 00537 09913MC-UVE-SPA-MW(w 2L) 34 MLR 00989 00495 09988MC-UVE-SPA-MW(w 2R) 34 MLR 00988 00486 09990MC-UVE-SPA-MW(w 3) 48 MLR 01005 00434 09999e maximum number of PCs was again set to 10 for both PLS models

Journal of Spectroscopy 11

[3] M Schwanninger J C Rodrigues and K Fackler ldquoA reviewof band assignments in near infrared spectra of wood andwood Componentsrdquo Journal of Near Infrared Spectroscopyvol 19 no 5 pp 287ndash308 2011

[4] Y-H Yun Y-C Wei X-B Zhao W-J Wu Y-Z Liang andH-M Lu ldquoA green method for the quantification of poly-saccharides in Dendrobium officinalerdquo RSC Advances vol 5no 127 pp 105057ndash105065 2015

[5] C K Vance D R Tolleson K Kinoshita J Rodriguez andW J Foley ldquoNear infrared spectroscopy in wildlife andbiodiversityrdquo Journal of Near Infrared Spectroscopy vol 24no 1 pp 1ndash25 2016

[6] H Hotelling ldquoAnalysis of a complex of statistical variablesinto principal componentsrdquo Journal of Educational Psychol-ogy vol 24 no 6 pp 417ndash441 1933

[7] P Geladi and B R Kowalski ldquoPartial least-squares regressiona tutorialrdquo Analytica Chimica Acta vol 185 no 1 pp 1ndash171986

[8] A A Kardamakis and N Pasadakis ldquoAutoregressive mod-eling of near-IR spectra and MLR to predict RON values ofgasolinesrdquo Fuel vol 89 no 1 pp 158ndash161 2010

[9] I Guyon and A Elisseeff ldquoAn introduction to variable andfeature selectionrdquo Journal of Machine Learning Researchvol 3 pp 1157ndash1182 3003

[10] B Lu N Liu H Li et al ldquoQuantitative determination andcharacteristic wavelength selection of available nitrogen incoco-peat by NIR spectroscopyrdquo Soil and Tillage Researchvol 191 pp 266ndash274 2019

[11] M J Anzanello and F S Fogliatto ldquoA review of recentvariable selection methods in industrial and chemometricsapplicationsrdquo European Journal of Industrial Engineeringvol 8 no 5 p 619 2014

[12] Y-H Yun H-D Li B-C Deng and D-S Cao ldquoAn overviewof variable selection methods in multivariate analysis of near-infrared spectrardquo TrAC Trends in Analytical Chemistryvol 113 pp 102ndash115 2019

[13] B Nadler and R R Coifman ldquoe prediction error in CLSand PLS the importance of feature selection prior to mul-tivariate calibrationrdquo Journal of Chemometrics vol 19 no 2pp 107ndash118 2005

[14] T Mehmood K H Liland L Snipen and S Saeligboslash ldquoA reviewof variable selection methods in Partial Least Squares Re-gressionrdquo Chemometrics and Intelligent Laboratory Systemsvol 118 no 8 pp 62ndash69 2012

[15] V Centner D-L Massart O E de Noord S de JongB M Vandeginste and C Sterna ldquoElimination of uninfor-mative variables for multivariate calibrationrdquo AnalyticalChemistry vol 68 no 21 pp 3851ndash3858 1996

[16] G Tang Y Huang K Tian et al ldquoA new spectral variableselection pattern using competitive adaptive reweightedsampling combined with successive projections algorithmrdquo9e Analyst vol 139 no 19 pp 4894ndash4902 2014

[17] Y Li Y Guo C Liu et al ldquoSPA combined with swarm in-telligence optimization algorithms for wavelength variableselection to rapidly discriminate the adulteration of applejuicerdquo Food Analytical Methods vol 10 no 6 pp 1965ndash19712017

[18] J-B Li C-J Zhao W-Q Huang et al ldquoA combination al-gorithm for variable selection to determine soluble solidcontent and firmness of pearsrdquo Analytical Methods vol 6no 7 pp 2170ndash2180 2014

[19] Z Xiaobo Z Jiewen M Hanpin S Jiyong Y Xiaopin andL Yanxiao ldquoGenetic algorithm interval partial least squaresregression combined successive projections algorithm for

variable selection in near-infrared quantitative analysis ofpigment in cucumber leavesrdquo Applied Spectroscopy vol 64no 7 pp 786ndash794 2010

[20] S Ye D Wang and S Min ldquoSuccessive projections algorithmcombined with uninformative variable elimination forspectral variable selectionrdquo Chemometrics and IntelligentLaboratory Systems vol 91 no 2 pp 194ndash199 2008

[21] B-C Deng Y-H Yun P Ma C-C Lin D-B Ren andY-Z Liang ldquoA new method for wavelength interval selectionthat intelligently optimizes the locations widths and com-binations of the intervalsrdquo 9e Analyst vol 140 no 6pp 1876ndash1885 2015

[22] W Fan Y-Y Li Y-K Peng et al ldquoNondestructive deter-mination of lycopene content based on visiblenear infraredtransmission spectrumrdquo Chinese Journal of AnalyticalChemistry vol 46 no 9 pp 1424ndash1431 2018

[23] Z Sun J Fan J Wang et al ldquoAssessment of the humanalbumin in acid precipitation process using NIRS and multi-variable selection methods combined with SPArdquo Journal ofMolecular Structure vol 1199 p 126942 2020

[24] H-D Li Q-S Xu and Y-Z Liang ldquolibPLS an integratedlibrary for partial least squares regression and linear dis-criminant analysisrdquo Chemometrics and Intelligent LaboratorySystems vol 176 pp 34ndash43 2018

[25] R Zhang Y-Y Chen Z-B Wang and L Kewu ldquoA novelensemble L1 regularization based variable selection frame-work with an application in near infrared spectroscopyrdquoChemometrics and Intelligent Laboratory Systems vol 163pp 7ndash15 2017

[26] W Cai Y Li and X Shao ldquoA variable selection method basedon uninformative variable elimination for multivariate cali-bration of near-infrared spectrardquo Chemometrics and Intelli-gent Laboratory Systems vol 90 no 2 pp 188ndash194 2008

[27] L M Bregman ldquoFinding the common point of convex sets bythe method of successive projectionrdquo Proceedings of the USSRAcademy of Sciences vol 162 no 3 pp 487ndash490 1965

[28] X Peng T Shi A Song Y Chen and W Gao ldquoEstimatingsoil organic carbon using VISNIR spectroscopy with SVMRand SPA methodsrdquo Remote Sensing vol 6 no 4 pp 2699ndash2717 2014

[29] Y-H Liu Q-Q Wang X-W Gao and A-G Xie ldquoTotalphenolic content prediction in Flos Lonicerae using hyper-spectral imaging combined with wavelengths selectionmethodsrdquo Journal of Food Process Engineering vol 42 no 6Article ID e13224 2019

[30] J Li H Zhang B Zhan Y Zhang R Li and J Li ldquoNon-destructive firmness measurement of the multiple cultivars ofpears by Vis-NIR spectroscopy coupled with multivariatecalibration analysis and MC-UVE-SPA methodrdquo InfraredPhysics amp Technology vol 104 Article ID 103154 2020

[31] P Hourant V Baeten M T Morales M Meurens andR Aparicio ldquoOil and fat classification by selected bands ofnear-infrared spectroscopyrdquo Applied Spectroscopy vol 54no 8 pp 1168ndash1174 2000

12 Journal of Spectroscopy

Page 7: Moving-Window-ImprovedMonteCarloUninformativeVariable ... · 2020. 8. 3. · to the NIR spectroscopic analysis of nicotine in tobacco lamina and active pharmaceutical ingredients

UVE-SPA-MW algorithm is very apparent in Figure 10compared with the wavelength variables selected by theMC-UVE-SPA We can also note from Table 3 that the full-spectrum model was relatively complicated and its pre-diction accuracy was the worst of all models considered dueto the impact of the large number of uninformative wave-length variables included within the model In comparisonthe models established with spectral data selected by theMC-UVEMC-UVE-SPA andMC-UVE-SPA-MW (w 2L2R 3) algorithms are all greatly simplified and better modelprediction accuracies are uniformly obtained We also notefrom the table that of the five wavelength selection

algorithms the MC-UVE-SPA selected the least number ofwavelengths and the MC-UVE-SPA-MW (w 2L) algo-rithm provided a model with the greatest predictionaccuracy

32 Diesel Spectral Data Experiments e number ofwavelength variables selected from the NIR spectral data ofdiesel fuel reflecting the boiling point by the MC-UVE MC-UVE-SPA and MC-UVE-SPA-MW (w 3 2L 2R) algo-rithms were respectively 262 30 83 58 and 59 as shown inTable 4 ese respectively represent 653 75 207

Begin

Preprocessing

MCUVE

Moving window improved

End

SPA

w = 3

w = 2 L

w = 2 R

Extending the wavelength point outwardwith the window width w

0 0 0 1 0 0 0 0 0 1 1 1 0 0

0 0 0 1 0 0 0 0 0 1 1 0 0 0

0 0 0 1 0 0 0 0 0 0 1 1 0 0

Figure 5 Processing flow of the proposed moving-window-improved MC-UVE-SPA (MC-UVE-SPA-MW) wavelength selection algo-rithm based on the Monte Carlo uninformative variable elimination (MC-UVE) algorithm cascaded with the successive projectionsalgorithm (SPA)

0 100 200 300 400 500 600 700Wavelength variable

0

1

2

3

4

5

6

7

Stab

ility

inde

x

Figure 6 Wavelength variable stability distribution map of the partial least squares (PLS) regression model reflecting the oil concentrationusing the MC-UVE algorithm

Journal of Spectroscopy 7

145 and 147 of the 401 wavelength variables includedin the full spectrum e prediction results of the PLS orMLR models constructed from the selected wavelengthvariables are shown in Figure 11 and the details are listedin Table 4 along with the results obtained for a full-spectrum PLS model We note from Table 4 and Figure 11that the models established with spectral data selected byMC-UVE MC-UVE-SPA and MC-UVE-SPA-MW(w 2L 2R 3) algorithms are greatly simplified comparedwith the full-spectrum model MC-UVE retains 262wavelength points and the prediction accuracy is theworst of all the models considered which may be due tothe existence of wavelength collinearity When SPA al-gorithm is used to further screen the wavelength pointsselected by MC-UVE only 30 wavelength points areretained while the prediction accuracy of the model is

greatly improved RMSEP is reduced to 88676 r value isincreased to 09341 and RPD value is increased to 24650We note from Figure 12 that the moving windowemployed by the MC-UVE-SPA-MW expands the wave-length variables selected by the MC-UVE-SPA and im-proves the continuity of the wavelength variables selectedby the MC-UVE-SPA-MWWhen the window width w 2(Left) 2 (Right) and 3 the accuracy of the three modelsobtained by the MC-UVE-SPA-MW are all improvedWhen w 3 the MC-UVE-SPA-MW expands 30 wave-length variables selected by the MC-UVE-SPA to 83 Atthis point RMSEP is reduced to 59694 R value is in-creased to 09752 RPD value is increased to 39994 andthe model is optimal We can also note from Table 4 andFigure 11 that of the five wavelength selection algorithmsthe MC-UVE-SPA selected the least number of wave-lengths and the MC-UVE-SPA-MW (w 3) algorithmprovided a model with the greatest prediction accuracy

33 Ethylene Gas Spectral Data Experiments e number ofwavelength variables selected from the spectral datareflecting the C2H4 concentration by the MVUVE MC-UVE-SPA and MC-UVE-SPA-MW (w 3 2L 2R) algo-rithms were respectively 214 17 48 34 and 34 as shown inFigure 13 ese respectively represent 47 037 10074 and 074 of the 4601 wavelength variables includedin the full spectrum It can be determined from Figure 13that greater than half of the selected wavelength variables fallwithin the strong absorption regions in the wavenumberranges 794sim1105 cmminus1 and 2917sim3242 cmminus1 ese resultscan be explained according to the description given on theHITRAN web page which states that the absorption spectralband of C2H4 gas is in the range of 614sim3242 cmminus1 and thatthe two isotopes H2

12C12CH2 and H212C13CH2 of C2H4

present strong absorption bands in the wavenumber rangesof 794sim1105 cmminus1 and 2917sim3242 cmminus1 respectively FromFigure 4 it can be seen that in some areas that are not C2H4absorption bands the spectral intensity has a significantlinear relationship with C2H4 content which may be due tothe interference caused by the background spectrum withthe change of C2H4 concentration so in some areas that arenot C2H4 absorption bands the wavelength point is alsoselected

e details regarding the prediction results of the PLS orMLR models constructed from the selected wavelengthvariables are listed in Table 5 along with the results obtainedfor a full-spectrum PLS model We again note from thetable that the full-spectrum model is more complicatedand its prediction accuracy was the worst of all modelsconsidered In comparison the models established withspectral data selected by the MC-UVE MC-UVE-SPA andMC-UVE-SPA-MW algorithms are all greatly simplifiedand better model prediction accuracies are uniformlyobtained Of the five wavelength selection algorithms weagain note that theMC-UVE-SPA selected the least numberof wavelengths and the MC-UVE-SPA-MW (w 3) algo-rithm provided a model with the greatest predictionaccuracy

0 100 200 300 400 500 600 700Number of wavelength variables

002

004

006

008

01

012

014

016

018

RMSE

C

0028931

Figure 7 Relationship between the root mean square error(RMSE) of the corn oil concentration predicted by the PLS re-gression model and the number of selected wavelength variables

0 10 20 30 40 50Number of variables included in the model

0

002

004

006

008

01

012

014

016

018

RMSE

Final number of selected variables 37 (RMSE = 0011356)

Figure 8 Relationship between the RMSE of the corn oil con-centration predicted by the multiple linear regression (MLR)modeland the number of selected wavelength variables using the MC-UVE-SPA

8 Journal of Spectroscopy

34 Summary of the Experimental Results It can be notedfrom the above experimental results that the predictionaccuracy of the models established by the wavelength

selection algorithm are higher than that of the full-spectrum model e MC-UVE-SPA selects the leastcharacteristic wavelengths and eliminates the collinearity

3 35 4Measured concentration ()

3

32

34

36

38

4

Pred

icte

d co

ncen

trat

ion

()

(a)

Pred

icte

d co

ncen

trat

ion

()

3 35 4Measured concentration ()

3

32

34

36

38

4

(b)

Pred

icte

d co

ncen

trat

ion

()

3 35 4Measured concentration ()

3

32

34

36

38

4

(c)

Pred

icte

d co

ncen

trat

ion

()

3 35 4Measured concentration ()

3

32

34

36

38

4

(d)Pr

edic

ted

conc

entr

atio

n (

)

3 35 4Measured concentration ()

3

32

34

36

38

4

(e)

Figure 9 Comparison of corn oil concentration predictions from the prediction data set using the (a) MC-UVE-PLS (b) MC-UVE-SPA-MLR (c) MC-UVE-SPA-MW-MLR (w 2L) (d) MC-UVE-SPA-MW-MLR (w 2R) and (e) MC-UVE-SPA-MW-MLR (w 3)

Table 3 Number of selected wavelengths and corn oil concentration prediction results

Wavelength selection algorithm Number of wavelengths Model type RMSECV RMSEP r RPDFull spectrum 700 PLS 01431 01345 09257 10550MC-UVE 102 PLS 00289 01447 09024 07948MC-UVE-SPA 37 MLR 00114 01225 09267 20225MC-UVE-SPA-MW(w 2L) 64 MLR 00035 00381 09713 163666MC-UVE-SPA-MW(w 2R) 64 MLR 00101 01009 08062 13703MC-UVE-SPA-MW(w 3) 82 MLR 00089 00532 09335 61627

1200 1400 1600 1800 2000 2200 2400Wavelength (nm)

0

05

1

Abs

orba

nce

(a)

1200 1400 1600 1800 2000 2200 2400Wavelength (nm)

0

05

1

Abs

orba

nce

(b)

1200 1400 1600 1800 2000 2200 2400Wavelength (nm)

0

05

Abs

orba

nce

(c)

Figure 10 Wavelengths selected from the NIR spectral data reflecting corn oil concentration by the (a) MC-UVE (b) MC-UVE-SPA and(c) MC-UVE-SPA-MW algorithms

Journal of Spectroscopy 9

between variables e prediction accuracy of the modelestablished by the MC-UVE-SPA is higher than thatestablished by MC-UVE e number of characteristic

wavelengths finally selected by the MC-UVE-SPA-MW ismore than that of MC-UVE-SPA but with better pre-diction accuracy

Table 4 Number of selected wavelengths and boiling point prediction results

Wavelength selection algorithm Number of wavelengths Model type RMSECV RMSEP r RPDFull spectrum 401 PLS 46350 83841 09103 23704MC-UVE 262 PLS 43184 100221 08311 17710MC-UVE-SPA 30 MLR 35503 88676 09341 24650MC-UVE-SPA-MW(w 2L) 58 MLR 39921 70442 09534 28212MC-UVE-SPA-MW(w 2R) 59 MLR 40024 68573 09556 28982MC-UVE-SPA-MW(w 3) 83 MLR 26478 59694 09752 39994e maximum number of PCs was again set to 10 for both PLS models

Pred

icte

d bo

iling

poi

nt (deg

C)

150 200 250 300Measured boiling point (degC)

150

200

250

300

(a)

Pred

icte

d bo

iling

poi

nt (deg

C)

Measured boiling point (degC)150 200 250 300

150

200

250

300

(b)

Pred

icte

d bo

iling

poi

nt (deg

C)

Measured boiling point (degC)150 200 250 300

150

200

250

300

(c)

Pred

icte

d bo

iling

poi

nt (deg

C)

Measured boiling point (degC)150 200 250 300

150

200

250

300

(d)

Pred

icte

d bo

iling

poi

nt (deg

C)

Measured boiling point (degC)150 200 250 300

150

200

250

300

(e)

Figure 11 Comparison of diesel fuel boiling point predictions from the prediction data set using (a) the MC-UVE-PLS (b) MC-UVE-SPA-MLR (c) MC-UVE-SPA-MW-MLR (w 2L) (d) MC-UVE-SPA-MW-MLR (w 2R) and (e) MC-UVE-SPA-MW-MLR (w 3)

800 900 1000 1100 1200 1300 1400 1500Wavelength (nm)

ndash1012

Abs

orba

nce

(a)

800 900 1000 1100 1200 1300 1400 1500Wavelength (nm)

ndash1012

Abs

orba

nce

(b)

800 900 1000 1100 1200 1300 1400 1500Wavelength (nm)

ndash1012

Abs

orba

nce

(c)

Figure 12 Wavelengths selected from the NIR spectral data reflecting the boiling point of diesel fuel by the (a) MC-UVE (b) MC-UVE-SPA and (c) MC-UVE-SPA-MW algorithms

10 Journal of Spectroscopy

4 Conclusions

e present study addressed the sparsity of wavelengthvariables selected by the cascaded MC-UVE-SPA throughthe application of a moving window which improved thecontinuity of the selected wavelength variables andthereby better exploited the advantages of the MC-UVEalgorithm and the SPA to obtain regression models withhigh prediction accuracy e advantages of the proposedMC-UVE-SPA-MW were demonstrated by applying theMC-UVE MC-UVE-SPA and MC-UVE-SPA-MW al-gorithms to the selection of wavelength variables from theNIR spectral absorbance data of corn diesel fuel andethylene and PLS and MLR models reflecting the oilcontent of corn the boiling point of diesel fuel and theethylene concentration were thereby established andtested e experimental results demonstrated that theprogressive elimination of uncorrelated and collinearvariables generated increasingly simplified partial-spec-trum models with greater prediction accuracy than thefull-spectrum model Among the three wavelength se-lection algorithms the MC-UVE-SPA selected the leastnumber of wavelength variables and the proposed MC-UVE-SPA-MW algorithm provided models with thegreatest prediction accuracy

Data Availability

e data used to support the findings of this study areavailable from the corresponding author upon request

Conflicts of Interest

e authors declare that they have no conflicts of interest

Acknowledgments

is study was supported by grants from the Major NationalScience and Technology Special Project of China(JZ2015KJZZ0254) and the Key Projects of Natural ScienceResearch in Universities in Anhui China (KJ2018A0544)

References

[1] C Pasquini ldquoNear infrared spectroscopy fundamentalspractical aspects and analytical applicationsrdquo Journal of theBrazilian Chemical Society vol 14 no 2 pp 198ndash219 2003

[2] X Sun M Zhou and Y Sun ldquoSpectroscopy quantitativeanalysis cotton content of blend fabricsrdquo InternationalJournal of Clothing Science and Technology vol 28 no 1pp 65ndash76 2016

500 1000 1500 2000 2500 3000 3500 4000 45000

05

1

15

2

25

3

35

4

Abs

orba

nce

OriginalMCUVE

MCUVE-SPAMCUVE-SPA-MW

times104

Wavenumber (cmndash1)

Figure 13 Measured C2H4 gas absorbance spectra and wavelength variables selected by the MC-UVE MC-UVE-SPA and MC-UVE-SPA-MW algorithms

Table 5 Number of selected wavelengths and C2H4 gas concentration prediction results

Wavelength selection algorithm Number of wavelengths Model type RMSECV RMSEP rFull spectrum 4601 PLS 01431 01342 09801MC-UVE 214 PLS 01009 01212 09887MC-UVE-SPA 17 MLR 01375 00537 09913MC-UVE-SPA-MW(w 2L) 34 MLR 00989 00495 09988MC-UVE-SPA-MW(w 2R) 34 MLR 00988 00486 09990MC-UVE-SPA-MW(w 3) 48 MLR 01005 00434 09999e maximum number of PCs was again set to 10 for both PLS models

Journal of Spectroscopy 11

[3] M Schwanninger J C Rodrigues and K Fackler ldquoA reviewof band assignments in near infrared spectra of wood andwood Componentsrdquo Journal of Near Infrared Spectroscopyvol 19 no 5 pp 287ndash308 2011

[4] Y-H Yun Y-C Wei X-B Zhao W-J Wu Y-Z Liang andH-M Lu ldquoA green method for the quantification of poly-saccharides in Dendrobium officinalerdquo RSC Advances vol 5no 127 pp 105057ndash105065 2015

[5] C K Vance D R Tolleson K Kinoshita J Rodriguez andW J Foley ldquoNear infrared spectroscopy in wildlife andbiodiversityrdquo Journal of Near Infrared Spectroscopy vol 24no 1 pp 1ndash25 2016

[6] H Hotelling ldquoAnalysis of a complex of statistical variablesinto principal componentsrdquo Journal of Educational Psychol-ogy vol 24 no 6 pp 417ndash441 1933

[7] P Geladi and B R Kowalski ldquoPartial least-squares regressiona tutorialrdquo Analytica Chimica Acta vol 185 no 1 pp 1ndash171986

[8] A A Kardamakis and N Pasadakis ldquoAutoregressive mod-eling of near-IR spectra and MLR to predict RON values ofgasolinesrdquo Fuel vol 89 no 1 pp 158ndash161 2010

[9] I Guyon and A Elisseeff ldquoAn introduction to variable andfeature selectionrdquo Journal of Machine Learning Researchvol 3 pp 1157ndash1182 3003

[10] B Lu N Liu H Li et al ldquoQuantitative determination andcharacteristic wavelength selection of available nitrogen incoco-peat by NIR spectroscopyrdquo Soil and Tillage Researchvol 191 pp 266ndash274 2019

[11] M J Anzanello and F S Fogliatto ldquoA review of recentvariable selection methods in industrial and chemometricsapplicationsrdquo European Journal of Industrial Engineeringvol 8 no 5 p 619 2014

[12] Y-H Yun H-D Li B-C Deng and D-S Cao ldquoAn overviewof variable selection methods in multivariate analysis of near-infrared spectrardquo TrAC Trends in Analytical Chemistryvol 113 pp 102ndash115 2019

[13] B Nadler and R R Coifman ldquoe prediction error in CLSand PLS the importance of feature selection prior to mul-tivariate calibrationrdquo Journal of Chemometrics vol 19 no 2pp 107ndash118 2005

[14] T Mehmood K H Liland L Snipen and S Saeligboslash ldquoA reviewof variable selection methods in Partial Least Squares Re-gressionrdquo Chemometrics and Intelligent Laboratory Systemsvol 118 no 8 pp 62ndash69 2012

[15] V Centner D-L Massart O E de Noord S de JongB M Vandeginste and C Sterna ldquoElimination of uninfor-mative variables for multivariate calibrationrdquo AnalyticalChemistry vol 68 no 21 pp 3851ndash3858 1996

[16] G Tang Y Huang K Tian et al ldquoA new spectral variableselection pattern using competitive adaptive reweightedsampling combined with successive projections algorithmrdquo9e Analyst vol 139 no 19 pp 4894ndash4902 2014

[17] Y Li Y Guo C Liu et al ldquoSPA combined with swarm in-telligence optimization algorithms for wavelength variableselection to rapidly discriminate the adulteration of applejuicerdquo Food Analytical Methods vol 10 no 6 pp 1965ndash19712017

[18] J-B Li C-J Zhao W-Q Huang et al ldquoA combination al-gorithm for variable selection to determine soluble solidcontent and firmness of pearsrdquo Analytical Methods vol 6no 7 pp 2170ndash2180 2014

[19] Z Xiaobo Z Jiewen M Hanpin S Jiyong Y Xiaopin andL Yanxiao ldquoGenetic algorithm interval partial least squaresregression combined successive projections algorithm for

variable selection in near-infrared quantitative analysis ofpigment in cucumber leavesrdquo Applied Spectroscopy vol 64no 7 pp 786ndash794 2010

[20] S Ye D Wang and S Min ldquoSuccessive projections algorithmcombined with uninformative variable elimination forspectral variable selectionrdquo Chemometrics and IntelligentLaboratory Systems vol 91 no 2 pp 194ndash199 2008

[21] B-C Deng Y-H Yun P Ma C-C Lin D-B Ren andY-Z Liang ldquoA new method for wavelength interval selectionthat intelligently optimizes the locations widths and com-binations of the intervalsrdquo 9e Analyst vol 140 no 6pp 1876ndash1885 2015

[22] W Fan Y-Y Li Y-K Peng et al ldquoNondestructive deter-mination of lycopene content based on visiblenear infraredtransmission spectrumrdquo Chinese Journal of AnalyticalChemistry vol 46 no 9 pp 1424ndash1431 2018

[23] Z Sun J Fan J Wang et al ldquoAssessment of the humanalbumin in acid precipitation process using NIRS and multi-variable selection methods combined with SPArdquo Journal ofMolecular Structure vol 1199 p 126942 2020

[24] H-D Li Q-S Xu and Y-Z Liang ldquolibPLS an integratedlibrary for partial least squares regression and linear dis-criminant analysisrdquo Chemometrics and Intelligent LaboratorySystems vol 176 pp 34ndash43 2018

[25] R Zhang Y-Y Chen Z-B Wang and L Kewu ldquoA novelensemble L1 regularization based variable selection frame-work with an application in near infrared spectroscopyrdquoChemometrics and Intelligent Laboratory Systems vol 163pp 7ndash15 2017

[26] W Cai Y Li and X Shao ldquoA variable selection method basedon uninformative variable elimination for multivariate cali-bration of near-infrared spectrardquo Chemometrics and Intelli-gent Laboratory Systems vol 90 no 2 pp 188ndash194 2008

[27] L M Bregman ldquoFinding the common point of convex sets bythe method of successive projectionrdquo Proceedings of the USSRAcademy of Sciences vol 162 no 3 pp 487ndash490 1965

[28] X Peng T Shi A Song Y Chen and W Gao ldquoEstimatingsoil organic carbon using VISNIR spectroscopy with SVMRand SPA methodsrdquo Remote Sensing vol 6 no 4 pp 2699ndash2717 2014

[29] Y-H Liu Q-Q Wang X-W Gao and A-G Xie ldquoTotalphenolic content prediction in Flos Lonicerae using hyper-spectral imaging combined with wavelengths selectionmethodsrdquo Journal of Food Process Engineering vol 42 no 6Article ID e13224 2019

[30] J Li H Zhang B Zhan Y Zhang R Li and J Li ldquoNon-destructive firmness measurement of the multiple cultivars ofpears by Vis-NIR spectroscopy coupled with multivariatecalibration analysis and MC-UVE-SPA methodrdquo InfraredPhysics amp Technology vol 104 Article ID 103154 2020

[31] P Hourant V Baeten M T Morales M Meurens andR Aparicio ldquoOil and fat classification by selected bands ofnear-infrared spectroscopyrdquo Applied Spectroscopy vol 54no 8 pp 1168ndash1174 2000

12 Journal of Spectroscopy

Page 8: Moving-Window-ImprovedMonteCarloUninformativeVariable ... · 2020. 8. 3. · to the NIR spectroscopic analysis of nicotine in tobacco lamina and active pharmaceutical ingredients

145 and 147 of the 401 wavelength variables includedin the full spectrum e prediction results of the PLS orMLR models constructed from the selected wavelengthvariables are shown in Figure 11 and the details are listedin Table 4 along with the results obtained for a full-spectrum PLS model We note from Table 4 and Figure 11that the models established with spectral data selected byMC-UVE MC-UVE-SPA and MC-UVE-SPA-MW(w 2L 2R 3) algorithms are greatly simplified comparedwith the full-spectrum model MC-UVE retains 262wavelength points and the prediction accuracy is theworst of all the models considered which may be due tothe existence of wavelength collinearity When SPA al-gorithm is used to further screen the wavelength pointsselected by MC-UVE only 30 wavelength points areretained while the prediction accuracy of the model is

greatly improved RMSEP is reduced to 88676 r value isincreased to 09341 and RPD value is increased to 24650We note from Figure 12 that the moving windowemployed by the MC-UVE-SPA-MW expands the wave-length variables selected by the MC-UVE-SPA and im-proves the continuity of the wavelength variables selectedby the MC-UVE-SPA-MWWhen the window width w 2(Left) 2 (Right) and 3 the accuracy of the three modelsobtained by the MC-UVE-SPA-MW are all improvedWhen w 3 the MC-UVE-SPA-MW expands 30 wave-length variables selected by the MC-UVE-SPA to 83 Atthis point RMSEP is reduced to 59694 R value is in-creased to 09752 RPD value is increased to 39994 andthe model is optimal We can also note from Table 4 andFigure 11 that of the five wavelength selection algorithmsthe MC-UVE-SPA selected the least number of wave-lengths and the MC-UVE-SPA-MW (w 3) algorithmprovided a model with the greatest prediction accuracy

33 Ethylene Gas Spectral Data Experiments e number ofwavelength variables selected from the spectral datareflecting the C2H4 concentration by the MVUVE MC-UVE-SPA and MC-UVE-SPA-MW (w 3 2L 2R) algo-rithms were respectively 214 17 48 34 and 34 as shown inFigure 13 ese respectively represent 47 037 10074 and 074 of the 4601 wavelength variables includedin the full spectrum It can be determined from Figure 13that greater than half of the selected wavelength variables fallwithin the strong absorption regions in the wavenumberranges 794sim1105 cmminus1 and 2917sim3242 cmminus1 ese resultscan be explained according to the description given on theHITRAN web page which states that the absorption spectralband of C2H4 gas is in the range of 614sim3242 cmminus1 and thatthe two isotopes H2

12C12CH2 and H212C13CH2 of C2H4

present strong absorption bands in the wavenumber rangesof 794sim1105 cmminus1 and 2917sim3242 cmminus1 respectively FromFigure 4 it can be seen that in some areas that are not C2H4absorption bands the spectral intensity has a significantlinear relationship with C2H4 content which may be due tothe interference caused by the background spectrum withthe change of C2H4 concentration so in some areas that arenot C2H4 absorption bands the wavelength point is alsoselected

e details regarding the prediction results of the PLS orMLR models constructed from the selected wavelengthvariables are listed in Table 5 along with the results obtainedfor a full-spectrum PLS model We again note from thetable that the full-spectrum model is more complicatedand its prediction accuracy was the worst of all modelsconsidered In comparison the models established withspectral data selected by the MC-UVE MC-UVE-SPA andMC-UVE-SPA-MW algorithms are all greatly simplifiedand better model prediction accuracies are uniformlyobtained Of the five wavelength selection algorithms weagain note that theMC-UVE-SPA selected the least numberof wavelengths and the MC-UVE-SPA-MW (w 3) algo-rithm provided a model with the greatest predictionaccuracy

0 100 200 300 400 500 600 700Number of wavelength variables

002

004

006

008

01

012

014

016

018

RMSE

C

0028931

Figure 7 Relationship between the root mean square error(RMSE) of the corn oil concentration predicted by the PLS re-gression model and the number of selected wavelength variables

0 10 20 30 40 50Number of variables included in the model

0

002

004

006

008

01

012

014

016

018

RMSE

Final number of selected variables 37 (RMSE = 0011356)

Figure 8 Relationship between the RMSE of the corn oil con-centration predicted by the multiple linear regression (MLR)modeland the number of selected wavelength variables using the MC-UVE-SPA

8 Journal of Spectroscopy

34 Summary of the Experimental Results It can be notedfrom the above experimental results that the predictionaccuracy of the models established by the wavelength

selection algorithm are higher than that of the full-spectrum model e MC-UVE-SPA selects the leastcharacteristic wavelengths and eliminates the collinearity

3 35 4Measured concentration ()

3

32

34

36

38

4

Pred

icte

d co

ncen

trat

ion

()

(a)

Pred

icte

d co

ncen

trat

ion

()

3 35 4Measured concentration ()

3

32

34

36

38

4

(b)

Pred

icte

d co

ncen

trat

ion

()

3 35 4Measured concentration ()

3

32

34

36

38

4

(c)

Pred

icte

d co

ncen

trat

ion

()

3 35 4Measured concentration ()

3

32

34

36

38

4

(d)Pr

edic

ted

conc

entr

atio

n (

)

3 35 4Measured concentration ()

3

32

34

36

38

4

(e)

Figure 9 Comparison of corn oil concentration predictions from the prediction data set using the (a) MC-UVE-PLS (b) MC-UVE-SPA-MLR (c) MC-UVE-SPA-MW-MLR (w 2L) (d) MC-UVE-SPA-MW-MLR (w 2R) and (e) MC-UVE-SPA-MW-MLR (w 3)

Table 3 Number of selected wavelengths and corn oil concentration prediction results

Wavelength selection algorithm Number of wavelengths Model type RMSECV RMSEP r RPDFull spectrum 700 PLS 01431 01345 09257 10550MC-UVE 102 PLS 00289 01447 09024 07948MC-UVE-SPA 37 MLR 00114 01225 09267 20225MC-UVE-SPA-MW(w 2L) 64 MLR 00035 00381 09713 163666MC-UVE-SPA-MW(w 2R) 64 MLR 00101 01009 08062 13703MC-UVE-SPA-MW(w 3) 82 MLR 00089 00532 09335 61627

1200 1400 1600 1800 2000 2200 2400Wavelength (nm)

0

05

1

Abs

orba

nce

(a)

1200 1400 1600 1800 2000 2200 2400Wavelength (nm)

0

05

1

Abs

orba

nce

(b)

1200 1400 1600 1800 2000 2200 2400Wavelength (nm)

0

05

Abs

orba

nce

(c)

Figure 10 Wavelengths selected from the NIR spectral data reflecting corn oil concentration by the (a) MC-UVE (b) MC-UVE-SPA and(c) MC-UVE-SPA-MW algorithms

Journal of Spectroscopy 9

between variables e prediction accuracy of the modelestablished by the MC-UVE-SPA is higher than thatestablished by MC-UVE e number of characteristic

wavelengths finally selected by the MC-UVE-SPA-MW ismore than that of MC-UVE-SPA but with better pre-diction accuracy

Table 4 Number of selected wavelengths and boiling point prediction results

Wavelength selection algorithm Number of wavelengths Model type RMSECV RMSEP r RPDFull spectrum 401 PLS 46350 83841 09103 23704MC-UVE 262 PLS 43184 100221 08311 17710MC-UVE-SPA 30 MLR 35503 88676 09341 24650MC-UVE-SPA-MW(w 2L) 58 MLR 39921 70442 09534 28212MC-UVE-SPA-MW(w 2R) 59 MLR 40024 68573 09556 28982MC-UVE-SPA-MW(w 3) 83 MLR 26478 59694 09752 39994e maximum number of PCs was again set to 10 for both PLS models

Pred

icte

d bo

iling

poi

nt (deg

C)

150 200 250 300Measured boiling point (degC)

150

200

250

300

(a)

Pred

icte

d bo

iling

poi

nt (deg

C)

Measured boiling point (degC)150 200 250 300

150

200

250

300

(b)

Pred

icte

d bo

iling

poi

nt (deg

C)

Measured boiling point (degC)150 200 250 300

150

200

250

300

(c)

Pred

icte

d bo

iling

poi

nt (deg

C)

Measured boiling point (degC)150 200 250 300

150

200

250

300

(d)

Pred

icte

d bo

iling

poi

nt (deg

C)

Measured boiling point (degC)150 200 250 300

150

200

250

300

(e)

Figure 11 Comparison of diesel fuel boiling point predictions from the prediction data set using (a) the MC-UVE-PLS (b) MC-UVE-SPA-MLR (c) MC-UVE-SPA-MW-MLR (w 2L) (d) MC-UVE-SPA-MW-MLR (w 2R) and (e) MC-UVE-SPA-MW-MLR (w 3)

800 900 1000 1100 1200 1300 1400 1500Wavelength (nm)

ndash1012

Abs

orba

nce

(a)

800 900 1000 1100 1200 1300 1400 1500Wavelength (nm)

ndash1012

Abs

orba

nce

(b)

800 900 1000 1100 1200 1300 1400 1500Wavelength (nm)

ndash1012

Abs

orba

nce

(c)

Figure 12 Wavelengths selected from the NIR spectral data reflecting the boiling point of diesel fuel by the (a) MC-UVE (b) MC-UVE-SPA and (c) MC-UVE-SPA-MW algorithms

10 Journal of Spectroscopy

4 Conclusions

e present study addressed the sparsity of wavelengthvariables selected by the cascaded MC-UVE-SPA throughthe application of a moving window which improved thecontinuity of the selected wavelength variables andthereby better exploited the advantages of the MC-UVEalgorithm and the SPA to obtain regression models withhigh prediction accuracy e advantages of the proposedMC-UVE-SPA-MW were demonstrated by applying theMC-UVE MC-UVE-SPA and MC-UVE-SPA-MW al-gorithms to the selection of wavelength variables from theNIR spectral absorbance data of corn diesel fuel andethylene and PLS and MLR models reflecting the oilcontent of corn the boiling point of diesel fuel and theethylene concentration were thereby established andtested e experimental results demonstrated that theprogressive elimination of uncorrelated and collinearvariables generated increasingly simplified partial-spec-trum models with greater prediction accuracy than thefull-spectrum model Among the three wavelength se-lection algorithms the MC-UVE-SPA selected the leastnumber of wavelength variables and the proposed MC-UVE-SPA-MW algorithm provided models with thegreatest prediction accuracy

Data Availability

e data used to support the findings of this study areavailable from the corresponding author upon request

Conflicts of Interest

e authors declare that they have no conflicts of interest

Acknowledgments

is study was supported by grants from the Major NationalScience and Technology Special Project of China(JZ2015KJZZ0254) and the Key Projects of Natural ScienceResearch in Universities in Anhui China (KJ2018A0544)

References

[1] C Pasquini ldquoNear infrared spectroscopy fundamentalspractical aspects and analytical applicationsrdquo Journal of theBrazilian Chemical Society vol 14 no 2 pp 198ndash219 2003

[2] X Sun M Zhou and Y Sun ldquoSpectroscopy quantitativeanalysis cotton content of blend fabricsrdquo InternationalJournal of Clothing Science and Technology vol 28 no 1pp 65ndash76 2016

500 1000 1500 2000 2500 3000 3500 4000 45000

05

1

15

2

25

3

35

4

Abs

orba

nce

OriginalMCUVE

MCUVE-SPAMCUVE-SPA-MW

times104

Wavenumber (cmndash1)

Figure 13 Measured C2H4 gas absorbance spectra and wavelength variables selected by the MC-UVE MC-UVE-SPA and MC-UVE-SPA-MW algorithms

Table 5 Number of selected wavelengths and C2H4 gas concentration prediction results

Wavelength selection algorithm Number of wavelengths Model type RMSECV RMSEP rFull spectrum 4601 PLS 01431 01342 09801MC-UVE 214 PLS 01009 01212 09887MC-UVE-SPA 17 MLR 01375 00537 09913MC-UVE-SPA-MW(w 2L) 34 MLR 00989 00495 09988MC-UVE-SPA-MW(w 2R) 34 MLR 00988 00486 09990MC-UVE-SPA-MW(w 3) 48 MLR 01005 00434 09999e maximum number of PCs was again set to 10 for both PLS models

Journal of Spectroscopy 11

[3] M Schwanninger J C Rodrigues and K Fackler ldquoA reviewof band assignments in near infrared spectra of wood andwood Componentsrdquo Journal of Near Infrared Spectroscopyvol 19 no 5 pp 287ndash308 2011

[4] Y-H Yun Y-C Wei X-B Zhao W-J Wu Y-Z Liang andH-M Lu ldquoA green method for the quantification of poly-saccharides in Dendrobium officinalerdquo RSC Advances vol 5no 127 pp 105057ndash105065 2015

[5] C K Vance D R Tolleson K Kinoshita J Rodriguez andW J Foley ldquoNear infrared spectroscopy in wildlife andbiodiversityrdquo Journal of Near Infrared Spectroscopy vol 24no 1 pp 1ndash25 2016

[6] H Hotelling ldquoAnalysis of a complex of statistical variablesinto principal componentsrdquo Journal of Educational Psychol-ogy vol 24 no 6 pp 417ndash441 1933

[7] P Geladi and B R Kowalski ldquoPartial least-squares regressiona tutorialrdquo Analytica Chimica Acta vol 185 no 1 pp 1ndash171986

[8] A A Kardamakis and N Pasadakis ldquoAutoregressive mod-eling of near-IR spectra and MLR to predict RON values ofgasolinesrdquo Fuel vol 89 no 1 pp 158ndash161 2010

[9] I Guyon and A Elisseeff ldquoAn introduction to variable andfeature selectionrdquo Journal of Machine Learning Researchvol 3 pp 1157ndash1182 3003

[10] B Lu N Liu H Li et al ldquoQuantitative determination andcharacteristic wavelength selection of available nitrogen incoco-peat by NIR spectroscopyrdquo Soil and Tillage Researchvol 191 pp 266ndash274 2019

[11] M J Anzanello and F S Fogliatto ldquoA review of recentvariable selection methods in industrial and chemometricsapplicationsrdquo European Journal of Industrial Engineeringvol 8 no 5 p 619 2014

[12] Y-H Yun H-D Li B-C Deng and D-S Cao ldquoAn overviewof variable selection methods in multivariate analysis of near-infrared spectrardquo TrAC Trends in Analytical Chemistryvol 113 pp 102ndash115 2019

[13] B Nadler and R R Coifman ldquoe prediction error in CLSand PLS the importance of feature selection prior to mul-tivariate calibrationrdquo Journal of Chemometrics vol 19 no 2pp 107ndash118 2005

[14] T Mehmood K H Liland L Snipen and S Saeligboslash ldquoA reviewof variable selection methods in Partial Least Squares Re-gressionrdquo Chemometrics and Intelligent Laboratory Systemsvol 118 no 8 pp 62ndash69 2012

[15] V Centner D-L Massart O E de Noord S de JongB M Vandeginste and C Sterna ldquoElimination of uninfor-mative variables for multivariate calibrationrdquo AnalyticalChemistry vol 68 no 21 pp 3851ndash3858 1996

[16] G Tang Y Huang K Tian et al ldquoA new spectral variableselection pattern using competitive adaptive reweightedsampling combined with successive projections algorithmrdquo9e Analyst vol 139 no 19 pp 4894ndash4902 2014

[17] Y Li Y Guo C Liu et al ldquoSPA combined with swarm in-telligence optimization algorithms for wavelength variableselection to rapidly discriminate the adulteration of applejuicerdquo Food Analytical Methods vol 10 no 6 pp 1965ndash19712017

[18] J-B Li C-J Zhao W-Q Huang et al ldquoA combination al-gorithm for variable selection to determine soluble solidcontent and firmness of pearsrdquo Analytical Methods vol 6no 7 pp 2170ndash2180 2014

[19] Z Xiaobo Z Jiewen M Hanpin S Jiyong Y Xiaopin andL Yanxiao ldquoGenetic algorithm interval partial least squaresregression combined successive projections algorithm for

variable selection in near-infrared quantitative analysis ofpigment in cucumber leavesrdquo Applied Spectroscopy vol 64no 7 pp 786ndash794 2010

[20] S Ye D Wang and S Min ldquoSuccessive projections algorithmcombined with uninformative variable elimination forspectral variable selectionrdquo Chemometrics and IntelligentLaboratory Systems vol 91 no 2 pp 194ndash199 2008

[21] B-C Deng Y-H Yun P Ma C-C Lin D-B Ren andY-Z Liang ldquoA new method for wavelength interval selectionthat intelligently optimizes the locations widths and com-binations of the intervalsrdquo 9e Analyst vol 140 no 6pp 1876ndash1885 2015

[22] W Fan Y-Y Li Y-K Peng et al ldquoNondestructive deter-mination of lycopene content based on visiblenear infraredtransmission spectrumrdquo Chinese Journal of AnalyticalChemistry vol 46 no 9 pp 1424ndash1431 2018

[23] Z Sun J Fan J Wang et al ldquoAssessment of the humanalbumin in acid precipitation process using NIRS and multi-variable selection methods combined with SPArdquo Journal ofMolecular Structure vol 1199 p 126942 2020

[24] H-D Li Q-S Xu and Y-Z Liang ldquolibPLS an integratedlibrary for partial least squares regression and linear dis-criminant analysisrdquo Chemometrics and Intelligent LaboratorySystems vol 176 pp 34ndash43 2018

[25] R Zhang Y-Y Chen Z-B Wang and L Kewu ldquoA novelensemble L1 regularization based variable selection frame-work with an application in near infrared spectroscopyrdquoChemometrics and Intelligent Laboratory Systems vol 163pp 7ndash15 2017

[26] W Cai Y Li and X Shao ldquoA variable selection method basedon uninformative variable elimination for multivariate cali-bration of near-infrared spectrardquo Chemometrics and Intelli-gent Laboratory Systems vol 90 no 2 pp 188ndash194 2008

[27] L M Bregman ldquoFinding the common point of convex sets bythe method of successive projectionrdquo Proceedings of the USSRAcademy of Sciences vol 162 no 3 pp 487ndash490 1965

[28] X Peng T Shi A Song Y Chen and W Gao ldquoEstimatingsoil organic carbon using VISNIR spectroscopy with SVMRand SPA methodsrdquo Remote Sensing vol 6 no 4 pp 2699ndash2717 2014

[29] Y-H Liu Q-Q Wang X-W Gao and A-G Xie ldquoTotalphenolic content prediction in Flos Lonicerae using hyper-spectral imaging combined with wavelengths selectionmethodsrdquo Journal of Food Process Engineering vol 42 no 6Article ID e13224 2019

[30] J Li H Zhang B Zhan Y Zhang R Li and J Li ldquoNon-destructive firmness measurement of the multiple cultivars ofpears by Vis-NIR spectroscopy coupled with multivariatecalibration analysis and MC-UVE-SPA methodrdquo InfraredPhysics amp Technology vol 104 Article ID 103154 2020

[31] P Hourant V Baeten M T Morales M Meurens andR Aparicio ldquoOil and fat classification by selected bands ofnear-infrared spectroscopyrdquo Applied Spectroscopy vol 54no 8 pp 1168ndash1174 2000

12 Journal of Spectroscopy

Page 9: Moving-Window-ImprovedMonteCarloUninformativeVariable ... · 2020. 8. 3. · to the NIR spectroscopic analysis of nicotine in tobacco lamina and active pharmaceutical ingredients

34 Summary of the Experimental Results It can be notedfrom the above experimental results that the predictionaccuracy of the models established by the wavelength

selection algorithm are higher than that of the full-spectrum model e MC-UVE-SPA selects the leastcharacteristic wavelengths and eliminates the collinearity

3 35 4Measured concentration ()

3

32

34

36

38

4

Pred

icte

d co

ncen

trat

ion

()

(a)

Pred

icte

d co

ncen

trat

ion

()

3 35 4Measured concentration ()

3

32

34

36

38

4

(b)

Pred

icte

d co

ncen

trat

ion

()

3 35 4Measured concentration ()

3

32

34

36

38

4

(c)

Pred

icte

d co

ncen

trat

ion

()

3 35 4Measured concentration ()

3

32

34

36

38

4

(d)Pr

edic

ted

conc

entr

atio

n (

)

3 35 4Measured concentration ()

3

32

34

36

38

4

(e)

Figure 9 Comparison of corn oil concentration predictions from the prediction data set using the (a) MC-UVE-PLS (b) MC-UVE-SPA-MLR (c) MC-UVE-SPA-MW-MLR (w 2L) (d) MC-UVE-SPA-MW-MLR (w 2R) and (e) MC-UVE-SPA-MW-MLR (w 3)

Table 3 Number of selected wavelengths and corn oil concentration prediction results

Wavelength selection algorithm Number of wavelengths Model type RMSECV RMSEP r RPDFull spectrum 700 PLS 01431 01345 09257 10550MC-UVE 102 PLS 00289 01447 09024 07948MC-UVE-SPA 37 MLR 00114 01225 09267 20225MC-UVE-SPA-MW(w 2L) 64 MLR 00035 00381 09713 163666MC-UVE-SPA-MW(w 2R) 64 MLR 00101 01009 08062 13703MC-UVE-SPA-MW(w 3) 82 MLR 00089 00532 09335 61627

1200 1400 1600 1800 2000 2200 2400Wavelength (nm)

0

05

1

Abs

orba

nce

(a)

1200 1400 1600 1800 2000 2200 2400Wavelength (nm)

0

05

1

Abs

orba

nce

(b)

1200 1400 1600 1800 2000 2200 2400Wavelength (nm)

0

05

Abs

orba

nce

(c)

Figure 10 Wavelengths selected from the NIR spectral data reflecting corn oil concentration by the (a) MC-UVE (b) MC-UVE-SPA and(c) MC-UVE-SPA-MW algorithms

Journal of Spectroscopy 9

between variables e prediction accuracy of the modelestablished by the MC-UVE-SPA is higher than thatestablished by MC-UVE e number of characteristic

wavelengths finally selected by the MC-UVE-SPA-MW ismore than that of MC-UVE-SPA but with better pre-diction accuracy

Table 4 Number of selected wavelengths and boiling point prediction results

Wavelength selection algorithm Number of wavelengths Model type RMSECV RMSEP r RPDFull spectrum 401 PLS 46350 83841 09103 23704MC-UVE 262 PLS 43184 100221 08311 17710MC-UVE-SPA 30 MLR 35503 88676 09341 24650MC-UVE-SPA-MW(w 2L) 58 MLR 39921 70442 09534 28212MC-UVE-SPA-MW(w 2R) 59 MLR 40024 68573 09556 28982MC-UVE-SPA-MW(w 3) 83 MLR 26478 59694 09752 39994e maximum number of PCs was again set to 10 for both PLS models

Pred

icte

d bo

iling

poi

nt (deg

C)

150 200 250 300Measured boiling point (degC)

150

200

250

300

(a)

Pred

icte

d bo

iling

poi

nt (deg

C)

Measured boiling point (degC)150 200 250 300

150

200

250

300

(b)

Pred

icte

d bo

iling

poi

nt (deg

C)

Measured boiling point (degC)150 200 250 300

150

200

250

300

(c)

Pred

icte

d bo

iling

poi

nt (deg

C)

Measured boiling point (degC)150 200 250 300

150

200

250

300

(d)

Pred

icte

d bo

iling

poi

nt (deg

C)

Measured boiling point (degC)150 200 250 300

150

200

250

300

(e)

Figure 11 Comparison of diesel fuel boiling point predictions from the prediction data set using (a) the MC-UVE-PLS (b) MC-UVE-SPA-MLR (c) MC-UVE-SPA-MW-MLR (w 2L) (d) MC-UVE-SPA-MW-MLR (w 2R) and (e) MC-UVE-SPA-MW-MLR (w 3)

800 900 1000 1100 1200 1300 1400 1500Wavelength (nm)

ndash1012

Abs

orba

nce

(a)

800 900 1000 1100 1200 1300 1400 1500Wavelength (nm)

ndash1012

Abs

orba

nce

(b)

800 900 1000 1100 1200 1300 1400 1500Wavelength (nm)

ndash1012

Abs

orba

nce

(c)

Figure 12 Wavelengths selected from the NIR spectral data reflecting the boiling point of diesel fuel by the (a) MC-UVE (b) MC-UVE-SPA and (c) MC-UVE-SPA-MW algorithms

10 Journal of Spectroscopy

4 Conclusions

e present study addressed the sparsity of wavelengthvariables selected by the cascaded MC-UVE-SPA throughthe application of a moving window which improved thecontinuity of the selected wavelength variables andthereby better exploited the advantages of the MC-UVEalgorithm and the SPA to obtain regression models withhigh prediction accuracy e advantages of the proposedMC-UVE-SPA-MW were demonstrated by applying theMC-UVE MC-UVE-SPA and MC-UVE-SPA-MW al-gorithms to the selection of wavelength variables from theNIR spectral absorbance data of corn diesel fuel andethylene and PLS and MLR models reflecting the oilcontent of corn the boiling point of diesel fuel and theethylene concentration were thereby established andtested e experimental results demonstrated that theprogressive elimination of uncorrelated and collinearvariables generated increasingly simplified partial-spec-trum models with greater prediction accuracy than thefull-spectrum model Among the three wavelength se-lection algorithms the MC-UVE-SPA selected the leastnumber of wavelength variables and the proposed MC-UVE-SPA-MW algorithm provided models with thegreatest prediction accuracy

Data Availability

e data used to support the findings of this study areavailable from the corresponding author upon request

Conflicts of Interest

e authors declare that they have no conflicts of interest

Acknowledgments

is study was supported by grants from the Major NationalScience and Technology Special Project of China(JZ2015KJZZ0254) and the Key Projects of Natural ScienceResearch in Universities in Anhui China (KJ2018A0544)

References

[1] C Pasquini ldquoNear infrared spectroscopy fundamentalspractical aspects and analytical applicationsrdquo Journal of theBrazilian Chemical Society vol 14 no 2 pp 198ndash219 2003

[2] X Sun M Zhou and Y Sun ldquoSpectroscopy quantitativeanalysis cotton content of blend fabricsrdquo InternationalJournal of Clothing Science and Technology vol 28 no 1pp 65ndash76 2016

500 1000 1500 2000 2500 3000 3500 4000 45000

05

1

15

2

25

3

35

4

Abs

orba

nce

OriginalMCUVE

MCUVE-SPAMCUVE-SPA-MW

times104

Wavenumber (cmndash1)

Figure 13 Measured C2H4 gas absorbance spectra and wavelength variables selected by the MC-UVE MC-UVE-SPA and MC-UVE-SPA-MW algorithms

Table 5 Number of selected wavelengths and C2H4 gas concentration prediction results

Wavelength selection algorithm Number of wavelengths Model type RMSECV RMSEP rFull spectrum 4601 PLS 01431 01342 09801MC-UVE 214 PLS 01009 01212 09887MC-UVE-SPA 17 MLR 01375 00537 09913MC-UVE-SPA-MW(w 2L) 34 MLR 00989 00495 09988MC-UVE-SPA-MW(w 2R) 34 MLR 00988 00486 09990MC-UVE-SPA-MW(w 3) 48 MLR 01005 00434 09999e maximum number of PCs was again set to 10 for both PLS models

Journal of Spectroscopy 11

[3] M Schwanninger J C Rodrigues and K Fackler ldquoA reviewof band assignments in near infrared spectra of wood andwood Componentsrdquo Journal of Near Infrared Spectroscopyvol 19 no 5 pp 287ndash308 2011

[4] Y-H Yun Y-C Wei X-B Zhao W-J Wu Y-Z Liang andH-M Lu ldquoA green method for the quantification of poly-saccharides in Dendrobium officinalerdquo RSC Advances vol 5no 127 pp 105057ndash105065 2015

[5] C K Vance D R Tolleson K Kinoshita J Rodriguez andW J Foley ldquoNear infrared spectroscopy in wildlife andbiodiversityrdquo Journal of Near Infrared Spectroscopy vol 24no 1 pp 1ndash25 2016

[6] H Hotelling ldquoAnalysis of a complex of statistical variablesinto principal componentsrdquo Journal of Educational Psychol-ogy vol 24 no 6 pp 417ndash441 1933

[7] P Geladi and B R Kowalski ldquoPartial least-squares regressiona tutorialrdquo Analytica Chimica Acta vol 185 no 1 pp 1ndash171986

[8] A A Kardamakis and N Pasadakis ldquoAutoregressive mod-eling of near-IR spectra and MLR to predict RON values ofgasolinesrdquo Fuel vol 89 no 1 pp 158ndash161 2010

[9] I Guyon and A Elisseeff ldquoAn introduction to variable andfeature selectionrdquo Journal of Machine Learning Researchvol 3 pp 1157ndash1182 3003

[10] B Lu N Liu H Li et al ldquoQuantitative determination andcharacteristic wavelength selection of available nitrogen incoco-peat by NIR spectroscopyrdquo Soil and Tillage Researchvol 191 pp 266ndash274 2019

[11] M J Anzanello and F S Fogliatto ldquoA review of recentvariable selection methods in industrial and chemometricsapplicationsrdquo European Journal of Industrial Engineeringvol 8 no 5 p 619 2014

[12] Y-H Yun H-D Li B-C Deng and D-S Cao ldquoAn overviewof variable selection methods in multivariate analysis of near-infrared spectrardquo TrAC Trends in Analytical Chemistryvol 113 pp 102ndash115 2019

[13] B Nadler and R R Coifman ldquoe prediction error in CLSand PLS the importance of feature selection prior to mul-tivariate calibrationrdquo Journal of Chemometrics vol 19 no 2pp 107ndash118 2005

[14] T Mehmood K H Liland L Snipen and S Saeligboslash ldquoA reviewof variable selection methods in Partial Least Squares Re-gressionrdquo Chemometrics and Intelligent Laboratory Systemsvol 118 no 8 pp 62ndash69 2012

[15] V Centner D-L Massart O E de Noord S de JongB M Vandeginste and C Sterna ldquoElimination of uninfor-mative variables for multivariate calibrationrdquo AnalyticalChemistry vol 68 no 21 pp 3851ndash3858 1996

[16] G Tang Y Huang K Tian et al ldquoA new spectral variableselection pattern using competitive adaptive reweightedsampling combined with successive projections algorithmrdquo9e Analyst vol 139 no 19 pp 4894ndash4902 2014

[17] Y Li Y Guo C Liu et al ldquoSPA combined with swarm in-telligence optimization algorithms for wavelength variableselection to rapidly discriminate the adulteration of applejuicerdquo Food Analytical Methods vol 10 no 6 pp 1965ndash19712017

[18] J-B Li C-J Zhao W-Q Huang et al ldquoA combination al-gorithm for variable selection to determine soluble solidcontent and firmness of pearsrdquo Analytical Methods vol 6no 7 pp 2170ndash2180 2014

[19] Z Xiaobo Z Jiewen M Hanpin S Jiyong Y Xiaopin andL Yanxiao ldquoGenetic algorithm interval partial least squaresregression combined successive projections algorithm for

variable selection in near-infrared quantitative analysis ofpigment in cucumber leavesrdquo Applied Spectroscopy vol 64no 7 pp 786ndash794 2010

[20] S Ye D Wang and S Min ldquoSuccessive projections algorithmcombined with uninformative variable elimination forspectral variable selectionrdquo Chemometrics and IntelligentLaboratory Systems vol 91 no 2 pp 194ndash199 2008

[21] B-C Deng Y-H Yun P Ma C-C Lin D-B Ren andY-Z Liang ldquoA new method for wavelength interval selectionthat intelligently optimizes the locations widths and com-binations of the intervalsrdquo 9e Analyst vol 140 no 6pp 1876ndash1885 2015

[22] W Fan Y-Y Li Y-K Peng et al ldquoNondestructive deter-mination of lycopene content based on visiblenear infraredtransmission spectrumrdquo Chinese Journal of AnalyticalChemistry vol 46 no 9 pp 1424ndash1431 2018

[23] Z Sun J Fan J Wang et al ldquoAssessment of the humanalbumin in acid precipitation process using NIRS and multi-variable selection methods combined with SPArdquo Journal ofMolecular Structure vol 1199 p 126942 2020

[24] H-D Li Q-S Xu and Y-Z Liang ldquolibPLS an integratedlibrary for partial least squares regression and linear dis-criminant analysisrdquo Chemometrics and Intelligent LaboratorySystems vol 176 pp 34ndash43 2018

[25] R Zhang Y-Y Chen Z-B Wang and L Kewu ldquoA novelensemble L1 regularization based variable selection frame-work with an application in near infrared spectroscopyrdquoChemometrics and Intelligent Laboratory Systems vol 163pp 7ndash15 2017

[26] W Cai Y Li and X Shao ldquoA variable selection method basedon uninformative variable elimination for multivariate cali-bration of near-infrared spectrardquo Chemometrics and Intelli-gent Laboratory Systems vol 90 no 2 pp 188ndash194 2008

[27] L M Bregman ldquoFinding the common point of convex sets bythe method of successive projectionrdquo Proceedings of the USSRAcademy of Sciences vol 162 no 3 pp 487ndash490 1965

[28] X Peng T Shi A Song Y Chen and W Gao ldquoEstimatingsoil organic carbon using VISNIR spectroscopy with SVMRand SPA methodsrdquo Remote Sensing vol 6 no 4 pp 2699ndash2717 2014

[29] Y-H Liu Q-Q Wang X-W Gao and A-G Xie ldquoTotalphenolic content prediction in Flos Lonicerae using hyper-spectral imaging combined with wavelengths selectionmethodsrdquo Journal of Food Process Engineering vol 42 no 6Article ID e13224 2019

[30] J Li H Zhang B Zhan Y Zhang R Li and J Li ldquoNon-destructive firmness measurement of the multiple cultivars ofpears by Vis-NIR spectroscopy coupled with multivariatecalibration analysis and MC-UVE-SPA methodrdquo InfraredPhysics amp Technology vol 104 Article ID 103154 2020

[31] P Hourant V Baeten M T Morales M Meurens andR Aparicio ldquoOil and fat classification by selected bands ofnear-infrared spectroscopyrdquo Applied Spectroscopy vol 54no 8 pp 1168ndash1174 2000

12 Journal of Spectroscopy

Page 10: Moving-Window-ImprovedMonteCarloUninformativeVariable ... · 2020. 8. 3. · to the NIR spectroscopic analysis of nicotine in tobacco lamina and active pharmaceutical ingredients

between variables e prediction accuracy of the modelestablished by the MC-UVE-SPA is higher than thatestablished by MC-UVE e number of characteristic

wavelengths finally selected by the MC-UVE-SPA-MW ismore than that of MC-UVE-SPA but with better pre-diction accuracy

Table 4 Number of selected wavelengths and boiling point prediction results

Wavelength selection algorithm Number of wavelengths Model type RMSECV RMSEP r RPDFull spectrum 401 PLS 46350 83841 09103 23704MC-UVE 262 PLS 43184 100221 08311 17710MC-UVE-SPA 30 MLR 35503 88676 09341 24650MC-UVE-SPA-MW(w 2L) 58 MLR 39921 70442 09534 28212MC-UVE-SPA-MW(w 2R) 59 MLR 40024 68573 09556 28982MC-UVE-SPA-MW(w 3) 83 MLR 26478 59694 09752 39994e maximum number of PCs was again set to 10 for both PLS models

Pred

icte

d bo

iling

poi

nt (deg

C)

150 200 250 300Measured boiling point (degC)

150

200

250

300

(a)

Pred

icte

d bo

iling

poi

nt (deg

C)

Measured boiling point (degC)150 200 250 300

150

200

250

300

(b)

Pred

icte

d bo

iling

poi

nt (deg

C)

Measured boiling point (degC)150 200 250 300

150

200

250

300

(c)

Pred

icte

d bo

iling

poi

nt (deg

C)

Measured boiling point (degC)150 200 250 300

150

200

250

300

(d)

Pred

icte

d bo

iling

poi

nt (deg

C)

Measured boiling point (degC)150 200 250 300

150

200

250

300

(e)

Figure 11 Comparison of diesel fuel boiling point predictions from the prediction data set using (a) the MC-UVE-PLS (b) MC-UVE-SPA-MLR (c) MC-UVE-SPA-MW-MLR (w 2L) (d) MC-UVE-SPA-MW-MLR (w 2R) and (e) MC-UVE-SPA-MW-MLR (w 3)

800 900 1000 1100 1200 1300 1400 1500Wavelength (nm)

ndash1012

Abs

orba

nce

(a)

800 900 1000 1100 1200 1300 1400 1500Wavelength (nm)

ndash1012

Abs

orba

nce

(b)

800 900 1000 1100 1200 1300 1400 1500Wavelength (nm)

ndash1012

Abs

orba

nce

(c)

Figure 12 Wavelengths selected from the NIR spectral data reflecting the boiling point of diesel fuel by the (a) MC-UVE (b) MC-UVE-SPA and (c) MC-UVE-SPA-MW algorithms

10 Journal of Spectroscopy

4 Conclusions

e present study addressed the sparsity of wavelengthvariables selected by the cascaded MC-UVE-SPA throughthe application of a moving window which improved thecontinuity of the selected wavelength variables andthereby better exploited the advantages of the MC-UVEalgorithm and the SPA to obtain regression models withhigh prediction accuracy e advantages of the proposedMC-UVE-SPA-MW were demonstrated by applying theMC-UVE MC-UVE-SPA and MC-UVE-SPA-MW al-gorithms to the selection of wavelength variables from theNIR spectral absorbance data of corn diesel fuel andethylene and PLS and MLR models reflecting the oilcontent of corn the boiling point of diesel fuel and theethylene concentration were thereby established andtested e experimental results demonstrated that theprogressive elimination of uncorrelated and collinearvariables generated increasingly simplified partial-spec-trum models with greater prediction accuracy than thefull-spectrum model Among the three wavelength se-lection algorithms the MC-UVE-SPA selected the leastnumber of wavelength variables and the proposed MC-UVE-SPA-MW algorithm provided models with thegreatest prediction accuracy

Data Availability

e data used to support the findings of this study areavailable from the corresponding author upon request

Conflicts of Interest

e authors declare that they have no conflicts of interest

Acknowledgments

is study was supported by grants from the Major NationalScience and Technology Special Project of China(JZ2015KJZZ0254) and the Key Projects of Natural ScienceResearch in Universities in Anhui China (KJ2018A0544)

References

[1] C Pasquini ldquoNear infrared spectroscopy fundamentalspractical aspects and analytical applicationsrdquo Journal of theBrazilian Chemical Society vol 14 no 2 pp 198ndash219 2003

[2] X Sun M Zhou and Y Sun ldquoSpectroscopy quantitativeanalysis cotton content of blend fabricsrdquo InternationalJournal of Clothing Science and Technology vol 28 no 1pp 65ndash76 2016

500 1000 1500 2000 2500 3000 3500 4000 45000

05

1

15

2

25

3

35

4

Abs

orba

nce

OriginalMCUVE

MCUVE-SPAMCUVE-SPA-MW

times104

Wavenumber (cmndash1)

Figure 13 Measured C2H4 gas absorbance spectra and wavelength variables selected by the MC-UVE MC-UVE-SPA and MC-UVE-SPA-MW algorithms

Table 5 Number of selected wavelengths and C2H4 gas concentration prediction results

Wavelength selection algorithm Number of wavelengths Model type RMSECV RMSEP rFull spectrum 4601 PLS 01431 01342 09801MC-UVE 214 PLS 01009 01212 09887MC-UVE-SPA 17 MLR 01375 00537 09913MC-UVE-SPA-MW(w 2L) 34 MLR 00989 00495 09988MC-UVE-SPA-MW(w 2R) 34 MLR 00988 00486 09990MC-UVE-SPA-MW(w 3) 48 MLR 01005 00434 09999e maximum number of PCs was again set to 10 for both PLS models

Journal of Spectroscopy 11

[3] M Schwanninger J C Rodrigues and K Fackler ldquoA reviewof band assignments in near infrared spectra of wood andwood Componentsrdquo Journal of Near Infrared Spectroscopyvol 19 no 5 pp 287ndash308 2011

[4] Y-H Yun Y-C Wei X-B Zhao W-J Wu Y-Z Liang andH-M Lu ldquoA green method for the quantification of poly-saccharides in Dendrobium officinalerdquo RSC Advances vol 5no 127 pp 105057ndash105065 2015

[5] C K Vance D R Tolleson K Kinoshita J Rodriguez andW J Foley ldquoNear infrared spectroscopy in wildlife andbiodiversityrdquo Journal of Near Infrared Spectroscopy vol 24no 1 pp 1ndash25 2016

[6] H Hotelling ldquoAnalysis of a complex of statistical variablesinto principal componentsrdquo Journal of Educational Psychol-ogy vol 24 no 6 pp 417ndash441 1933

[7] P Geladi and B R Kowalski ldquoPartial least-squares regressiona tutorialrdquo Analytica Chimica Acta vol 185 no 1 pp 1ndash171986

[8] A A Kardamakis and N Pasadakis ldquoAutoregressive mod-eling of near-IR spectra and MLR to predict RON values ofgasolinesrdquo Fuel vol 89 no 1 pp 158ndash161 2010

[9] I Guyon and A Elisseeff ldquoAn introduction to variable andfeature selectionrdquo Journal of Machine Learning Researchvol 3 pp 1157ndash1182 3003

[10] B Lu N Liu H Li et al ldquoQuantitative determination andcharacteristic wavelength selection of available nitrogen incoco-peat by NIR spectroscopyrdquo Soil and Tillage Researchvol 191 pp 266ndash274 2019

[11] M J Anzanello and F S Fogliatto ldquoA review of recentvariable selection methods in industrial and chemometricsapplicationsrdquo European Journal of Industrial Engineeringvol 8 no 5 p 619 2014

[12] Y-H Yun H-D Li B-C Deng and D-S Cao ldquoAn overviewof variable selection methods in multivariate analysis of near-infrared spectrardquo TrAC Trends in Analytical Chemistryvol 113 pp 102ndash115 2019

[13] B Nadler and R R Coifman ldquoe prediction error in CLSand PLS the importance of feature selection prior to mul-tivariate calibrationrdquo Journal of Chemometrics vol 19 no 2pp 107ndash118 2005

[14] T Mehmood K H Liland L Snipen and S Saeligboslash ldquoA reviewof variable selection methods in Partial Least Squares Re-gressionrdquo Chemometrics and Intelligent Laboratory Systemsvol 118 no 8 pp 62ndash69 2012

[15] V Centner D-L Massart O E de Noord S de JongB M Vandeginste and C Sterna ldquoElimination of uninfor-mative variables for multivariate calibrationrdquo AnalyticalChemistry vol 68 no 21 pp 3851ndash3858 1996

[16] G Tang Y Huang K Tian et al ldquoA new spectral variableselection pattern using competitive adaptive reweightedsampling combined with successive projections algorithmrdquo9e Analyst vol 139 no 19 pp 4894ndash4902 2014

[17] Y Li Y Guo C Liu et al ldquoSPA combined with swarm in-telligence optimization algorithms for wavelength variableselection to rapidly discriminate the adulteration of applejuicerdquo Food Analytical Methods vol 10 no 6 pp 1965ndash19712017

[18] J-B Li C-J Zhao W-Q Huang et al ldquoA combination al-gorithm for variable selection to determine soluble solidcontent and firmness of pearsrdquo Analytical Methods vol 6no 7 pp 2170ndash2180 2014

[19] Z Xiaobo Z Jiewen M Hanpin S Jiyong Y Xiaopin andL Yanxiao ldquoGenetic algorithm interval partial least squaresregression combined successive projections algorithm for

variable selection in near-infrared quantitative analysis ofpigment in cucumber leavesrdquo Applied Spectroscopy vol 64no 7 pp 786ndash794 2010

[20] S Ye D Wang and S Min ldquoSuccessive projections algorithmcombined with uninformative variable elimination forspectral variable selectionrdquo Chemometrics and IntelligentLaboratory Systems vol 91 no 2 pp 194ndash199 2008

[21] B-C Deng Y-H Yun P Ma C-C Lin D-B Ren andY-Z Liang ldquoA new method for wavelength interval selectionthat intelligently optimizes the locations widths and com-binations of the intervalsrdquo 9e Analyst vol 140 no 6pp 1876ndash1885 2015

[22] W Fan Y-Y Li Y-K Peng et al ldquoNondestructive deter-mination of lycopene content based on visiblenear infraredtransmission spectrumrdquo Chinese Journal of AnalyticalChemistry vol 46 no 9 pp 1424ndash1431 2018

[23] Z Sun J Fan J Wang et al ldquoAssessment of the humanalbumin in acid precipitation process using NIRS and multi-variable selection methods combined with SPArdquo Journal ofMolecular Structure vol 1199 p 126942 2020

[24] H-D Li Q-S Xu and Y-Z Liang ldquolibPLS an integratedlibrary for partial least squares regression and linear dis-criminant analysisrdquo Chemometrics and Intelligent LaboratorySystems vol 176 pp 34ndash43 2018

[25] R Zhang Y-Y Chen Z-B Wang and L Kewu ldquoA novelensemble L1 regularization based variable selection frame-work with an application in near infrared spectroscopyrdquoChemometrics and Intelligent Laboratory Systems vol 163pp 7ndash15 2017

[26] W Cai Y Li and X Shao ldquoA variable selection method basedon uninformative variable elimination for multivariate cali-bration of near-infrared spectrardquo Chemometrics and Intelli-gent Laboratory Systems vol 90 no 2 pp 188ndash194 2008

[27] L M Bregman ldquoFinding the common point of convex sets bythe method of successive projectionrdquo Proceedings of the USSRAcademy of Sciences vol 162 no 3 pp 487ndash490 1965

[28] X Peng T Shi A Song Y Chen and W Gao ldquoEstimatingsoil organic carbon using VISNIR spectroscopy with SVMRand SPA methodsrdquo Remote Sensing vol 6 no 4 pp 2699ndash2717 2014

[29] Y-H Liu Q-Q Wang X-W Gao and A-G Xie ldquoTotalphenolic content prediction in Flos Lonicerae using hyper-spectral imaging combined with wavelengths selectionmethodsrdquo Journal of Food Process Engineering vol 42 no 6Article ID e13224 2019

[30] J Li H Zhang B Zhan Y Zhang R Li and J Li ldquoNon-destructive firmness measurement of the multiple cultivars ofpears by Vis-NIR spectroscopy coupled with multivariatecalibration analysis and MC-UVE-SPA methodrdquo InfraredPhysics amp Technology vol 104 Article ID 103154 2020

[31] P Hourant V Baeten M T Morales M Meurens andR Aparicio ldquoOil and fat classification by selected bands ofnear-infrared spectroscopyrdquo Applied Spectroscopy vol 54no 8 pp 1168ndash1174 2000

12 Journal of Spectroscopy

Page 11: Moving-Window-ImprovedMonteCarloUninformativeVariable ... · 2020. 8. 3. · to the NIR spectroscopic analysis of nicotine in tobacco lamina and active pharmaceutical ingredients

4 Conclusions

e present study addressed the sparsity of wavelengthvariables selected by the cascaded MC-UVE-SPA throughthe application of a moving window which improved thecontinuity of the selected wavelength variables andthereby better exploited the advantages of the MC-UVEalgorithm and the SPA to obtain regression models withhigh prediction accuracy e advantages of the proposedMC-UVE-SPA-MW were demonstrated by applying theMC-UVE MC-UVE-SPA and MC-UVE-SPA-MW al-gorithms to the selection of wavelength variables from theNIR spectral absorbance data of corn diesel fuel andethylene and PLS and MLR models reflecting the oilcontent of corn the boiling point of diesel fuel and theethylene concentration were thereby established andtested e experimental results demonstrated that theprogressive elimination of uncorrelated and collinearvariables generated increasingly simplified partial-spec-trum models with greater prediction accuracy than thefull-spectrum model Among the three wavelength se-lection algorithms the MC-UVE-SPA selected the leastnumber of wavelength variables and the proposed MC-UVE-SPA-MW algorithm provided models with thegreatest prediction accuracy

Data Availability

e data used to support the findings of this study areavailable from the corresponding author upon request

Conflicts of Interest

e authors declare that they have no conflicts of interest

Acknowledgments

is study was supported by grants from the Major NationalScience and Technology Special Project of China(JZ2015KJZZ0254) and the Key Projects of Natural ScienceResearch in Universities in Anhui China (KJ2018A0544)

References

[1] C Pasquini ldquoNear infrared spectroscopy fundamentalspractical aspects and analytical applicationsrdquo Journal of theBrazilian Chemical Society vol 14 no 2 pp 198ndash219 2003

[2] X Sun M Zhou and Y Sun ldquoSpectroscopy quantitativeanalysis cotton content of blend fabricsrdquo InternationalJournal of Clothing Science and Technology vol 28 no 1pp 65ndash76 2016

500 1000 1500 2000 2500 3000 3500 4000 45000

05

1

15

2

25

3

35

4

Abs

orba

nce

OriginalMCUVE

MCUVE-SPAMCUVE-SPA-MW

times104

Wavenumber (cmndash1)

Figure 13 Measured C2H4 gas absorbance spectra and wavelength variables selected by the MC-UVE MC-UVE-SPA and MC-UVE-SPA-MW algorithms

Table 5 Number of selected wavelengths and C2H4 gas concentration prediction results

Wavelength selection algorithm Number of wavelengths Model type RMSECV RMSEP rFull spectrum 4601 PLS 01431 01342 09801MC-UVE 214 PLS 01009 01212 09887MC-UVE-SPA 17 MLR 01375 00537 09913MC-UVE-SPA-MW(w 2L) 34 MLR 00989 00495 09988MC-UVE-SPA-MW(w 2R) 34 MLR 00988 00486 09990MC-UVE-SPA-MW(w 3) 48 MLR 01005 00434 09999e maximum number of PCs was again set to 10 for both PLS models

Journal of Spectroscopy 11

[3] M Schwanninger J C Rodrigues and K Fackler ldquoA reviewof band assignments in near infrared spectra of wood andwood Componentsrdquo Journal of Near Infrared Spectroscopyvol 19 no 5 pp 287ndash308 2011

[4] Y-H Yun Y-C Wei X-B Zhao W-J Wu Y-Z Liang andH-M Lu ldquoA green method for the quantification of poly-saccharides in Dendrobium officinalerdquo RSC Advances vol 5no 127 pp 105057ndash105065 2015

[5] C K Vance D R Tolleson K Kinoshita J Rodriguez andW J Foley ldquoNear infrared spectroscopy in wildlife andbiodiversityrdquo Journal of Near Infrared Spectroscopy vol 24no 1 pp 1ndash25 2016

[6] H Hotelling ldquoAnalysis of a complex of statistical variablesinto principal componentsrdquo Journal of Educational Psychol-ogy vol 24 no 6 pp 417ndash441 1933

[7] P Geladi and B R Kowalski ldquoPartial least-squares regressiona tutorialrdquo Analytica Chimica Acta vol 185 no 1 pp 1ndash171986

[8] A A Kardamakis and N Pasadakis ldquoAutoregressive mod-eling of near-IR spectra and MLR to predict RON values ofgasolinesrdquo Fuel vol 89 no 1 pp 158ndash161 2010

[9] I Guyon and A Elisseeff ldquoAn introduction to variable andfeature selectionrdquo Journal of Machine Learning Researchvol 3 pp 1157ndash1182 3003

[10] B Lu N Liu H Li et al ldquoQuantitative determination andcharacteristic wavelength selection of available nitrogen incoco-peat by NIR spectroscopyrdquo Soil and Tillage Researchvol 191 pp 266ndash274 2019

[11] M J Anzanello and F S Fogliatto ldquoA review of recentvariable selection methods in industrial and chemometricsapplicationsrdquo European Journal of Industrial Engineeringvol 8 no 5 p 619 2014

[12] Y-H Yun H-D Li B-C Deng and D-S Cao ldquoAn overviewof variable selection methods in multivariate analysis of near-infrared spectrardquo TrAC Trends in Analytical Chemistryvol 113 pp 102ndash115 2019

[13] B Nadler and R R Coifman ldquoe prediction error in CLSand PLS the importance of feature selection prior to mul-tivariate calibrationrdquo Journal of Chemometrics vol 19 no 2pp 107ndash118 2005

[14] T Mehmood K H Liland L Snipen and S Saeligboslash ldquoA reviewof variable selection methods in Partial Least Squares Re-gressionrdquo Chemometrics and Intelligent Laboratory Systemsvol 118 no 8 pp 62ndash69 2012

[15] V Centner D-L Massart O E de Noord S de JongB M Vandeginste and C Sterna ldquoElimination of uninfor-mative variables for multivariate calibrationrdquo AnalyticalChemistry vol 68 no 21 pp 3851ndash3858 1996

[16] G Tang Y Huang K Tian et al ldquoA new spectral variableselection pattern using competitive adaptive reweightedsampling combined with successive projections algorithmrdquo9e Analyst vol 139 no 19 pp 4894ndash4902 2014

[17] Y Li Y Guo C Liu et al ldquoSPA combined with swarm in-telligence optimization algorithms for wavelength variableselection to rapidly discriminate the adulteration of applejuicerdquo Food Analytical Methods vol 10 no 6 pp 1965ndash19712017

[18] J-B Li C-J Zhao W-Q Huang et al ldquoA combination al-gorithm for variable selection to determine soluble solidcontent and firmness of pearsrdquo Analytical Methods vol 6no 7 pp 2170ndash2180 2014

[19] Z Xiaobo Z Jiewen M Hanpin S Jiyong Y Xiaopin andL Yanxiao ldquoGenetic algorithm interval partial least squaresregression combined successive projections algorithm for

variable selection in near-infrared quantitative analysis ofpigment in cucumber leavesrdquo Applied Spectroscopy vol 64no 7 pp 786ndash794 2010

[20] S Ye D Wang and S Min ldquoSuccessive projections algorithmcombined with uninformative variable elimination forspectral variable selectionrdquo Chemometrics and IntelligentLaboratory Systems vol 91 no 2 pp 194ndash199 2008

[21] B-C Deng Y-H Yun P Ma C-C Lin D-B Ren andY-Z Liang ldquoA new method for wavelength interval selectionthat intelligently optimizes the locations widths and com-binations of the intervalsrdquo 9e Analyst vol 140 no 6pp 1876ndash1885 2015

[22] W Fan Y-Y Li Y-K Peng et al ldquoNondestructive deter-mination of lycopene content based on visiblenear infraredtransmission spectrumrdquo Chinese Journal of AnalyticalChemistry vol 46 no 9 pp 1424ndash1431 2018

[23] Z Sun J Fan J Wang et al ldquoAssessment of the humanalbumin in acid precipitation process using NIRS and multi-variable selection methods combined with SPArdquo Journal ofMolecular Structure vol 1199 p 126942 2020

[24] H-D Li Q-S Xu and Y-Z Liang ldquolibPLS an integratedlibrary for partial least squares regression and linear dis-criminant analysisrdquo Chemometrics and Intelligent LaboratorySystems vol 176 pp 34ndash43 2018

[25] R Zhang Y-Y Chen Z-B Wang and L Kewu ldquoA novelensemble L1 regularization based variable selection frame-work with an application in near infrared spectroscopyrdquoChemometrics and Intelligent Laboratory Systems vol 163pp 7ndash15 2017

[26] W Cai Y Li and X Shao ldquoA variable selection method basedon uninformative variable elimination for multivariate cali-bration of near-infrared spectrardquo Chemometrics and Intelli-gent Laboratory Systems vol 90 no 2 pp 188ndash194 2008

[27] L M Bregman ldquoFinding the common point of convex sets bythe method of successive projectionrdquo Proceedings of the USSRAcademy of Sciences vol 162 no 3 pp 487ndash490 1965

[28] X Peng T Shi A Song Y Chen and W Gao ldquoEstimatingsoil organic carbon using VISNIR spectroscopy with SVMRand SPA methodsrdquo Remote Sensing vol 6 no 4 pp 2699ndash2717 2014

[29] Y-H Liu Q-Q Wang X-W Gao and A-G Xie ldquoTotalphenolic content prediction in Flos Lonicerae using hyper-spectral imaging combined with wavelengths selectionmethodsrdquo Journal of Food Process Engineering vol 42 no 6Article ID e13224 2019

[30] J Li H Zhang B Zhan Y Zhang R Li and J Li ldquoNon-destructive firmness measurement of the multiple cultivars ofpears by Vis-NIR spectroscopy coupled with multivariatecalibration analysis and MC-UVE-SPA methodrdquo InfraredPhysics amp Technology vol 104 Article ID 103154 2020

[31] P Hourant V Baeten M T Morales M Meurens andR Aparicio ldquoOil and fat classification by selected bands ofnear-infrared spectroscopyrdquo Applied Spectroscopy vol 54no 8 pp 1168ndash1174 2000

12 Journal of Spectroscopy

Page 12: Moving-Window-ImprovedMonteCarloUninformativeVariable ... · 2020. 8. 3. · to the NIR spectroscopic analysis of nicotine in tobacco lamina and active pharmaceutical ingredients

[3] M Schwanninger J C Rodrigues and K Fackler ldquoA reviewof band assignments in near infrared spectra of wood andwood Componentsrdquo Journal of Near Infrared Spectroscopyvol 19 no 5 pp 287ndash308 2011

[4] Y-H Yun Y-C Wei X-B Zhao W-J Wu Y-Z Liang andH-M Lu ldquoA green method for the quantification of poly-saccharides in Dendrobium officinalerdquo RSC Advances vol 5no 127 pp 105057ndash105065 2015

[5] C K Vance D R Tolleson K Kinoshita J Rodriguez andW J Foley ldquoNear infrared spectroscopy in wildlife andbiodiversityrdquo Journal of Near Infrared Spectroscopy vol 24no 1 pp 1ndash25 2016

[6] H Hotelling ldquoAnalysis of a complex of statistical variablesinto principal componentsrdquo Journal of Educational Psychol-ogy vol 24 no 6 pp 417ndash441 1933

[7] P Geladi and B R Kowalski ldquoPartial least-squares regressiona tutorialrdquo Analytica Chimica Acta vol 185 no 1 pp 1ndash171986

[8] A A Kardamakis and N Pasadakis ldquoAutoregressive mod-eling of near-IR spectra and MLR to predict RON values ofgasolinesrdquo Fuel vol 89 no 1 pp 158ndash161 2010

[9] I Guyon and A Elisseeff ldquoAn introduction to variable andfeature selectionrdquo Journal of Machine Learning Researchvol 3 pp 1157ndash1182 3003

[10] B Lu N Liu H Li et al ldquoQuantitative determination andcharacteristic wavelength selection of available nitrogen incoco-peat by NIR spectroscopyrdquo Soil and Tillage Researchvol 191 pp 266ndash274 2019

[11] M J Anzanello and F S Fogliatto ldquoA review of recentvariable selection methods in industrial and chemometricsapplicationsrdquo European Journal of Industrial Engineeringvol 8 no 5 p 619 2014

[12] Y-H Yun H-D Li B-C Deng and D-S Cao ldquoAn overviewof variable selection methods in multivariate analysis of near-infrared spectrardquo TrAC Trends in Analytical Chemistryvol 113 pp 102ndash115 2019

[13] B Nadler and R R Coifman ldquoe prediction error in CLSand PLS the importance of feature selection prior to mul-tivariate calibrationrdquo Journal of Chemometrics vol 19 no 2pp 107ndash118 2005

[14] T Mehmood K H Liland L Snipen and S Saeligboslash ldquoA reviewof variable selection methods in Partial Least Squares Re-gressionrdquo Chemometrics and Intelligent Laboratory Systemsvol 118 no 8 pp 62ndash69 2012

[15] V Centner D-L Massart O E de Noord S de JongB M Vandeginste and C Sterna ldquoElimination of uninfor-mative variables for multivariate calibrationrdquo AnalyticalChemistry vol 68 no 21 pp 3851ndash3858 1996

[16] G Tang Y Huang K Tian et al ldquoA new spectral variableselection pattern using competitive adaptive reweightedsampling combined with successive projections algorithmrdquo9e Analyst vol 139 no 19 pp 4894ndash4902 2014

[17] Y Li Y Guo C Liu et al ldquoSPA combined with swarm in-telligence optimization algorithms for wavelength variableselection to rapidly discriminate the adulteration of applejuicerdquo Food Analytical Methods vol 10 no 6 pp 1965ndash19712017

[18] J-B Li C-J Zhao W-Q Huang et al ldquoA combination al-gorithm for variable selection to determine soluble solidcontent and firmness of pearsrdquo Analytical Methods vol 6no 7 pp 2170ndash2180 2014

[19] Z Xiaobo Z Jiewen M Hanpin S Jiyong Y Xiaopin andL Yanxiao ldquoGenetic algorithm interval partial least squaresregression combined successive projections algorithm for

variable selection in near-infrared quantitative analysis ofpigment in cucumber leavesrdquo Applied Spectroscopy vol 64no 7 pp 786ndash794 2010

[20] S Ye D Wang and S Min ldquoSuccessive projections algorithmcombined with uninformative variable elimination forspectral variable selectionrdquo Chemometrics and IntelligentLaboratory Systems vol 91 no 2 pp 194ndash199 2008

[21] B-C Deng Y-H Yun P Ma C-C Lin D-B Ren andY-Z Liang ldquoA new method for wavelength interval selectionthat intelligently optimizes the locations widths and com-binations of the intervalsrdquo 9e Analyst vol 140 no 6pp 1876ndash1885 2015

[22] W Fan Y-Y Li Y-K Peng et al ldquoNondestructive deter-mination of lycopene content based on visiblenear infraredtransmission spectrumrdquo Chinese Journal of AnalyticalChemistry vol 46 no 9 pp 1424ndash1431 2018

[23] Z Sun J Fan J Wang et al ldquoAssessment of the humanalbumin in acid precipitation process using NIRS and multi-variable selection methods combined with SPArdquo Journal ofMolecular Structure vol 1199 p 126942 2020

[24] H-D Li Q-S Xu and Y-Z Liang ldquolibPLS an integratedlibrary for partial least squares regression and linear dis-criminant analysisrdquo Chemometrics and Intelligent LaboratorySystems vol 176 pp 34ndash43 2018

[25] R Zhang Y-Y Chen Z-B Wang and L Kewu ldquoA novelensemble L1 regularization based variable selection frame-work with an application in near infrared spectroscopyrdquoChemometrics and Intelligent Laboratory Systems vol 163pp 7ndash15 2017

[26] W Cai Y Li and X Shao ldquoA variable selection method basedon uninformative variable elimination for multivariate cali-bration of near-infrared spectrardquo Chemometrics and Intelli-gent Laboratory Systems vol 90 no 2 pp 188ndash194 2008

[27] L M Bregman ldquoFinding the common point of convex sets bythe method of successive projectionrdquo Proceedings of the USSRAcademy of Sciences vol 162 no 3 pp 487ndash490 1965

[28] X Peng T Shi A Song Y Chen and W Gao ldquoEstimatingsoil organic carbon using VISNIR spectroscopy with SVMRand SPA methodsrdquo Remote Sensing vol 6 no 4 pp 2699ndash2717 2014

[29] Y-H Liu Q-Q Wang X-W Gao and A-G Xie ldquoTotalphenolic content prediction in Flos Lonicerae using hyper-spectral imaging combined with wavelengths selectionmethodsrdquo Journal of Food Process Engineering vol 42 no 6Article ID e13224 2019

[30] J Li H Zhang B Zhan Y Zhang R Li and J Li ldquoNon-destructive firmness measurement of the multiple cultivars ofpears by Vis-NIR spectroscopy coupled with multivariatecalibration analysis and MC-UVE-SPA methodrdquo InfraredPhysics amp Technology vol 104 Article ID 103154 2020

[31] P Hourant V Baeten M T Morales M Meurens andR Aparicio ldquoOil and fat classification by selected bands ofnear-infrared spectroscopyrdquo Applied Spectroscopy vol 54no 8 pp 1168ndash1174 2000

12 Journal of Spectroscopy