作者简介:刘陵玉(1987—),女,齐鲁工业大学(山东省科学院)自动化研究所助理研究员. 研究方向:太赫兹技术与应用. E-mail:llytime@163.com
中文责编:方 圆; 英文责编:阡 陌
1)齐鲁工业大学(山东省科学院)自动化研究所,山东济南 250014; 2)吉林大学仪器科学与电气工程学院,吉林长春 130061
1)Institute of Automation, Qilu University of Technology(Shandong Academy of Sciences), Jinan 250014, Shandong Province, P.R.China2)College of Instrumentation and Electrical Engineering, Jilin University, Changchun 130061, Jilin Province, P.R.China
spectroscopy; terahertz time-domain spectroscopy; principal component analysis; linear discriminant analysis; American ginseng; identification
DOI: 10.3724/SP.J.1249.2019.02207
太赫兹时域光谱技术是一种新型的光谱测量技术.由于它对非导电材料和非极性材料的穿透性及其安全性,被广泛用于材料检测领域. 本研究将太赫兹时域光谱与主成分分析-线性判别分析相结合,建立萃取过的西洋参和正宗西洋参的无损鉴别模型. 主成分分析-线性判别分析的方法基于太赫兹波谱范围,萃取过的西洋参与正宗西洋参的吸光度光谱高度相似,采用留一法对主成分-线性判别分析模型分类性能进行评价. 结果表明,前3个主成分的累计方差贡献率大于98.1%,主成分分析-线性判别分析模型对萃取过的西洋参和正宗西洋参的识别率分别为100%和96.7%,总的识别率达到98.3%. 研究显示,利用太赫兹时域光谱技术结合主成分分析-线性判别分析模型,能够对萃取过的西洋参和正宗未萃取西洋参进行准确鉴别,结果可靠.
Terahertz time-domain spectroscopy is a new spectroscopic measurement technique that has been widely applied in material detection due to its ability to penetrate most non-conducting and non-polar materials and its intrinsical safe nature. In this work, terahertz time-domain spectroscopy combined with principle component analysis and linear discriminant analysis was applied to establish a non-destructive identification model for American ginseng after extraction and authentic American ginseng. The spectral analysis was based on the terahertz spectra and the absorbance spectra of American ginseng after extraction and authentic American ginseng showed little difference. The leave-one-out approach was used to evaluate the performance of the principle component analysis and linear discriminant analysis model. The result of the analysis suggested that the reliabilities of the top three principal components were more than 98.1% and the recognition rates of the principle component analysis and linear discriminant analysis model were 100% and 96.7% in terms of the American ginseng after extraction and authentic American ginseng, and the total recognition rate was 98.3%. Our work suggests that by combining terahertz time-domain spectroscopy with principle component analysis and linear discriminant analysis, American ginseng after extraction and authentic American ginseng can be accurately distinguished and the identification results are reliable and practicable.
American ginseng is a herbaceous perennial plant in the ivy family and is one of the most important Chinese herbal medicines[1]. In spite of the fact that it is native to northeastern America, it is also cultivated in China and has a prominent position on the list of best-selling herbal products in the world.The American ginseng chemical composition is a complicated mixture which includes the saponins, naphtha class, carbohydrates and starch[2]. The pharmacological activity of American ginseng is due to the combination of its constituents and not the presence of a single compound[3]. The major bioactive ingredients in American ginseng are ginsenosides and there are more than thirty different ginsenosides. The major ginsenosides of Rb1, Rb2, Rc, Rd, Re and Rg1 account for 90% of the total ginsenosides and they are used as markers of American ginseng quality[1,4-5].It is reported that American ginseng has many beneficial biological and pharmaceutical properties, e.g. anti-anxiety, anti-tumor, and antioxidant properties, and is generally considered helpful for a healthy immune system and central nervous system, as well as in slowing the aging process[6-7].
In some cases, after the major bioactive ingredients are extracted, the residue of the American ginseng is sold as authentic ginseng. Due to the fact that the residue looks morphologically similar to the authentic American ginseng, it is difficult to authenticate it by morphological and histological means. Thus it is necessary to study new methods for distinguishing American ginseng after extraction from the authentic American ginseng.
Current herbal identification methods include microscopic analysis, thin-layer chromatography(TLC), high-performance liquid chromatography(HPLC), and infrared spectroscopy(IR)[2, 8-9]. Microscopic analysis is subject to misinterpretation. The operation procedures of TLC and HPLC are complicated. The instruments required for HPLC are expensive and the solvent should be strictly purified. The IR method is susceptible to the dispersion of the samples and to the interference from various heat sources in the environment, resulting in decreased accuracy. Terahertz time-domain spectroscopy(THz-TDS)is a new spectroscopic measurement technique that has been widely applied in material analysis. Terahertz usually refers to electromagnetic waves with frequencies from 0.1 to 10 THz, or a wavelength range of 3 mm to 30 μm. Located in the special transition region between electronics and photonics, terahertz waves occupy a very important position in the electromagnetic spectra, and possess a great value in both theory and practice[10-11]. This frequency region is attractive because many chemical substances have their vibrational and rotational modes therein. And due to the low energy of the terahertz radiation(about 1~10 meV per photon), it will not lead to ionization damage. The molecular response of biological material to terahertz radiation is of particular interest in numerous fields[12-14]. Terahertz spectroscopy has been used for quantitative and qualitative analysis for various pharmaceutical and environmental samples[15-18].
Since American ginseng after extraction and authentic American ginseng do not have obvious characteristic absorption peaks in the terahertz band, they cannot be identified by the method of fingerprint analysis. Principle component analysis(PCA)was thus adopted to reduce the redundancy of the spectra. Principle component analysis and linear discriminant analysis(PCA-LDA)is used to establish a qualitative analysis model to distinguish American ginseng after extraction from authentic American ginseng. Three principle component features were extracted and then fed as inputs into a linear discriminant analysis(LDA)model. The leave-one-out(LOO)approach was used to evaluate the PCA-LDA model classification performance.
Experimental spectra were obtained with a THz-TDS system developed by Zomega Terahertz Corporation which provided a usable frequency range of 0-3 THz. The time-domain range is 0-110 ps, with 0.05 ps time resolution, and 9 GHz spectral resolution. The THz-TDS system has two modes: transmission mode and reflection mode. The transmission mode was used in this study. The schematic diagram and photograph of the THz-TDS are shown in Fig.1 and Fig.2. To avoid the absorption of water vapor during the measurements, the system was put in an enclosed chamber and purged with dry air. The relative humidity was maintained at approximately 3%, and the temperature was kept at room temperature. For American ginseng there are several kinds of extracting technologies.In this work, the water-extracting method was used. Water-extracting was performed by solid-liquid heating using water as a solvent; this method is considered to be an environmentally friendly and efficient extracting technology for the extraction of bioactive components from American ginseng. The water and American ginseng mixture had a volume ratio of 8:1. The mixture was decocted twice, for 60 minutes each time[19]. In what follows, the American ginseng that was processed using the above water-extracting treatment is referred to as treated American ginseng, while authentic American ginseng is referred to as untreated American ginseng.The treated American ginseng and untreated American ginseng were dried and ground into powder. Then, they were pressed into sample tablets with a thickness of approximately 1 mm and a diameter of 13 mm, much larger than the spot size of the terahertz ray(about 1 mm). The number of samples for the treated American ginseng and untreated American ginseng was 30 for each. Before the test, the samples were dried again in order to remove the influence of water.
图1 THz-TDS 原理图 图2 THz-TDS 实物图The time domain sample signal Etrans(t)and the reference signal E0(t)were collected by focusing the terahertz beam with and without the sample in the optical path, respectively. After applying the fast Fourier transform(FFT), we obtained reference and sample signals in the frequency domain, E0(ω)and Etrans(ω), respectively, which were used to calculate the absorbance spectrum. The absorbance A is a dimensionless relative quantity and is calculated as in Eq.(1)[20]
A=-1/dln(E2trans(ω))/(E20(ω))(1)
where ω is the frequency of terahertz wave, and d is the thickness of sample.
PCA is a statistical extraction method which reduces the dimensionality of a data set through mathematical transformation. The comprehensive variables derived from the transformation are the principal components which summarize the features of the data set. All the principal components are uncorrelated and ordered. Each principal component is a linear combination of the original variables. When the cumulative variance contribution rate of the first k principal components is large enough(typically P≥85%), the original data set can be replaced approximately with the first k principal components.
PCA projects the n-dimensional features to the k-dimension(k<n). The greatest variance lies on the first coordinate, called the first principal component, the second greatest variance lies on the second coordinate, and so on.Consider a p-dimensional data matrix, X=(X1,X2,…,Xp)T,
PC1=a1'X=a11X1+a21X2+…+ap1Xp
PC2=a2'X=a12X1+a22X2+…+ap2Xp
PCp=ap'X=a1pX1+a2pX2+…+appXp
After the transformation, the original data of p variables X1,X2,…,Xp are replaced by the variables PC1, PC2, …, PCp in the new coordinate system which successively inherit the maximum possible variance from X[21].
Before PCA is carried out, the original absorbance spectra should be standardized in order to solve the comparability of data. After standardized processing, the original data are transformed into dimensionless. Standard scores are also called z-values, z-scores, normal scores, and standardized variables. The computation of the z-score are based on the mean and standard deviation of the original data.
The standard score was calculated using the sample mean and sample standard deviation as estimates of the population values[22]. In these cases, the z-score is
z=(x-(-overx))/S(2)
where(-overx)is the mean of the sample, and S is the standard deviation of the sample.
The absolute value of z represents the distance between the raw score and the sample mean in units of the standard deviation. z is negative when the raw score is below the mean, positive when above.
The LDA classifier was applied for identification after the PCA data projection. Multivariate analysis(PCA and LDA)combined with terahertz spectroscopy is an excellent method with good potential for the identification and classification of biological materials. LDA is a classical dimension reduction linear discriminant analysis. Prior knowledge and experience of categories can be used in dimensionality reduction and classification.Both LDA and PCA are commonly used for dimensionality reduction. PCA ignores class labels and it is an unsupervised algorithm. In contrast to PCA, LDA computes the directions that will represent the axes that that maximize the separation between multiple classes and it is a supervised algorithm. It is common to use both LDA and PCA in combination: PCA for dimensionality reduction followed by an LDA.
The evaluation of the classification performance was carried out using the LOO approach[23]. LOO is commonly used when the data set is too small to provide sufficient samples for separate, independent training and test sets.There are N samples in the whole dataset; each sample was used as the test set alone and the remaining(N-1)samples were used as the training set. The category estimation was performed on that data, and then compared to the known category. This procedure was repeated N times(the total number of the measured spectra), while each time a different datum was left out. LOO will get N models, and the average of the total prediction error of the N models is used as the performance index.
The terahertz time domain transmittance spectra were measured for 30 treated American ginseng samples and 30 untreated American ginseng samples. Then, the time domain spectra were transformed to power density spectra by fast Fourier transformation and the absorbance was calculated according to Eq.(1). The effective frequency range from 0.2 to 1.4 THz of the absorption spectra was determined according to power density spectra.
The absorbance spectra of 30 different treated American ginseng samples and 30 different untreated American ginseng samples are shown in Fig.3. From Fig.3, we can find that there are no significant characteristic peaks for treated American ginseng and untreated American ginseng. The extraction of the bioactive components has little influence on the absorbance spectra of American ginseng. The absorbance spectra were highly similar for treated American ginseng and untreated American ginseng in our frequency range, thus it is difficult to distinguish them by fingerprint analysis.
图3 已处理西洋参与未处理西洋参的太赫兹吸收谱PCA was applied to reduce the dimensionality and extract features from the original absorbance spectra. Before carrying out PCA, the original absorbance spectra were standardized by Eq.(2).After performing PCA, as seen in table 1, the top three eigenvectors(principal components)extracted from the terahertz absorbance spectra data of 60 samples were 0.515 9, 0.052 9 and 0.005 9, respectively. The variance contribution rates of the top three principal components of the absorption coefficient spectra were 88.15%, 9.04% and 1.01%, respectively, which were representative of spectral features of the original absorbance spectra. As shown in table 1, the accumulated variance of the top three eigenvectors reaches 98.10%. PCA effectively reduced the number of variables of the data and maintained the spectral features of the original absorbance spectra. In the spectra around 0.7 to 1.0 THz, the number of spectral variables was reduced from 70 to 3 by using the PCA method.
表1 各成分解释总方差
Fig.4 shows that the treated American ginseng samples and untreated American ginseng samples can be classified clearly by using the scoring chart of their top three principal components. The treated American ginseng and untreated American ginseng can be distinguished by the combination of the terahertz technique and the PCA method. Furthermore, this combination could be extended to other Chinese herbal medicines.
Fig.4 Principal component scores of the absorbance spectra of treated American ginseng and untreated American ginseng
These three principal component scores were input to the linear discrimination analysis LDA model for classification and discrimination. Symbols -1 and 1 represent treated American ginseng and untreated American ginseng respectively. The LOO approach was used to evaluate the LDA model classification performance. Good results were obtained as shown in table 2, where, Mi represents the number of samples identified as class i in the test. Fig.5 presents the posterior of 60 samples if they were discriminated correctly with PCA-LDA model. Table 2 shows that the PCA-LDA model can distinguish the treated American ginseng from the untreated American ginseng accurately. The recognition rate of the PCA-LDA model is 100% and 96.7% for the spectra of the treated American ginseng and untreated American ginseng acquired with terahertz spectroscopy. The total recognition rate is 98.3%.
表2 模型验证结果 图5 60个样本PCA-LDA模型鉴别的后验概率Terahertz spectroscopy is extending to new application fields. In this work, THz-TDS combined with PCA-LDA has been applied to establish a non-destructive identification model for American ginseng after being extracted and authentic American ginseng. The absorbance spectra of the American ginseng after extraction and the authentic American ginseng cannot be effectively distinguished by the analysis of the spectral characteristics in the frequency range of 0.1 to 1.4 THz. The identification model based on PCA-LDA was applied and validated. The leave-one-out approach was used to evaluate the classification performance of the PCA-LDA model classification performance. By applying PCA for the dimensionality reduction and feature extraction of the original absorbance spectra, the accumulated variance of the top three eigenvectors reaches 98.1%. The recognition rate of the PCA-LDA model is 100% and 96.7% for the treated American ginseng and untreated American ginseng, respectively. The total recognition rate is 98.3%. The utilization of terahertz spectroscopy combined with the PCA-LDA model can distinguish the American ginseng after extraction from authentic American ginseng accurately. Terahertz spectroscopy combined with chemometric method could be a promising approach for certification of the quality of American ginseng, and this method can be readily adopted by other Chinese herbal medicines.
深圳大学学报理工版
JOURNAL OF SHENZHEN UNIVERSITY SCIENCE AND ENGINEERING
(1984年创刊 双月刊)
主 管 深圳大学
主 办 深圳大学
编辑出版 深圳大学学报理工版编辑部
主 编 李清泉
国内发行 深圳市邮电局
国外发行 中国国际图书贸易集团有限公司(北京399信箱)
地 址 北京东黄城根北街16号
邮 编 100717
电 话 0755-26732266
0755-26538306
Email journal@szu.edu.cn
标准刊号 ISSN 1000-2618
CN 44-1401/N