Home    Document    Datasets    Contact Us

  The RF method is adopted here to build the final classification model. RF is a very powerful and practical classifier that can use multiple trees to train and predict the samples and has been extensively employed in the aspect of chemometrics and bioinformatics. There are two main advantages to the RF method. First, it can use an out-of-bag set to monitor error, strength, and correlation. Second, it can measure variable importance through permutation. The RF method can handle high-dimensional data and approach the best predictor by further decreasing the dimension of feature space and discovering rigorous feature number. In this work, we make use of the entire set of features to establish an RF classification prediction model on the basis of the 10-fold cross-validation. Furthermore, the importance for each feature of its association with the prediction target was demonstrated. Finally, we chose the most suitable model with the fewest number of top-ranking features but similar prediction performance compared with the entire feature space. The feature selection process also helped us to locate the key feature of predicting lung cancer. RF was executed by applying the RandomForest package of R. Both internal 10-fold cross-validation and external validation were adopted in this work to obtain a reliable predictor for lung cancer. The entire modeling process, including feature ranking, RF parameter adjusting, and final model selection, was performed based only on a training set using 10-fold cross-validation, whereas the external validation set was not involved in any of these model-building processes, as emphasized in Smialowski’s work . Ten fold cross-validation, namely, is employed to randomly divide an internal set into 10 non-overlapping parts, one of which is used as a test set while the rest are used as the training set. This process is repeated 10 times so that all samples can be used as a test set once. The circular work can facilitate the establishment of a stable classification model for predicting lung cancer. The average results were obtained after 10 runs of the circular process as the ultimate 10-fold cross-validation result.

  Five frequently used indicators were adopted here to evaluate the final performance of the RBLC method, including sensitivity (Sens), specificity (Spec), ACC, MCC, and AUC.


where TP, TN, FP, and FN stand for true-positive, true-negative, false-positive, and false-negative, respectively.

  The receiver-operating characteristic (ROC) curve is a composite indicator for the continuous variables of Sens and Spec, with Sens as the y-axis and 1-Spec as the x-axis. The commendable characteristic of the ROC curve is that the curve could remain unchanged as the positive and negative samples are out of balance in the test set. AUC is one of the main evaluation indices for a binary classifier system. AUC is the area under the ROC curve, which ranges from 0 to 1. The closer the value of the AUC is to 1, the better the prediction performance of lung cancer. We could get a different false-positive rate and true-positive rate by adjusting the threshold in our forecasting model. The point at which the false-positive rate and true-positive rate are determined on the coordinate is on the ROC curve.

  The ROC curve is a graphical plot that illustrates the performance of a binary classifier system when Sens and Spec vary with different decision threshold. Each point in the ROC curve is created by plotting the true-positive rate versus the false-positive rate at a particular decision threshold. AUC is the area under the ROC curve, which can present a comprehensive evaluation of a binary classification method.

Copyright © 2019 - Shuyan Li · All Rights Reserved