Blood-based methods utilizing circulating tumor DNA (ctDNA) and cell-free DNA (cfDNA) are currently being developed to enable early and minimally invasive detection of lung cancer. However, these methods have demonstrated suboptimal performance in detecting cancers at the earliest stages (stages 0-II). To address this limitation, researchers have proposed a machine-learning approach that incorporates RNA gene expression as a means to extract valuable insights into the biological characteristics of patients. By utilizing gene expression profiles as surrogate indicators of cancer disease phenotype, this approach holds promise for the early detection of lung cancer. A previous study successfully identified and validated 23 miRNA biomarkers for the non-invasive diagnostic classification of lung adenocarcinoma. It achieved an impressive sensitivity of 97.7% and specificity of 98.7% when analyzing blood samples obtained from 383 clinical subjects. Building upon these findings, the objective of the present study was to employ a machine-learning algorithm that leverages the 23 miRNA features to evaluate the efficacy of this signature in the early detection of lung cancer.
A substantial and diverse clinical cohort was acquired from the National Institutes of Health (NIH) Gene Expression Omnibus database, specifically the GEO Accession Number GSE137140. The cohort consisted of a total of 3,744 participants, encompassing individuals with pre-operative lung cancer (n=1,566) as well as non-cancer controls (n=2,178). Serum samples were obtained and analyzed to extract miRNA for further investigation. A meticulously designed analytic plan was implemented to effectively analyze the data, utilizing machine learning methods based on XGBoost classification. The specific implementation of the algorithm involved training it using XGBoost 1.4.1.1 R library, which was programmed with R v3.6.3. This comprehensive approach ensured the robustness and reliability of the findings obtained from the study. The cohort of patients with lung cancer primarily consisted of individuals diagnosed with early-stage disease (87.7% stage I/II). This composition ensured a comprehensive representation across various histologic types, with the majority of cases being adenocarcinoma (77.8%) and the remaining cases classified as non-adenocarcinoma (22.2%). Additionally, a notable proportion of participants identified themselves as never smokers (37.9%), further reflecting the diversity of the cohort. The evaluation of the 23-miRNA signature in this study yielded promising results. In the held-out test set, the signature exhibited a sensitivity of 98% and a specificity of 89%.
The machine learning approach utilizing RNA gene expression in patient serum has demonstrated remarkable sensitivity and specificity within a substantial cohort predominantly consisting of early-stage lung cancer cases. This significant achievement highlights the potential of a multi-analyte, multimodal strategy that integrates machine learning algorithms with RNA gene expression profiles alongside pertinent demographic and clinical risk factors. With this approach, the accurate detection of lung cancer in its earliest stages becomes a tangible possibility. Moreover, successfully translating this innovative approach from microarray technology to PCR instrumentation enhances its practicality and feasibility in clinical settings. Ongoing efforts are currently underway to further validate the effectiveness and reliability of this machine-learning method and approach.