For women in the U.S., breast cancer death rates are second only to lung cancer. It is estimated that approximately 30% of the cancer diagnosed in women will be breast cancer.
To further complicate the issue, breast cancer is a highly heterogeneous disease with different molecular subtypes as well. About 5-10% of breast cancers can be linked to gene mutations, causing the protein to malfunction and disrupting normal development of genes that increase risk of cancer and are inherited from one’s mother or father and about 85% of breast cancers occur in women who have no family history of breast cancer. These occur due to genetic mutations that happen as a result of the aging process and life in general,
rather than inherited mutations.
Precision medicine ,  is an emerging approach for disease treatment and prevention that takes into account individual variability in genes, environment, and lifestyle to predict more accurately which treatment and prevention strategies for a particular disease will work in which groups of people. Considering the severe complication of breast cancer treatment, it is imperative that researchers engage precision medicine in breast cancer diagnosis and treatment, and for that to happen, a clearer idea about the genomes of breast
tumors needs to be realized.
A few decades ago, breast cancer classification systems were based on tumor response to endocrine therapy. In the last decade, new technologies for analyzing the genomic profiles of human tumors have substantially improved our knowledge of the molecular classification of breast cancer. Currently, molecular classification of breast tumors is used along with classical prognostic factors to predict tumor evolution and behavior and to select specific treatments accordingly. This development improves diagnostic accuracy and enhances the
ability to individualize therapy for breast cancer, thereby leading to direct implications for patient management.
Unfortunately, computational studies, such as those that identify connections between genetic patterns and cancer characterization, have often proven to be difficult to reproduce , and work is ongoing to characterize and improve the reproducibility for computational research , , .
Thus, our first objective was to replicate the work conducted by Zhiyuan et. al . Using the same data and procedures, we have collected the intrinsic gene list upon which their single sample predictor was built. The rationale behind the intrinsic gene list is that the gene expression pattern that is to be used to classify tumors summarizes characteristics of tumor samples taken from the same tumor and reflects the fundamental differences of the tumors at the molecular level among cancer subtypes . We have obtained similar results from the conduct of an independent study where our procedures are as closely matched to the original experiment
Currently, mankind has the capacity to capture and analyze biological information at the genetic level. Using this genomic data for disease prognosis and diagnosis is arguably one of the most important applications of this knowledge. However, several problems currently exist that impede the ability of researchers to effectively analyze such data sets. One issue is the difficulty in obtaining sufficient sample data in this domain. Due to the expense, both monetary and temporal, involved in the collection and processing of biological samples, most result sets are comprised of very few samples.
These few samples, however, typically contain many thousands of genetic features. This has led some researchers to explore methods for merging data sets from various studies into a single, cohesive set. Given the differing objectives of different researches, variations in sampling protocols, and no universal standard for data curation, unifying these data sets is not trivial. As with the original work, to generate a validation set we have merged four publicly available breast cancer expression datasets using Distance Weighted Discrimination (DWD) . DWD is an analytical tool capable of merging disparate, multivariate data sets by making global adjustments designed to minimize biases inherent in those sets.
Finally, we have performed both hierarchical clustering and k-means clustering to seek interesting subpopulations of genes. Much of the previous analyses of gene expression data to classify breast tumors used hierarchical clustering. However, we will extend the work by trying different forms of hierarchical clustering and also k-means clustering to discover distinctive molecular portrait of each tumor. We will show that we have been able to successfully achieve high reproducibility in identifying most of the subgroups previously identified in other studies.
Using the Kaplan-Meier survival analysis and Cox proportional-hazards model, we will compare the differences in outcomes and association with other clinical parameters between each of the groups. While hierarchical clustering technique will give us some distinct subgroups of genes, we show that k-means clustering perform better to identify the distinct subgroups which are consistently predictive of a patients clinical outcomes as evidenced by the prediction of Relapse-free survival and Overall survival of each identified