5 ways big data disrupted research for mutation identification and causes

The rise of big data has become a significant turning point in biomedical research. Twenty years ago, it cost $100 million to sequence a single human genome. With the advent of high throughput sequencing technologies, sequencing the human genome has become faster and much more cost-effective — average costs today are closer to $1,000 — and has opened up new doors for targeted diagnostics and therapies.

The mass of big data flowing in from all sources — from whole genome and whole exome sequencing and analysis to single-cell RNA sequencing, to RNA seq analysis — has changed the way academic researchers and biotechnology companies approach the study of cancer and rare genetic diseases, for example. Coupled with advances in artificial intelligence and machine learning, it has made the analysis of highly diverse biomedical datasets possible.

Big data and AI in the identification of driver mutations

AI has not only sped up the processing and analysis of biomedical data, but it also makes the analysis of highly complex sets of data possible on a level that could not be achieved with manual processing. One of the biggest challenges of integrating big data analytics into genomic analytics is the elimination of unnecessary noise.

For example, next-generation sequencing (NGS) technologies like whole-genome sequencing (WGS) and whole-exome sequencing (WES) often generate an extensive list of thousands of variants. DNA-seq and RNA seq analysis generally reveal that a majority of these variants are benign, but any rare mutation needs to be treated as potentially pathogenic.

Available academic tools can winnow out benign variants on the basis of minor allele frequency, segregation, text-mining, genotype quality, dbSNP data, and predicted pathogenicity. However, none of these tools can define the causative mutations of a patient’s phenotype by the method of elimination. The identification of driver mutations always demands additional investigation, including the use of external databases, and the determination of common rare variants among patients with similar diseases.

Big data in characterizing and categorizing rare and deadly diseases

Big data has influenced the way researchers categorize cancers. There was a time when “lung cancer” or “kidney cancer” were perfectly acceptable diagnoses. Today, scientists and oncologists understand that lung cancer or kidney cancer can refer to several different diseases, each of which arises from distinct mutations.

Identifying mutations in tumor cells is no longer difficult. Running a complete DNA or RNA seq analysis on the isolated nuclear matter is not challenging either. However, telling apart the disease-causing mutations from the non-driver mutations found in the tumor cells can be a challenge.

Comparative analysis using the biomedical dataset on de novo mutations, SNPs at the disease site, and CNV data promise new techniques for the determination of driver mutations for the different types of cancers. The distinction of one kind of cancer from another based on their driving mutation or differences in molecular mechanisms is opening new windows for personalized treatment, precision pharmacology, and targeted therapy.

Cancer classification and increased survival rates

In the last couple of years, the big data approach to cancer research has changed how doctors describe non-small-cell lung carcinoma (NSCLC). It is now categorized by the predominant mutation found in NSCLC cells and not by the organ or tissue affected by the disease.

This approach of using DNA and RNA seq analysis to categorize cancer according to its driver mutations rather than the organs or tissues it affects has enhanced the chances of survival and improved the prognosis for hundreds suffering from cancers caused by rare genetic mutations. Using treatments that target specific mutations in a single gene can help reduce the chances of treatment failure and other side-effects people associate with chemotherapy.

Big data analytics and discovery of other driver mutation for genetic diseases

An information-rich approach is not only helping in the diagnosis and treatment of cancer, but it is also helping the scientific community unravel the mystery of autism genetics. In a large study on autism, biomedical data from over 600 families were studied using RNA seq analysis. Participants were children diagnosed with autism with unaffected parents and siblings.

The study showed that there are hundreds of genes at play, but six significant candidates became the focus of several research groups working on the genetics of autism. In 2014, another similar study resulted in the discovery of 27 genes with rare de novo mutations in those diagnosed with autism.

The year 2016 saw a breakthrough in the study of autism genetics when collaborative research combined data on de novo mutations with data on inherited mutations and CNV data. The Autism Genome Project played a significant role in the subsequent discovery of the 65 genes now linked with autism, and the 6 CNVs now considered the driving mutations. The study further went on to confidently identify 28 “autism genes” that will undoubtedly make the diagnosis of anyone with the charted mutations easier and faster in the near future.

What does big data analytics hold for the future of diagnostics and treatment of genetic diseases?

Whether it is autism genetics or cancer genetics, scientists are finally acquiring the sequencing tools, analytics algorithms, expansive datasets, and robust models necessary for searching beyond the exome. To date, most studies have focused on SNPs and mutations that occur within the exome, leaving around 98% of the genome unexplored. With big data analytics and AI, the scientific community is seeing new, powerful tools at its disposal that can aid in the identification, study, and targeting of disease-causing mutations.

Author avatar
Amit Sinha

Amit U Sinha, Ph.D. (Machine Learning and Genomics) is the founder and CEO of Basepair, an online NGS analysis platform. Amit is an expert in genomics and bioinformatics, with over a decade of experience in the field. Prior to founding Basepair, Amit worked as an investigator at Memorial Sloan Kettering Cancer Center. Additionally, he has held research faculty positions at the Dana Farber Cancer Institute and Harvard Medical School. Amit’s work focuses on leveraging technology to improve healthcare research by enabling scientists to make sense of big data quickly and accurately.

Post a comment