[Podcast Transcript] Is Genomic replacing Astronomical?
Land of Digital Opportunities is a podcast series for business leaders driven by innovation and productivity, and keen on growing an idea, team, or a company! Every week, MediaAgility sits down with someone with strong innovation acumen and who has achieved a remarkable business presence in the digital industry, and together we unfold ‘Why’.
Goldy Arora: Welcome to the 3rd episode of Land of Digital Opportunities podcast by MediaAgility, a global digital consulting company. My perception has always been that the data that social media generates is astronomical where just YouTube alone acquires 300 hours of media content every minute but recently when I was reading a journal which expects that by 2025 Genomics will need 20 times more storage than Twitter or YouTube which I guess is more than astronomical, so today you are going to speak with Rajesh, Co-founder and CEO of MediaAgility about – Is Genomical replacing Astronomical?
So Rajesh, welcome to the Land of Digital Opportunities podcast, why don’t you tell us a bit about what is Genomics?
Rajesh Abhyankar: Sure, Goldy! A very interesting topic you have picked for this episode and very timely as well. Right from childhood we know that the DNA molecule is a twisted pair of strands, double helix as we see it in many places which has the ATGC chemical units that make that long helix molecule and again without going into all the chemical details of it, an organism’s completes sets of a DNA is called it’s ‘Geno’ so it’s a complete map of close to 3 billion DNA base pairs as they are called or letters that make up the human Genomics. Genomics is that area within the broader field of genetics that concerns with the with how we sequence this entire geno and the ability to analyze it and we look at it as ideally 4 different stages that are involved. We need to first sequence it and then there is a primary, secondary, and tertiary level of processing it before we start doing some of the what-if analysis and some of the clinical impact of that understanding and by definition if you are trying to sequence the 3 billion base pairs you can imagine how big these data sets can grow and why Genomics has now become an IT problem and that’s why we are in it.
Goldy Arora: You are listening to the Land of Digital Opportunities podcasts by MediaAgility, a global digital consulting company and now back to the conversation with Rajesh about Is Genomical replacing Astronomical?
So Rajesh, whenever we talk about Genomics we usually talk about processing, analyzing huge data sets, talk to us about the data sizes and Genomics and is it really replacing the word Astronomical to define something which is really big?
Rajesh Abhyankar: It does look like in the next 10 years the rate at which the data storage and the need for analysis for the Genomics domain is increasing, it certainly looks like ‘Genomical’ might be the right name for describing something big.
The journal that you mentioned did some analysis and compared, they did some projection of what some of the data sizes might be for, I think they compare astronomy, Twitter, YouTube and Genomics and it was pretty interesting study and projection there is that in the next 10 years around by 2025, we would have a billion whole genome sequences produced and if each of the sequence approximated at about 100 gigabytes that you can calculate what a billion whole genome sequences might look like, it runs into exabytes and around more aggressive estimate that put it at may be there are 2 billions if the 3rd generation technologies is coming up like nanopore sequencing and so on, if the costs really becomes affordable it’s quite possible that the estimate of having a billion whole genome sequences might actually be conservative and it may be the double of that, so you are looking at anywhere between 2-40 exabytes of data being produced per year that’s way we are higher than anything else we have done in the history. ‘Genomical’, I will not be surprised if that is the word that catches up soon.
Goldy Arora: We are a digital consulting company helping organization understand the digital revolution and then respond to that and we do it with an integrated approach of strategy, design thinking, technology expertise and agile implementation, so with that let’s go back to the conversation with Rajesh about – Is Genomical replacing Astronomical?
Thank you Rajesh, it was a good overview of what Genomics is and how it’s impacting our lives. Now let’s change gears a bit, few years ago genome analysis was costing us a lot of time and money which now I guess is possible within a day and may be under thousand dollars. It seems you are really witnessing the Genomics revolution as you said, so what are the technological advancements or some of the plans which are causing this revolution.
Rajesh Abhyankar: The first time when they sequence the human genome it costs I think 3 billion dollars and 15 years to do it and there was point in 15 years where it was thought that it was almost impossible to do it but human curiosity and there are inherent ability to explore, we are species of explorer and we have always thrived on adventuring into an area and understanding the outer space and understand what’s within, thrived down to understanding the single molecule and within that what makes up our genome , so that spirit of exploration kept us going and those early days of human genome sequencing, when could have not been predicted that something that could take 15 years to do can be done within a day. One of our very interesting engagement we are doing with a very large pharma company in the US is to help them bring some of these cloud technologies in their efforts to build next generation nanopore sequencing technology that’s aiming and bringing the costs to under $100 and do it within hours rather than the whole day. So, we are looking at a very exponential trajectory speed with which you can now not only do the sequencing but get it to the point where it is affordable and then the challenge here is that if you can start producing the whole human genome at that rate and with 100’s of gigabytes and soon petabytes and exabytes, how do you store all that data, how do you process it, how do you analyze it and how do you share it, whole scientific community thrives on sharing each other’s work, verifying each other’s work and that’s how scientific progress happens but when you are trying to deal with just take the cancelled genome data that’s in petabytes, so even when the National Cancer Institute wanted to open it up where does it group even download that kind of data, where do you store it, how do you even prepare it for processing and analysis the vast amount of compute required on premise is just ridiculous and this is only going to get worse as more and more data gets generated so the reason why we think this is the right time for cloud and some of the advancements in computing to be applied to this field of Genomics is because it’s the ecosystem seems to be ready all the way from sequencing technology which requires the research in chemistry , in semiconductors, in cloud computing, in innovation and software and bringing that cost down to a below $100 level and then being able to process that kind of data at scale.
There arealso these anchor organizations like the broad institute or the institutes for system biology and then there are many government funded or semi- government funded agencies even research hospitals and there are so many organization in this ecosystem that are trying to work together, share the data together obviously for qualified research, just a casual browser may not have access to all the data but if you are doing qualified research there is never been a great time when you have access to petabytes of data from TCGA, which is the cancer genomic data, another data set which is just getting started is from the autism speak from the Google cloud which with the aim is to get 10,000 people around the autism condition and even today there is subset of that available and there are so many more public data sets available so the reason why all of these things are coming together in the ecosystem is why the cloud computing vendors like Google and Amazon everyone also is making an extra efforts to built something specific to understand their cloud compute infrastructure to make it more suitable for solving this Genomic problem.
Goldy Arora: You are listening to Land of Digital Opportunities podcasts by MediaAgility, a global digital consulting company and now back to the conversation with Rajesh about – Is Genomical replacing Astronomical?
I completely agree with you Rajesh, it’s great to see how technology is solving massive computational challenges and helping us answer complex biological questions. So as a researcher or bio- informatician or technology professional what are some of the things I can do today and more important where and how should we start?
Rajesh Abhyankar: That’s a great question, since we are a Google Cloud Partner, I would like to start there and give our listeners a good idea on why we partnered with Google and why we are doubling down on creating custom solutions that are specific to improving the productivity of scientists in lab which we called LabAgility and then we are working specifically on how do we pace that last mile gap where rather than bringing the data to your applications, it’s more appropriate now to bring your applications to the data because the data is so huge and is sitting in the cloud but in terms of where do you start let’s just talk a bit about why we think Google is well placed to handle a problem like this? You mention earlier 300 hours of video gets uploaded every minute on YouTube, the entire Google search index as I understand is about 100 petabytes and when you search that huge index of 100 petabytes, you get results in milliseconds, so company who understands processing 300 hours worth of video per minute in just that volume of data, processing it for the obvious checks and then making it available and distributing it to a global network. When you apply some of these technologies and try to see what do these number means in terms of whole genome sequence, the YouTube example is very close to equal to six whole genome sequences per minute, at that scale Google infrastructure is able to handle today and the search index, I would say is close to million whole genome sequences can be searched in index and answer given in milliseconds. So company that can already do compute and storage at that scale, I think is well suited to handle the huge challenges that lie ahead of us and the community that is trying to make the most of Genomics and to me the exciting part is what’s called the tip of this period which is real patients will start seeing the impact of all this in their real life and that matters to us and that’s why we got interested in it.
Coming from the compute background, we thought why don’t we work with some of the experts in the industry and we started working with a startup called YouGenomics and we recently published a case study in partnership with Google on what we did for them, working with customers like those or the other one that I mention, what we clearly see is a place to start is if you run a bunch of servers in your own data centres or in your own lab, a good place to start with is to see what are the cost benefits of moving all that infrastructure to the cloud, just the basic compute engine, VM, storage and it all really starts from there running some of your clusters on your own premise those clusters can be run on the cloud. So we have helped some of your customers in just helping them with assessment of their basic infrastructure, how can their pipeline and workflows can be optimised or do some custom workflow development for them and built some custom applications out of it can help them with their analysis and exploring the data.
Google’s been working with the broad institute, now if you use GATK, now you can run it on scale at Google cloud, Google’s part of the GA for GH which is the Global Alliance for Genome and Health, it’s standard way of ensuring interoperability because you can imagine when you have a very heterogeneous ecosystem, you need some open standard so that the genome data can be shared and exchanged in a fairly easily so GA for GH there is an implementation of that API called the Google Genomics API so that’s available and I would say if you depending on what stage you are in, one way to look at it is the problem of storage, you can start there if you want to start storing in the cloud, then there is actual processing of your pipelines, some of the things that are readymade available, if you have a custom pipeline, you can build those things on the cloud, exploring these data sets on technology like big query and lastly sharing it in a secure way, So depending on where you are currently on your genomic problem, there you can start with as simple as running your own virtual machine and running your own open source stack on the Google cloud and as you get more advanced in your adoption, you can start using some of these genome specific features of the Google cloud.
Goldy Arora: Rajesh, I guess our guest will find this advice valuable, as you suggested to start small, may be to start with making their labs more organized by moving to cloud computing and also to take advantage of the public data sets which are already available and matching them up with their own data for faster researches.
I hope today’s podcasts was helpful for listeners and not just understanding Genomic revolution, also to see how it’s helping them the mankind. With that Rajesh, I appreciate your time and thank you so much for being on Land of digital opportunities podcasts.
Rajesh Abhyankar: Thank you, this is a very exciting field and we are super excited to keep working with some of the initial customers that we have, learn more about this domain and as the compute infrastructure and technology improves, work more on creating our own creative solution because we think our goal is pretty clear here that platform providers like Google and Amazon and IBM are going to come up with advancements in their core technologies and platform and there is always that gap in between making all of that platform and technology work for solving real world problems, that’s where our expertise lie in and providing them the implementation of all of that working with our customers and we are super excited about being in this field, these are really early days and we can only imagine what this means to the impact that it can have on real people with autism, with cancer and just imagine that the promise this technology holds in transforming the way we think about healthcare and I am really excited about what the next decade brings, it’s going to be a roller coaster ride to be on a fast exponentially growing trajectory with exabytes of data being produced, need to analyze it, need to process explore and make sense of it, help all the way from the sequencing technology all the way to the clinical applications of it, we are committed to this discipline and this is just a beginning for us as well and I hope at some point we can say our Genomics practice is truly Genomical now in terms of the people and this kind of impact that we are having.