Adjunct & Associate #93
UBC Data Science Institute Vancouver, B.C., Canada
Postdoctoral Fellow Nov 2017 - Feb 2020
- DSI/CS/Botany multidisciplinary research. Point person in Rieseberg Lab (Botany) for large-scale genetic data analysis. Collaborator on student projects.
- Developed a reproducible data processing work ow on AWS which compresses core-years of compute into hours of wallclock time, via parallel use of docker containers and serverless. (Repo: https://github.com/rieseberglab/bunnies)
- Computed (on AWS) the largest genomic dataset in the world for sunflower. It captures the genetic diversity of thousands of individuals and is studied to fight effects of climate change.
University of British Columbia Vancouver, B.C., Canada
Ph.D. Computer Science in the Systems group Sept 2010 - Nov 2017
- Published in Systems, Security, Data Privacy, Cryptography and Anonymity conferences.
- Trained and supervised approx 14 undergraduate and M.Sc. computer science students.
Bunnies (with Rieseberg Lab): Reproducible Pipelines: https://github.com/rieseberglab/bunnies
(Link description: Bunnies is a python API to write scalable and reproducible scientific workflows/pipelines. It shares many ideas with other data-driven pipeline frameworks such as Snakemake, Nextflow, and Luigi, but strives to achieve a far higher level of reproducibility. It is in early stages of development, but it has been so far used to run bioinformatics pipelines on AWS, successfully)
Reproducibility of scientific experiments is required for allowing results of high fidelity to the environment, and to permit transparency in research. More specifically, in large bioinformatics pipelines, result datasets can be deterministically produced by a series of computing steps. Converting bioinformatics pipelines into scripts is often equated with reproducibility, but it is only one step towards a complete solution. For instance, in addition to command sequences, the input parameters and input datasets should also be tracked. Furthermore, data science is an iterative process, and pipelines can take multiple core-years to complete. So it becomes important to also track how parameters and inputs change over the long span of research projects (months or years).
My research interests span computer science, data science, and bioinformatics: systems, cloud, data provenance, and scientific reproducibility of large bioinformatics experiments. I apply cloud technologies to solve both scaling problems and bring tools that help tightly bind datasets to the code, parameters, and environment that generated them. I am interested in finding solutions which can reduce the overall time needed to generate datasets, and generate enough metadata that they can be reused by other researchers and in other experiments, with high confidence.
- Bioinformatics frameworks for data-driven pipelines. In particular, applied to short-variant-calling.
- Computer Systems
- Cloud Storage
- Data Provenance
- Todesco, Owens, Bercovich, Legare, et al., \Massive haplotypes underlie ecotypic dierentiation in sun owers", Nature 584, 602607, https://doi.org/10.1038/s41586-020-2467-6, July 2020.
- Several grants awarded during PostDoc: AWS Open Dataset Grant for \UBC Sun ower Genome" (Jan 2020), Compute-Canada Research Platforms and Portals (RPP) competition for \DivSeek Canada 2020" (March 2020), AWS Cloud Credits for Research Award 50K (Oct 2018).
- Jean-Sebastien Legare, \Enhancing user privacy in web services", PhD thesis, UBC Faculty ofGraduate and Postdoctoral Studies, Vancouver, November 2017.
- Jean-Sebastien Legare, Robert Sumi, William Aiello, \Beeswax: a platform for private web apps", Privacy Enhancing Technology Symposium (PETS), Darmstadt, July 2016.
- Legare, Meyer, Spear, Totolici, Bainbridge, MacRow, Sumi, Jung, Tjandra, Williams-King, Aiello, and Wareld, \Tolerating Business Failures in Hosted Applications", ACM Symposium on Cloud Computing (SoCC), Santa Clara, Oct. 2013