Oral Presentations

Exploring Hadoop-Based Data Lakes for Research

11:42 AM–12:00 PM Mar 15, 2018 (US - Pacific)



Abstract: Research data requirements often fall outside conventional analytic patterns. Researchers require broader and deeper data, in source format not pre-defined analytic schemas; they have advanced analysis and programming skills; prefer a “self-service” model; and their requirements are evolutionary. Apache Hadoop®-based data lakes that implement lift-and-shift, late binding, self-service architectures are a natural fit for research analytics. This presentation discusses our experiences with exploring and implementing data lake architectures for research at NYU Langone.

Learning Objective 1: Describe the unique nature of data requirements for research analytics and the challenges in satisfying such requirements.

Learning Objective 2 (Optional): Compare the different architectural alternatives for provisioning data for research analytics.

Learning Objective 3 (Optional): Formulate an approach to creating an enterprise data lake in an academic medical center and other healthcare settings.

Learning Objective 4 (Optional): Describe the features of Apache Hadoop platform and explain how they support research analytics.


Rajan Chandras (Presenter)
NYU Langone Health

Michael Cantor, NYU Langone Health
Jeff Shein, NYU Langone Health

Presentation Materials: