Case Study
ETL Software App for a Leading North-American Medical Research Institution
The results
1. Advanced ETL software application able to process billion EMRs.
2. Web-based client application able to manage processed datasets.
The challenge.
A leading North-American medical research institution was looking for a solution allowing them automate their data analysis and processing.
This institution is engaged in medical outcome research, and the ability to comprehensively analyze big datasets to identify intricate correlations between the different data parameters is crucial to all their research activities. In particular, they needed to be able to process over a billion electronic health records with a view to establishing any possible correlations between different illnesses, finding out whether a medicine is prone to produce any side effects, and performing a diverse variety of other advanced tasks associated with bulk-processing of medical data.
Thus, Envion Software needed to develop an advanced ETL software application that would enable the client to automatically collect differently formatted data from multiple sources, and process them based on a unified data model. In addition, we were also requested to create a Web-based client application to help manage processed datasets.
This application was to provide the following functional capabilities:
- Providing information on the variety of data contained in a data set, for example, age distribution of people, subject to a particular medical condition
- Viewing specific data in the unified data schema
- Displaying information related to a specific data parameter using a set of predefined query options
The Approach and Solution.
Although the main challenges posed by the project were associated with its technological implementation, the need to handle all project–related issues through direct communication with one of the client’s senior business stakeholders took an additional effort on part of the project team. The technology-related challenges included the need to create a separate ETL mapping for each of the multiple data sources, as well as functionality for the regular addition of new data sources and incorporation of newly arrived data set updates.
In addition, due to the often poor quality of the source data, the vast amounts of output data, and the highly specific nature of the latter it was not possible to apply a regular QA process. Thus, to ensure a proper quality of the delivered work, the project team had to employ two different ETL methods (MapReduce and classic SQL), and then collate the results produced by those methods.
The main difficulty in implementing the second stage of the project became the huge volume of the data sets to be processed (up to and over a billion records), that made it hard to achieve the required processing speed.
The project team was successful in overcoming this issue by skillfully applying Amazon EC-2 cloud technologies, and they delivered a Web-based solution fully compliant with the project requirements 3 months after the start of this stage of the project.
Business results.
Implementing the project has allowed the research institution to dramatically improve the quality of their research by adding whole new dimensions to it.
They are now able to analyze medical data at a much greater depth, and in much larger volumes.
Following the project’s delivery, the client made a decision to continue their cooperation with Envion Software, and requested us to develop a web portal to allow their researchers process large volumes of data in a cloud.
Technology Stack:
For unified data model and data-processing functionality:
- Amazon Web Services
- Apache Hadoop
- Python
- Amazon Redshift
- Oracle
- MS SQL server
For Web-based client application:
- PHP
- MySQL, Oracle
- JS, jQuery, HTML5
About Envion Software.
Envion Software is a U.S. based custom software vendor with a few R&D centers around the world.
Since 1984 we have been ‘one stop shop’ for hundreds of companies and start-ups meeting their R&D needs in Applications development, Natural Language Processing, Artificial Intelligence and Machine Learning, eLearning, Data engineering, Big data analytics delivering custom web, mobile, desktop applications and mastering the hardest design, UX / UI tasks.