How to Develop a Disease Classification Algorithm Using Electronic Medical Records

Speaker:  Uri Kartoun – Cambridge, MA, United States
Topic(s):  Information Systems, Search, Information Retrieval, Database Systems, Data Mining, Data Science


A patient might be associated with a disease, but the disease may not be clearly documented in the patient’s medical profile. Often, a disease may be mentioned in the context of being ruled out as a diagnosis. Furthermore, a disease may often be mentioned within the context of documenting family history. A disease being mentioned frequently may not provide any information because it is just part of a template generated by the electronic health record system of the care system. In addition, a certain mention of a disease may be only within the context of a biopsy, and the patient may very likely not have the disease.
Incorporating International Classification of Diseases (ICD) codes into text-processing methods to classify a disease could be severely problematic as well. For a variety of conditions, such codes do not indicate the existence or nonexistence of the condition, such as for psychiatric and sleep disorders. Furthermore, a code appearing on the patient’s problem list has a very different meaning from the same code appearing on that patient’s encounter diagnosis record.
As a research fellow at Massachusetts General Hospital (2013–2016), I had the opportunity to develop disease classification algorithms by applying machine-learning techniques using longitudinal medical databases combining structured and unstructured data elements. In my lecture, I will walk through the steps we took to develop our physician-documented insomnia classification algorithm, which led to the creation of one of the largest cohorts of patients with sleep disorders [1].
Furthermore, I will present in my lecture a new text-processing method we developed to extract concepts from clinical narrative notes. The method, text nailing (TN) [2], is highly accurate, and it outperforms traditional machine-learning algorithms in multiple scenarios, such as extracting family histories of coronary artery disease [3], classifying patients with sleep disorders [4, 1], and improving the accuracy of the Framingham risk score for patients with nonalcoholic fatty liver disease [5].
1. Kartoun, U., Aggarwal, R., Beam, A., Pai, J., Chatterjee, A., Fitzgerald, T., Kohane, I., and Shaw, S. Development of an algorithm to identify patients with physician-documented insomnia. Scientific Reports 8, 7862 (May 2018), 1–9. 
2. Kartoun, U. Text nailing: an efficient human-in-the-loop text-processing method 2017. ACM Interactions 24, 6 (November 2017), 44–49. 
3. Corey, K., Kartoun, U., Zheng, H., Chung, R., Shaw, S. Using an electronic medical records database to identify nontraditional cardiovascular risk factors in nonalcoholic fatty liver disease. The American Journal of Gastroenterology 111, 5 (2016), 671–676. 
4. Beam, A., Kartoun, U., Pai, J., Chatterjee, A., Fitzgerald, T., Shaw, S., Kohane I. Predictive modeling of physician-patient dynamics that influence sleep medication prescriptions
and clinical decision-making. Scientific Reports 7, 42282 (Feb. 2017), 1–7. 
5. Simon, T., Kartoun, U., Zheng, H., Chan, A., Chung, R., Shaw, S., Corey K. MELD-Na score predicts incident major cardiovascular events, in patients with nonalcoholic fatty liver disease. Hepatology Communications 1, 5 (2017), 429–438.

About this Lecture

Number of Slides:  30
Duration:  45 minutes
Languages Available:  English
Last Updated: 

Request this Lecture

To request this particular lecture, please complete this online form.

Request a Tour

To request a tour with this speaker, please complete this online form.

All requests will be sent to ACM headquarters for review.