Yu Sheng

Sheng Yu

Associate Professor

Research Areas: text-based artificial intelligence in medicine, including large language models, information retrieval, question answering, and knowledge graph construction.

Office: Room 209-B, Weiqing Building, Tsinghua University

Phone: +86-10-62783842

Email: syu@tsinghua.edu.cn


Background
  • Ph.D., George Washington University, Systems Engineering
  • Postdoctoral Research Fellow, Harvard University, 2012-2015
  • Research Fellow, Brigham and Women's Hospital, 2012-2015
  • Assistant Professor, Center for Statistical Science, Tsinghua University, 2015-2018
  • Associate Professor, Center for Statistical Science, Tsinghua University, 2018-
  • RONG Professor, Institute for Data Science, Tsinghua University, 2018-2021

TEACHING
  • Introduction to Data Science
  • Introduction to Statistical Learning

Projects

  • Biomedical Search & QA - We are developing a dataset for biomedical information retrieval and automatic question answering. So far, we have collected over 1 million biomedical search results, commonly asked questions with answers using Google search as a surrogate (queried and parsed using SerpApi).

Publications
  1. Zhengyun Zhao, Yichen Tian, Zheng Yuan, Peng Zhao, Feng Xia*, and Sheng Yu*. A machine learning method for improving liver cancer staging. Journal of Biomedical Informatics (2023): 137:104266.
  2. Qiao Jin, Zheng Yuan, Guangzhi Xiong, Qianlan Yu, Huaiyuan Ying, Chuanqi Tan, Mosha Chen, Songfang Huang, Xiaozhong Liu, and Sheng Yu*. Biomedical Question Answering: A Survey of Approaches and Challenges. ACM Computing Surveys (2023), 55(2):1-36. DOI:10.1145/3490238.
  3. Huaiyuan Ying, Shengxuan Luo, Tiantian Dang, and Sheng Yu*. (2022). Label Refinement via Contrastive Learning for Distantly-Supervised Named Entity Recognition. In Findings of the Association for Computational Linguistics: NAACL 2022, pages 2656–2666, Seattle, United States. Association for Computational Linguistics.
  4. Hongyi Yuan, Zheng Yuan, and Sheng Yu*. (2022). Generative Biomedical Entity Linking via Knowledge Base-Guided Pre-training and Synonyms-Aware Fine-tuning. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4038–4048, Seattle, United States. Association for Computational Linguistics.
  5. Sihang Zeng, Zheng Yuan, and Sheng Yu* (2022). Automatic Biomedical Term Clustering by Learning Fine-grained Term Representations. In Proceedings of the 21st Workshop on Biomedical Language Processing, Dublin, Ireland. Association for Computational Linguistics: 91–96.
  6. Hongyi Yuan, Zheng Yuan, Ruyi Gan, Jiaxing Zhang, Yutao Xie, and Sheng Yu* (2022). BioBART: Pretraining and Evaluation of A Biomedical Generative Language Model. In Proceedings of the 21st Workshop on Biomedical Language Processing, Dublin, Ireland. Association for Computational Linguistics: 97–109.
  7. Shengxuan Luo and Sheng Yu* (2022). An Accurate Unsupervised Method for Joint Entity Alignment and Dangling Entity Detection. In Findings of the Association for Computational Linguistics: ACL 2022, Dublin, Ireland. Association for Computational Linguistics: 2330–2339.
  8. Zheng Yuan, Zhengyun Zhao, Haixia Sun, Jiao Li, Fei Wang, and Sheng Yu*. CODER: Knowledge-infused cross-lingual medical term embedding for term normalization. Journal of Biomedical Informatics (2022): 103983.
  9. Jiaqi Guan, Runzhe Li, Sheng Yu, and Xuegong Zhang*. A Method for Generating Synthetic Electronic Medical Record Text. IEEE/ACM Transactions on Computational Biology and Bioinformatics (2021),18(1):173-182. DOI: 10.1109/TCBB.2019.2948985
  10. Yuanren Tong#, Keming Lu#, Yingyun Yang, Ji Li, Yucong Lin, Dong Wu, Aiming Yang, Yue Li*, Sheng Yu*, Jiaming Qian. Can natural language processing help differentiate inflammatory intestinal diseases in China? Models applying random forest and convolutional neural network approaches. BMC Medical Informatics and Decision Making (2020). #contributed equally, *contributed equally.
  11. Zheng Yuan, Yuanhao Liu, Qiuyang Yin, Boyao Li, Xiaobin Feng, Guoming Zhang, and Sheng Yu*. Unsupervised Multi-granular Chinese Word Segmentation and Term Discovery via Graph Partition. Journal of Biomedical Informatics (2020), 110:103542.
  12. Yucong Lin, Yang Li, Keming Lu, Cheng Ma, Peng Zhao, Daiqi Gao, Zihao Fan, Zijie Cheng, Zheyu Wang, and Sheng Yu*. Long-distance disorder-disorder relation extraction with bootstrapped noisy data. Journal of Biomedical Informatics (2020), 109:103529.
  13. Lishan Yu and Sheng Yu*. Developing an automated mechanism to identify medical articles from Wikipedia for knowledge extraction. International Journal of Medical Informatics (2020), 141:104234.
  14. Jian Zhang, Anil Can, Pui Man Rosalind Lai, Srinivasan Mukundan Jr., Victor M. Castro, Dmitriy Dligach, Sean Finan, Sheng Yu, Vivian S. Gainer, Nancy A. Shadick, Guergana Savova, Shawn N. Murphy, Tianxi Cai, Scott T. Weiss, and Rose Du*. Age and Morphology of Posterior Communicating Artery Aneurysms. Scientific Reports (2020), 10:11545.
  15. Aaron Sonabend#, Winston Cai#, Yuri Ahuja, Ashwin Ananthakrishnan, Zongqi Xia, Sheng Yu*, Chuan Hong*. Automated ICD Coding via Unsupervised Knowledge Integration (UNITE). International Journal of Medical Informatics (2020), 139:104135. #contributed equally, *contributed equally.
  16. Yichi Zhang*, Tianrun Cai*, Sheng Yu*, Kelly Cho, Chuan Hong, Jiehuan Sun, Jie Huang, Yuk-Lam Ho, Ashwin Ananthakrishnan, Zongqi Xia, Stanley Shaw, Vivian Gainer, Victor Castro, Nicholas Link, Jacqueline Honerlaw, Selena Huang, David Gagnon, Elizabeth Karlson, Robert Plenge, Peter Szolovits, Guergana Savova, Susanne Churchill, Christopher O'Donnell, Shawn Murphy, J Michael Gaziano, Isaac Kohane, Tianxi Cai*, and Katherine Liao*. Methods for High-throughput Phenotyping with Electronic Medical Record Data Using a Common Semi-supervised Approach (PheCAP). Nature Protocols (2019). *contributed equally.
  17. Katherine P. Liao#, Jiehuan Sun#, Tianrun A. Cai, Nicholas Link, Chuan Hong, Jie Huang, Jennifer E. Huffman, Jessica Gronsbell, Yichi Zhang, Yuk-Lam Ho, Victor Castro, Vivian Gainer, Shawn N. Murphy, Christopher J. O’Donnell, J. Michael Gaziano, Kelly Cho, Peter Szolovits, Isaac S. Kohane, MD, Sheng Yu*, Tianxi Cai*. High-throughput Multimodal Automated Phenotyping (MAP) with Application to PheWAS. Journal of the American Medical Informatics Association (2019). #contributed equally, *contributed equally.
  18. Yucong Lin, Cheng Ma, Daiqi Gao, Zihao Fan, Zijie Cheng, Zheyu Wang, Sheng Yu*. Long distance entity relation extraction with article structure embedding and applied to mining medical knowledge. IEEE ICHI (2019).
  19. Anil Can, Pui Man Rosalind Lai, Victor Castro, Sheng Yu, Dmitriy Dligach, Sean Finan, Vivian Gainer, Nancy Shadick, Guergana Savova, Shawn Murphy, Tianxi Cai, Scott Weiss, and Rose Du*. Decreased Total Iron Binding Capacity May Correlate with Ruptured Intracranial Aneurysms. Scientific Reports (2019).
  20. Wenxin Ning, Stephanie Chan, Andrew Beam, Ming Yu, Alon Geva, Katherine P Liao, Mary Mullen, Kenneth D Mandl, Isaac S Kohane, Tianxi Cai, Sheng Yu*. Feature Extraction for Phenotyping from Semantic and Knowledge Resources. Journal of Biomedical Informatics (2019), 91:103122;
  21. Jiaqi Guan, Runzhe Li, Sheng Yu, Xuegong Zhang*. Generation of Synthetic Electronic Medical Record Text. In 2018 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pp. 374-380. IEEE (2018).
  22. Jessica Gronsbell, Jessica Minnier, Sheng Yu, Katherine Liao, Tianxi Cai*. Automated Feature Selection of Predictors in Electronic Medical Records Data. Biometrics (2018).
  23. Anil Can, Victor Castro, Dmitriy Dligach, Sean Finan, Sheng Yu, Vivian Gainer, Nancy Shadick, Guergana Savova, Shawn Murphy, Tianxi Cai, Scott Weiss, and Rose Du*. Elevated International Normalized Ratio is Associated with Ruptured Aneurysms. Stroke (2018).
  24. Anil Can, Robert F. Rudy, BS, M. Castro, Sheng Yu, Dmitriy Dligach, Sean Finan, Vivian Gainer, Nancy A. Shadick, Guergana Savova, Shawn Murphy, Tianxi Cai, Scott T. Weiss, Rose Du*. Association between Aspirin Dose and Subarachnoid Hemorrhage from Saccular Aneurysms: A Case-Control Study; Neurology (2018).
  25. Anil Can, Victor M. Castro, Dmitriy Dligach, Sean Finan, Sheng Yu, Vivian Gainer, Nancy A. Shadick, Guergana Savova, Shawn Murphy, Tianxi Cai, Scott T. Weiss, Rose Du*. Low Serum Calcium and Magnesium Levels and Rupture of Intracranial Aneurysms; Stroke (2018); Impact factor 6.032.
  26. Jennifer A. Sinnott*, Fiona Cai, Sheng Yu, Boris P. Hejblum, Chuan Hong, Isaac S. Kohane, Katherine P. Liao. PheProb: Probabilistic Phenotyping Using Diagnosis Codes to Improve Power for Genetic Association Studies. Journal of the American Medical Informatics Association (2018).
  27. Jian Zhang, Anil Can, Srinivasan Mukundan Jr., Michael Steigner, Victor M. Castro, Dmitriy Dligach, Sean Finan, Sheng Yu, Vivian Gainer, Nancy A. Shadick, Guergana Savova, Shawn Murphy, Tianxi Cai, Zhong Wang, Scott T. Weiss, Rose Du*. Morphological Variables Associated with Ruptured Middle Cerebral Artery Aneurysms. Neurosurgery (2018).
  28. Thomas H. McCoy#, Sheng Yu#, Kamber L. Hart, Victor M. Castro, Hannah E. Brown, James N. Rosenquist, Alysa E. Doyle, Pieter J. Vuijk, Tianxi Cai*, Roy H. Perlis*. High Throughput Phenotyping for Dimensional Psychopathology in Electronic Health Records. Biological Psychiatry (2018); DOI: 10.1016/j.biopsych.2018.01.011; #contributed equally. https://www.sciencedaily.com/releases/2018/02/180226103436.htm
  29. Thomas H. McCoy, Victor M. Castro, Kamber L. Hart, Amelia M. Pellegrini, Sheng Yu, Tianxi Cai, Roy H. Perlis*. Genome-wide Association Study of Dimensional Psychopathology Using Electronic Health Records. Biological Psychiatry (2018); DOI: 10.1016/j.biopsych.2017.12.004.
  30. Anil Can, Victor Castro, Dmitriy Dligach, Sean Finan, Sheng Yu, Vivian Gainer, Nancy Shadick, Guergana Savova, Shawn Murphy, Tianxi Cai, Scott Weiss, and Rose Du*. Lipid-Lowering Agents and High HDL are Inversely Associated with Intracranial Aneurysm Rupture; Stroke (2018).
  31. Anil Can, Victor M. Castro, Yildirim H. Ozdemir, Sarajune Dagen, Dmitriy Dligach, Sean Finan, Sheng Yu, Vivian Gainer, Nancy A. Shadick, Shawn Murphy, Tianxi Cai, Guergana Savova, Scott T. Weiss, Rose Du*; Alcohol Consumption and Aneurysmal Subarachnoid Hemorrhage; Translational Stroke Research (2018), 9(1):13-19.
  32. Anil Can, Victor M. Castro, Sheng Yu, Dmitriy Dligach, Sean Finan, Vivian Gainer, Nancy A. Shadick, Guergana Savova, Shawn Murphy, Tianxi Cai, Guergana Savova, Scott T. Weiss, Rose Du*; Antihyperglycemic Agents are Inversely Associated with Intracranial Aneurysm Rupture; Stroke (2018), 49(1):34-39; doi: 10.1161/STROKEAHA.117.019249.
  33. Sheng Yu*, Yumeng Ma, Jessica Gronsbell, Tianrun Cai, Ashwin N. Ananthakrishnan, Vivian S. Gainer, Susanne E. Churchill, Peter Szolovits, Shawn N. Murphy, Isaac S. Kohane, Katherine P. Liao, Tianxi Cai. Enabling Phenotypic Big Data with PheNorm; Journal of the American Medical Informatics Association (2018), 25(1,数据科学特刊):54-60; doi: 10.1093/jamia/ocx111. Best Papers in "Knowledge Representation and Management", 2019 IMIA Yearbook.
  34. Anil Can, Victor M. Castro, Yildirim H. Ozdemir, Sarajune Dagen, Dmitriy Dligach, Sean Finan, Sheng Yu, Vivian Gainer, Nancy A. Shadick, Guergana Savova, Shawn Murphy, Tianxi Cai, Guergana Savova, Scott T. Weiss, Rose Du*; Heroin Use is Associated with Ruptured Saccular Aneurysms; Translational Stroke Research (2017); doi: 10.1007/s12975-017-0582-y.
  35. Anil Can, Victor Castro, Yildirim H Ozdemir, Sarajune Dagen, Sheng Yu, Dmitriy Dligach, Sean Finan, Vivian S Gainer, Nancy Shadick, Shawn Murphy, Tianxi Cai, Guergana Savova, Ruben Dammers, Scott T Weiss, and Rose Du*; Association of Intracranial Aneurysm Rupture with Smoking Duration, Intensity, and Cessation; Neurology (2017), 10-1212.
  36. Sheng Yu*, Abhishek Chakrabortty, Katherine P. Liao, Tianrun Cai, Ashwin N. Ananthakrishnan, Vivian S. Gainer, Susanne E. Churchill, Peter Szolovits, Shawn N. Murphy, Isaac S. Kohane, Tianxi Cai. Surrogate-assisted Feature Extraction for High-throughput Phenotyping; Journal of the American Medical Informatics Association (2017), 24 (e1): e143-e149; doi: 10.1093/jamia/ocw135.
  37. Victor M. Castro, Dmitriy Dligach, Sean Finan, Sheng Yu, Anil Can, Muhammad Abd-El-Barr, Vivian Gainer, Nancy A. Shadick, Shawn Murphy, Tianxi Cai, Guergana Savova, Scott T. Weiss, Rose Du*; Large-scale identification of subjects with cerebral aneurysms using natural language processing; Neurology (2017): 88(2), 164-168.
  38. Florence H. Yong, Lu Tian*, Sheng Yu, Tianxi Cai and L.J. Wei. Optimal stratification in outcome prediction using baseline information; Biometrika, 103.4 (2016): 817-828.
  39. Tianrun Cai, Andreas A. Giannopoulos, Sheng Yu, Tatiana Kelil,Beth Ripley, Kanako K. Kumamaru, Frank J. Rybicki, and Dimitrios Mitsouras*. Natural Language Processing Technologies in Radiology Research and Clinical Applications. RadioGraphics (2016), 36, no. 1: 176-191.
  40. Sheng Yu*, Katherine P. Liao, Stanley Y. Shaw, Vivian S. Gainer, Susanne E. Churchill, Peter Szolovits, Shawn N. Murphy, Isaac Kohane, and Tianxi Cai. Toward High-throughput Phenotyping: Unbiased Automated Feature Extraction and Selection from Knowledge Sources; Journal of the American Medical Informatics Association (2015), 22(5):993-1000; doi: 10.1093/jamia/ocv034. EDITOR'S CHOICE.
  41. Victor M. Castro, Yuanyuan Shen, Sheng Yu, Sean Finan, Cindy Ta Pau, Vivian Gainer, Candace C. Keefe, Guergana Savova, Shawn N. Murphy, Tianxi Cai and Corrine K. Welt*. Identification of subjects with polycystic ovary syndrome using electronic health records. Reproductive Biology and Endocrinology (2014), 13(1), p.116.
  42. Sheng Yu*, Kanako K. Kumamaru, Elizabeth George, Ruth M. Dunne, Arash Bedayat, Matey Neykov, Andetta R. Hunsaker, Karin E. Dill, Tianxi Cai, and Frank J. Rybicki. Classification of CT Pulmonary Angiography Reports By Presence, Chronicity, and Location of Pulmonary Embolism with Natural Language Processing; Journal of Biomedical Informatics (2014), 52: 386-393.
  43. Vishesh Kumar*, Katherine Liao, Su-Chun Cheng, Sheng Yu, Uri Kartoun, Ari Brettman, Vivian Gainer, Andrew Cagan, Shawn Murphy, Guergana Savova, Pei Chen, Peter Szolovits, Zongqi Xia, Elizabeth Karlson, Robert Plenge, Ashwin Ananthakrishnan, Susanne Churchill, Tianxi Cai, Isaac Kohane, Stanley Shaw. Natural Language Processing Improves Phenotypic Accuracy in an Electronic Medical Record Cohort of Type 2 Diabetes and Cardiovascular Disease; Journal of the American College of Cardiology (2014), 63(12):A1359.
  44. Sheng Yu* and Enrique Campos-Náñez. Adaptive Convex Enveloping for Multidimensional Stochastic Dynamic Optimization; 62nd IIE Annual Conference and Expo. Proceedings. 2012. Best Paper of Operations Research.