Skip to content

CheXpert Dataset

Aleco Kastanos edited this page May 19, 2020 · 1 revision

Overview

  • 224,316 chest x-rays
  • 65,240 patients
  • 14 observations from radiology reports
  • Validation is 200 studies annotated by 3 experts
  • Test is 500 studies annotated by 5 experts

Automated labeller

Extracts observations from free text reports in three stages:

  1. Mention extraction
  2. Mention classification: Either negative (no evidence for observation), uncertain observation, or positive (mention of observation)
  3. Mention aggregation: Positive label (1) - observed, uncertain label - if no positive and at least one uncertain mention (-1), negative label (0) - actively not observed, or no mention / observation ('blank')

There is also a "no finding" column which is assigned 1 if nothing is identified as positive or uncertain

Schemes for dealing with unknowns

U-ignore: Mask and ignore all unknowns - equivalent to dropping all rows with missing values

Binary Mapping: Map all unknowns to either 1 or 0

U-self-trained: Train a model on the U-ignore scheme to predict labels. Then use it to predict unknown values. You can use the class or the logit prediction.

3-class classification: Train model to predict 1, 0, or u. Make predictions and then ignore u and softmax over 1 and 0. You can then use the "more likely" class.

Clone this wiki locally