Analyzing datasets from our clients’ real-world studies can be complicated. The nature of these studies, and what sets them apart from randomized controlled trials (RCTs) – namely, the flexibility of digital data collection compared with the rigid experimental boundaries of trials – means that a data analysis question that seems simple at the outset could have different answers depending on the approach. The obstacles to overcome also vary, and how they are preempted or resolved can have a considerable impact on eventual success. We’ll be exploring two theoretical examples in this post.

Example 1

We are running an observational study investigating a disease of interest, and there is a survey question that asks study participants whether they have been tested for the expression of a certain gene; this data could then be linked to diagnosis or healthcare resource use. Participants can either select one of the following responses or skip the question altogether.


Question 1: I have been tested for gene X
Answer Yes
I don’t know
Prefer not to answer


Participants are then asked:


Question 2: I tested positive for gene X
Answer Yes
I don’t know
Prefer not to answer

Assume we find out from Question 2 that 100 people in the study cohort are positive for gene X. To get a percentage for this, what number do we use as the denominator?

    • All participants who answered Question 1?
    • Only those who answered “Yes” to Question 1?
    • Only those who answered Question 2 (this could be a different number)?

On top of this, how should we address participants who skip the question or select “I don’t know” or “Prefer not to answer”?

These issues are not trivial. A client’s product could be intended only for those with a particular gene or antibody expression, and pricing could be impacted by these percentages. If we go out into the world, will we find the same percentage of people who express gene X as was shown in our analysis? Or will those not providing an answer impact the external validity of the results?

Example 2

Imagine running an engagement analysis across a digital registry to establish whether participants are contributing data in a consistent and complete manner. To get a value to represent engagement, we need to consider numerators and denominators for the calculation.

In terms of numerators, are we looking at:

    • How many surveys participants have completed overall?
    • How many questions, within a particular survey, participants have completed?
    • The same questions, but for participants who report experiencing a certain symptom or comorbidity?

In terms of the denominator, the natural answer in many cases is to use the number of surveys administered. However, we need to consider that different surveys will have different administration schedules, and participants will have been in the study for different periods of time – meaning that finding a denominator that works across large parts of the study population can be challenging. Further complicating the situation is attrition: participants may stop entering data in a study app without officially “leaving” the study, or they may have a rapidly progressing disease that affects their ability to interact with a study app.

Handling obstacles in real-world data analytics

Real-world data analytics necessitates overcoming obstacles. Knowing how to deal with these obstacles comes with experience: establishing a clear statistical analysis plan with the client, developing methods to track shifting numbers and arranging contingency plans for different circumstances, and learning what you shouldn’t overthink and what you can’t overthink enough, are all part of the process.

Our Data team has substantial experience of adapting the analytics process to diverse digital real-world studies. To learn more about our capabilities, get in touch at

Obstacles real-world evidence

By Dr Casey Quinn and Amber Kudlac

Skip to content