Distributional semantic methods for duplicate search

The presence of duplicates in any pharmacovigilance system can create misleading signals and therefore impact on the safety monitoring and potential regulatory actions. Hence detection and handling of duplicates by National Competent Authorities (NCAs), Marketing Authorisation Holders (MAHs) and Sponsors of clinical trials (Sponsors) is an important element of good case management.

Uppsala monitoring center (UMC) has recently shared their journey in using natural language processing which lead to a new dimension in case report analysis. Below are the highlights from their article.

Current algorithms in identifying duplicate individual case safety reports (ICSRs) works by looking at different important fields of two reports (for example, age, gender, country of origin, reported reactions, reported drugs) and computes a score that rewards matching information and penalises mismatches, weighing in how frequent each piece of information is in the first place. When it came to the reported reactions, this approach just felt too rough considering how large MedDRA is: if two independent persons review a given case, it is not at all certain that they would choose to describe the observed reactions using the exact same set of MedDRA terms. If one choose “blood pressure increased” and other choose “hypertension”, this would make it difficult in identifying duplicates.

Distributional semantic methods: The central idea is that if there are two words surrounded by similar contexts – that is, with a high degree of interchangeability – then these words are likely to carry a similar meaning. The approach gained great popularity in the early 2010s when a distributional semantic method called word2vec was published, leading to a great boost in performance of algorithms in many kinds of classical NLP tasks (such as part-of-speech tagging, machine translation, question-answering, and document classification).

In essence, what word2vec does is to model every word as a vector (think simply of a list of numeric values) and learn each vector component with machine learning by trying to predict centre words in context windows, using massive amounts of text. UMC team applied word2vec to more than 16 million VigiBase reports, creating a novel “space of meaning” for both drugs and events. Preliminary quantitative and qualitative analyses showed that the resulting vectors were indeed meaningful, as illustrated by the tables below, which provide lists of nearest neighbours in the space for different concepts of interest. They have also mentioned “Evaluating the quality and soundness of the space is also a challenge because there is no clear truth to compare it to.”

Read full article here: https://www.uppsalareports.org/articles/found-in-space-new-method-reveals-related-drug-and-reaction-terms/


Posted

in

by

Tags:

Comments

Leave a Reply

error: Content is protected !!