John Brownstein, PhD, co-founder of HealthMap, an online infectious disease monitoring and tracking website, has published an interesting article in the latest PLoS Medicine, together with Clark Freifeld, a software developer and others, that documents the latest efforts to make HealthMap technology more attune to the language and chatter of the internet. We first reported about HealthMap back October, 2006. Since then the project has received a $450,000 grant from Google. Children’s Hospital Boston, where Brownstein et al. are from, is now reporting that HealthMap “has expanded its surveillance reach and now mines the Internet in English, Chinese, Spanish, Russian and French. Additional languages such as Hindi, Portuguese and Arabic are under development.”
More the PLoS Medicine article:
The use of international news media for public health surveillance has a number of potential biases that merit consideration. While local news sources may report on incidents involving a few cases that would not be picked up at the national level, such sources may be less reliable, lacking resources and training, and may report stories without adequate confirmation. Furthermore, other biases may be intentionally introduced for political reasons through disinformation campaigns (false positives) or state censorship of information relating to outbreaks (false negatives). We have attempted to better understand some of these issues through ongoing analysis and evaluation research. We ran a 43-week evaluation of HealthMap data, covering the period of October 1, 2006 through July 18 2007. We found that pathogen diversity was substantial across news sources, with 141 unique infectious disease categories reported through the Google News feed alone. We found the frequency of reports about particular pathogens to be related not to their associated morbidity or mortality impact, but rather to the direct or potential economic and social disruption caused by the outbreak.
The system characterizes disease outbreak reports by means of a series of text mining algorithms. Characterization stages include: (a) identifying disease and location; (b) determining relevance—namely, whether a given report refers to any current outbreak; and (c) grouping similar reports together while removing exact duplicates. Once the reports are automatically processed, curators correct the misclassifications of the system where necessary. Currently, only one analyst reviews and corrects the posts. However, additional resources would enable more detailed multilingual curation and annotation of collected reports.
Extracting location and disease names from text reports presents the most formidable challenge. HealthMap draws from a continually expanding dictionary of pathogens (human, plant, and animal diseases) and geographic names (country, province, state, and city) to classify outbreak alert information. However, disease and place names are often ambiguous, colloquial, and subject to change, and may have multiple spellings (e.g., diarrhea, common in the US, and diarrhoea, common in the UK). Thus, the expansion and editing of the database requires extensive manual data entry.
Once location and disease have been identified, articles are automatically tagged according to their relevance. Specifically, we identify whether a given report refers to a current outbreak (“breaking news”), as opposed to reporting on other infectious disease–related news, such as vaccination campaigns, scientific research, or public health policy. In this case, HealthMap makes use of a Bayesian machine learning algorithm, trained on manually characterized existing reports, to automatically tag and separate breaking news. Finally, duplicate reports are filtered, identified, and grouped based on the similarity of the article’s headline, body text, and disease and location categories. Using a similarity score threshold, the system groups related articles into clusters that provide the collective information on a given outbreak.
Knowledge integration and dissemination.
HealthMap is particularly focused on providing users with news of immediate interest and reducing information overload. Overwhelming public health officials with information on outbreaks of low public health impact may distract them from investigating outbreaks of greater priority that might receive reduced media attention. Thus, only articles classified as breaking news are posted to the site. Although they are filtered from the initial display, other article types and duplicate articles are shown in a related information window, providing a situational report on an ongoing outbreak as well as recent reports concerning either the same disease or location, and links for further research…
HealthMap also addresses the computational challenges of integrating multiple sources of unstructured information by generating meta-alerts of disease outbreaks. As false alarms can often be reduced by thorough aggregation and cross-validation of reported information, a composite activity score (or heat index) is calculated based on (a) the reliability of the data source (for instance, increased weight is given to WHO reports and reduced weight to local media reports); and (b) the number of unique data sources, with increased weight to multiple types of information (e.g., discussion sites and media reports on the same outbreak). This meta-alert derivation is based on the idea that multiple sources of information about an incident provide greater confidence in the reliability of the report than any one source alone.
Full article at PLoS Medicine: Surveillance Sans Frontières: Internet-Based Emerging Infectious Disease Intelligence and the HealthMap Project PLoS Medicine Vol. 5, No. 7, e151 doi:10.1371/journal.pmed.0050151
Children’s Hospital Boston: Internet crawling: a new tool for tracking infectious disease…
HealthMap | Global disease alert map…
Flashbacks: HEALTHmap Global Disease Tracker ; WhoIsSick.org: Hypochondriacs Welcome!