Text mining gold

February 16, 2016

Discovering hidden connections in the mass of scientific data on the web will become easier – thanks to Karin Verspoor.

Text mining gold

Karin Verspoor, Associate Professor in the Department of Computing and Information Systems at the University of Melbourne and Deputy Director of the University of Melbourne Health and Bioinformatics Centre, describes her early fascination with computers and exposure to multiple languages as key drivers for her becoming a computational linguist.

“When I was nine years old my parents bought me a programmable games console, and I discovered that I really enjoyed getting computers to do things from my imagination – it appealed to my logic and creativity.”

Karin went on to study BASIC – a high-level computer programming language developed for non-scientists that was popularised in the 1980s when the home computer market exploded.

Born in Senegal on the west coast of Africa to Dutch parents, Karin’s formative experience with the games console drove her study for an undergraduate degree with double major in Computer Science and Cognitive Sciences at Rice University in Houston, Texas. “I was drawn to the question of how to get computers to think and understand language,” Karin says.

“It was the perfect course because it combined computing, psychology, philosophy and linguistics.”

On completing her undergraduate studies, Karin swapped the heat of Texas for the cooler climate of Scotland, where she undertook a Master’s degree and PhD in Cognitive Science and Natural Languages at the University of Edinburgh. After finishing the PhD and doing a short stint as a research fellow at Macquarie University in Sydney working on the Dynamic Document Delivery project, which looked at generating natural language texts on demand, Karin left academe for a very different world: the business of start-ups.

“It was arguably the most exciting period of my career – I was involved in two start-ups with amazing ideas,” Karin says. “One of them was trying to build a thinking machine that was going to predict the stock market. It was crazy and so much fun, but it died after the dotcom bubble crash.”

Although the second start-up was much more successful, Karin missed the world of research and so took up a position at the prestigious Los Alamos National Laboratory in New Mexico, where she was able to leverage her business experience and pursue applied research in computational methods for the extraction and retrieval of knowledge from databases and information systems.

“Los Alamos was the home of the human genome project, and it was there I got into computational biology,” explains Karin, “I started working on text mining in the published molecular literature, which eventually led me to the University of Colorado and an opportunity to work exclusively in biomedical text mining.”

Text mining is the analysis of a natural language text – like English or French – by a computer. It’s used to discover and extract new information by linking together data from different written sources to generate new facts or hypotheses.

Karin’s current work at the University of Melbourne involves applying text mining to the field of biomedical research. “The rate of scientific publications is dramatically increasing in the biomedical space,” explains Karin, “The most important biomedical research repository called PubMed, hosted by the United States National Library of Medicine, has indexed over 25 million research publications.”

The multi-disciplinary nature of current biomedical research combined with the huge amounts of published material means that scientists today must stay abreast of a much broader range of literature to stay up-to-date.

“We’re looking to develop an automated computer system that analyses words to discover the relationships between them – to provide researchers with a tool that allows them to ask more structured questions and receive more targeted information,” Karin says.

– Carl Williams

Related stories

Leave a Reply

Your email address will not be published. Required fields are marked *