Making sense of data

Think of the information associated with you: your shopping habits, personal information, reading interests, favorite sports teams, and digital photographs. Imagine how this data is collected, stored and processed. Now imagine similar volumes of data for nearly 7 billion people. No wonder the amount of data being produced worldwide is doubling every three years.

The challenge is making sense of it all. Many industry sectors, including security and surveillance, finance, medical imaging and diagnostics, and environmental monitoring rely on controlling, processing, understanding and using vast quantities of information. Advances in computing power are unable to keep pace with the proliferation of data, and better algorithms for data processing are required. It is impossible for human analysts to examine even one tenth of one percent of the high volume of available data, which is often complex, incomplete, and even contradictory.

Dr. Yoshua Bengio, who heads the MITACS project Statistical Learning of Complex Data with Complex Distributions, describes the problem as similar to that of drinking from a fire hose. His team of six computer scientists and statisticians from Alberta to Nova Scotia has been working together since 1999 in the fields of statistical machine learning and data mining. 

Their goal is to extract patterns and rules from vast quantities of data, and develop better predictive and decision-making tools. This involves designing computer programs to process and classify existing data and then execute desired responses to future data. For example, by analyzing a set of medical records and associated patient diagnoses, a computer algorithm can deduce relationships between certain symptoms and diseases and then use this information to diagnose new patients.

The team works with telecommunications, insurance, pharmaceutical and even music companies to help them understand the needs and predict the future behaviour of their customers. Bell Canada is now using name mining, a technology developed by the team to predict long-distance phone patterns and target customers with individualized promotional materials.

MITACS ACCELERATE intern Shujie Li helped IT Interactive Services Inc. improve their personalized search for GenieKnows.com. Li used statistical learning and probabilistic modeling to study user activity and improve geographically-constrained searches – for example, a query for coffee shops in New York City.

The project’s tools have also been of interest to Communications Security Establishment Canada (CSEC), the national cryptographic organization. “CSEC and Canada will benefit from research in statistical learning because of its potential to identify security risks and thus minimize disruption,” says Renaud Lévesque, Director General Core Systems for CSEC.

Not all of the project research involves solving a specific problem for a specific set of data. The team is also looking more broadly at how we learn. “The objective,” says Bengio, “is to progress in bold steps towards artificial intelligence.” This cannot be done by simply feeding vast amounts of information into a computer, due in part to the complex and often uncertain relationships between different data. Instead, methods need to be developed to enable agents to learn complex behaviors with minimal human intervention or prior knowledge.

Recently the team has branched into the area of deep architectures, which tackles the processes involved in going from fine level data to an abstract concept like a face. Suppose you are asked to look at an image and count the number of people in it. The data you are working from is just a collection of coloured pixels, yet your brain is able to see patterns (such as a cluster of skin toned pixels) and identify a face.

In the deep architecture approach, there are intermediate layers of increasing complexity between the input (the pixels you see) and the output (the identification of a face).  Each layer performs a processing task, for example moving from images containing a single object to those containing two. Says Dr. Bengio, “The idea is inspired by the way children learn. They first learn simpler concepts and then build on them to learn more abstract ones. This process is more efficient in the presence of a teacher that carefully chooses the examples and the curriculum.”

The learning opportunities have certainly been numerous for the team’s trainees. Postdoctoral alumnus Samy Bengio is now with Google, where he has access to some of the biggest problems in terms of both data and computing resources. His collaboration with the project remains strong. “Together we are using innovative machine learning approaches and Google databases to better understand the semantics of images and their associated textual descriptions,” says Samy.

“Yoshua has always been interested in solving real life problems arising from the availability of data using machine learning techniques,” notes Samy. “This actually convinced me that some of the most interesting problems occur only with very large amounts of data.”

Spinoff Success - ApSTAT Technologies

Founded in 2001 by a team of researchers from Université de Montréal as a MITACS spin-off, ApSTAT provides a family of products and services aimed at the technological and organizational deployment of data mining solutions. Ventures include risk estimation for the property and casualty insurance market, and models for commodity and foreign-exchange trading.

ApSTAT is taking advantage of the MITACS ACCELERATE internship program to continue to tap into expertise at the university. Intern Olivier Delalleau is working with one client to better classify audio signals. His doctoral research deals with the extraction of high-level features from raw data, but current algorithms are unlikely to work for this problem. “A major interest for me is finding ways to overcome the difficulties,” says Delalleau. “I believe this is a great opportunity to apply ideas in a realistic setting and gain a better understanding of the behavior of algorithms developed in our laboratory.”



Up Until the 19th Century, Mathematicians were not Called Mathematicians.
Up until the 19th Century, mathematicians were not called mathematicians. What was their previous name?