Introduction to the course David Barber


21 January : Room 1.03 Malet Place Engineering Building

14:00 9 Laws of Data Mining

Tom Khabaza www.khabaza.com

Abstract:

In this lecture I will explain nine maxims or "laws" of data mining. The CRISP-DM methodology standard gives an accurate description of how data mining is performed, but says little about why the data mining process has this particular form. The "nine laws" are intended to be useful maxims, but also to begin a theory of why data mining is the way it is. The main body of this lecture will state, explain and evaluate the nine laws, but I will also add a postscript about data mining tools - work in progress on the comparison of the visual workbench approach embodied by Clementine and the programming approach embodied by R.

Bio:

Tom helps organisations improve their marketing and customer processes, and to improve their efficiency, risk analysis and fraud detection, through new knowledge and predictive capabilities extracted from their data.

Tom has worked in the field of data mining for 17 years, and is one of the authors of the world-leading Clementine data mining workbench, and of the CRISP-DM industry standard data mining methodology. Tom has undertaken data mining for organizations in a wide range of sectors including banking, telecommunications, retail, manufacturing, media, science and medicine, government and law enforcement, in addition to data mining R&D with a wide range of commercial and academic partners

15:00 Last.fm: how to organize digital music in a large social network presentation

Norman Casagrande last.fm

Abstract:

In the last few years digital music has transformed the landscape of music experience and distribution. It is not uncommon to find personal music collections that exceed thousands of tracks, and thanks to the Internet finding and accessing music has become simpler than ever. As a result, it is becoming increasingly harder to arrange and navigate the large databases of music that are available to the user.

Last.fm is an online music community that operates in that space. We have a large set of data which includes user listening habits and audio data. This data is processed though a series of tools which include statistical, combinatorial and audio analysis that allow us to provide recommendations, create playlists, measure user/items similarity, apply spam filtering, and more.

In this talk we will be giving a overview of some of the problems we are confronted with, and our approaches to solve them.

Bio:

Norman Casagrande joined Last.fm in 2006 as the head of music research. Since then he has been working on a wide range of problems, including collaborative filtering for user/item similarity and recommendation, dealing with scalability, dynamic playlist generation, users insight, audio and semantic analysis, fingerprint, spam fighting, and many other related topics.



28 January Room 1.04 Malet Place Engineering Building

14:00 Compliance Architectures – The Implications For Machine Learning presentation

Michael Mainelli

Abstract:

Compliance is important and costly for larger listed and regulated organisations. Yet despite compliance’s increasing size and impact, the systems community doesn’t seem to understand how it can apply measures, structure and automation to this increasingly expensive area. To a degree, the rise in formal compliance has been rapid, following the Enron scandals, Sarbanes-Oxley, anti-money laundering and now recent responses to the credit crunch. In addition, compliance is a good example of a “diffuse system”, everywhere but nowhere. Professor Michael Mainelli will outline the advantages and limitations of current measure-manage-motivate approaches to operational risk and compliance, such as basic indicators, standards, total quality management (TQM), operational cost variance, value-at-risk and procedural fiat. Michael will explain how Z/Yen have incorporated “environmental consistency confidence” utilising these approaches within a wider enterprise risk/reward framework for a few leading institutions. Finally, Michael will explore new techniques that Z/Yen are trialling, backed by research, particularly prediction markets and dynamic anomaly & pattern response systems.

Bio:

Michael leads Z/Yen Group, the City of London’s leading commercial think tank promoting societal advance through better finance and technology. Michael co-founded Z/Yen in 1994 after a career as a research scientist in aerospace & cartography then accountancy-firm partner. Michael’s financial clients include banks, exchanges and insurers. Michael has won a Smart Award for prediction systems, a Foresight Challenge award for financial research, been named UK IT Director of the Year and served on the board of Europe’s largest R&D organisation. Michael is Emeritus Professor and Fellow at Gresham College, as well as a visiting Professor at LSE. Michael created the Farsight Award for long-term investment research, created the $15M London Accord ‘open source’ research cooperative into climate change economics, and created the Global Financial Centres Index. Michael most recent book is about the credit crunch and free online, The Road To Long Finance: A Systems View Of The Credit Scrunch. Michael’s humorous risk/reward management novel, “Clean Business Cuisine: Now and Z/Yen”, written with Ian Harris, was published in 2000; it was a Sunday Times Book of the Week; Accountancy Age described it as “surprisingly funny considering it is written by a couple of accountants”.

15:30


4 February Room 1.03 Malet Place Engineering Building

14:00 Machine Learning applications in biomedicine presentation

Dimitrios Athanasakis National Institute for Medical Research

Abstract:

Modern biology through the advent of mass throughput technologies produces huge quantities of data on a daily basis. While Machine Learning techniques have been successfully applied in a number of prediction problems arising in the analysis of biological systems, technical and cultural barriers hinder their broader adoption in the field. I will use case studies, such as the successful discovery of accurate and practical diagnostics for Tuberculosis and a system for accelerated drug design for Malaria, in order to exhibit how current machine learning research translates into applications for the biomedical domain. In doing so I will also illustrate the interplay between different disciplines involved in the practical deployment of these methodologies

15:00 Mining large data sets for sentiment

presentation

Toby Mostyn meaningmine.com

Abstract:

The talk will focus on the subject of gaining insights into large amounts of unstructured data. This will include an overview of the Meaningmine sentiment analysis product and a discussion of appropriate information retrieval and information extraction techniques.



11 February Room 1.03 Malet Place Engineering Building

14:00 A brief introduction to intellectual property rights presentation

Alexander Korenberg

Kilburn and Strode LLP



25 February Room 1.03 Malet Place Engineering Building

14:00 Machine Learning Applied to Security presentation

Steve Poulson Scansafe

Abstract:

Scansafe works on the prevention, detection and elimination of spyware, viruses, phishing, illicit use of third-party databases malware threats as JavaScript, executable and non-executable files, unsolicited content and blocking that inappropriate web content as hate pages, and establishing of a history and reputation of pages, i.e. the probability of it being a data phishing or malware hosting URL.

Scansafe is the world leader in ‘Web security-as-a-service’ which means organizations of all sizes can be protected against web malware attacks and to have safe, productive use of the Web, without hardware, upfront capital, or IT management costs.

Bio:

Steve is responsible for implementing advanced pattern matching technologies to spot malicious behaviour and to classify content. He has fifteen years experience in state-of-the-art Internet and pattern matching technologies and has worked for companies including Deloitte Consulting, Virgin, BBC and various defense contractors.



4 March Room 1.04 Malet Place Engineering Building

14:00 Large Biological Networks Analysis and Applications in Pharmaceuticals Research

presentation

Steven Barret

Glaxo Smith Kline



11 March Room 1.04 Malet Place Engineering Building

14:00 Machine Learning Methods on functional MRI Data pres1

pres2 Janaina Mourao-Miranda UCL/KCL/Siemens

Abstract:

Recently machine learning approaches (e.g. SVM) have been used to analyze fMRI data. In these applications fMRI scans are treated as spatial patterns and statistical learning methods are used to identify statistical properties of the data that discriminate between brain states (e.g. cognitive task 1 vs. cognitive task 2) or group of subjects (e.g patients vs. healthy controls). In this talk I will present the early stage of these developments made at the department of Neural Computation, Siemens AG and more recent developments made at KCL and UCL.



18 March Room 1.04 Malet Place Engineering Building

14:00 Matchbox: Large Scale Online Bayesian Recommendations presentation

Thore Graepel

Microsoft Research

Abstract:

We present a probabilistic model for generating personalised recommendations of items to users of a web service. The system makes use of content information in the form of user and item meta data in combination with collaborative filtering information from previous user behaviour in order to predict the value of an item for a user. Users and items are represented by feature vectors which are mapped into a low-dimensional `trait space' in which similarity is measured in terms of inner products. The model can be trained from different types of feedback in order to learn user-item preferences. Here we present three alternatives: direct observation of an absolute rating each user gives to some items, observation of a binary preference (like/ don't like) and observation of a set of ordinal ratings on a user-specific scale. Efficient inference is achieved by approximate message passing involving a combination of Expectation Propagation (EP) and Variational Message Passing. We also include a dynamics model which allows an items popularity, a user's taste or a user's personal rating scale to drift over time. By using Assumed-Density Filtering (ADF) for training, the model requires only a single pass through the training data. This is an on-line learning algorithm capable of incrementally taking account of new data so the system can immediately reflect the latest user preferences. We evaluate the performance of the algorithm on the MovieLens and Netflix data sets consisting of ~1,000,000 and ~100,000,000 ratings respectively. This demonstrates that training the model using the on-line ADF approach yields state-of-the-art performance with the option of improving performance further if computational resources are available by performing multiple EP passes over the training data.



25 March Student Presentations

Song Hayden O'Neill Alcantara Moni Korokithakis Agathocleous Roger