Machine learning techniques such as artificial neural networks and support vector machines are by now widely known within the tech world. In contrast, other means of data analysis like subgroup discovery and local modeling mostly remain niche methods. However, they actually have a high value proposition in diverse scenarios from business to science and thus deserve more attention. Subgroup discovery in particular is a very flexible framework that allows to easily express and effectively answer practical analysis questions. In this post, I give an intuitive introduction to this framework assuming only a mild familiarity with machine learning and data analysis. In particular, I’d like to approach the topic from a somewhat unconventional but hopefully insightful angle: the idea that in data analysis it is often surprisingly powerful to selectively say “I don’t know”.
Archives
All posts by Mario Boley
Eirini Spyropoulou is one of the most uplifting and fun collaborators I had the pleasure to work with: diligent, proficient, and yet pleasantly free of any egoic sentiment. Despite her youthful age, she already gathered plenty of valuable experience at both sides of Data Science research—the academic as well as the industrial.
Before acquiring her PhD at the University of Bristol from 2009 to 2013 she worked as professional software engineer for multiple companies. During her PhD she pushed forward the topic of subjectively interesting pattern discovery into the realm of multi-relational data [1, 2, 3]. After finishing, she put her freshly acquired scientific insights into practice by joining Toshiba Research as research engineer. During this time she remained affiliated with her former research group in Bristol and in full touch with all developments of academic community. Finally, in 2015, interluded by another period of full-time research [4, 5], Eirini then made the move to London to become data scientist at Barclays b)All statements and opinions expressed in this interview do only reflect Eirini’s personal point of view and are in no way endorsed by Barclays.. In this interview, which was originally recorded via Skype in 2015, she shares her unique perspective on Data Science in industry and academia.
[Bibtex]
@article{spyropoulou2014interesting,
title={Interesting pattern mining in multi-relational data},
author={Spyropoulou, Eirini and De Bie, Tijl and Boley, Mario},
journal={Data Mining and Knowledge Discovery},
volume={28},
number={3},
pages={808--849},
year={2014},
publisher={Springer US}
}
[Bibtex]
@inproceedings{spyropoulou2013mining,
title={Mining interesting patterns in multi-relational data with n-ary relationships},
author={Spyropoulou, Eirini and De Bie, Tijl and Boley, Mario},
booktitle={International Conference on Discovery Science},
pages={217--232},
year={2013},
organization={Springer Berlin Heidelberg}
}
[Bibtex]
@article{kontonasios2012knowledge,
title={Knowledge discovery interestingness measures based on unexpectedness},
author={Kontonasios, Kleanthis-Nikolaos and Spyropoulou, Eirini and De Bie, Tijl},
journal={Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery},
volume={2},
number={5},
pages={386--399},
year={2012},
publisher={John Wiley \& Sons, Inc.}
}
[Bibtex]
@article{lijffijt2016pn,
title={PN-RMiner: a generic framework for mining interesting structured relational patterns},
author={Lijffijt, Jefrey and Spyropoulou, Eirini and Kang, Bo and De Bie, Tijl},
journal={International Journal of Data Science and Analytics},
volume={1},
number={1},
pages={61--76},
year={2016},
publisher={Springer International Publishing}
}
[Bibtex]
@article{leeuwen2016subjective,
title={Subjective interestingness of subgraph patterns},
author={Leeuwen, Matthijs and Bie, Tijl and Spyropoulou, Eirini and Mesnage, C{\'e}dric},
journal={Machine Learning},
pages={1--35},
year={2016},
publisher={Springer US}
}
a, b. | ↑ | All statements and opinions expressed in this interview do only reflect Eirini’s personal point of view and are in no way endorsed by Barclays. |
A little more than one month ago, KDD 2015, the world’s most renowned data mining conference, was held in Sydney. A part of it was the IDEA workshop on interactive and exploratory mining, in which I had the opportunity to present our first research paper on Creedo [1]. Altogether, there were a lot of interesting things happening there that are very relevant to the topic of this web site. So it makes sense to have a post, in which I review some of them along with an extended discussion of some part of my talk. Here we go.
IDEA: The Knowledge Factory
The focus of IDEA (“Interactive Data Exploration and Analytics”) is, as the name suggest, principles and practices of interactive data analysis systems. In particular, it deals with holistic analysis systems that are composed from different data mining and visual analytics techniques with the goal of offering a more meaningful and more effective user experience than each of its parts individually.
This was the third time I attended this workshop after its initial event at KDD 2014 in Chicago and the second edition one year later in New York City at KDD 2015. Big thanks to the organizers for keeping up this great line of events. For me, the series already resembles more or less a mini-conference on its own with serious high level contributions, a little poster session, in the afternoon, and excellent and well-known keynote speakers. Regarding the latter, this incarnation featured Geoff Webb from Monash University and Jure Leskovec from Stanford.
Geoff’s keynote in particular provided a perfect “lead-in” to my presentation; firstly, simply because his popularity made sure that not only the last of approximately 80-100 seats was filled but also more crowd gathered in the back of the room; secondly, because of the topic he was presenting itself. He revisited “The Knowledge Factory [2]”—an early interactive knowledge discovery tool that he developed during the 90s. More specifically, Geoff’s web page refers to it as an “Expert System Development Environment”. In a nutshell, it supports a process, in which user and machine together construct an interpretable rule-set for the description/prediction of some target variable. For that, Geoff emphasized two scientific challenges, that were relevant back then as much as they still are today:
- Finding a common language, in which user and machine can exchange information. To this end, The Knowledge Factory developed the paradigm of Case-based Communication. That is, the cases (examples, rows, data objects) serve as a common vocabulary. The system shows the user what cases are correctly/incorrectly covered and not covered by the current rule set and the user can respond by specifying cases that she considers a high/low priority to be corrected in the next iteration.
- Evaluating the interaction model (and its different underlying policies). The claim of the system is about algorithms and human users acting effectively together. Thus, Geoff felt that the performance of the implementation had to be measured by a study with real users, which his group conducted successfully [3]. However, he went on in noting, that the amount of effort was very high, and, ultimately, this lead his group to somewhat abandon interactive systems in favor of other topics that were less demanding in moving them forward.
Both of these themes I took up in my talk in some form. The second is essentially the motivation for the central functionality of Creedo of user study automatization. The first is a specific incarnation of what I refer to below as the abstraction-gap between the sphere of the user and the formal sphere of the software system.
My Presentation: One System to Connect Two Worlds
The central motif of my talk (see slides here) was the gap that typically abounds in data mining research papers between
- the motivation, which is based on the natural language description of some real-world data analysis problem, and
- the technical contribution, which is typically developed and tested against a set of simplifying formal criteria.
This transfer of the problem into a formal domain is an essential step in solution development, because without it we could not tap into the formal methodology and knowledge repository of our discipline. Also it does not necessarily pose a problem as long as this abstraction gap between the two worlds is not too large and/or the theoretical formalization is widely used and well accepted. It may, however, be problematic in relatively young research areas that deal with subjects that are inherently hard to formalize—such as the users of interactive data analysis systems.
Since the scientific process demands some form of evaluation in order to move forward (read: “to get a paper published”), a particular effect is that the community shifts its attention to performance metrics that can be easier formalized and measured than the what would actually be the prime measure of performance according to the application domain. An example for this is the focus on computational complexity in the early days of Pattern Discovery, which produced a relative overabundance of papers presenting algorithm to solve the “frequent itemset mining problem” (listing all conjunctive expressions in a database with an occurrence frequency exceeding some fixed threshold), which in itself has little to no value in terms of actual knowledge discovery.
With this problem in mind, I went ahead with presenting how Creedo study designs are specifically designed to bridge this abstraction-gap and meaningfully re-connect this formal domain to the natural-language/real-world domain of the applications. How do they achieve this? Mainly though the definition of Creedo Tasks, which are a representation of a real-world problem that provides suitable facets of it to entities living on both sides of the abstraction-gap: the participants of a study, on the one side, and the analysis system(s) under investigation, on the other. Namely, a Creedo Task consists of:
- Human-readable instruction set of how and with what purpose to operate the data analysis system (capturing the natural language problem statement of the application more or less literally),
- Formal definition of states of the analysis system, in which the task is considered solved (e.g., “three patterns have been found by the user”),
Formal definition of what, in such a state, is exactly considered as the tangible result of the task (e.g., “all three patterns considered jointly as a pattern set”), - Formal metrics for the evaluation of these results along with potentially human-readable instructions of how to apply them.
These tasks provide the context for running a study. Driven by the task instructions, participants operate a system until a solution state is reached. Creedo can then extract the results from these states and either evaluate them automatically or issue evaluation assignments to human evaluators based on the formal result metrics. Result evaluations can then be aggregated upwards to system performance metrics. This finally allows to compare different systems according to simple numeric quantities, and ultimately provide an extrinsic evaluation of a contribution as desired. That is, an evaluation that goes beyond the simplifying theoretical assumptions that have been posed initially to make solution development tractable and reach back to the original natural language problem statement.
Beyond IDEA: Rademacher Averages
Well, given my claim above about an arguably too high focus on computational complexity in Pattern Discovery, perhaps now it is time to put that into perspective. Clearly, after you have established what you have to compute, it is of course a relevant question to ask how to do it. Afterall, no knowledge discovery technique could be turned into actual usable technology if there wasn’t a software embodiment of it that is capable of processing input data relevant to a wide range of users. So beautiful algorithmic theory has a lot of purpose—especially, if it is capturing real input-requirements and is highly generic in terms of the number of specific computational problems it can be applied to.
So with this introduction, I would like to dive into a bit of algorithmic theory for closing this post. Namely, my favorite paper of the KDD 2015 main conference is “Mining Frequent Itemsets through Progressive Sampling with Rademacher Averages” by Matteo Riondato and Eli Upfal [4]. The paper describes how, Rademacher Averages, a tool from computational learning theory, can be applied to pattern discovery—more precisely to computing an approximate collection of frequent itemsets on a potentially very small subsample of the complete input dataset with controlled approximation guarantees (they also have a nice tutorial online).
Why is this important? First of all, you might end up with a much faster algorithm for computing frequent itemset for big datasets, which you could then use to compute, e.g., association rules, or any other knowledge discovery “end-product”. However, the results of the paper are far more general than that. In fact, the paper provides an efficient meta-algorithm for generating an (as small as possible) data sample that guarantees that the occurrence probabilities of all conjunctive pattern descriptors are approximately correct with high probability. This is huge, because a lot of utility functions in pattern discovery are actually functions of frequency counts of conjunctive expressions. For example, Matteo’s and Eli’s algorithm could be used just as much to compute the strength or lift of an association rule (no matter if you obtained it via the traditional workflow, a more modern algorithm like, e.g., by sampling, or even by hand like with MIME).
Therefore, I would love to see this approach as a general building block in integrated pattern discovery solutions. In particular I would love to see it in realKD as an encapsulated framework functionality that can be utilized by implementers of algorithms with a call as simple as measure.getApproximation(epsilon,delta). However, there are still some problems to be solved. For one, relatively strong bounds on frequency counts only induce relatively weak bounds on other functions, which in turn could lead to impractically large subsample sizes. So perhaps those functions could be targeted by a more direct application of learning theory. Also there are important pattern languages and utility functions that cannot simply be expressed as a function of frequency counts (e.g., applications’ favorite: Subgroup Discovery). In any case, this is rather a question of how much further can we push the limits, because already in its current this is something definitely worth including into any toolbox and I hope we can do something about this as soon as possible.
This is it for this time. I hope there was one interesting thing or two in it for you. Any comments/question either in the comment section below or private via e-mail are of course highly welcome.
References
[Bibtex]
@inproceedings{boley2015creedo,
title={Creedo-Scalable and Repeatable Extrinsic Evaluation for Pattern Discovery Systems by Online User Studies},
author={Boley, Mario and Krause-Traudes, Maike and Kang, Bo and Jacobs, Bj{\"o}rn},
booktitle={ACM SIGKDD Workshop on Interactive Data Exploration and Analytics (IDEA)},
pages={20--28},
year={2015},
organization={ACM}
}
[Bibtex]
@article{webb1996integrating,
title={Integrating machine learning with knowledge acquisition through direct interaction with domain experts},
author={Webb, Geoffrey I},
journal={Knowledge-based systems},
volume={9},
number={4},
pages={253--266},
year={1996},
publisher={Elsevier}
}
[Bibtex]
@article{webb1999experimental,
year={1999},
issn={0885-6125},
journal={Machine Learning},
volume={35},
number={1},
doi={10.1023/A:1007504102006},
title={An Experimental Evaluation of Integrating Machine Learning with Knowledge Acquisition},
url={http://dx.doi.org/10.1023/A%3A1007504102006},
publisher={Kluwer Academic Publishers},
keywords={Integrated learning and knowledge acquisition; classification learning; evaluation of knowledge acquisition techniques},
author={Webb, GeoffreyI. and Wells, Jason and Zheng, Zijian},
pages={5-23},
language={English}
}
[Bibtex]
@inproceedings{riondato2015mining,
title={Mining frequent itemsets through progressive sampling with Rademacher averages},
author={Riondato, Matteo and Upfal, Eli},
booktitle={Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining},
pages={1005--1014},
year={2015},
organization={ACM}
}
The User-Study Conspiracy Theory
For the most part of my academic life I worked on Pattern Discovery algorithms; perhaps too much, because nowadays I cannot help but to count pattern occurrences everywhere I go. One of those patterns, which I want to call here “The User Study Conspiracy Pattern”, I discovered in recent years during Data Mining research presentations. It roughly goes as follows.
Independent of whether I was at a large-scale presentation during an international conference such as KDD or ECMLPKDD or at a more intimate PhD-defense, I often witnessed a variant of the following dialog during the Q&A part:
Q: This looks interesting, but is this really what users would want?
A: Well, I guess in order to really confirm that we would need to test this somehow with real users.
Q: Yes, I agree. Thank you.
Alright, so much for that. Next question. It almost feels like there is an unspoken agreement:
“Yes, for many of our contributions we would actually need to perform user-studies in order to evaluate them, but… we won’t (and won’t call each other out on this).”
Sounds like a conspiracy theory? I admit, between selective hearing and a general attendance preference to talks that lean themselves to user-based evaluations, there are quite a few reasons for not jumping to conclusions. Anyway, we might have found one (or two) interesting questions here:
“What is the fraction of Data Mining research that could have benefited from a user study, but the authors decided not to perform one?…and what are the reasons for these cases?”
So I joined forces with Thomas Gärtner and Bo Kang in order to shed some empirical light on these two questions. The idea was to perform an online poll, in which we would ask Data Mining authors directly whether they felt that user studies would be beneficial for their own work.
The Poll
The first question we had to answer, when designing the poll, was which authors to invite to participate. While going for an open call on a collection of known mailing list (perhaps even with a request to re-share) would have maximized our reach, we decided that we wanted to target a more cleanly defined population. Since it was March and the ECMLPKDD deadline was approaching, we found that “the ECMLPKDD authors” were a good choice. To turn this idea into an operational definition, we further refined our target group to all authors of ECMLPKDD articles who provided a contact email address on the paper that was still valid at the day of our poll invitation, which was at March 28.
Then, of course, we had to design the content. Since we wanted to maximize the response rate, we made it a priority to not ask for more than 5 minutes of the participant’s time. Consequently, we crafted a minimalist questionnaire focused around the two core questions we wanted to answer: what is the fraction of people who skip potentially insightful user studies and what are their reasons for that? One additional questions that we inserted was about the author’s field of work. The rational here was that, while ECMLPKDD is a clearly defined population, it is a quite heterogeneous one, and we expected results to vary quite a bit, e.g., between Machine Learning researchers who often work on clearly defined formal problems (that don’t lean themselves naturally to user studies) and pattern discovery or visual analytics researchers who often try to develop intuitive data-exploration tools for non-expert users. So we wanted to be able to see differentiated results between those groups.
The literal final list of 3 questions was then as follows:
- Question:
“In which of the following areas have you had orginal research contributions (published or unpublished) during the last two years?”
Possible answers (multiple possible):- “Machine Learning”,
- “Exploratory Data Analysis and Pattern Mining”,
- “Visual Analytics”,
- “Information Retrieval”, and
- “Other”.
- Question:
“During the last two years, was there a time when you had an original contribution that could have benefited from an evaluation with real users, but ultimately you decided not to conduct such a study?”
Possible answers (exlusively):
“Yes”, “No”. - Question:
“In case you answered “yes” to question 2, which of the problems listed below prevented you from conducting a user study?”
Possible answers (multiple possible):- “No added benefit of user study over automatized/formal evaluation”,
- “Unclear how to recruit suitable group of participants”,
- “High cost of developing convincing study design”,
- “High cost of embedding contribution in a system that is accessible to real users”,
- “High cost of organizing and conducting the actual study”,
- “High cost of evaluating study results”,
- “Insecurity of outcome and acceptance by peers”, and
- “Other”.
The Results
Out of the 525 authors, who were invited to participate, 125 responded already within the first 5 days after we opened the poll. We deliberately did not include a deadline in the invitation email and instead simply kept accepting answers until we did not receive any more responses for more than 7 days (which was reached by the end of April). Ultimately, we ended up with 136 respondents, which corresponds to a very solid response rate of 25.9%.
So let’s look at the results starting with the central question: what is the amount of research that is not backed up by a user study although it would make sense to do so? The following image contains the results for this question (differentiated by the different sub-fields queried in the first question). Note that, for this summary we filtered out 4 responses that did not give any of those 4 core areas as their area of work and are thus not really representing data mining research.
So throughout the whole set of respondents the answer was “yes” in almost 50% of the cases. Restricted to authors that stated Machine Learning as one of the areas of their contributions (which is by far the dominating subject area), the “yes”-rate was with 48.57% slightly less, but altogether surprisingly close to the group of Exploratory Data Analysis/Pattern Discovery (52.08%). In the smaller groups of Visual Analytics and Information Retrieval “yes” was the dominant answer with 75% and 70.59%, respectively. Note that the different subgroups are overlapping because multiple research topics could be stated.
With this substantial number of “yes”-answers across all the different sub-populations, let us take a look at the remaining question: what are the reason for the reluctance of authors to do something that, according to their own assessment, could be beneficial for their work? In the following image we can see the distribution among the different reasons for the complete population of “yes”-responses as well as individually for the three sub-populations that had a “yes”-rate of more than 50%.
So the dominant individual reasons were “cost of conducting the actual study” (61.45%) and “unclear how to recruit participants” (56.92%). There are some interesting shifts between the sub-populations. Apparently, Pattern Discovery and Visual Analytics researchers worry more about from where to recruit study participants than Information Retrieval researchers (which might be due to the fact that the first two tend to work more on systems for experts, which can not be tested by a general audience). Also, the Visual Analytics group worries notably more about “the cost to evaluate the study results” (which might be due to the fact that performance metrics are harder to establish in this area).
Returning to the big picture, it should be noted that the aggregation of all “cost” items was chosen 98.5% of the time. That is, almost unanimously authors who did skip on a study opportunity provided a cost aspect as at least one of the reasons. Finally, an important observation is that the first reason “no added benefit” was stated only a small fraction of times (6.15%).
The Bottom Line
So what to take from all this? I believe that even if we consider the rather defensive phrasing of Question 2 and acknowledge a certain gap between responding ECMLPKDD authors and “all of Data Mining research”, the data presented here allows a rather clear interpretation:
A substantial fraction of active Data Mining researchers would potentially be interested an adding an empirical layer to the support of their work if doing so had a lower cost/benefit ratio.
Now one could argue for increasing the benefit part through the negative incentive of punishing those papers that would lend themselves to an extrinsic empirical evaluation but are lacking one. However, this would require a lot of players to change their behavior without having an individual incentive. So from a game-theoretic point of view there is some doubt that this approach is actually possible. Even if it would work, it could lead to a somewhat unproductive division between empiricists and those that fight for their right to abstain (conspiratorial or openly).
Instead I believe that it is a more promising approach to try to reduce the costs (and needless to say that I hope realKD and Creedo can make a contribution here). If it is more affordable to design, conduct, and evaluate studies with real user, then those authors that feel the most need to do so can go ahead and test their theoretical assumptions from a different angle. These theoretical assumptions, once properly evaluated, can then be used by other authors in order to develop improved methods within purely algorithmic papers. This way, we could establish a cooperative culture, in which individual pioneers can effort to go to the extra mile and to successively create an empirical pillar of support for the joint research activities of the community as a whole.