The realKD library is a free open-source Java library that has been designed to help real users to discover real knowledge from real data. The main purpose of realKD is to be used by other Java applications that aim to enable knowledge discovery functionality to their users, which is based on highly innovative algorithms from the academic Data Mining community. In addition, all major functions of the library can also be used stand-alone from the command line. The source code is available from Bitbucket under the MIT License.
The realKD Data Model
The realKD data model (as of version 0.1.1) provides a unified input source for various pattern discovery algorithms – independent of whether they work on “natural” tabular data, binary data matrices, or time-series. Attributes (columns) of a input table can be annotated to form semantic groups such as a time-series or a hierarchy. On top of that, there is a large set of logical factories for different attribute types that can provide a propositional logic as binary view on a data table. This is the lens through which most of the traditional pattern discovery algorithms have to see data. However, in realKD, the original attribute semantic is not lost. This way expressive semantic constraints are available in the binary view. Ultimately this enables the discovery of more meaningful result patterns, than with simple prototypical implementations that are oblivious to the origin of the binary data.Available Algorithms
As of version 0.1.1, realKD can be used to discover associations (itemset patterns), exceptional model patterns (subgroups), and subspace outliers from tabular data based on different deterministic and randomized algorithms. In particular, it contains the following highly innovative contributions from the research community: 2-Step Pattern Sampling algorithms [1] for datasets with very large pattern spaces, Diverse Subgroup Set Discovery [2] for subgroup discovery in domains with a lot of redundancy in potential findings, and the Cumulative Jensen-Shannon Divergence [3] as deviation measure for user-friendly Subgroup Discovery that does not require users to know the distribution of their data.
[Bibtex]
@inproceedings{kdd/BoleyMG12, author = {Boley, Mario and Moens, Sandy and G{\"{a}}rtner, Thomas }, title = {Linear space direct pattern sampling using coupling from the past}, booktitle = {The 18th {ACM} {SIGKDD} International Conference on Knowledge Discovery and Data Mining, {KDD} '12, Beijing, China, August 12-16, 2012}, pages = {69--77}, year = {2012} }
[Bibtex]
@article{van2012diverse, title={Diverse subgroup set discovery}, author={van Leeuwen, Matthijs and Knobbe, Arno}, journal={Data Mining and Knowledge Discovery}, volume={25}, number={2}, pages={208--242}, year={2012}, publisher={Springer US} }
[Bibtex]
@inproceedings{pkdd/nguyenV15, title={Non-Parametric Jensen-Shannon Divergence}, author={Nguyen, Hoang-Vu and Vreeken, Jilles}, booktitle={ECMLPKDD}, year={2015} }