DBMiner- Adaptive Machine Learning System
The industryâ€™s standard modeling tools do a great job of solving moderately complex problems, but they still require a significant amount of manual effort when data involves varied complexities and relationships that are unstable and hard to locate. DBMiner was developed to supplement the dominant modeling tools, empowering our clients to locate latent business opportunities in the data.
The software was designed and optimized to provide a scalable, flexible, and embeddable solution to predictive modeling problems. Priority has been given to ensure that all components are embeddable in a wide variety of production systems. The system is written in Java and provides SOAP and REST interfaces for non-java architectures.
Main Ideas and Principles
- Complex feature creation: transforms raw data into descriptive numerical examples readable by machine learning systems, forcing the complexity of the model into the confines of the problem space.
- Sequentially trainable machine learning models capable of handling extremely large data sets and data streams quickly, large-scale training, high-throughput testing.
Regularization: many different regularization techniques are available, and can be mixed and matched, controlling the complexity of the model, and preventing over fitting.
- Model selection: hyper-parameter tuning, finding the best model from the many candidate models.
- Evaluation: a rich set of evaluation metrics is generated, enabling the assessment of the performance of predictive systems.
- Configurability: machine learning systems can be designed by specifying the layout of the above components in a YAML file. This allows non-programmers to build production quality systems.
Feature CreationThe most crucial component of any machine learning system is the way data is consumed, cleaned, filtered, and transformed into numerical arrays for input by core algorithms. Our machine learning package employs functional programming to enable the seamless composition of transforms, filters, and projections, in addition to a flexible system that allows for various input channels. A number of common data transforms are currently implemented, allowing raw data to be transformed into complex, expressive â€œfeature vectorsâ€. Additionally, filtering components are in place to remove data that does not satisfy user-defined rules. Also, specific types of values can be removed according to automatic or user-defined criteria (for instance, values considered to be outliers). Feature selection may also be employed, selecting only those features believed to be most valuable. A wide variety of data projections are also available, transforming input from one space to another. This may reduce data to a more manageable size, remove correlated features, or create new features that offer additional predictive power. Data transformations can be performed in sequence or at simultaneously in parallel processes, to be combined later. This allows multiple data strategies to be used together. Finally, it should be noted that the functional implementation is very efficient. DBMiner takes great care to ensure that object creation is minimized, and that there is never more than the absolute minimum amount of data kept in memory
Predictive ModelingWhile a variety of machine learning techniques have been implemented, DBMiner puts special emphasis on the technique that allows for sequential training. While such techniques occasionally lack the predictive complexity of â€œbatch onlyâ€ models, sequential training facilitates training on arbitrarily large datasets. A number of efficient, well-tested models have been implemented, often with many configurable elements, potentially offering greater predictive power if properly set. Models also can be serialized in a number of ways for later use. Among the techniques implemented are regularized cost-sensitive logistic regression, maximum margin linear methods such as support vector machines and confidence-weighted passive aggressive models, online kernel-methods, and a rich naive Bayes suite capable of modeling likelihoods with a variety of distributions as well as a mixture of models for likelihood estimates We implement general optimization procedures to allow for the training of models that optimize user-defined, task-specific loss functions (rather than simply maximizing likelihood), offering a greater degree of flexibility than many other machine learning systems. Furthermore, in some cases, the objectives of several models can be combined into one meta-objective. Ensemble methods are also included, whereby several models are combined into a single aggregate model, thus potentially offering greater predictive power than would be possible using any single model. Using the DBMiner modeling substrate, a large set of machine learning models such as nearest neighbor models, Gaussian processes, and radial basis function networks can be implemented with minimal effort.
RegularizationModern data modeling increasingly employs regularization (complexity control-based trading of data fitting and simplicity) as a means of reducing errors due to overfitting training data. Our machine learning system provides a wide variety of regularization techniques: Lp norms including the popular L2 norm (ridge regression), L1 norm (lasso), probabilistic priors including Gaussian and Cauchy priors, and model entropy. These techniques can be combined in something akin to a super Elastic-Net and applied together, offering benefits that would not be achievable otherwise.
Model SelectionsOften times, there are multiple candidate models for solving a problem. A given model may have many different parameter settings that can be tuned-- boolean, integer, floating point, categorical, and object-valued parameters that affect how the modeling procedure learns from data and how the model makes predictions. Deep Blueâ€™s DBMiner machine learning system has a parallel system for finding the best models and parameter settings from the space of viable candidates. This requires exploring the space of possibilities, often using complex cross-validation to avoid making predictions on the training data, and keeping the choices with the greatest promise.
EvaluationDetailed knowledge of a system's behavior is crucial to any machine learning system. DBMiner collects a wide variety of metrics to increase understanding of a systemâ€™s and modelâ€™s performance, and facilitates choices that yield improvements.
ConfigurabilityEntire machine learning systems can easily be configured using the YAML markup language. This choice allows even non-programmers to build and tune powerful machine learning systems once DBMiner has been integrated. Tweaks can easily be made by making small changes to configuration files. More complex objects are serialized separately and loaded based on requests specified in the YAML file