Data Before ML — MukulPal.com

Intelligence is about solving a chaotic problem; you can see the manifold (the butterfly wings) but never exactly know when and which wing will the form flip too. Data is the bottleneck and the gold mine when it comes to building such systems.

ML could be understood as something supervised or unsupervised, leading to a classification by error reduction.

An ML system based on a data architecture that mirrors or has semblance to the data generating mechanism improves performance. Because it asks, where is the data coming from? How should the data architecture be? Will the dataset lend itself well to the chosen ML process?

Figure 2 -Data Generating Mechanism View of ML

However, a data generating mechanism and a good architecture which assumes stability and linearity in information could stay biased and deliver erroneous results. Specifying a linear model and expecting it to understand causality has lead modern finance to a set of conflicting theories over the last 100 years. So to expect such an approach to solve the challenges of the non-financial domain is naive thinking.

And because causality eventually leads to chaos, a robust ML system should specify a model assuming a varying and dynamic degree of influence between a set of causes, expect some causes to fail and new causes to emerge and succeed. Such a system makes no assumption about information validity and embraces both error reduction and amplification as an output.

Data heavy or data light is not a characteristics of the ML process, but of the data architecture. One can use a good data architecture to sample the data well for the ML process. A well-designed architecture can drive the ML process with a fraction (e.g. less than 10%) of the data, which is refreshed periodically. Therefore, it’s essential to ask, how much of my database do I really need for my ML process to run optimally? The more carefully we use data for training, the less biases we introduce in our ML processes and the less electricity we burn, a desired objective in a computation heavy world.

If we want to build ML systems to simulate real world problems, we have to train them on datasets that evolve from an understanding of data generating mechanisms, approriate data architectures, and causal complexity which bases itself on the assumption that there is a certain probability of failure and success of a cause to affect.