Discover more from some antics
A watershed moment in the data space
Water, in all its forms, looms as the predominant metaphor in the data space. Snow, ice, flow, lakes — we seem to naturally understand data in these shapes. Some people mock this proclivity, as with any overburdened marketing-speak, but however played out I find it apt.
I think the central problem with these aquatic metaphors is not overuse, but lack of depth. We don't lean into them deeply enough. They remain a surface-level marketing term. If we expanded our conception to fully embrace the totality of the water cycle within natural ecosystems, we could gain not only a greater understanding of where we're at, but a clearer vision of what's developing.
That's because data is becoming more like water every day. Data is unfreezing, flowing, and pooling in new ways, and understanding water more deeply will help us flow with this change rather than fight it.
The best metaphor for data in the current and coming state of the space is not the dominant imaginal symbol of the snowy glacier, but a vibrant watershed.
A watershed is a complex system, where rainfall moves from high ground to large basins, collecting from rivulets into streams, streams to creeks, creeks to lakes and rivers — some in the process cycling back, evaporating into clouds and rainfall to start again, and some making its way to the ocean.
Data is much like this! Individual events gather like raindrops, collecting as they go along into ever larger, more purposeful arrangements. Analysis, ML, and Reverse ETL, like evaporation, cycle back some of the data and help generate more to fuel the process continually. Plants and animals are nourished by this ecosystem.
Until recently, in the 'ice age' of the past decade, we've exerted a lot of effort to carve straight pathways to forcibly collect and freeze the flows of our data. Without adequate tooling to meet data where it is or deal with it in natural flow, our only option was to turn towards irrigation and ice for our needs.
This is changing though. The rise of federated query engines, improved streaming, in-process OLAP databases, distributed data formats, and distributed storage platforms for those formats is creating a world where we need not mar the watershed with manmade constructs. We can map it, nurture it, become stewards of the ecosystem.
Networks of ponds and creeks can become as valuable as lakes and rivers. Ice can melt into fresh springs. We can trade in our arctic ocean liners for canoes.
When we have to go to great lengths to shift and centralize our data, we inherently introduce change, and at great cost. We mangle the ecosystem, and we pay dearly to do so. It has thus far been easier to carve out and map a single, rectangular block of ice, so we’ve had to be very careful and expend great effort to do this is an accurate and sustainable way.
We'll be much better served though, in both the immediate and long term, to let go of this effort and begin to nurture the data where it is. We have a new opportunity to work with data in its natural habitat, and I for one will be taking my canoe (dbt-duckdb) to the pond to explore! I hope we can discover new waterways between our ponds and camp out together.