5 Major Problems of Spreadsheets Solved with Parallel Dataflow Programming

Or: The Adventure Park of Data Analytics and Data Visualization

Geschrieben von Stefan Herr am 02. Mai 2012

Data Mining and Data Analysis Adventure Park One of the big current challenges in the computer industry is how we can make the power of the modern multi-core CPUs available to a large audience. Many attempts center on improving and extending existing procedural programming languages or shifting towards paradigms that are friendlier to parallel execution, like functional or dataflow programming. Unfortunately, most of them only address trained programmers and computer scientists and usually require a complete upfront design of the desired solution for the problem at hand.

So what about domain specialists in areas like advanced data analytics (e.g. data mining), data visualization or financial analysis (e.g. business intelligence and financial controlling), who require (and thus are often used to) more intuitive approaches and tools to solving their problems?

Even better: wouldn’t it be great if there even was a way enabling practically “everyone” to easily create solutions for their everyday computing problems, harnessing the full power of parallel programming, without the need to write a single line of code?

Interestingly, there is already a large group of people who unconsciously write parallel dataflow programs every day without being aware of it: I am talking about the users of spreadsheet applications like Excel or Calc (which is part of Open Office). Unfortunately, spreadsheets are afflicted with 5 major problems, so this choice can turn out to be a dead-end street for many.

What are these problems, and is there a way to get out of this trap? Can we bring back the fun into mining information from our data?

The great thing about spreadsheets is the easy initial access that they provide to analyzing data and visualizing the results of this process. Once the source data has been imported, the solution can be developed by trial and error, without having to create a design first. A spreadsheet is a little bit like a simple playground for trying out ideas how to process your data.

However, as any experienced user of spreadsheet applications can confirm, there is a certain limit to the complexity of solutions that can be implemented as a spreadsheet. I have identified the following 5 major problems:

Problem 1: Difficult inspection / lack of traceability: In complex spreadsheets it is very hard to keep the overview concerning the formulas. Auditing the correctness of a spreadsheet can become a cumbersome task, especially in documents with several sheets making cross-references all over the place.

Problem 2: Inflexibility: It is hard to rearrange the solution when it turns out that the trial-and-error development process went into the wrong direction and needs to be corrected.

Problem 3: Limited reuse: Applying a solution to slightly different source data sets is difficult. A spreadsheet created for one purpose can in most cases not be reused for even a slightly different problem. It is not possible to create libraries with subsets of a solution and rearrange or recombine those easily for application to a new problem.

Problem 4: Limited scalability: Processing very large amounts of data in a spreadsheet causes problems as the complete source data has to be present at once – there is no simple way for “streaming” large amounts of data through the spreadsheet solution.

Problem 5: Deployment: It is not easily possible to turn a spreadsheet into a standalone solution that can be deployed as a distributable product or be published as a web site.

Let’s not even start talking about having to use scripting (the big trap for the non-programmers) when simple formulas do not suffice any more to solve an issue, or complex filtering and preparation of data for processing it after the import.

How can all of these problems with spreadsheets be effectively addressed? That’s what I and my colleagues here at ANKHOR have been asking ourselves for a long time (the concept of spreadsheets is more than 30 years old…yikes!). Finally, we came up with this:

Solution 1: Make the dataflow graph visible and graphically editable. Representing the formulas as operators in a graphical dataflow graph and allowing the inspection of intermediate values on the edges of the graph provides the desired traceability of every processing step. Visually editing the graph using the mouse or gestures provides the desired intuitive way for developing solutions by simply trying out things as it is easy to rearrange connections between operators.

Solution 2: Facilitate refactoring with simple operations like drag-and-drop and a context-sensitive menu. The quick reconnection and rearrangement of operators in the dataflow graph further increases the speed by which new ways to treat the data can be tried out “experimentally”.

Solution 3: Support macros and macro libraries. By combining parts of a dataflow graph into a sub-graph, it becomes possible to isolate functionality into reusable functional blocks called macros. Collecting macros of similar nature in libraries for reuse and recombination allows the construction of domain-specific toolkits that significantly reduce the effort required to build new solutions.

Solution 4: Support data streaming (aka “chunked” data processing) and compression of intermediate results. In many cases, the original data to be processed is read just once and then transformed into a new representation, e.g. by filtering. After that has happened, only the transformed data is required for further processing. Providing support for operators that do the intermediate processing chunk by chunk in a streaming-like fashion provides the capability to handle much larger data volumes. Further reduction of data volume in memory is achieved by compressing intermediate results, e.g. by identifying identical data values in tables.

Solution 5: Provide stand-alone and server runtime environments with HTML rendering. A dataflow graph implementing a solution can simply be taken “as is” and be distributed as a self-installing executable or be installed on an HTML server that renders the output as HTML. The capabilities must be strong enough to display professional dashboards or produce reports in PDF format.

ANKHOR FlowSheet - a Data Analysis Adventure Park All of the above concepts are anything but new. However, until last year, when we released version 1.0 of our product ANKHOR FlowSheet, there was no platform combining all of the above approaches in one, without ever having to resort to other tools outside of the graphical dataflow graph editing environment. In December we started the beta program for the coming 1.1 release, which greatly improves the parallel execution of the dataflow graphs. In comparison to other dataflow based development environments, it exploits more than just the inherent parallelism of the data flow graphs themselves.

So if you are using spreadsheet programs like Excel to “play with your data” today you might want to give ANKHOR FlowSheet a try. The great thing is that you won’t lose any of the freedoms you have with spreadsheets, but you will never run against any of their limits, even when your problems and demands keep growing. Thus it’s not just a playground for data miners and visualizers – it’s the adventure park instead!

In the coming weeks and months my colleagues and I will be publishing here on this blog about the experiences our customers and us are making with using a graphical dataflow environment for data mining and visualizing arbitrary data in all kinds of ways – without writing a single line of code. There are some real gems in our set of standard libraries covering statistical data mining and predictive analytics (e.g. methods like k-means factoring and Bayesian networks), MOLAP data cube processing, beautiful visualizations of the results and much, much more that we would like you to all know about. Stay tuned!