Facing the Paradox of Software Visualization

The paper The Paradox of Software Visualization by Steven Reiss opens with a question:

Software visualization seems like such a logical and helpful concept with obvious benefits and advantages. But after decades of research and work, it has yet to be successful in any mainstream development environment. What is the reason for this paradox? Will software visualization ever be actually widely used?

It was published in 2005 but feels as relevant today as it was 20 years ago. The paper describes how research efforts are out of touch with reality and lays out several reasons for this failure. It mentions the realities of Understanding, Software, and Developers.

The Reality of Understanding

The visualization systems that have been developed address “generic” understanding problems. They look at the program structure from the generic view of the class hierarchy or the call hierarchy.

However, the reality of software understanding is that programmers ask specific questions, not generic ones. For example, they want to know what state the threads of a system are in not in terms of generic states, but rather in terms of logical states from the applications point of view. They want to understand abstractions of their specific internal structures or the resulting execution.

Generic solutions only work for generic problems, not for specific problems in specific programs. The reality of program understanding is that understanding involves dealing with specific problems that require program and task-specialized solutions and that software visualization has not addressed these issues.

Realities of Software

Today’s systems are also structurally complex and heterogeneous. Tomorrow’s systems, the ones being built today to run in the future, are even more complex. These systems will use web services and outside components over which the programmer has no control nor detailed knowledge. These systems will be highly distributed, running unpredictably on grids of machines, sharing data using peer-to-peer facilities, and interacting at network speeds with other, possibly outside, systems.

Software visualization systems and solutions have generally addressed yesterday’s problems. They do not scale to handle today’s large systems (although they now do scale to handle what were large systems a decade ago). They do not address the heterogeneous nature of today’s software, instead concentrating on a single aspect or single portion of the system.

Realities of Developers

Visualization needs to be incorporated into environments where developers find it useful. The cost of learning a new tool by developers must not exceed its expected rewards. It must be easily shown that a tool provides real benefits.

Software visualization has generally failed on both accounts. It is rare to find a software visualization tool that an uninformed programmer can take off the shelf and use on their particular system immediately. Most software visualization tools (many of mine own included) require the programmer to do significant work before they can receive any benefits. Some tools require extensive configuration to get a program into an environment and get it understood by the environment. Some tools require recompilation with different arguments. Some require a long program analysis process with a large database. Some require that the user work with specific languages or subsets or convert portions of the system for compatibility.

Tackling the Challenge

To me the state of visualization software feels comparable to the state of IDEs in the era before Language Server Protocol (LSP). To move forward like IDEs did and bridge the gap we could try to apply the Narrow Waist pattern to software visualization.

To think about the challenges we can divide the software visualization process roughly into two parts:

Ingestion - Collecting and processing the data from various sources. Here we deal mostly with the Realities of Software.
Visualization - Taking the data, querying the relevant information, and displaying it in a useful visual form. The challenges are usually about the Realities of Understanding and Realities of Developers.

Ingestion

There are several challenges regarding ingestion. We need to bring together multiple sources. Parts of a system can be written in different programming languages. A system can consist of many services distributed across a network. It can be split across multiple source control repositories.

There are also additional sources available that have been traditionally untapped. Thanks to the rising trend of Infrastructure as code, infrastructure topology can be mined for insights as an additional source. To gain understanding and be able to provide additional insights, a software visualization system needs to connect these sources together.

There is a technical challenge how to store and query the data. To be able to capture a connected model I expect it is needed to go beyond relational tables. Likely a graph model is needed or triple-based like RDF.

The other challenge is about schema and modeling. The presentation about Glean introduces a great insight regarding schema definition. In the past the usual approaches were:

Least Common Denominator
- This leads to omitting and missing information, some questions become impossible to answer.
Union of all Languages
- It ends up including nuances of all programming languages, becomes complicated and difficult to work with.

In contrast, Glean does not enforce a single schema because each language has its own nuances and different clients need different data. Instead each language-specific source has it’s own schema that contains full information and commonality is then derived.

Another consideration is processing and understanding the execution flow. The dilemma is that execution traces in reality are the ultimate source of truth, but static models are easier to work with and can be reasoned about more easily.

Structure and dependencies can be derived and aggregated from execution traces. The visualization systems should not treat static models vs. dynamic execution as two separate things, but instead as two complementary perspectives about the same system. Showing just the static structure is the easier part and a good start, but is only half of the job.

This can go hand-in-hand as the observability industry is moving from simple logs to structured traces. Having full structured traces is another source to add and compare with a model of the system to gain additional insights.

Visualization

To acknowledge realities of understanding, we must embrace that there is no one-size-fits-all visualization kind. Parsing and analyzing source code on the ingestion side is significant work, as is creating good visualization with nuanced interactivity. A capable visualization system should be able to leverage work put into existing research and tooling.

To be able to reuse visual implementation, the schema of a visual model needs to be separated and have a defined interface or schema. Mapping data from code models into different representations of defined visual models is then fairly straightforward glue code.

Another challenge is that platforms change. Functioning solutions become obsolete by not being able to keep up with the changing platform landscape.

Facing the reality of developers nowadays, web-rendered tools are most commonly used. Therefore any software visualization software needs to be able to run on the web. This is to be able to integrate visualization into web-rendered IDEs, but it also has use for other potential places like documentation portals.

Although the web is currently dominant as a rendering target, other platforms might rise in the future. Additional care needs to be taken to separate concerns, extracting algorithms into libraries like layouting and metrics calculation. When new platforms become prominent, then just the view layer needs to be reimplemented.

In summary, from the visualization side we need to bridge the realities of understanding and developers to make available different views and perspectives in different environments.

Substrates

A pattern I see is that software visualization researchers often build monolithic, standalone applications that end up being difficult to modify. These systems are often presented as complete solutions, with a fixed set of use cases designed to showcase the research.

Having a limited set of supported use cases makes sense for commercial vendors which find a market opportunity and deliver value to customers. However, when research systems are tightly integrated, it makes it harder to build upon them and expand the research in new directions.

It is useful for systems to be driven by specific real use cases, but it is desirable to achieve them using composable components. Then those components can be recombined to cater to a long tail of previously unimagined one-off tasks and questions. Composable systems can serve as a foundation for future work, and new ideas can be tested without needing to start from scratch.

Adaptability of software to a long tail of tasks by users faces challenges that no-code/low-code tools are trying and struggling to solve. Intriguing ideas are discussed in Programming substrates by Tomas Petricek. The article suggests a different way of creating extendable systems by creating substrates that make small changes easy while allowing progression to more complex tasks.

With the advent of generative AI customizing software might get easier. However, if we create composable building blocks that are easier to customize for human users, models will also benefit and learn to leverage those building blocks.

Future Exploration

I try to apply the ideas above in the software visualization project Stratify. One principle is to build upon and extend existing components and research where possible.

On the ingestion side I try to leverage both languages specific parsers as well as experiment with language agnostic tools like SCIP, LSP, Stack Graphs and Glean.

Besides loading just source code, there are experiments with loading other sources like architecture and infrastructure maps. One part of the future work is to explore merging the sources into a unified graph that can be queried for questions and serve as a foundation for flexible visualizations.

Once we parse and load the sources, the visualization side is about translating the data into representations for various existing renderers like DGML Hierarchical Graph, 3D Code City, or 3D Code Galaxy. Observing limitations of these can inform development of a custom renderer in the future.

To explore end-user adaptability the approach is trying to avoid introducing a custom platform or application, but instead leveraging tools developers already use like computational notebooks. Another interactive visual system paradigm to be explored further is REBL (Read-eval-browse loop), which expands the REPL (Read-eval-print loop) paradigm by browsing the results visually. REBL is an intriguing approach because it is closer to the idea of being a programming substrate.

It’s going to be an interesting journey to see where we are in another 20 years and if The Paradox of Software Visualization will still hold.