Gestalt Principles for Data Visualization

Proximity and Past Experience with Network Visualization

Introduction

Position of graphical elements in charts cannot be left to chance, even when position does not directly encode a dimension of the data. Contrast the scatterplot, where the XY position of a symbol is directly related to the two dimensions of data being analyzed, with the force-directed network, where the XY position of a symbol is based on a mechanical simulation that pushes disconnected nodes away from each other and pulls connected nodes toward each other. Network diagrams like the kind on the right epitomize the problem of proximity, but these issues are shared by any chart that lays out data based on principles other than direct representation of quantitative dimensions.

Force-directed networks are one of a multitude of methods of representing networks (or graphs). The circles represent some kind of actor or other object (typically referred to as a node) and the lines (referred to as links or edges) indicate explicit connections between them. As the overall tension of the network relaxes, the algorithm finally stops laying out with the most optimized possible representation. You can drag the nodes and see how the structure of the network self-organizes based on that simulated push and pull.

Proximity

Position in charts is problematic firstly because proximity indicates similarity. The basic philosophy of the force-directed network embraces this fact, and attempts to keep related nodes (that is, connected nodes) near each other. But too many connections and too much complexity invariably leads to unrelated nodes displayed visually closer to each other than they should be. In a network, all that matters is network distance: the number of steps from one node to another. Nodes that are entirely disconnected from each other, where no path can be drawn from one to the other, can be considered at an infinite distance as far as the network perspective is concerned. The only reason such disconnected parts of the network are kept on-screen is a canvas gravity built into the force-directed algorithm soley to counteract this.

This time, when the layout has finished running, some nodes will be highlighted in red. These red nodes are displayed on-screen in closer proximity than is warranted by their distance from each other on the network. Because this is a randomly generated network, and the force-directed algorithm can easily be perturbed to produce a slightly different final layout, the affected nodes and the distance will change.

You can try to drag nodes around and adjust the inaccuracy of the layout but, like pushing down an air bubble in a linoleum floor, this will not remove the spatial problem from this network, it will only at best reduce it or move it to a different part of the netowrk.

Once the network algorithm finishes, we can calculate the nodes that demonstrate this problem.

Past Experience

Data visualization readers come to a chart with certain assumptions, and as they interact with a chart that is dynamic, they develop certain assumptions. Force-directed networks, like any layout that attempts to efficiently use on-screen space, can be non-deterministic in their results and this rightly confuses readers. The network on the right has had all of its nodes moved to random points and then the force-directed layout is re-run. The results, from exactly the same settings using exactly the same algorithm, is invariably different.

The principle of past experience is once again on the side of the reader, who expects that an accurate chart will look the same if the same parameters and the same data are put into it. But even a reader familiar with the vagaries of force-directed layouts suffers when their experience with the very same network and the very same layout changes so much. Just how different can each layout of such a simple network actually be? The old network is kept in light green, and arrows are drawn to show the change in position from the last layout to the new one. A combination of mirroring, rotation, and randomization of where disconnected components are placed all leads to very different visual representations of the same system.

One way to begin to address this difficulty, as suggested by Katy Börner, is the use of a network base map. The force-directed layout algorithm is used in an initial phase to effectively represent a network and then it is frozen and any later filters only affect color, size and the existence of links and nodes. The problem with this approach is that it also runs afoul of past experience, since the presence or existence of nodes and edges does not merely affect the measurements of the data represented by those elements, it would affect the forces that were used to lay out the entire network.

A better approach would be to develop deterministic qualities of network layouts that account for rotation, direction and the placement of disconnected components. A curated force layout like that could also optimize based on the original position of nodes to prefer to maintain similar visual structure.

Conclusion

It was easiest to demonstrate these principles with network visualization, but the same problems apply in hierarchical data visualization like circle-packing and dendrograms. Such layouts are designed to efficiently pack shapes and don't generally account for a fixed order or positioning. In circle packs, a reader will expect that circles are in the same order and in the relatively same position if they represent the same category of nested data. In Sankey diagrams a node being near another node implies similarity. Accounting for the mixed signals of proximity and past experience (both in general and in particular when dealing with dynamic charts) is a distinct challenge in pushing for the adoption of complex data visualization.

Elijah Meeks - April 2015