Gephi: The Photoshop for Graphs

A Tool Born In A Parisian Classroom

In two thousand seven, three computer science students at the University of Technology of Compiègne in northern France were taking a course on complex networks. The course was taught using a piece of software called GUESS, which was the academic standard at the time for visualizing graph data. GUESS was powerful, well-documented, and approximately as pleasant to use as filing a tax return in a foreign language. The students, Mathieu Bastian, Sebastien Heymann, and Mathieu Jacomy, decided to build something better.

They called it Gephi, which was originally a portmanteau of graph and exploration, although the full etymology has been told several different ways over the years. By two thousand eight, they had a working version. By two thousand nine, it was being used by researchers in social network analysis around the world. By two thousand sixteen, the International Consortium of Investigative Journalists used a commercial version of related technology to render the visual maps that accompanied the Panama Papers, the largest journalism collaboration in history. The visual style of those maps, the soft spheres in subtle colors arranged by some invisible force, the way related entities cluster like galaxies, that is the Gephi aesthetic, and it has shaped how investigative data journalism looks for nearly a generation.

The remarkable thing about Gephi is that it is still free and open source, still desktop software you can install on a laptop, still run by a small group of volunteers, and still capable of producing maps that look like they belong in a national newspaper. The fact that one piece of software has held the aesthetic standard for graph journalism for over fifteen years is unusual. Usually the standard moves around. Gephi has stayed.

Why Network Graphs Are Strange

Most data visualization, when you stop to think about it, has a clear structure. A bar chart has categories along one axis and values along the other. A line chart has time along one axis and a variable along the other. A scatter plot has two variables. A map has latitude and longitude. The relationship between the data and the picture is fixed.

A network graph does not have this. A network graph has nodes, which are things, and edges, which are relationships between things. The data does not tell you where on the screen any particular node should appear. There is no axis. The data tells you only that node A is connected to node B and to node C, and that node D is connected to nobody. Where these nodes appear on the screen is a decision somebody has to make.

[calm]

This is the central problem of graph visualization, and it is the problem Gephi exists to solve. The pieces it solves it with are called layout algorithms, and they are the entire game. Every interesting graph visualization you have ever seen was produced by a layout algorithm working on the data. Change the algorithm and the same data looks completely different. Choose the wrong algorithm and the data looks meaningless. Choose the right algorithm and the data tells a story you did not know was there.

How Force-Directed Layouts Actually Work

The most famous layout algorithm in Gephi is called ForceAtlas2. It was developed by Mathieu Jacomy, one of the original three founders, and published in a paper in twenty-fourteen. The algorithm works on a simple physical metaphor that is surprisingly accurate to what you see on screen.

Imagine every node in the graph is a small ball with the same electrical charge. Like-charged particles repel each other, so all the balls want to push away from all the other balls. Now imagine every edge in the graph is a spring connecting two balls. The spring wants to pull its two balls toward each other. If you let the system run, the balls will find a configuration where the springs are at rest length and no ball is too close to any other ball. That configuration is the layout.

In practice the algorithm runs in a loop. It calculates the forces on each node from all the other nodes and all the connected edges. It moves each node a small distance in the direction of the net force on it. Then it recalculates. Then it moves again. After a few thousand iterations, the system settles into a stable configuration, and that configuration is your graph layout.

The thing that is genuinely magical about this is that you do not tell the algorithm where any node should go. You give it only the connections. The visual structure of the resulting picture, the clusters, the bridges, the outliers, all of it emerges from the physics. Communities of densely connected nodes find each other and pull into spheres. Loosely connected outliers drift to the edges. Nodes that bridge two communities sit between them like binary stars. The picture looks designed because the physics has done the designing.

The Specific Aesthetic That Took Over

The reason Gephi maps look the way Gephi maps look is that ForceAtlas2 has very specific behavior. It tends to produce roughly circular arrangements. It tends to space out densely connected clusters more aggressively than older algorithms. It tends to keep edge lengths roughly uniform, which makes the picture readable. And it tends to be deterministic in a way that the older random-walk-based layouts were not.

This last point is important. If you run ForceAtlas2 twice on the same data, you get roughly the same picture. Maybe rotated, maybe mirrored, but the same essential structure. This means a journalist can iterate. You can run the algorithm, look at it, adjust the parameters, run it again, and not feel like you are starting from scratch. You can build up a picture you trust.

The other thing that took over from Gephi is the color and size encoding. In a Gephi map, node size usually represents some quantitative property of the entity, like how many connections it has or how much money flows through it. Node color usually represents some categorical property, like what type of entity it is or what community detection algorithm has clustered it into. The size and color of the edges encode the strength or type of the relationship. These conventions are not enforced by the software. They are just what the early users settled on, and they have hardened into a visual language that readers now understand without needing to be told.

[serious]

When you see a graph in the New York Times or the Süddeutsche Zeitung today, with soft pastel clusters of differently-sized nodes connected by gray lines, you are looking at something whose visual grammar was invented by three students in northern France in two thousand seven. That is a remarkable thing for a piece of academic software to have done.

The Cost Of The Tool

Gephi is free, which is wonderful, but Gephi has costs that are not financial. The software is desktop-only, which means there is no collaborative version, no cloud sync, no shared workspace. If two journalists want to work on the same graph, they have to send files back and forth. The performance is bound by your laptop's memory. A graph with one hundred thousand nodes is fine on a modern machine. A graph with one million nodes will crash.

The release cycle is slow. The team is small and volunteer. The user interface has not been substantially updated in a long time, which means it looks like software from the early twenty-tens, because it largely is software from the early twenty-tens. The data import workflow is finicky. The export options are good but not great. There is no official mobile version, no web embed, no API integration with modern databases.

For investigative work, none of this matters very much. You sit down for an afternoon, you load your data, you run a layout, you style the nodes, you export an SVG, you call it done. The tool is shaped for the task. It is not trying to be a platform. It is trying to be a graph laboratory you spend half a day in and then leave.

What You Might Actually Do With It

Imagine you are looking at twenty-seven mineral exploration permits in a Swedish county. Each permit has a holder, which is a company. Each company has officers, which are people. Some officers appear on multiple companies. Some companies are owned by larger parent companies. There is a structure here you cannot see by looking at a table.

You build a spreadsheet with three columns. Source, target, relationship. You write down every connection. Company A has officer X. Company B has officer X. Company A is owned by parent company P. Parent company P is owned by ultimate parent U. You end up with a hundred-odd rows. You save it as a comma-separated values file.

You open Gephi. You import the spreadsheet. You see a confused blob of nodes and edges. You select ForceAtlas2 from the layout menu. You click run.

The blob unfolds. Companies cluster around their shared officers. The parent company sits at the center of its subsidiaries. One officer appears in the middle of two different clusters, bridging them. Until you ran the algorithm, you did not know those clusters were connected. Until you saw the picture, you would not have asked.

This is the moment Gephi is built for. The moment when you see something in the picture that you did not know to look for in the data. It happens often enough to be reliable. It is not magic. It is just that human eyes are very good at recognizing structure in spatial arrangements, and very bad at recognizing the same structure in a table of numbers. The algorithm translates one to the other.

The Larger Story

There is something worth saying about the lineage here. Mathieu Jacomy went on to do academic work in network science at Aalborg University in Denmark, where he writes about how visualization shapes interpretation. Sebastien Heymann went on to co-found Linkurious, a commercial graph software company that builds tools used by banks and intelligence agencies. Mathieu Bastian went to work at LinkedIn. The original three are scattered, but Gephi continues, run by a different group of maintainers, releasing slowly, still free.

[calm]

This is the half-life pattern. The original creators do other things. The software continues to work because it solved its problem well enough that it does not need much new development. Bugs get fixed. New plugins get added. The world around it changes, but the core of it does not need to. There are tools that need constant feeding to stay alive, and there are tools that quietly keep working long after their authors have moved on. Gephi is the second kind.

For a working journalist, this is reassuring. You are not betting on a hot startup that might be acquired and gutted. You are betting on a piece of academic software that has held the standard for fifteen years and shows no sign of stopping. You can learn it once and use it for the rest of your career. The skill you build now will still be useful in twenty thirty-six.

Why The Algorithm Is The Real Story

When you sit down with a finished Gephi visualization, the image looks designed. Someone made choices, you assume. Someone decided where to put each company, which clusters to highlight, which edges to thicken. But this is mostly wrong. The algorithm made those choices, from the data, by simulating physics. The journalist's choice was earlier and smaller. The journalist decided what counted as a node, what counted as an edge, and what algorithm to run.

This is a deeper point than it sounds. Most journalism is about deciding what to include and what to leave out. Most journalism is about framing. Network journalism is also about framing, but the framing is mostly upstream of the visualization. By the time you see the picture, the choices have been made. The picture is honest about the data you gave it.

This is why Gephi visualizations have a certain quality of truthfulness that is hard to fake. You cannot really lie with a force-directed layout. You can choose what to include, but once you include it, the algorithm decides where it goes. The clusters you see are real clusters in the data. The bridges you see are real bridges. The outliers are real outliers. The journalism is honest in the way that a good map is honest. It shows you what is there. The rest is up to you.

That is the practical aesthetic of investigative network journalism. Be honest about what you put in. Trust the algorithm to be honest about the shape. Write the story around what the picture shows. The wall of red string in the detective movies was always a metaphor for understanding. Gephi is the wall of red string, but the strings have arranged themselves, and the result is a picture you can publish.