pStruct: The Social Life of Data
Stephen DiVerdi, Ethan Kaplan
Traditional web forums force pieces of data (posts) into a strict, predetermined organization, which often does not accurately reflect the natural flow and evolution of discourse. pStruct enables content to organize itself dynamically, based on similarities to other pieces of data, as well as users' interaction with the forum. The result is an unstructured graph that responds in life-like ways to the interaction of data and users.
pStruct is built on a multithreaded Java architecture designed to maintain system responsiveness when faced with hundreds of users and millions of pieces of content. Every post to the forum is stored in a database for archival purposes. A subset of the posts are kept in memory as 'live' content which users are presented with and can interact with. When a post is no longer live, it is saved to the database for later retrieval. Each live entity runs as a separate thread, maintaining connections to other entities (posts, users, etc.), responding to requests and seeking out new relationships. While pStruct is currently built to act as a web forum backend, the architecture is general enough to allow for management of any data storage and content retrieval system.
Organization of the graph occurs by the maintenance of connections between nodes. When a node is created (a user posts a reply, for instance), the new node immediately seeks out related nodes (eg. similar content) and establishes connections. Connections are dynamically updated in response to a variety of input. As time progresses, the strength of a connection degrades, so that old connections eventually disappear. As users traverse connections, by viewing one piece of data and then another, the connection is reinforced. Thus, connections that are meaningful to users navigating the graph will be made stronger and will last longer. As new nodes are entered into the system, existing nodes will examine them to see if they are potentially related, and establish additional connections.
Understanding the structure of a self-organizing graph is a difficult task. To aid in analysis, there is also a visualization component of pStruct. An interactive 3D representation of the live forum content is created, showing which nodes are live, what connections currently exist, what users are active, etc. The structure of the graph is exploited to cluster nodes into groups that are strongly related based on strength of connections. This replaces the threading functionality of traditional web forums, allowing nodes to move among different clusters dynamically, as users and content change around them.
The visualization is broken down into a series of steps. First, a 3D structure must be generated for an non-planer, undirected graph, that attempts to relate distance between nodes to their level of connectedness. To accomplish this, a spring embedding is computed, by treating edges in the graph as springs and calculating timesteps of a dynamic force simulation. The nodes start in an initial configuration and evolve into a shape based on the strength of connections. To determine clusters within the graph, algorithms such as normalized graph cut or k-means are used. Graph cut uses the edges in the graph to compute a globally optimal partition, recursively until a threshold is reached. K-means, on the other hand, uses the euclidian distance of the nodes after embedding to determine clusters.
Rendering is the final step of the visualization. Each node is rendered as a gaussian texture, with alpha blending for smooth grouping effects. The blending can be additive, simulating the effect of bright light sources, or more traditional transparency, making the nodes look like clouds of smoke. Edges are rendered on top with intensity proportional to strength. Colors are chosen randomly to represent cluster membership.
E. Kaplan, S. DiVerdi.
Data Organization and Visualization in a Web Community Environment.