# Evolution of a Plot: Better Data Visualization, One Step at a Time

My latest post on data.blog

The goal of data visualization is to transform numbers into insights. However, default data visualization output often disappoints. Sometimes, the graph shows irrelevant data or misses important aspects; sometimes, the graph lacks context; sometimes, it’s difficult to read. Often, data practitioners “feel” that something isn’t right with the graph, but cannot pinpoint the problem.

In this post, I’ll share the process of visualizing a complex issue using a simple plot. Despite the fact that the final plot looks elementary and straightforward, it took me several hours and trial-and-error attempts to achieve the result. By sharing this process, I hope to accomplish two goals: to offer my perspectives and approaches to data visualization and to learn from other options you suggest. You’ll find the code and the data used in this post here.

### Plotting power distribution in the Knesset

This post is devoted to a graph I created to explore…

View original post 1,738 more words

# 16-days work month — The joys of the Hebrew calendar

Tishrei is the seventh month of the Hebrew calendar that starts with Rosh-HaShana — the Hebrew New Year*. It is a 30 days month that usually occurs in September-October. One interesting feature of Tishrei is the fact that it is full of holidays: Rosh-HaShana (New Year), Yom Kippur (Day of Atonement), first and last days of Sukkot (Feast of Tabernacles) **. All these days are rest days in Israel. Every holiday eve is also a de facto rest day in many industries (high tech included). So now we have 8 resting days that add to the usual Friday/Saturday pairs, resulting in very sparse work weeks. But that’s not all: the period between the first and the last Sukkot days are mostly considered as half working days. Also, the children are at home since all the schools and kindergartens are on vacation, so we will treat those days as half working days in the following analysis.

I have counted the number of business days during this 31-day period (one day before the New Year plus the entire month of Tishrei) between 1993 and 2020 CE, and this is what we get:

Overall, this period consists of between 15 to 17 non-working days in a single month (31 days, mind you). This is how the working/not-working time during this month looks like this:

Now, having some vacation is nice, but this month is absolutely crazy. There is not a single full working week during this month. It is very similar to constantly interrupted work day, but at a different scale.

So, next time you wonder why your Israeli colleague, customer or partner barely works during September-October, recall this post.

(*) New Year starts in the seventh’s month? I know this is confusing. That’s because we number Nissan — the month of the Exodus from Egypt as the first month.
(**)If you are an observing Jew, you should add to this list Fast of Gedalia, but we will omit it from this discussion

# Interview with a WordPress.com data scientist

Two weeks ago, I gave an interview to Matthew Kaboomis Loomis from http://www.buildyourownblog.net. This was my first time, and I was pretty nervous. During the interview, Matthew and I talked about the recent findings that I have published in my previous post. Surprisingly, I really enjoyed the interview.

Click on the image below to see the interview on Matthews’ blog.

# A problem shared is a problem halved

## Social factors that promote persistent blogging

Socially active bloggers, and bloggers who write in teams tend to write a blog for longer times.

Writing a blog is a hard and demanding task. It requires creativity, dedication and persistence. Data shows that a large percentage of bloggers stop posting after a couple of months, and most of them don’t survive for more than a year. What is the force that drives the successful bloggers? What factors distinguish the persistent bloggers from the “quitters”? Is there anything a person can do to increase his or her chances to keep blogging and not to quit?

### Competing theories

The research that I will present here was performed in collaboration with Lior Zalmanson, a post-doc researcher at Stern School of Business New York University. In this study, we asked ourselves whether people can increase their chance to keep blogging by joining forces.

Whether people work better in groups or as individuals is an open and long-debated question in social psychology. On one hand, there is a phenomenon of “Social Loafing” — a phenomenon that was first described by a French agricultural engineer Maximilien Ringelmann in 1913. According to Ringelmann’s findings, having group members work together on a task results in significantly less effort than when individual members are acting alone [Wikipedia, Original paper].

On the other hand, Otto Köhler, a German psychologist, has found in 1926 that the weaker group members strive to keep up with the accomplishment of the other group members, which results in an overall performance improvement [Enc Britanica]. The effect of “Social Compensation” was suggested by Karau and Williams in 1991. According to their observations, a worker will work harder in groups, compensating for those who work less.

### The connected world of WordPress.com bloggers

Which of the two competing theories is more applicable towards the world of bloggers? To shed some light on this question, we studied the blogging patterns of WordPress.com users. WordPress.com is a platform that hosts more than 110,000,000 sites that belong to more than 102,000,000 registered users. To better understand the implications of social interactions on blogging activities, we analyze the links between WP.com users and blogs.
This analysis results in a mathematical structure called ‘graph’ that contains different types of interactions among various kinds of entities.

Let’s consider a simple example. Alice who wrote a post on her blog. The fact of publishing a post created a relationship between Alice and her blog. We will call this connection “IS_CONTRIBUTOR”. At some point, Bob joins Alice and writes a post to the same blog. Now, both Alice and Bob are contributors of that blog.

As time passes by, Bob continues submitting content to the blog and Alice doesn’t. To reflect this difference between the two, we define a “weight” of a link — the more content a user contributes to the blog, the higher is the weight. If a user, Alice in our case, doesn’t write new content to the blog her connection to the blog gradually disappears until at some time we will delete the link. We will no longer consider Alice as a contributor to B1.

At some point, Charlie reads Alice’s blog post and presses the “like” button. By clicking this button, a connection between Charlie and Alice is created. We will call this relationship “LIKES_AUTHOR”. We consider bringing a new audience to a blog as a contribution to that blog. Thus, when Charlie “likes” Alice’s post, he also increases the “IS_CONTRIBUTION” link between Alice and the blog.

For the sake of our discussion, let’s assume that Alice writes posts in another blog, we will call it B2. It turns out that Daphne and Eve are also authors in B2. Charlie, whom we already met, is also writing a blog, all by himself.

We want to know how collaboration affects authors’ persistence. In other words: does the number of collaborators an author has have an impact on the probability that that author will keep blogging for a longer time. In this toy example, Alice has three collaborators (Bob, Daphne and Eve), Daphne and Eve have both two collaborators; Bob has only one collaborator (Alice), and Charlie has no collaborators at all. In order not to upset writers who write alone, we will consider a person a partner of him- or her- selves. Thus, for the purpose of our analysis, Alice has four collaborators (including herself), Daphne and Eve have three, and Bob has two collaborators.

Author # of collaborators
Alice 4
Bob 2
Charlie 1
Daphne 3
Eve 3

What we see here is a small example. In reality, WordPress.com users form a large complex network of people and blogs. Since the connections between the nodes in this network can appear and disappear, this network is in constant change. One of the interesting things that we can do with this kind of dynamic systems is to discover user communities. I have already presented one such an analysis in the past (see this presentation) and will certainly show more of it in the future. Meanwhile, note that blogs don’t exist in a vacuum. Most of the time, people who write don’t write for themselves, they seek an audience. Writing inside a large platform of many interconnected communities brings the author closer to such an audience and promotes discussion and exchange of ideas.

### Collecting the data

There are many types of communities in the online world. There are also many types of collaboration between people. Let us now concentrate on a very specific kind of community — a community of writers. In this analysis, we treat a blog or a website as a gathering point and all the writers in that website as the community members. We can tackle the question raised at the beginning of this post: what makes some bloggers keep blogging?
To answer this question, we look at people who opened a WordPress.com account during the thirteen months from Jan 2013 and Feb 2014. WordPress.com is a home of large professional (VIP) customers such as NBC Sports, TED, CNN, Time and others (https://vip.wordpress.com/clients/). It would be unfair to include these professional writers in our analysis. Unfortunately, some people use WordPress.com for spamming, fraud and other non-legit activities; we have removed those people from the analysis. Many people open a WordPress.com account, write a test post and go away. We did not like to include such users in this study, so we only included those people who gave or received, at least, one “like”, as a substitute for a minimal level of quality and commitment. Last but not least, we excluded the 400+ Automattic employees and contractors from the analysis. In Automattic, we use WordPress.com for internal communication purposes and are a clear outlier in any analysis.

When we look at monthly snapshots of the dynamic networks mentioned above, we collect various descriptive statistics about the users who registered two to three months before the snapshot. We believe that this period is long enough for novice users to accommodate with the platform and to gain some popularity and social ties. Next, we look at the monthly snapshot made one year later and check whether a current user appears in the network as a contributor to at least one blog or not. We consider the users who appear in this graph as survivors. For the purpose of this analysis, we completely ignore what happens during the entire year — between the two snapshots.

In the end, we need to analyze the connection between two sets of information. On one hand we have the descriptive statistics collected at the data collection point. On the other side of the equation, we have the survival information. Let’s see the results.

### Silver bullet of blogging success?

Approximately 570,000 authors met the study inclusion criteria. We don’t analyze these authors alone but in a context of a network that contains about 5,000,000 users. What factors promote author survival? In the next paragraphs, I will show several similar graphs. In these graphs, the Y-axis will show the probability to continue blogging, as a function of the variable presented on the X-axis. On that axis, below the number that represents variable value, you will find the number of people for whom this true.

The more incoming likes authors receive, the higher is their survival probability

The figure above shows how the survival probability depends on the number of likes an author received two to three months after the registration. We can see that there were 172 thousand users who did not get any “like” at the data collection point. These authors had a 2% probability to stay active at the end of the study. Approximately 213 thousand users with one “like” had the probability to keep blogging of 4% — a two-fold increase. Overall, we see that the more “likes” a person receives, the higher is the survival probability. Users who received more than 32 likes (there were only 800 of them) are 43% likely to keep blogging (the shaded area represents the 95% confidence interval — how confident we are in the survival estimate).

Can this graph help an aspiring blogger? Not directly. It doesn’t surprise that an author who is capable of producing high-quality content that is also popular with a broad audience will also be likely to continue blogging in the future. Plus, authors can’t directly affect the number of likes they receive. Unlike receiving likes, pressing the “like” button on other’s content is under a person’s direct control. The following graph shows survival probability as a function of how many “likes” a user gave to others.

The more likes authors give, the higher is their probability to survive

Generally speaking, the more socially active the authors are, the more likely they are to keep blogging. It is interesting to note that there is such a thing as too much love: authors who pressed the “like” button more than 32 times during their first two to three months are less likely to survive, compared to the immediately preceding group.
At first glance, one might take the two graphs above as a magic recipe for blogging success. Doing this will be a mistake. It is correct that we have found a correlation between the number of incoming and outgoing likes and the survival probability. This connection does not tell us which of the two sides of the association is the cause and which one is the result. This uncertainty, which is often called the correlation-causation question, is very hard to solve. However, based on my general knowledge of human psychology, I claim that social interactions are partially responsible for the high correlation that we see in the data.

### A problem shared is a problem halved

Let’s get back to our original question: do authors who collaborate with others have a higher probability to keep on blogging? In the terms of our example network, does Alice, who has four collaborators, have greater chance to survive, compared to Charlie, who writes by himself? The answer is yes. According to the data we have, authors who write by themselves have only a 6% probability to keep blogging till the end of the study. Authors who write in pairs have a 9% probability to stay — a 50% increase. Interestingly, subsequent addition of collaborators has almost no effect, until we see teams of more than 16 authors.

Authors who write in groups are more likely to continue writing

### Does fairness matter?

This discussion started with a description of two competing theories that try to predict the outcome of a joined effort. Both theories talk about “weaker” and “stronger” participants. It turns out that we can measure the extent to which a person contributes to a blog and the extent to which blog’s authors provide equal contributions. Let’s get back to our example network. Suppose that, at some point we notice that Alice, Bob and Charlie collaborated in submitting content to a blog (B1). Alice wrote most of the posts and attracted most of the “likes”; Bob wrote from time to time; and Charlie contributed only one post, long time ago. If we measure link strength between each author and this particular blog, we see that Alice’s contribution is 10, Bob’s contribution is 0.5 and Charlie’s contribution is 0.01. On the other hand, Charlie and Daphne and Eve occasionally write a joined blog (B2) to which they contribute equally, with tie strength values of 0.1 each.

Will Alice feel abused by Charlie and Bob and cease her collaboration with them? Is it possible that Charlie will feel insecure about his weak cooperation with Alice and Bob, and will only write to B2? We can record how unbalanced a user’s input is in a given blog. We will measure this type of inequality such that authors who contribute less than the average will receive negative values; authors who provide the average contribution will have inequality value of zero and authors who perform most of the job will have a positive inequality score. We compute this score for each person-blog connection, which means that people who contribute to several blogs will have several inequality scores.

Blog Author Contribution Inequality score
B1 Alice 10 1
Bob 5 0
Charlie 0.1 -6
B2 Charlie 0.1 0
Daphne 0.1 0
Eve 0.1 0

What is the probability of a given author to continue contributing to a particular blog, given how equal (or unequal) person’s contribution to this blog is. Note, that a user who contributes to several blogs may stop writing to one blog, but continue adding content to another one. In our latest example, Charlie may decide that he doesn’t want to write to B1 anymore and will contribute only to B2.
Generally speaking, there may be four types of connection between the inequality score and the persistence probability as schematically depicted in the figure below:

One possible outcome is that the more a person contributes to a blog, the higher is the chance that he or she will continue writing to that blog (case A in the figure above). Another possibility is that the opposite is the truth: the more a person contributes to the blog, the smaller is the chance to keep writing (for example, due to the sense of unfairness; case B). Another possibility is that the average contributing people will have a higher chance to drop off, leaving only the super engaged and occasional contributors (case C). Finally, an opposite possibility exists, where average-contributing authors have the highest chance to persist.

According to our data, option A seems to describe better the reality. Contrarily to my initial belief, people don’t mind carrying the load. To some extent, authors who contribute more than the average to the joined effort are more likely to keep writing.

Authors who contribute more are more likely to survive

There is another level to analyze contribution inequality. If you go back to our latest example, you’ll notice that the contributions to B1 are very balanced, while the contributions to B1 are not. To quantify such a balance we use the Gini index. The Gini index is in extensive use in economics. It was developed to measure the extent to which the distribution of a limited resource among individuals deviates from a perfectly equal distribution. The Gini index ranges from zero to one. The value of zero means total equality, and one means the strongest inequality possible. Taken in the example above, blog B1 has the Gini index of 0.44 and B2 has the index of 0.
Having computed the Gini index for all the blogs in our study, we can analyze the connection between the Gini index and the probability of a blog to survive one year after the analysis. Similarly to the previous case, there are four possible relationships between the measured value (Gini index) and the probability:

I must admit that before the analysis, I expected to see a curve similar to the case B. I was very surprised (and super excited) to see that sites with extreme input inequality have much higher survival probability, compared to the overall equal blogs.

Blogs with highly dominant contributors are more likely to survive

### Conclusions

Let me go back to the questions I asked at the beginning of this post.

Q: What is the force that drives the successful bloggers?

Q: What factors distinguish the persistent bloggers from the “quitters”?

I have shown an empirical evidence that social interaction with other authors is strongly correlated to blogging success. Does social interaction bring blogging success, or are highly motivated authors also active in inter-personal activity? I don’t know for sure, but I assume that the reality is a combination of both. Thus, when choosing a blogging platform, make sure you can interact with your readers and with bloggers with similar interests. Make sure to connect with them. You can only gain from these connections.

Q: Is there anything a person can do to increase his or her chances to keep blogging and not to quit?

Team up! The evidence is clear: people who write in groups are more likely to keep writing. Every project needs a leader (recall the inequality graphs). If you are already motivated, don’t be afraid to lead. On the other hand, don’t be scared to be in the shadow of a dominant partner. Unbalanced contribution to a joint project is not about unfairness but is about leadership. Remember that 1% of fame is better than 100% of anonymity.

The information in this post was first presented at WordCamp Israel that took place in Jerusalem on March 28, 2016. This study was performed in tight collaboration with Lior Zalmanson, a post-doc researcher in the Stern School of Business, New York University.

# Comparing the incomparable — expanding the time map concept

We are surrounded by discrete events: posts and comments, purchases from the online store, page visits are all examples of discrete events. Some of these events happen periodically; others happen sporadically. Some happen once in a while; others are generated every microsecond. There are many ways to visualise such streams, the most naive one being a one-dimensional plot in which events are placed on a time axis, as demonstrated in the figure below for three simple cases.

But what happens if there are too many such events? How do we gain a global overview? Recently, I have learned about “Time maps” — a creative way to perform such a visualization, which I will describe below using examples from WordPress.com. I will expand this approach to accommodate for bringing cases with drastically different time scales to the same domain, for easier comparison and pattern discovery.

## Time maps

Instead of plotting the absolute time values, or time differences, Max Watson from District Data Labs suggests representing each data point in two dimensions — in terms of time passed before and after the events. Thus, all the data points from case A in the figure above will be represented by the pairs $(1; 1)$ as all of them occur at regular intervals of 1 second. Most of the dots in case B from the same figure will be represented by the same pair, $(1; 1)$. However, the pause between the event at time marks 3 and 6 seconds and not 1. Thus, the third point will be represented by $(1; 3)$ and the fourth one — by $(3; 1)$. Following are the time maps for the three toy cases from the figure above:

Note that all the dots in case A overlap. Now we can identify several interesting regions in these time maps. Events that happen at equal intervals lie on the diagonal line and overlap. Events that occur at approximately equal intervals will be distributed over the diagonal. The upper right portion of the map represent events that happen at a slow pace while the lower-left quadrant of the graph represents fast events. Long interruptions followed by recovery manifest as large deviations from the diagonal — the interruption itself (high $\Delta t after$) is shown as a dot above the diagonal and the event after the interruption (high $\Delta t before$) appears as a dot to the right of the diagonal.

Let us examine some real-life events. I’ll start with publishing posts or pages on some of the WordPress.com blogs.

Following is a time map of a particularly busy site with more than 7,700 posts. On the right, you may see post and pages time map in which each point is colored according to the GMT in which the post was published. Note the logarithmic scale of the plot that shows events at time intervals of several orders of magnitude — from seconds to weeks. As you may see, the vast majority of publishing events occur at intervals of several minutes to one hour. The lines of purple dots represent breaks of approximately eight hours in publishing new posts — something that is consistent with a medium size editorial team. To the right, I present a landscape representation of the same time map. The dark blue bulb in the graph on the left shows that the vast majority of the events occur at regular intervals between several minutes and one hour.

Time map of another busy site, with its 6511 posting events, shows a slightly different publishing pattern.

In this news site, most of the events occurred at one-minute intervals. The typical service break in this site is either one or two days long, as represented by the two distinct strips.
Using the time map approach is relatively easy to identify bots. For example, the owners of the following site have uploaded more than 1,000 posts such that most of them were published at a one-minute interval — a pattern that is unlikely to be performed by a human.

What about smaller blogs? This Norwegian blogger posted less than 150 posts with a typical time interval between a day and week.

I have analyzed time maps of different administrative tasks performed by  several random WordPress.com users.
This user, for example, is a moderately active blogger with a single blog. Their time map is characteristic of sporadic events with intervals of several seconds to one day. We may see that this user is very disciplined and does not leave the blog for more than a day.

On the other hand, another user, who has two blogs presents a completely different time map with thousands of events within the range of micro- and milliseconds. Notice that the two types of events (the super-fast ones and the “normally paced” ones) are well separated and that the super-fast events occur at very specific hours. Does this mean that that user’s computer is affected by a virus? Does that user use an aggressive script to monitor their site? This is a topic for another post.

## Bringing the maps to the common scale

You may have noticed the vast differences in time scales that we saw in the maps above. Sometimes, we are interested in map’s shape, so that we can compare similar patterns that occur at different time scales. To achieve this task, I first sort time difference values in each case (user, blog, log record, etc.). Next, I transform these values such that the median time difference (the one that is larger than half of the observed points) gets the value of 0. Time differences larger then the median receive positive values and time differences smaller than the median receive negative values. In technical terms: I first compute the empirical cumulative distribution of the time differences, and then perform logit transform.

The following figure shows the normalised maps for four blogs I have mentioned at the beginning of this document . We may see that the elliptical shape of the dot cloud represents a series of events that occur somewhat regularly at a relatively low frequency. Round-shaped graphs represent cases in which the typical variability in inter-event periods is comparable to that period. Bot-induced time maps are still easily detected in this representation

Let us now examine the normalised time maps of the two users mentioned above:

One interesting thing to note is the fact that the lower-left region in the second user’s graph is now significantly enlarged because it contains a much larger amount of data points. As with the case of the blog time maps, the approximately round shape of the first user’s time map indicates that the variability of the between-event interval is comparable in its size to the interval. Note, though, that when observing first user’s raw map, it was clear that that user’s events mostly occur not faster than once a second, and that there the maximal interval is never larger than one day. Such a notion, which may be valuable in many cases, is absent from the normalised graph.

## Conclusions

Time maps, as developed by Max Watson from District Data Labs provide a valuable insight in the analysis of event streams. It is evident that these maps reveal interesting activity patterns. Analysis of such patterns may be used in spam and fraud detection.

The proposed normalisation of time maps provides a tool for comparing the shapes of drastically different maps. The following figure, for example, presents time maps of Premium and Business purchases performed in Japanese Yens between June 1st and Dec 20th, 2015 :.

I selected the Yen as the least frequent currency used in WordPress.com. Had I analysed the same data for USD, we would see a completely different raw map, but comparable normalised one: