Elephant Factor Study on Libre Health, Mifos, Open Data Kit, and Hot Projects

Elephant Factor is defined as the minimum number of companies whose employees perform 50% of the total contributions to a project.

From the previous post, we know that the Bus Factor is “a measurement of the risk resulting from information and capabilities not being shared among team members”. Elephant factor is the same concept, applied to companies.

Diversity of organizations do not guarantee more contributions, but more resilient communities. The worst scenario would be, of course, depending only on a single company that at some point leaves the project. In that case, the project could be difficult to maintain and in the worst case, the project could be discontinued. However, this is a typical scenario when a company creates and funded a project. Making their community grow after them is the actual challenge.

Following with the community layer that we started to analyze in the previous post about the Bus Factor, this post tries to shed some more light on the Organizational Diversity of the four projects we had the opportunity to analyze: Libre Health, Mifos, Open Data Kit, and Hot.

Before starting, it’s worth noticing the importance of the affiliations for this analysis. The dashboard uses a basic domain matching algorithm to set contributor affiliations. In some cases, the system doesn’t know which company corresponds to a given domain, or some people use personal e-mail addresses instead of corporate ones. The dashboard allows us to improve the affiliations data through a tool called Hatstall where we can merge several profiles into the same one, add affiliations for profiles, mark profiles as bots, and some other similar operations. Please don’t hesitate to contact us if you need help to improve current data to make results more reliable. To have a glance at current affiliations data, please visit the Affiliations Dashboard.

Based on the data we have, there are some considerations to take into account:

  • Hot is the most reliable source, less than 20% of contributions belong to people affiliated to ‘Unknown’.

  • Libre Health and Mifos are around 50% of contributions made by people affiliated to Unknown. Their analysis could be biased by this fact. 50% of contributions is still a reasonable number to provide a view of the actual organizational diversity of the projects

  • Open Data Kit is a reliable source. Even having most of the people affiliated to ‘Unknown’, only around 30% of contributions were made by those people.

The following are the numbers (A) for the Elephant Factor calculated on the 21st of November 2019 and for the last year. While the following ones (B) were retrieved today for the last year using the OSCaaS Dashboard. There is certain overlapping time, but this helps to see the evolution of these numbers.

It is especially interesting to see how Mifos changed in such a short period of time. Conflux Technologies and Mifos were the two organizations leading contributions in the first period. However, currently, The Apache Software Foundation is contributing more than 50% of the total contributions by themselves. The rest of the projects are being maintained mainly by a single organization, except Open Data Kit. As we’ll see later on, this is not an isolated fact when talking about open source projects.

Nevertheless, the goal of any open source community should be to involve more and more people under its umbrella. The next step for a community or/and project manager would be looking at the data to find out which other organizations are contributing to the project and plan specific actions to engage them and increase their commitment to the project.

As in the previous post, we run another analysis to bring more context to the discussion (C). Wikimedia Foundation, OPNFV, GitLab, and Kata Containers were selected again to make the data comparable. As a reminder, there are huge projects as Wikimedia and smaller ones as Kata Containers. We show the data for the last year only just because there was no variation between the two periods we analyzed:

As you can see, having a single company or organization leading a project is something usual. That could mean that company is very interested in the project, although most of the times it is probably due to that organization created and/or funded the project. Along these lines, for an open-source community, it would be great not to rely on a single company, and that’s why in certain cases big companies decide to invest in projects they are especially interested in.

Remember, if you are interested in running a similar analysis for your project and you are part of the OSC Hub and you want your project to be part of further analysis, please reply to this post and let us know!

The definition of the Elephant Factor was stolen borrowed from the CHAOSS community :slight_smile:

Thanks @alpgarcia!

I really appreciate the format of sharing these analyses. In thoughtful, comprehensive, in-depth analysis is quite insightful.

I see bots on this page with substantial contributions (as would be expected).
Did you include or filter out the bot activity in this analysis?
What would you and others think about assigning bots an organization by themselves?


Two nitpicks:

stolen --> borrowed :wink:

I believe “©” should be (C), but Markdown is converting it to a symbol.

@GeorgLink, thanks for reading!

Yes, I did include the filters for bots and merge commits.

I’m afraid the affiliations need some work to mark those profiles as bots. I didn’t want to do it beforehand to let the community have a glance at the data. This way we can also show how to do it in a future post or during the office hours (next session will be APAC friendly and will take place on 2020-06-25T07:30:00Z).

1 Like

Thanks @alpgarcia, I appreciate it.

I just realized there was a small bug in some donut charts that were grouping authors by Go to Hatstall string instead of using the underlying hash value. It resulted in green donuts, as seen at the bottom of the screenshot:

In short, the dashboard is now fixed by splitting data by author_name instead of author_uuid.

For those who want to know more

The values of author_uuid field are long and ugly hashes. To avoid displaying them, we use a template by means of the index pattern to transform them in links to Hatstall (the web app to manage identity profiles, basically affiliating them to the right organizations or marking them as bots).

That template usually works well in other visualizations, for instance Authors table on the Affiliations dashboard. You can see there how each row starts with Go to Hatstall, and it corresponds to a different contributor. That’s because:

  • Each row is split by the underlying author_uuid.
  • The template uses the value to build the link.
  • The template is configured to display a static text (meaning always the same string) for the link’s title.

If you hover the text Go to Hatstall, two small magnifying glasses will appear nest to it, allowing you to filter in and out based on the cell value. If you do that, a filter will appear on top of the dashboard. It will say author_uuid: "Go to Hatstall". If you then click on the rightmost icon when hovering the filter -the tooltip will say Edit filter-, you can see how the filter is actually using the long and ugly hash under the hood, which is what we expect. To make things clear, Kibana uses author_uuid: "Go to Hatstall" as filter title, but not for the real query it’s throwing to ElasticSearch.

As I was there, I took advantage of the situation and marked some profiles as bots in Hatstall following the rule of the thumb of looking for bot within their names. Fixing affiliations is really easy and improves a lot the data quality.

However, for some reason probably related to the way Kibana pie charts compute the data, the same reasoning doesn’t apply to the donuts. Unluckily, the text Go to Hatstall is used to group results under a single big group (the green donut from above’s screenshot).

That’s why we decided to use author_name instead. This workaround could lead to errors in case two author profiles have exactly the same name, as we would count their contributions together as a single contributor. Nevertheless, we could always edit their SortingHat profiles through Hatstall and assign them slightly different names to be able to identify them.

That’s all on my side, hope this details help you understand the dashboard data a bit better :slight_smile:

And now, what questions do you have? :phone:

Thanks for your time reading this post,