Notes from ClearlyDefined chat on 2020-02-21

On 2020-02-21T00:00:00Z we had a preliminary brainstorming meeting via Jitsi about ClearlyDefined (CD) with @sconrad, @nribeka, Carol Smith (from CD), and me; with a goal of evaluating if it might be useful in the work our team is doing for DIAL’s ICT4SDG product registry and other work cataltoging open source public goods. Here are a few rough notes from that chat, for future reference as we move forward.

Purpose of ClearlyDefined

  • At its core, CD only deals with facts about projects, and to some extent the communities building them.
  • CD can translate data from other sources. This data can be stored in a ClearlyDefined service such as on their web site currently, or it can point to external sources.
  • The long-term goal of CD is essentially “automated curation”. Much of this curation is done by humans today, but is done so in git repositories so the work can be tracked and understood over time.

How it works

  • CD has 3 core components:

    • Crawler - coordinates (identifying a product). Getting the data and running tools/scanners on it. Is this extensible? Crawler will harvest data, service will transform it. They’re all in GitHub: https://github.com/clearlydefined
    • Service - Extracting/summarizing the data that people are actually interested in. Component definition.
    • Website - Humans curate the data, fix any issues. PRs on CD GitHub repo.
  • CD is using multiple open source tools to perform analysis on packages / open source project. CD will crawl for information from projects and then store the raw information using clearly defined structure (the ‘raw’ data when you go to CD website).

  • Multiple crawl process against the same coordinate will yield the same result. Coordinate components are: type (maven, git, npm) + provider (npm registry, github) + optional name for maven (I think this is because maven have artifact and project name) + component name + version. The hope is to minimize the need for other folks to do the same process over and over again because the result will be immutable. Raw data from CD is exposed if other project want to use it.

  • Curated data is introduced when needed. This curated data will be proposed to the component source location via a pull request. CD then will reprocess the component (hopefully with the curated data) and the curation will not needed anymore. Example given was an incorrectly-identified MTI license when it should be MIT. In such a case, CD will propose to the component to amend the license to MIT after curation. Curations are also stored in CD data.

Future directions and possibilities

  • CD has discussed expanding into “meta” tracking of security/accessibility data maintained in other sources, in addition to the license data that they have.
  • If we want to define new metrics, we can express data in yml format that people can curate.
  • The CD framework can provide the mapping between different data sources - we could have evaluation data sources from entities using CHAOSS metrics, or OSC evaluations, for example.
1 Like