License Detection Updates

References:

The Problem:

The goal was to reduce false-positives in scancode license detection results, especially unknown-license-reference detections and approximate detections reporting best-guess license_expressions. To tackle this the following solution elements were discussed and implemented:

  1. Reporting the primary, declared license in a scan summary record

  2. tagging mandatory portions in rules #2773

  3. Adding license detections by combine multiple license matches #2961

  4. Integrating the existing scancode-analyzer tool into SCTK to combine multiple matches based on statistics and heuristics #2961

  5. Reporting license clues when the matched license rule data is not sufficient to create a LicenseDetection #2961

  6. web app for efficient scan and review of a single license to ease reporting license detection issues nexB/scancode.io#450

  7. also apply LicenseDetection to package license detections #2961

  8. rename resource and package license fields #2961

Some other elements are still WIP, see issue #3300 for more details on this.

What is a LicenseDetection?

A detection which can have one or multiple LicenseMatch in them, and creates a License Expression that we finally report.

Properties:

  • A file can have multiple LicenseDetections (separated by non-legalese lines)

  • This can be from a file directly or a package.

  • We should be mostly certain of a proper license detection to report a LicenseDetection, i.e. we should have ideally gotten rid of false positives and wrong license matches, or improved them.

  • One LicenseDetection can have matches from different files, in case of local license references.

  • We don’t remove any detection matches, but we add more matches only to rectify and correct the license_expression.

Also there are two levels of reporting license detections:

  • File/package level License Detections

  • Codebase level unique License Detections (summarized from the file/package level detections)

Examples

A License Intro example:

Consider the following text:

/*********************************************************************
* Copyright (c) 2019 Red Hat, Inc.
*
* This program and the accompanying materials are made
* available under the terms of the Eclipse Public License 2.0
* which is available at https://www.eclipse.org/legal/epl-2.0/
*
* SPDX-License-Identifier: EPL-2.0
**********************************************************************/

The text:

"This program and the accompanying materials are made\n* available under the terms of the"

is detected as unknown-license-reference with is_license_intro as True, and has several epl-2.0 detections after that.

This can be considered as a single License Detection with its detected license-expression as epl-2.0. The matches of this license detection would also have the matches with the unknown-license-reference, but they will not be present in the final license_expression.

A License Reference example:

Consider the two following files:

file.py:

This is free software. See COPYING for details.

COPYING:

license: apache 2.0

Here there will be a unknown-license-reference detected in file.py and this actually references the license detected in COPYING which is apache-2.0.

This can be considered a single LicenseDetection with both the license matches from both files, and a concluded license_expression apache-2.0 instead of the unknown-license-reference.

Chnagelog Summary

  • There is a new license_detections codebase level attribute with all the unique license detections in the whole scan, both in resources and packages.

  • The data structure of the JSON output has changed for licenses at resource level, also with new attribute names, licenses -> license_detections and license_expressions -> detected_license_expression also with a SPDX version of the same. As license detection attributes we have: license_expression, identifier and matches. We also have a detection_log (present optionally if the --license-diagnostics option is enabled).

  • There are license_detections now reported at packages, and the data structure of license attributes in package_data and the codebase level packages has been also updated: license_expression -> declared_license_expression, also with it’s SPDX version, declared_license -> extracted_license_statement, and also secondary license detections data in: other_license_expression and other_license_detections.

  • Instead of reporting one match for each license key of a matched license expression, we now report one single match for each matched license expression, avoiding data duplication. Inside each match, we also list each match and matched rule attributes directly to avoiding nesting.

  • License and Rule reference data is not reported at match level in license detections and instead is reported at codebase-level with a new CLI option --license-references as new attributes: license_references and license_rule_references that list unique detected license and license rules with their details.

Change in License Data format: Resource

The data structure of the JSON output has changed for licenses at file level:

  • The licenses attribute is deleted.

  • A new license_detections attribute contains license detections in that file. This object has three attributes: license_expression, detection_log and matches. matches is a list of license matches and is roughly the same as licenses in the previous version with additional structure changes detailed below.

  • A new attribute license_clues contains license matches with the same data structure as the matches attribute in license_detections. This contains license matches that are mere clues and were not considered to be a proper conclusive license detection.

  • The license_expressions list of license expressions is deleted and replaced by a detected_license_expression single expression. Similarly spdx_license_expressions was removed and replaced by detected_license_expression_spdx.

See the before/after results for a file to compare the changes.

Before:

{
  "licenses": [
    {
      "key": "apache-2.0",
      "score": 100.0,
      "name": "Apache License 2.0",
      "short_name": "Apache 2.0",
      "category": "Permissive",
      "is_exception": false,
      "is_unknown": false,
      "owner": "Apache Software Foundation",
      "homepage_url": "http://www.apache.org/licenses/",
      "text_url": "http://www.apache.org/licenses/LICENSE-2.0",
      "reference_url": "https://scancode-licensedb.aboutcode.org/apache-2.0",
      "scancode_text_url": "https://github.com/nexB/scancode-toolkit/tree/develop/src/licensedcode/data/licenses/apache-2.0.LICENSE",
      "scancode_data_url": "https://github.com/nexB/scancode-toolkit/tree/develop/src/licensedcode/data/licenses/apache-2.0.yml",
      "spdx_license_key": "Apache-2.0",
      "spdx_url": "https://spdx.org/licenses/Apache-2.0",
      "start_line": 1,
      "end_line": 1,
      "matched_rule": {
        "identifier": "apache-2.0_65.RULE",
        "license_expression": "apache-2.0",
        "licenses": [
          "apache-2.0"
        ],
        "referenced_filenames": [],
        "is_license_text": false,
        "is_license_notice": false,
        "is_license_reference": false,
        "is_license_tag": true,
        "is_license_intro": false,
        "has_unknown": false,
        "matcher": "1-hash",
        "rule_length": 4,
        "matched_length": 4,
        "match_coverage": 100.0,
        "rule_relevance": 100,
        "is_builtin": true
      },
      "matched_text": "License: Apache-2.0"
    }
  ],
  "license_expressions": [
    "apache-2.0"
  ]
}

After:

"detected_license_expression": "apache-2.0",
"detected_license_expression_spdx": "Apache-2.0",
"license_detections": [
  {
    "license_expression": "apache-2.0",
    "matches": [
      {
        "score": 100.0,
        "start_line": 1,
        "end_line": 1,
        "matched_length": 4,
        "match_coverage": 100.0,
        "matcher": "1-hash",
        "license_expression": "apache-2.0",
        "rule_identifier": "apache-2.0_65.RULE",
        "rule_relevance": 100,
        "rule_url": "https://github.com/nexB/scancode-toolkit/tree/develop/src/licensedcode/data/rules/apache-2.0_65.RULE",
        "matched_text": "license: apache 2.0"
      }
    ],
    "detection_log": [],
    "identifier": "apache_2_0-ec759ae0-ea5a-f138-793e-388520e080c0"
  }
],
"license_clues": [],

Change in License Data format: Package

License data attributes has also changed in packages:

Before:

{
  "type": "cocoapods",
  "namespace": null,
  "name": "LoadingShimmer",
  "version": "1.0.3",
  "license_expression": "mit AND unknown",
  "declared_license": ":type = MIT, :file = LICENSE",
  "datasource_id": "cocoapods_podspec",
  "purl": "pkg:cocoapods/LoadingShimmer@1.0.3"
}

After:

"declared_license_expression": "mit",
"declared_license_expression_spdx": "MIT",
"license_detections": [
  {
    "license_expression": "mit",
    "matches": [
      {
        "score": 100.0,
        "start_line": 1,
        "end_line": 1,
        "matched_length": 4,
        "match_coverage": 100.0,
        "matcher": "1-hash",
        "license_expression": "mit",
        "rule_identifier": "mit_in_manifest.RULE",
        "rule_relevance": 100,
        "rule_url": "https://github.com/nexB/scancode-toolkit/tree/develop/src/licensedcode/data/rules/mit_in_manifest.RULE",
        "matched_text": ":type = MIT, :file = LICENSE"
      }
    ],
    "identifier": "mit-74f1df5b-f94d-2423-6bb8-3e4d809c26a5"
  }
],
"other_license_expression": null,
"other_license_expression_spdx": null,
"other_license_detections": [],
"extracted_license_statement": ":type = MIT, :file = LICENSE",

Previously in package data only the license_expression was present and it was very hard to debug license detections. Now there’s a license_detections field with the detections, same as the resource license_detections, with additional declared_license_expression and other_license_expression with their SPDX counterparts. The declared_license field also has been renamed to extracted_license_statement.

Codebase level Unique License Detection

We now have a new codebase level attribute license_detections which has Unique License Detection across the codebase, in both packages and resources. They are linked by a common attribute identifier containing the license_expression and a UUID generated from the match content. The match level data is only present at the resource level if needed, to look at details.

New codebase level attribute:

{
  "license_detections": [
    {
      "identifier": "epl_1_0-583490fb-0b3a-f445-a1b9-1b96423b9ec3",
      "license_expression": "epl-1.0",
      "detection_count": 2,
      "detection_log": []
    }
  ]
}

For the corresponding resource level license detection:

"license_detections": [
  {
    "license_expression": "epl-1.0",
    "matches": [
      {
        "score": 99.34,
        "start_line": 12,
        "end_line": 25,
        "matched_length": 150,
        "match_coverage": 99.34,
        "matcher": "3-seq",
        "license_expression": "epl-1.0",
        "rule_identifier": "epl-1.0_3.RULE",
        "rule_relevance": 100,
        "rule_url": "https://github.com/nexB/scancode-toolkit/tree/develop/src/licensedcode/data/rules/epl-1.0_3.RULE",
      },
      {
        "score": 100.0,
        "start_line": 17,
        "end_line": 17,
        "matched_length": 8,
        "match_coverage": 100.0,
        "matcher": "2-aho",
        "license_expression": "epl-1.0",
        "rule_identifier": "epl-1.0_7.RULE",
        "rule_relevance": 100,
        "rule_url": "https://github.com/nexB/scancode-toolkit/tree/develop/src/licensedcode/data/rules/epl-1.0_7.RULE",
      }
    ],
    "detection_log": [],
    "identifier": "epl_1_0-583490fb-0b3a-f445-a1b9-1b96423b9ec3"
  }
]

LicenseMatch Result Data

LicenseMatch data was based on a license key instead of being based on a license-expression.

So if there is a gpl-2.0 AND patent-disclaimer license expression detected from a single LicenseMatch, there were two entries in the licenses list for that resource, one for each license key, (here gpl-2.0 and patent-disclaimer respectively). This repeats the match details as these two entries have the same details except the license key.

We should only add one entry per match (and therefore per rule) and here the primary attribute should be the license-expression, rather than the license-key.

We also used to create a mapping inside a mapping in these license details to refer to the license rule (and there are other inconsistencies in how we report here). We are now just reporting a flat mapping here, and all the rule details are also not present in the license match, and only available as an optional reference.

See this before/after comparision to see how the license data in results has evolved.

Before:

"licenses": [
  {
    "key": "gpl-2.0",
    "score": 100.0,
    "name": "GNU General Public License 2.0",
    "short_name": "GPL 2.0",
    "category": "Copyleft",
    "is_exception": false,
    "is_unknown": false,
    "owner": "Free Software Foundation (FSF)",
    "homepage_url": "http://www.gnu.org/licenses/gpl-2.0.html",
    "text_url": "http://www.gnu.org/licenses/gpl-2.0.txt",
    "reference_url": "https://scancode-licensedb.aboutcode.org/gpl-2.0",
    "scancode_text_url": "https://github.com/nexB/scancode-toolkit/tree/develop/src/licensedcode/data/licenses/gpl-2.0.LICENSE",
    "scancode_data_url": "https://github.com/nexB/scancode-toolkit/tree/develop/src/licensedcode/data/licenses/gpl-2.0.yml",
    "spdx_license_key": "GPL-2.0-only",
    "spdx_url": "https://spdx.org/licenses/GPL-2.0-only",
    "start_line": 4,
    "end_line": 30,
    "matched_rule": {
      "identifier": "gpl-2.0_and_patent-disclaimer_3.RULE",
      "license_expression": "gpl-2.0 AND patent-disclaimer",
      "licenses": [
        "gpl-2.0",
        "patent-disclaimer"
      ],
      "referenced_filenames": [],
      "is_license_text": false,
      "is_license_notice": true,
      "is_license_reference": false,
      "is_license_tag": false,
      "is_license_intro": false,
      "has_unknown": false,
      "matcher": "2-aho",
      "rule_length": 185,
      "matched_length": 185,
      "match_coverage": 100.0,
      "rule_relevance": 100
    }
  },
  {
    "key": "patent-disclaimer",
    "score": 100.0,
    "name": "Generic patent disclaimer",
    "short_name": "Generic patent disclaimer",
    "category": "Permissive",
    "is_exception": false,
    "is_unknown": false,
    "owner": "Unspecified",
    "homepage_url": null,
    "text_url": "",
    "reference_url": "https://scancode-licensedb.aboutcode.org/patent-disclaimer",
    "scancode_text_url": "https://github.com/nexB/scancode-toolkit/tree/develop/src/licensedcode/data/licenses/patent-disclaimer.LICENSE",
    "scancode_data_url": "https://github.com/nexB/scancode-toolkit/tree/develop/src/licensedcode/data/licenses/patent-disclaimer.yml",
    "spdx_license_key": "LicenseRef-scancode-patent-disclaimer",
    "spdx_url": "https://github.com/nexB/scancode-toolkit/tree/develop/src/licensedcode/data/licenses/patent-disclaimer.LICENSE",
    "start_line": 4,
    "end_line": 30,
    "matched_rule": {
      "identifier": "gpl-2.0_and_patent-disclaimer_3.RULE",
      "license_expression": "gpl-2.0 AND patent-disclaimer",
      "licenses": [
        "gpl-2.0",
        "patent-disclaimer"
      ],
      "referenced_filenames": [],
      "is_license_text": false,
      "is_license_notice": true,
      "is_license_reference": false,
      "is_license_tag": false,
      "is_license_intro": false,
      "has_unknown": false,
      "matcher": "2-aho",
      "rule_length": 185,
      "matched_length": 185,
      "match_coverage": 100.0,
      "rule_relevance": 100
    }
  }
],
"license_expressions": [
  "gpl-2.0 AND patent-disclaimer"
],

After:

"license_detections": [
  {
    "license_expression": "gpl-2.0 AND patent-disclaimer",
    "matches": [
      {
        "score": 100.0,
        "start_line": 4,
        "end_line": 30,
        "matched_length": 185,
        "match_coverage": 100.0,
        "matcher": "2-aho",
        "license_expression": "gpl-2.0 AND patent-disclaimer",
        "rule_identifier": "gpl-2.0_and_patent-disclaimer_3.RULE",
        "rule_relevance": 100,
        "rule_url": "https://github.com/nexB/scancode-toolkit/tree/develop/src/licensedcode/data/rules/gpl-2.0_and_patent-disclaimer_3.RULE"
      }
    ],
    "identifier": "gpl_2_0_and_patent_disclaimer-3bb2602f-86f5-b9da-9bf5-b52e6920c8d1"
  }
],