Contributing to Code Development¶
Contributions comes as bugs/questions/issues and as pull requests.
Source code and runtime data are in the /src/ directory.
Test code and test data are in the /tests/ directory.
Datasets (inluding licenses) and test data are in /data/ sub-directories.
We use DCO signoff in commit messages, like Linux does.
Porting ScanCode to other OS (FreeBSD is supported, etc.) is possible. Enter an issue for help.
See CONTRIBUTING.rst for details.
Code layout and conventions¶
Source code is in the
src/ directory, tests are in the
Miscellaneous scripts and configuration files are in the
There is one Python package for each major feature under
src/ and a
corresponding directory with the same name under
tests (but this is not a
package by design as it would not make sense to have a top level “tests” package
which is a name that’s too common).
Each test script is named
test_XXXX; we prefer organizing tests in subclasses
of the standard library
unittest module. But we also use plain functions
that are discovered nicely by
When source or tests need data files, we store these in a
This is used extensively in tests and also in source code for the reference
license texts and data and license detection rules files.
We use PEP8 conventions with a relaxed line length that can be up to 90’ish characters long when needed to keep the code clear and readable.
We write tests, a lot of tests, thousands of tests. When finding bugs or adding new features, we add tests. See existing test code for examples which form also a good specification for the supported features.
The tests should pass on Linux 64 bits, Windows 64 bits and on macOS 10.14 and up. We maintain multiple CI loops with Azure (all OSes) at https://dev.azure.com/nexB/scancode-toolkit/_build and Appveyor (Windows) at https://ci.appveyor.com/project/nexB/scancode-toolkit .
Several tests are data-driven and use data files as test input and sometimes data files as test expectation (in this case using either JSON or YAML files); a large number of copyright, license and package manifest parsing tests are such data-driven tests.
ScanCode comes with over 29,000 unit tests to ensure detection accuracy and stability across Linux, Windows and macOS OSes: we kinda love tests, do we?
We use pytest to run the tests: call the
pytest script to run the whole
test suite. This is installed with the
pytest package which is installed
when you run
If you are running from a fresh git clone and you run
./configure and then
source venv/bin/activate the
pytest command will be available in your path.
Alternatively, if you have already configured but are not in an activated
pytest command is available under
<root of your checkout>/venv/bin/pytest
(Note: paths here are for POSIX, but mostly the same applies to Windows)
If you have a multiprocessor machine you might want to run the tests in parallel
(and faster). For instance:
pytest -n4 runs the tests on 4 CPUs. We
typically run the tests in verbose mode with
pytest -vvs -n4.
You can also run a subset of the test suite as shown in the CI configs
pytest -n 2 -vvs tests/scancode runs only the test scripts present in the
tests/scancode directory. (You can give the path to a specific test script
file there too).
See also https://docs.pytest.org for details or use the
pytest -h command
to show the many other options available.
One useful option is to run a select subset of the test functions matching a
pattern with the
-k option, for instance:
pytest -vvs -k tcpdump would
only run test functions that contain the string “tcpdump” in their name or their
class name or module name.
Another useful option after a test run with some failures is to re-run only the
failed tests with the
--lf option, for instance:
pytest -vvs --lf would
only run only test functions that failed in the previous run.
Because we have a lot of tests (over 29,000), we organized theses in test suites
using pytest markers that are defined in the
conftest.py pytest plugin.
These are enabled by adding a
--test-suite option to the pytest command.
--test-suite=standardis the default and runs a decent but basic test suite
standardtest and adds a comprehensive test suite
alltest and adds extensive data-driven and data validations (for package, copyright and license detection)
In some cases we need to regenerate test data when expected behavious/result data structures change, and we have an environement variable to regenerate test data. SCANCODE_REGEN_TEST_FIXTURES is present in scancode_config and this can be set to regenerate test data for specific tests like this:
SCANCODE_REGEN_TEST_FIXTURES=yes pytest -vvs tests/packagedcode/test_package_models.py
This command will only regenerate test data for only the tests in test_package_models.py, and we can further specify the tests to regen by using more pytest options like –lf and -k test_instances.
If test data is regenerated, it is important to review the diff for test files and carefully go through all of it to make sure there are no unintended changes there, and then commit all the regenerated test data.
To help debug in scancode, we use logging. There are different environement variables you need to set to turn on logging. In packagedcode:
``SCANCODE_DEBUG_PACKAGE=yes pytest -vvs tests/packagedcode/ --lf``
Or set the
TRACE variable to
True. This enables
logging variables and shows code execution paths by logging and printing the logs
in the terminal. If debugging full scans run by click, you have to raise exceptions
in addition to setting the TRACE to enable logging.
Thirdparty libraries and dependencies management¶
ScanCode uses the
configure.bat scripts to install a
virtualenv , install required
packaged dependencies using setuptools
and such that ScanCode can be installed in a repeatable and consistent manner on
all OSes and Python versions.
For this we maintain a
setup.cfg with our direct dependencies with loose
minimum version constraints; and we keep pinned exact versions of these
dependencies in the
testing and development).
Note: we also have a
setup-mini.cfg used to create a ScanCode PyPI package
with minimal dependencies (and limited features). This is mostly duplicated
And to ensure that we also all use well known version of the core virtualenv,
pip, setuptools and wheel libraries, we use the
zipp app from https://github.com/pypa/get-virtualenv/tree/main/public and
store it in the Git repo in the
We bundle pre-built bundled native binaries as plugins which are installed as wheels. These binaries are organized by OS and architecture; they ensure that ScanCode works out of the box either using a checkout or a download, without needing a compiler and toolchain to be installed.
The corresponding source code and build scripts for all for the pre-built binaries are stored in a separate repository at https://github.com/nexB/scancode-plugins
ScanCode app archives should not require network access for installation or
configuration of its third-party libraries and dependencies. To enable this,
we store bundled thirdparty components and libraries in the
directory of released app archives; this is done at build time.
These dependencies are stored as pre-built wheels. These wheels are sometimes
built by us when there is no wheel available upstream on PyPI. We store all
these prebuilt wheels with corresponding .ABOUT and .LICENSE files in
https://github.com/nexB/thirdparty-packages/tree/main/pypi which is published
for download at https://thirdparty.aboutcode.org/pypi/
Because this is used by the configure script, all the thirdparty dependencies used in ScanCode MUST be available there first. Therefore adding a new dependency means requesting a merge/PR in https://github.com/nexB/thirdparty-packages/ first that contains all the recursive dependencies.
There are utility scripts in
etc/release that can help with the dependencies
management process in particular to build or update wheels with native code for
multiple OSes (Linux, macOS and Windows) and multiple Python versions (3.7+),
which is not a completely simple operation (and requires eventually 12 wheels
and one source distribution to be published as we support 3 OSes and 4 Python
Using ScanCode as a Python library¶
ScanCode can be used also as a Python library and is available as a
Python wheel in PyPi and installed with
pip install scancode-toolkit or
pip install scancode-toolkit-mini.
Since we do not pin dependencies to avoid dependency resolution conflicts for downstream users, there are possibilities of issues arising from dependencies silently changing API/functions which scancode uses.