Cybersecurity Archives - Software Heritage

Why we need better software identification

Nicole Martinelli — Thu, 31 Jul 2025 14:38:00 +0000

With cybersecurity breaches and new regulations making headlines, software supply chain security is now top of mind for many people. New laws like the European Union’s Cyber Resilience Act (CRA) and recent United States Executive Orders are pushing for more transparency in digital goods.

All this attention means we need a solid, trustworthy way to identify software. Here’s the problem: how we currently name software and point to it in repositories often falls short. These ways can be temporary, vague, or just not secure enough. That leads to messy situations like confusion, name clashes, and outdated links. These aren’t just minor annoyances; they’re open doors for attacks, like “dependency confusion,” where bad actors trick systems into using malicious code. Plus, software bits can just disappear or move, making it impossible to check them later.

Clearly, we need a permanent fix that guarantees we can always find and verify software. This post outlines key information from the preprint paper “Software Identification for Cybersecurity: Survey and Recommendations for Regulators,” authored by Olivier Barais, Roberto Di Cosmo, Ludovic Mé, Stefano Zacchiroli, and Olivier Zendra with support from the SWHSec project.

Existing ID approaches: The good and the bad

There are two main types of software identification:

External IDs: These rely on outside info, like product names, version numbers, or links to package managers.
- Pros: They’re usually easy for humans to read and work with existing lists like the National Vulnerability Database. Some examples: the SWID, Package URL (purl), and SPDXID.
- Cons: Their reliability depends on external lists or naming rules, which can change or even be reused. That causes conflicts and makes them unreliable for security checks.
Internal IDs: These come directly from the software’s actual content, usually using a cryptographic hash (like a digital fingerprint).
- Pros: They offer uniqueness and integrity without relying on a central authority. They’re great for spotting if something’s been tampered with, don’t rely much on outside dependencies, and are difficult to fake with good hashing. Simple SHA256 checksums and Software Hash IDentifiers (SWHIDs) are examples.
- Cons: They’re often not very human-readable, which can make searching or brand recognition tricky.

In the real world, effective software bills of materials (SBOMs) and supply-chain tools generally combine both external references (which help connect with existing databases, vulnerability feeds, or licensing tools) and internal references (for strong integrity checks and guaranteed uniqueness). This means the smart approach is often to publish both—say, a purl or SWID alongside a cryptographic hash or SWHID. That way, you ensure both discoverability and verifiability.

Photo by George Prentzas on Unsplash

Inside the SWHID

SWHIDs are based on content, they’re permanent, and they can’t be tampered with easily. In 2025, they became an international standard (ISO/IEC 18670), making them globally recognized.

SWHIDs essentially package up both the data and its context using a clever Merkle DAG structure. This means each ID is directly tied to the exact piece of software it refers to.
They follow a simple pattern:
swh: : :

Key types include:

Content (cnt): Identifies a single file based on its raw contents:
swh:1:cnt:94a9ed024d3859793618152ea559a168bbcbb5e2
Directory (dir): Points to a directory’s layout and what’s inside it, including IDs of its contents:
swh:1:dir:d198bc9d7a6bcf6db04f476d29314f157507d505
Revision (rev): Like a “commit” in version control, holding details like who did it, when, and the message: swh:1:rev:309cf2674ee7a0749978cf8265ab91a60aea0f7d
Release (rel): Similar to a “tag,” pointing to a specific revision and maybe including a version name or signature: swh:1:rel:22ece559cc7cc2364edc5e5593d63ae8bd229f9f
Snapshot (snp): Captures everything in a whole version control system (all branches) at one specific moment:
swh:1:snp:c7c108084bc0bf3d81436bf980b46e98bd338453

SWHIDs also allow for optional qualifiers to add more context. You can specify:

Lines qualifier (lines=…): To point to specific lines in a file: swh:1:cnt:94a9ed024d3859793618152ea559a168bbcbb5e2;lines=112-116
Origin qualifier (origin=…)To say where the software was first seen: swh:1:rev:309cf2674ee7a0749978cf8265ab91a60aea0f7d;
origin=https://github.com/example/repo
Path, anchor, and context qualifiers. These help pinpoint subdirectories, specific parts, or other key info for super-precise references:
swh:1:dir:d198bc9d…;path=/docs;anchor=readme-section

This way, SWHIDs combine the best of both internal and external identification methods into one stable system.

SWHIDs + The Software Heritage Archive

SWHIDs get even more robust when you link them with the Software Heritage Archive. Software Heritage is a non-profit project that saves publicly available source code and its entire history, and once code is in, it’s never deleted. It’s the biggest public archive of source code, 400 million projects, over 25 billion unique source code files, and more than five billion unique commits. The archive stores everything in a cryptographically secure way, which helps with saving space by not duplicating things and makes sure everything is truly what it claims to be.

The combination of SWHIDs and the Software Heritage archive offers real advantages for meeting today’s legal requirements:

Guaranteed integrity: If the code changes even a little, the SWHID changes. This makes tampering immediately detectable.
Always there: SWHIDs don’t rely on outside services or websites, so they stay valid no matter where the code is hosted or if the original platform goes down. This solves the problem of code just vanishing.
Trackable history: SWHIDs identify parts of the Software Heritage structure, letting you trace a project’s development history, see where code came from, and check how different parts are related. Those extra qualifiers let you even track tiny code snippets.
Plays nice with rules: This combined approach directly helps meet the strict requirements for Software Bills of Materials (SBOMs), open-source security, and vulnerability management that the CRA and US Executive Orders demand.
Works everywhere: SWHIDs work consistently across all sorts of version control systems and software ecosystems.

The authors recommend SWHIDs, paired with the Software Heritage Archive, as the standard approach for referencing software, especially concerning the CRA and relevant US Executive Orders.
Here are some specifics for stakeholders:

Policy makers: Should mention SWHIDs (ISO/IEC 18670) in their rules and encourage their use in government purchases and funding programs.
Software companies: Should start making SWHIDs a part of their development process (CI/CD pipelines) to get stable IDs for their releases and patches.
Open source communities: Should publish official releases with their SWHIDs, ensure their code and history are archived by Software Heritage, and adopt best practices for referencing any outside software they use via SWHIDs in docs and SBOMs.

To wrap it up, using content-based, permanent software identifiers—specifically SWHIDs linked with the Software Heritage archive—is a strong and reliable answer to today’s cybersecurity and regulatory challenges. This approach builds trust and transparency, keeps us aligned with regulations, and even helps with innovation and saving money by simplifying compliance checks and cutting down on supply chain risks.

For more details and recommendations for implementation, check out the paper (preprint).

See our Publications section for more research from the Software Heritage Archive.

The post Why we need better software identification appeared first on Software Heritage.

Takeaways from the Software Heritage Symposium 2025

Nicole Martinelli — Wed, 26 Feb 2025 10:36:00 +0000

PARIS — Software is more than just code; it’s culture, science, and the engine of our economy. Our sixth annual Symposium & Summit at UNESCO Headquarters in Paris brought together thought leaders from science, technology, and policy to explore the future of software and its impact on society, tackling some of the complex issues surrounding code’s pervasive influence.

In the morning, members, sponsors and partners of Software Heritage held their annual Summit, a private meeting where key aspects of the evolution of Software Heritage are discussed in depth.

After lunch it was time for the Software Heritage Symposium, co-organized by Software Heritage, UNESCO played host to a crucial conversation about software’s impact on everything from artificial intelligence to our cultural memory. Thought leaders from science, technology, and policy explored the future of software and its impact on society, tackling the complex issues surrounding code’s pervasive influence. With a packed lineup of panels, keynotes, and technical talks, the summit reinforced the critical role of open, transparent, and secure digital infrastructures in today’s rapidly evolving technological landscape.

Why it matters

Software is the foundation of our digital world. From scientific research to AI models, from critical infrastructure to everyday applications, software plays a vital role in shaping our present and future. However, its preservation, security, and transparency remain major challenges.

In the opening address, Tawfik Jelassi (Assistant Director-General for Communication and Information, UNESCO) and Bruno Sportisse (CEO, Inria) joined Software Heritage co-founder Roberto Di Cosmo in expressing their strong and continued support for our crucial infrastructure for ensuring long-term accessibility, integrity, and trust in digital knowledge, at a time when AI adoption is accelerating, cybersecurity threats are evolving, and scientific reproducibility is under scrutiny.

“Nothing functions today in our daily life without software, being embedded or integrated in. This makes it essential to pass the knowledge and skills to develop, maintain, and preserve software on to future generations,” — Tawfik Jelassi

Watch the full 41-minute opening remarks on YouTube.
View the slides.

Key themes and discussions

The symposium focused on four essential topics, each shaping the future of digital preservation and innovation.

Cybersecurity and software supply chain resilience

The increasing regulation of open-source software ecosystems—exemplified by the EU’s Cyber Resilience Act (CRA)—was a major point of discussion, addressed by Carolina Lavatelli (CTO & Founder, Internet of Trust), Mike Milinkovich (Executive Director, Eclipse Foundation) and Olivier Zendra (Tenured Researcher, Inria, HiPEAC), in a panel moderated by Simon Phipps (Director, AlmaLinux OS Foundation).

Left to right: Phipps, Milinkovich, Lavatelli and Zendra.
© Inria / Photo B. Fourrier

“We’re in for an unprecedented change – and I’m not talking about AI….What we’re going to have here is what happens when irresistible force meets irremovable objects…and that will be reflected in how things happen. Otherwise, you’re not going to br able to sell products in the European Union, which will be a big problem for anybody trying to make a commercial product on the planet…” — Mike Milinkovich

Panelists highlighted how Software Heritage’s open infrastructure can address many of the issues discussed, providing software traceability, security, and compliance without stifling innovation.

Watch the session on YouTube.

SWHSec: How Software Heritage can improve cybersecurity

Next up came Stefano Zacchiroli (Chief Scientific Officer and Software Heritage co-founder) with a practical example about the Software Heritage Security Initiative (SWHSec.)

With 96% of products now based on open-source software, he notes, this widespread adoption has led to increased scrutiny and targeting by cyber attackers. Attackers target ‘leaf packages’—dependencies in your project’s dependency tree that you don’t directly use, Zacchiroli notes. These are indirect dependencies, often several layers removed from your code, and therefore, far from your immediate attention. “They often target these under-maintained packages, perhaps maintained by volunteers without corporate backing, and try to inject malicious code.” He described how leaf packages are dependencies in your project’s dependency tree that you don’t use directly. They’re indirect dependencies, often several layers removed from your own code. Attackers target packages maintained ‘by a few random volunteers’ that tend to be under-maintained and lack corporate backing.
These attacks can have significant financial consequences, as evidenced by examples, he notes, resulting in multi-billions of dollars of damage.

Where does Software Heritage come in? By providing a universal open knowledge base about facts for open-source software, which can have important applications in cybersecurity.

“It’s important to say here that it’s also data we made available openly as Software Heritage so that announcing security is not something that only big companies can do. Anyone—a researcher, a startup, anyone—can do this and help others in securing their software.”

Watch the 10-minute overview on YouTube.
View slides.

AI transparency and open models

As AI systems become increasingly central to decision-making, transparency and accountability are paramount. Aurélie Simard (Executive Director, Paris Center of Expertise for International Cooperation on AI) moderated the panel where Gaspard Demur (Deputy Head of Unit, EU AI Office), Agata Ferretti (AI Alliance Europe, IBM), Stefano Maffulli (Executive Director, Open Source Initiative), Fabio Porto (Senior Researcher, Laboratório Nacional de Computação Científica, Brazil) Nayat Sanchez Pi (Director, Inria Chile and French Chilean binational research center on AI), and Abhishek Singh (Additional Secretary, Ministry of Electronics and Information Technology, India) explored how open AI models and datasets—built on preserved, accessible code—can drive responsible AI development, and pinpointed the issues that need to be overcome to get there.

“AI is something too big and too important for us as humans to be left in the hands of the very few,” Agata Ferretti

Software Heritage is instrumental in addressing some of these issues. Roberto Di Cosmo gave an overview of CodeCommons, a major initiative funded by the French government’s France 2030 program through the BPI to build high-quality, transparent datasets for responsible AI training in collaboration with Inria, CEA, and many other academic partners. Launched the day before the event, read more about the projects teams are already working on.

Watch the AI panel on YouTube.

Open science and reproducibility

Open science: it’s the best-kept secret in research, even though it shouldn’t be. While some scientists cling to the shadows, a recent panel illuminated how open data and collaboration aren’t just good practice—they’re the bedrock of trust between researchers, especially when a pandemic hits. From speeding up vaccine development to navigating the murky world of software in research—the “black, grey, and white” areas where reproducibility gets tricky—the conversation underscored how initiatives like Software Heritage are building the infrastructure for a more trustworthy, and ultimately more ethical, scientific future.

The panel, moderated by Morane Gruenpeter (Head of Open Science, Software Heritage), addressed a broad spectrum of issues via the viewpoints of Kazutsuna Yamaji (Director, RCOS, NII) on the Japanese policy on Open Science, Micha Moskovic (Product Manager, CERN) on CERN’s Open Source Program Office, Sarah Cohen Boulakia (Deputy Director DATAIA, Université Paris Saclay) and Lorena Barba (Director, George Washington University OSPO) on computational reproducibility, and Nicolas Fressengeas (French Ministry of Higher Education and Research) on research software monitoring.

“Open source is not just about licenses, it’s a development model, a way of collaborating. And so open source is an opportunity because it enables these rich networks of connections between the artifacts that you’re creating together in distribution, distributed teams, and the ways of coordinating that team to provide, you know, using established open source practices to provide trust overall on the product. And so in science, the reason we care about reproducibility, not for reproducibility in itself, but become prepared for predictability also is a way of ensuring trust,” — Lorena Barba, Director, Open Source Program Office, The George Washington University

Watch the panel session on YouTube.

Software Heritage plays a major role in preserving research code as a first-class research output and supporting computational reproducibility. Two tech talks drove this home: A presentation by Violaine Louvet (CNRS) on the French national catalog for research software, and a presentation by Petr Knoth (Director, CORE) on SOFAIR, a project dedicated to building trusted connections between research articles and software.

Software for cultural heritage and software as cultural heritage

Software is instrumental today in preserving and understanding cultural heritage, as was shown in the panel moderated by Fackson Banda (Chief, Documentary Heritage Unit, UNESCO) by Anthea Seles (Director of Archives and Records Management, McGill University) and Charles Henry (President, Council on Library & Information Resources).

But software is more than just code—it’s a testament to human creativity, collaboration, and technological progress, and David C. Brock (Director of Curatorial Affairs, Computer History Museum), highlighted efforts to preserve historically significant software and ensure its recognition as part of our collective digital heritage.

“It’s estimated every day 5.2 billion selfies are taken. That’s a lot of snaps. All we’re asking is just a really small percentage of that population to turn that fun and focus on their culture and by doing that get a different sense of self and purpose,” Charles Henry.

You can catch up on all the videos from the Symposium on our YouTube channel.

The post Takeaways from the Software Heritage Symposium 2025 appeared first on Software Heritage.