Software Heritage

Why the space age’s most epic code still matters

Nicole Martinelli — Thu, 04 Sep 2025 20:42:00 +0000

It’s 2003. You’re watching «Apollo 13,» and you’re struck by a wild, almost insane idea: what if you could run the actual flight software that guided humanity to the moon? That was Ron Burkey. His singular obsession launched The Virtual AGC Project, a digital archaeology mission to rescue the Apollo Guidance Computer (AGC) code from the brink of obscurity. Fast-forward two decades, and his tireless work—painstakingly retyping code from a tower of faded printouts—has given us a time machine.

This isn’t just about nostalgia. It’s about a 60,000-line codebase that is, by every measure, a masterpiece of engineering under impossible constraints. As Burkey noted, the AGC was «very slow and had a very small memory,» yet it ran «real-time multitasking fault-tolerant executive software… quite sophisticated given the limited Hardware resources.» This code is a direct line to the minds of the people who worked for Margaret Hamilton, the visionary who led the software team.

Margaret Hamilton, who coined the term ‘software engineering’, standing beside the AGC source code

From paper to pixels: A digital dig

The journey of this code is a story in itself. It didn’t just exist on a hard drive somewhere; it was a physical artifact: «a stack of 11-inch by 14-inch fan-fold paper a couple of inches thick.» The challenge wasn’t just to scan it—it was to make it usable.

Burkey’s team discovered early on that modern OCR (Optical Character Recognition) was useless. As he famously put it, «OCR is no CR.» The faded, low-quality printouts from the 1960s were a non-starter. Instead, a heroic team of volunteers manually transcribed every line of code. They’d then use modern assemblers to re-create the executable and compare it to the original, correcting any discrepancies until the transcription was perfect. This meticulous process even used clever visual tricks, overlaying colorized text on scans to spot potential errors.

This grueling work isn’t just about accuracy; it’s a testament to the code’s inherent value. It’s why non-profits like Software Heritage are so dedicated to its preservation. Burkey’s insights were part of a talk at Software Heritage Preservation (SWHAP) Days in Paris.

The universal archive and the value of saved code

So what’s the ultimate destination for this salvaged code? It’s not just gathering dust in a closet or on a few random GitHub repos. Software Heritage houses all of it in a global, non-profit archive. Think of it as the Library of Congress for all publicly available source code. The archive isn’t just storing files, but creating persistent intrinsic identifiers (SWHIDs) for every piece of code, right down to specific lines. This means that a researcher can cite a single line from the Apollo code today, and that link will still be valid decades from now.

You can browse the entire Apollo codebase, a truly immersive experience that allows you to explore the very software that navigated the mission.

This raises a crucial question: beyond the historical thrill, why does this matter?

The code itself is a primary source document, a window into the mind of developer culture back in the 1960s. The comments, often a mix of technical notes and subtle human personality, provide context that textbooks can’t. Take, for example, the program called «Burn, Baby, Burn,» which was tasked with igniting the lunar module’s descent engine. The name traces back to the Los Angeles riots of 1965, inspired by the phrase used by disc jockey extraordinaire Magnificent Montague when spinning the hottest new records. It’s a testament to the code’s ability to capture not just technical notes, but the cultural zeitgeist of the era. The codebase is also a master class in efficiency, a reminder of how elegant and resourceful software can be when constrained by minimal hardware.

The Apollo Guidance Computer and its display keyboard (DSKY)

Finally, the very existence of this archive is a win for the concept of open knowledge. The fact that software developed with federal funding is in the public domain in the United States is a «simple and great idea» that still needs to be adopted globally. The preservation of the AGC code is a powerful argument for why we should treat software not just as a tool, but as a vital part of our cultural and intellectual heritage.

The Apollo code is just one example of this vital heritage. To build a comprehensive record of all software, we need everyone’s help.

Check the Archive to see if your code is already preserved for posterity; if not, add it with our Save Code Now feature.

The post Why the space age’s most epic code still matters appeared first on Software Heritage.

Université Paris Saclay

Nicole Martinelli — Thu, 04 Sep 2025 11:44:16 +0000

«Open-source software and codes are at the core of academic research. Building on its partnership with Software Heritage, Université Paris-Saclay will further enhance its quest to safeguard software longevity and archival for the common societal good.»

— Etienne Augé, Vice-President for Open Science at Université Paris-Saclay

The post Université Paris Saclay appeared first on Software Heritage.

Bridging tech, activism, and the future of software archival

Nicole Martinelli — Tue, 26 Aug 2025 09:38:57 +0000

Forget polite chit chat about historic buildings or the perfect waffle. When our new Ambassador, Neha Oudin, met team member Nicolas Dandrimont in Belgium, their conversation took a hard turn. They had known each other for years, but as usual, for almost an hour, they went deep, dissecting the nitty-gritty of building efficient, rock-solid hash tables. She had attended a keynote about Software Heritage back when it was a brand-new project and wanted to contribute more. Now, Oudin joins the ambassador community, eager to raise awareness about preserving technical and scientific knowledge.

She’s a Data Platform Engineer at Canonical, a company founded to market commercial support and related services for Ubuntu and other related projects. Fluent in Python, Rust, and Zsh, she also has knowledge of several other languages. Her technical interests include software engineering, security, free software, backend development, and scalable database deployment.

Oudin is also a privacy advocate. She’s a member of “La Quadrature du Net”, a French non-governmental organization dedicated to promoting and defending fundamental freedoms in the digital world. Within the activist community, it operates at the intersection of two influential forces: the free (libre) software activist movement, fueled by the emancipatory spirit of hackers and early internet pioneers, and various human rights associations, both French and international.

Oudin is also involved in free software projects like Tor, a free overlay network for anonymous communication. She contributes to BorgBackup, too, a deduplicating backup program that efficiently and securely backs up data, optionally supporting compression and authenticated encryption. She also presents at the Chaos Communication Congress and FOSDEM each year.

While Oudin’s background makes her right at home talking tech, she’s also a big believer in software archival as vital for many communities. She’s convinced that the earlier children start learning about the societal role of software, the better, and suggests weaving this topic into school lessons.

She views software archiving as a crucial element on which society is built. Their perspective brings an important, diverse view while actively protecting the rights of minorities and transgender individuals.

While Oudin is at home in technical conversations thanks to her background, she’s also a strong advocate for software archival as a vital need across many communities. She believes that the earlier, the better: the societal role of software should be integrated into school curricula. And, indeed, tech discussions may quickly lead to broader topics; therefore, she views software archiving through the lens of protecting transgender rights. Software is a vital component of our cultural heritage.

You can find more about her projects and contact information on her Ambassador profile.
We’re also seeking passionate individuals and organizations to volunteer as Ambassadors and help grow the Software Heritage community. If you’re interested in becoming an Ambassador, please share a bit about yourself and your connection to the Software Heritage mission.

The post Bridging tech, activism, and the future of software archival appeared first on Software Heritage.

Episciences links article code through Software Heritage

Nicole Martinelli — Thu, 21 Aug 2025 14:35:00 +0000

Software Heritage, the universal source code archive, preserves and provides access to source code as vital digital heritage. Researchers can directly link their published scholarly articles to the software that powers them. This new capability enhances research reproducibility and transparency by connecting findings to specific software versions.

Software Heritage partnered with the Center for Direct Scientific Communication (CCSD) to make this happen. The collaboration previously enabled software deposits on HAL in 2018, laying the groundwork for this new capability. Episciences, an overlay journal that hosts articles from open repositories like arXiv, Zenodo, and bioRxiv, now builds on a 2018 collaboration that enabled software deposits on open archive HAL, allowing authors to link to software archived there. Authors and journals using Episciences can link their articles with supplementary software via Software Heritage, using a SoftWare Hash IDentifier (SWHID) or a HAL-ID.

There are three basic steps:

Submit software to HAL

Depositing software via HAL ensures its sustainable archiving in Software Heritage. The complete deposit procedure is detailed in the HAL documentation: Deposit software source code.

Or, you can deposit software directly into Software Heritage by making a «Save Code Now» request with a GitHub URL. For more details, see this post: Save and reference research software
Link the software to the Episciences publication by adding a SWHID or a HAL-ID to your publication. For more, check out the Episciences documentation or this YouTube walkthrough.

Building on this ability to link articles and software, Episciences actively works to meet the evolving needs of researchers. Episciences is emerging as a new model in academic publishing, improving the visibility and accessibility of research articles that have already been peer-reviewed and published in conference proceedings. Instead of building a new library (traditional journal), overlay journals act as a highly knowledgeable curator who goes through existing open shelves (repositories), selects the best books, writes introductions for them, and creates a guide (the journal) pointing readers to those excellent, freely available books. This approach allows researchers to submit their conference papers to Episciences for additional scrutiny and broader dissemination, potentially increasing the impact and reach of their work.

“One fundamental aspect of the openness of science is the close link between scientific publications and associated research data. This link is essential for the transparency, reproducibility, and the overall progress of science. Episciences responds to this dynamic by inviting authors to supplement the submission of their document with a link to the dataset and/or software used in their work,” Agnès Magron CCSD

Beyond enabling these connections, Episciences actively contributes to the wider open science movement. The API and connector for Episciences were developed as part of the European Union-funded FAIRCORE4EOSC project. Episciences is also a member of the SCOSS Family. This commitment underscores why enabling this link via Episciences and the Software Heritage integration with HAL is paramount for research reproducibility, transparency, and accountability.

The next step is leveraging the COAR Notify protocol (developed by the Confederation of Open Access Repositories, COAR) to share links between different research object types.

The established partnership with CCSD, the successful HAL integration, and its utilization by Episciences provide a practical way for researchers to ensure their essential software is archived by Software Heritage and discoverable with their publications. Researchers and journals using platforms integrated with open repositories like HAL are encouraged to leverage this capability to link their software to their scholarly articles using Software Heritage. The partnership is about building a clearer, more open scientific story, where the findings and the code that powers them are part of the same picture.

The post Episciences links article code through Software Heritage appeared first on Software Heritage.

IBM

Nicole Martinelli — Tue, 19 Aug 2025 15:31:47 +0000

«IBM and Red Hat share a decades-long commitment to open source, from co-founding the Eclipse and Apache foundations to advancing Linux and donating key AI projects to the Linux AI and Data Foundation. Open source has long been the foundation of innovation across software technologies, and today’s breakthroughs in AI are no exception.

We are proud to support Software Heritage as a vital steward of open innovation, helping to preserve and advance software for the benefit of society and the global community. We look forward to collaborating to strengthen impact in AI and beyond.»
— Nirmi Desai
Director, Data and Tools for AI Models
IBM Research

The post IBM appeared first on Software Heritage.

Preserving legacy code with Software Heritage: A tutorial

Nicole Martinelli — Wed, 13 Aug 2025 12:08:00 +0000

This post will walk you through the Software Heritage Acquisition Process (SWHAP), a step-by-step method for properly archiving your legacy source code into the Software Heritage Archive. You can also follow along with the 32-minute YouTube video or use the guide on GitHub prepared by team member Mathilde Fichen. If you’re looking for more help, check out the SWHAP guide or join our mailing list to share information with other rescue and curation teams.

Setting up your local working environment

Let’s get your local workspace set up. First, you’ll use a GitHub template to create a new repository, then clone it to your computer. This creates a local copy, making it easy to manage your files.

Start by creating your own GitHub repository using the provided template. Name it after your software, adding «workbench» to the end (e.g., «my software workbench») and indicate that it’s a private, temporary workspace. After you create it, you can update the README with details about your software.

Now, let’s create a local copy of this environment. Click the «code» button, copy the SSH link, and then use the git clone command in your Linux terminal to clone the repository to your computer.

Uploading raw material

Once your local workbench is set up, the next crucial step is to upload all your initial materials into the raw materials folder. This includes your original source code material, such as scanned paper listings, compressed files, or any initial digital versions. It’s also vital to upload any relevant documentation that explains the source of the code, such as emails from the historical author, provided the author consents.

«Zenith Z-19 Terminal» by ajmexico is licensed under CC BY 2.0

Next, you’ll upload the machine-readable version of your source code into the source code folder. If your code is in a non-digital format (like on paper), you’ll need to transcribe it first.

For better organization, especially if your software has multiple files, it’s a good idea to create subfolders. Just be sure to use the correct file extensions for your programming language (e.g., .for for Fortran or .py for Python).

To wrap things up, you’ll need to fill out the metadata folder. This folder contains several important elements that you should complete as thoroughly as possible:

Catalog: This file references the initial elements you uploaded into the raw materials folder. You should include details like the item’s name (e.g., «listing from 1971»), its origin (e.g., «author’s personal archives»), where the original is stored, the author’s name, approximate dates, and who collected it, along with any relevant descriptions or notes.
License: If you know the software’s license, fill it in. For private code that you own, you can specify any license you wish. If there’s no license, but you have explicit permission to archive and use the code (for academic or educational purposes, for example), be sure to state that.
Version history.csv: This CSV file is designed to register data for each version of your software. It’s useful for automating the reconstruction of your software’s development history if you have multiple versions. Remember to fill in details such as the directory where each version is stored, author names and emails, creation dates, release tags (official version numbers if available), and a commit message for each version.
Codemeta.json: This file, in JSON format, is not meant for human reading but is crucial for search engines to easily find and identify your code and its linked data once archived. While you can update your codemeta.json file manually, we recommend using the CodeMeta generator website, which allows you to enter your software data in a user-friendly interface and then generates the necessary JSON snippet to paste into your codemeta.json file.

Syncing with GitHub

Once you’ve added all your materials and metadata locally, the next step is to synchronize these changes with your online GitHub repository. You’ll do this using a few Git commands in your Linux terminal. Navigate to your workbench directory and use the git add, git commit, and git push commands for the raw materials, source code, and metadata folders. This ensures all your local work is backed up on your GitHub repository.

After you’ve uploaded all your initial materials locally, the next big step is to make sure everything you’ve added is synchronized with your GitHub repository online. Here’s how you do it:

First, navigate to your workbench directory using your command line. Once you’re in the workbench directory, you’ll use specific Git commands to synchronize your files.
You’ll do this in three main parts:

Raw materials:

Add your raw materials: git add raw materials
Commit these changes: git commit -m "Your small message here"
Push the changes to GitHub: git push

Source code:

Add your source code: git add source code
Commit these changes: git commit -m "Your small message here"
Push the changes to GitHub: git push

Metadata:

Add your metadata: git add metadata
Commit these changes: git commit -m "Your small message here"
Push the changes to GitHub: git push

Finally, check your GitHub repository to confirm that all your documents, like your raw materials, are visible. With that, you’ve now completed the first major step of getting your initial materials uploaded and synced to your workbench.

Reconstructing development history

This is a crucial phase, especially if your software has multiple versions. Your goal is to rebuild the development timeline of your source code on a new, dedicated GitHub branch.

1. Create an orphan branch: From your workbench, you first create a new branch called source code. This orphan branch is completely detached and doesn’t carry any previous commit history from your master branch.

2. Clean the branch: After creating a SourceCode branch, you’ll clear out any existing files within it by running git rm -r . and then committing the change. This prepares the branch for you to add each version of your source code one by one.

3. Copy and commit versions: Next, copy paste the first version of your software’s source code into this new branch.

Copy the source contents into our branch:

git checkout master -- source_code/v1/*
mv source_code/v1/* .
rm -rf source_code

Then use the following template to manually create an individual commit/release:

export GIT_COMMITTER_DATE="YYYY-MM-DD HH:MM:SS"
export GIT_COMMITTER_NAME="Commiter Name"
export GIT_COMMITTER_EMAIL="email@address"
export GIT_AUTHOR_DATE="YYYY-MM-DD HH:MM:SS"
export GIT_AUTHOR_NAME="Author Name"
export GIT_AUTHOR_EMAIL="
git add -A
git commit -m "Commit Message Here"

Mind the metadata

When you import source code and commit it, Git will, by default, use your current user information and the present date. This means you would appear as both the committer and the author of the code, and the timestamp would be today’s date—not the historical date from when the code was originally created.

That’s not what we want. To get the commit history right—so it shows the code’s real origin—you have to change the commit’s metadata manually. The template in the guide allows you to explicitly set the author, committer, and dates for the commit, preserving the historical information of the source code. Finally, add a Git tag (for example, v1) to mark this as an official version.

export GIT_COMMITTER_DATE="2024-05-01 00:00:00"
export GIT_COMMITTER_NAME="Math Fichen"
export GIT_COMMITTER_EMAIL="mathfichen@monadresse.com"
export GIT_AUTHOR_DATE="1972-05-01 00:00:00"
export GIT_AUTHOR_NAME="Colmerauer et al."
export GIT_AUTHOR_EMAIL="<>"
git add -A
git commit -m "V1 of MySoftware"

4. Repeat for subsequent versions: If you have multiple versions, repeat the process. You’ll clean the repository again, copy the next version of the source code, and commit it with its respective historical metadata and a new tag (e.g., «v2»).

5. Push the branch: Finally, you’ll push this new source code branch (with its reconstructed history) to your GitHub repository.

Pro-tip: automate the process

If you have many software versions, you can automate the process of updating the commit metadata with a small script called DT2SG. That way you can use the data you entered in the version history.csv file to apply the correct historical metadata automatically.

Run the following Git commands:

dotnet ./DT2SG/DT2SG_app.dll -r mathfichen/MySoftware_Workbench/source_code/ -m mathfichen/MySoftware_Workbench/metadata/version_history.csv

Creating the final public repo

Once the development history is reconstructed in your workbench, you’re ready to create the final public repository on GitHub. This is the repository that will be shared and ultimately archived by Software Heritage.

Go to GitHub and create a new repository. Name your repository after your software and make it public so Software Heritage can harvest it.

Copy the URL of this new, public repository.

Using specific Git commands in your Linux command line, you will transfer all the work you’ve done in your private «workbench» repository into this new public repository. This essentially pushes all branches and their content (master branch with raw materials and metadata, and the source code branch with its development history) to the public repository.

As a final touch, it’s a good idea to add topics to your GitHub repository, such as software heritage, legacy code, archive, and swap. This makes the repository easier to find when people search.

Triggering software heritage archival

The last step is to trigger the Software Heritage acquisition process itself.

Navigate to the Software Heritage «Save Code Now» page.
Enter the URL of your final, public GitHub repository into the designated section.
Submit the URL. Software Heritage will then process and archive your code. After a few minutes, you should be able to search for your software on the Software Heritage archive and find it archived.
As a final touch, you can generate «badges» for your archived software. This generates a code snippet (typically Markdown) that you can copy into your public GitHub repository’s README, displaying a badge confirming your software’s successful archival in Software Heritage.

And just like that, your legacy software is preserved in the Software Heritage archive.

The post Preserving legacy code with Software Heritage: A tutorial appeared first on Software Heritage.

Infosys

Nicole Martinelli — Mon, 11 Aug 2025 07:08:32 +0000

Open source is an important part of our technology strategy, and we are excited to partner and collaborate with Software Heritage to curate the worldwide knowledge embedded in code and use it to innovate for the future.

— Rafee Tarafdar, CTO Infosys

The post Infosys appeared first on Software Heritage.

How to preserve legacy code with Software Heritage

Nicole Martinelli — Wed, 06 Aug 2025 14:44:00 +0000

The Software Heritage Acquisition Process, or SWHAP, is a method developed by the Software Heritage team and its partners for saving and archiving older source code. This post and the companion 10-minute YouTube video offer an overview of what SWHAP is all about.

Understanding Software Heritage

First, a quick refresher on Software Heritage. It’s a non-profit dedicated to building a universal, open archive of source code. Usually, Software Heritage works by automatically collecting and saving public code already hosted on forges – online platforms like GitHub or GitLab where devs keep their projects. This automated system is massive, having already archived 400 million projects and over 25 billion unique source code files to date.

The challenge with legacy code

But what if your code isn’t sitting on one of these public platforms? That’s the core issue with «legacy» source code. This is code that isn’t easily accessible online – maybe it’s printed on paper, stuck on an old floppy disk, or just living on your hard drive. Getting this kind of code properly archived for the future is where things get complicated.

Why preserving source code matters

You might wonder why we bother preserving old, seemingly outdated code. Beyond its immediate function, source code is an invaluable record of technological history and human ingenuity. It safeguards intellectual heritage, allowing future generations to learn from past solutions and understand the evolution of software that underpins our world. Preserving these digital artifacts provides crucial context for researchers, historians, and developers to trace ideas and comprehend the thought processes behind their creation.

Margaret Hamilton standing beside the Apollo Guidance Computer (AGC) source code, now archived at Software Heritage.

“Programs must be written for people to read, and only incidentally for machines to execute.”
― Harold Abelson, Structure and Interpretation of Computer Programs

This perspective highlights code’s role as a human document, not just machine instructions. If you have valuable source code you want to preserve but aren’t sure how, the SWHAP process is designed to help.

How to SWHAP: The basics

The SWHAP process involves two primary steps:

Get your legacy source code onto a forge. In most cases, GitHub is the preferred platform, simply because it’s so widely used.
Once it’s on GitHub, we can then trigger Software Heritage’s automated system. This ensures your code is securely pulled into the Software Heritage Archive.

What you’ll need

SWHAP requires a few specific tools and some prep work. First, if your code isn’t already in a digital format – say, if it’s a printout – you’ll need to transcribe it into an electronic file.
After that, you’ll need:

A GitHub account
A Linux command-line interface
Git installed on your computer
A secure SSH key configured for your GitHub account.

If these technical requirements seem daunting, our detailed SWHAP guide provides comprehensive setup assistance. You can also join our mailing list to share information with other rescue and curation teams.

The pitfalls of GitHub uploads

You might be tempted to skip these steps and just manually upload your code directly to GitHub. But that approach can cause significant problems. Here’s a real example of what SWHAP aims to prevent: a public GitHub repository for C-Prolog. While this is historically important code—an early interpreter from 1982—a glance at the screen reveals a GitHub user uploaded it in 2017.

A casual visitor might assume the code is much newer than it is, and that the GitHub user, not the actual creator, wrote it. Worse, if you try to verify the code’s accuracy or origin, the only information is that it was «found somewhere on the net.» That offers no way to confirm its true source or authenticity. This is why SWHAP matters: it makes sure your code lands on GitHub with the correct history and vital information, preventing misunderstandings for anyone looking at it in the future.

Setting up your GitHub repository for SWHAP

Before diving into the precise steps, let’s go over what your GitHub repository should look like for SWHAP.

In this example, the code is called «MySoftware,» and the repository bears that name. It has two main sections, or branches:

The master branch: This holds all the initial information for preservation, including metadata and the code’s origin. Typically, there are three key folders:
- raw materials: For any original documents related to the code you’re preserving (e.g., a scanned paper listing).
- source code: This is where the machine-readable version of your code goes.
- metadata: As the name suggests, this folder holds all the descriptive information about your software.
The source code branch: This becomes crucial if your software has multiple versions. For instance, if you have 10 different iterations, a future user might not want to sift through each one. However, seeing the code’s development over time is still very valuable. In this branch, we’ll recreate the software’s development timeline, adding each version sequentially using Git’s commit feature. This provides a practical way for anyone viewing the repository in the future to track how the source code evolved.

That’s it for the overview. Check out part two, which has a more detailed, step-by-step explanation of the SWHAP process.

The post How to preserve legacy code with Software Heritage appeared first on Software Heritage.

Why we need better software identification

Nicole Martinelli — Thu, 31 Jul 2025 14:38:00 +0000

With cybersecurity breaches and new regulations making headlines, software supply chain security is now top of mind for many people. New laws like the European Union’s Cyber Resilience Act (CRA) and recent United States Executive Orders are pushing for more transparency in digital goods.

All this attention means we need a solid, trustworthy way to identify software. Here’s the problem: how we currently name software and point to it in repositories often falls short. These ways can be temporary, vague, or just not secure enough. That leads to messy situations like confusion, name clashes, and outdated links. These aren’t just minor annoyances; they’re open doors for attacks, like «dependency confusion,» where bad actors trick systems into using malicious code. Plus, software bits can just disappear or move, making it impossible to check them later.

Clearly, we need a permanent fix that guarantees we can always find and verify software. This post outlines key information from the preprint paper «Software Identification for Cybersecurity: Survey and Recommendations for Regulators,» authored by Olivier Barais, Roberto Di Cosmo, Ludovic Mé, Stefano Zacchiroli, and Olivier Zendra with support from the SWHSec project.

Existing ID approaches: The good and the bad

There are two main types of software identification:

External IDs: These rely on outside info, like product names, version numbers, or links to package managers.
- Pros: They’re usually easy for humans to read and work with existing lists like the National Vulnerability Database. Some examples: the SWID, Package URL (purl), and SPDXID.
- Cons: Their reliability depends on external lists or naming rules, which can change or even be reused. That causes conflicts and makes them unreliable for security checks.
Internal IDs: These come directly from the software’s actual content, usually using a cryptographic hash (like a digital fingerprint).
- Pros: They offer uniqueness and integrity without relying on a central authority. They’re great for spotting if something’s been tampered with, don’t rely much on outside dependencies, and are difficult to fake with good hashing. Simple SHA256 checksums and Software Hash IDentifiers (SWHIDs) are examples.
- Cons: They’re often not very human-readable, which can make searching or brand recognition tricky.

In the real world, effective software bills of materials (SBOMs) and supply-chain tools generally combine both external references (which help connect with existing databases, vulnerability feeds, or licensing tools) and internal references (for strong integrity checks and guaranteed uniqueness). This means the smart approach is often to publish both—say, a purl or SWID alongside a cryptographic hash or SWHID. That way, you ensure both discoverability and verifiability.

Photo by George Prentzas on Unsplash

Inside the SWHID

SWHIDs are based on content, they’re permanent, and they can’t be tampered with easily. In 2025, they became an international standard (ISO/IEC 18670), making them globally recognized.

SWHIDs essentially package up both the data and its context using a clever Merkle DAG structure. This means each ID is directly tied to the exact piece of software it refers to.
They follow a simple pattern:
swh: : :

Key types include:

Content (cnt): Identifies a single file based on its raw contents:
swh:1:cnt:94a9ed024d3859793618152ea559a168bbcbb5e2
Directory (dir): Points to a directory’s layout and what’s inside it, including IDs of its contents:
swh:1:dir:d198bc9d7a6bcf6db04f476d29314f157507d505
Revision (rev): Like a «commit» in version control, holding details like who did it, when, and the message: swh:1:rev:309cf2674ee7a0749978cf8265ab91a60aea0f7d
Release (rel): Similar to a «tag,» pointing to a specific revision and maybe including a version name or signature: swh:1:rel:22ece559cc7cc2364edc5e5593d63ae8bd229f9f
Snapshot (snp): Captures everything in a whole version control system (all branches) at one specific moment:
swh:1:snp:c7c108084bc0bf3d81436bf980b46e98bd338453

SWHIDs also allow for optional qualifiers to add more context. You can specify:

Lines qualifier (lines=…): To point to specific lines in a file: swh:1:cnt:94a9ed024d3859793618152ea559a168bbcbb5e2;lines=112-116
Origin qualifier (origin=…)To say where the software was first seen: swh:1:rev:309cf2674ee7a0749978cf8265ab91a60aea0f7d;
origin=https://github.com/example/repo
Path, anchor, and context qualifiers. These help pinpoint subdirectories, specific parts, or other key info for super-precise references:
swh:1:dir:d198bc9d…;path=/docs;anchor=readme-section

This way, SWHIDs combine the best of both internal and external identification methods into one stable system.

SWHIDs + The Software Heritage Archive

SWHIDs get even more robust when you link them with the Software Heritage Archive. Software Heritage is a non-profit project that saves publicly available source code and its entire history, and once code is in, it’s never deleted. It’s the biggest public archive of source code, 400 million projects, over 25 billion unique source code files, and more than five billion unique commits. The archive stores everything in a cryptographically secure way, which helps with saving space by not duplicating things and makes sure everything is truly what it claims to be.

The combination of SWHIDs and the Software Heritage archive offers real advantages for meeting today’s legal requirements:

Guaranteed integrity: If the code changes even a little, the SWHID changes. This makes tampering immediately detectable.
Always there: SWHIDs don’t rely on outside services or websites, so they stay valid no matter where the code is hosted or if the original platform goes down. This solves the problem of code just vanishing.
Trackable history: SWHIDs identify parts of the Software Heritage structure, letting you trace a project’s development history, see where code came from, and check how different parts are related. Those extra qualifiers let you even track tiny code snippets.
Plays nice with rules: This combined approach directly helps meet the strict requirements for Software Bills of Materials (SBOMs), open-source security, and vulnerability management that the CRA and US Executive Orders demand.
Works everywhere: SWHIDs work consistently across all sorts of version control systems and software ecosystems.

The authors recommend SWHIDs, paired with the Software Heritage Archive, as the standard approach for referencing software, especially concerning the CRA and relevant US Executive Orders.
Here are some specifics for stakeholders:

Policy makers: Should mention SWHIDs (ISO/IEC 18670) in their rules and encourage their use in government purchases and funding programs.
Software companies: Should start making SWHIDs a part of their development process (CI/CD pipelines) to get stable IDs for their releases and patches.
Open source communities: Should publish official releases with their SWHIDs, ensure their code and history are archived by Software Heritage, and adopt best practices for referencing any outside software they use via SWHIDs in docs and SBOMs.

To wrap it up, using content-based, permanent software identifiers—specifically SWHIDs linked with the Software Heritage archive—is a strong and reliable answer to today’s cybersecurity and regulatory challenges. This approach builds trust and transparency, keeps us aligned with regulations, and even helps with innovation and saving money by simplifying compliance checks and cutting down on supply chain risks.

For more details and recommendations for implementation, check out the paper (preprint).

See our Publications section for more research from the Software Heritage Archive.

The post Why we need better software identification appeared first on Software Heritage.

Sec4AI4Sec

Nicole Martinelli — Thu, 24 Jul 2025 10:02:50 +0000

Sec4AI4Sec is a research project funded by the European Union’s Horizon Program that aims to create a range of cutting-edge technologies, open-source tools, and methodologies for designing secure AI-enhanced systems and AI-enhanced systems for security.

Additionally, Sec4AI4Sec will provide reference benchmarks that can be used to standardize the evaluation of research outcomes in the secure software research community. In the context of Sec4AI4Sec, Software Heritage will leverage the content of its Archive to augment, publish, and curate open datasets about vulnerabilities affecting open-source projects, crucial in improving both security technologies and the security of open-source software. Funded by the European Union’s Horizon Europe programme (grant agreement No 101120393), Sec4AI4Sec aims to develop security-by-design testing and assurance technology for AI-enhanced systems, software, and assets.

The post Sec4AI4Sec appeared first on Software Heritage.