Software Heritage https://www.softwareheritage.org/ Thu, 04 Sep 2025 10:45:55 +0000 en-US hourly 1 https://wordpress.org/?v=6.8.2 https://www.softwareheritage.org/wp-content/uploads/2015/08/cropped-swh-logo-32x32.png Software Heritage https://www.softwareheritage.org/ 32 32 Why the space age’s most epic code still matters https://www.softwareheritage.org/2025/09/04/why-the-space-ages-most-epic-code-still-matters/ Thu, 04 Sep 2025 20:42:00 +0000 https://www.softwareheritage.org/?p=46796 Ron Burkey tells the story behind the Virtual AGC project and how it ended up at the Software Heritage Archive.

The post Why the space age’s most epic code still matters appeared first on Software Heritage.

]]>
It’s 2003. You’re watching “Apollo 13,” and you’re struck by a wild, almost insane idea: what if you could run the actual flight software that guided humanity to the moon? That was Ron Burkey. His singular obsession launched The Virtual AGC Project, a digital archaeology mission to rescue the Apollo Guidance Computer (AGC) code from the brink of obscurity. Fast-forward two decades, and his tireless work—painstakingly retyping code from a tower of faded printouts—has given us a time machine.

This isn’t just about nostalgia. It’s about a 60,000-line codebase that is, by every measure, a masterpiece of engineering under impossible constraints. As Burkey noted, the AGC was “very slow and had a very small memory,” yet it ran “real-time multitasking fault-tolerant executive software… quite sophisticated given the limited Hardware resources.” This code is a direct line to the minds of the people who worked for Margaret Hamilton, the visionary who led the software team. 

Margaret Hamilton and the AGC source code
Margaret Hamilton, who coined the term ‘software engineering’, standing beside the AGC source code

From paper to pixels: A digital dig 

The journey of this code is a story in itself. It didn’t just exist on a hard drive somewhere; it was a physical artifact: “a stack of 11-inch by 14-inch fan-fold paper a couple of inches thick.” The challenge wasn’t just to scan it—it was to make it usable.

Burkey’s team discovered early on that modern OCR (Optical Character Recognition) was useless. As he famously put it, “OCR is no CR.” The faded, low-quality printouts from the 1960s were a non-starter. Instead, a heroic team of volunteers manually transcribed every line of code. They’d then use modern assemblers to re-create the executable and compare it to the original, correcting any discrepancies until the transcription was perfect. This meticulous process even used clever visual tricks, overlaying colorized text on scans to spot potential errors.

This grueling work isn’t just about accuracy; it’s a testament to the code’s inherent value. It’s why non-profits like Software Heritage are so dedicated to its preservation. Burkey’s insights were part of a talk at Software Heritage Preservation (SWHAP) Days in Paris. 

The universal archive and the value of saved code 

So what’s the ultimate destination for this salvaged code? It’s not just gathering dust in a closet or on a few random GitHub repos. Software Heritage houses all of it in a global, non-profit archive. Think of it as the Library of Congress for all publicly available source code. The archive isn’t just storing files, but creating persistent intrinsic identifiers (SWHIDs) for every piece of code, right down to specific lines. This means that a researcher can cite a single line from the Apollo code today, and that link will still be valid decades from now.

You can browse the entire Apollo codebase, a truly immersive experience that allows you to explore the very software that navigated the mission.

This raises a crucial question: beyond the historical thrill, why does this matter?

The code itself is a primary source document, a window into the mind of developer culture back in the 1960s. The comments, often a mix of technical notes and subtle human personality, provide context that textbooks can’t. Take, for example, the program called “Burn, Baby, Burn,” which was tasked with igniting the lunar module’s descent engine. The name traces back to the Los Angeles riots of 1965, inspired by the phrase used by disc jockey extraordinaire Magnificent Montague when spinning the hottest new records. It’s a testament to the code’s ability to capture not just technical notes, but the cultural zeitgeist of the era. The codebase is also a master class in efficiency, a reminder of how elegant and resourceful software can be when constrained by minimal hardware.

Apollo Guidance Computer and its DSKY
The Apollo Guidance Computer and its display keyboard (DSKY)

Finally, the very existence of this archive is a win for the concept of open knowledge. The fact that software developed with federal funding is in the public domain in the United States is a “simple and great idea” that still needs to be adopted globally. The preservation of the AGC code is a powerful argument for why we should treat software not just as a tool, but as a vital part of our cultural and intellectual heritage.

The Apollo code is just one example of this vital heritage. To build a comprehensive record of all software, we need everyone’s help.

Check the Archive to see if your code is already preserved for posterity; if not, add it with our Save Code Now feature.

The post Why the space age’s most epic code still matters appeared first on Software Heritage.

]]>
Bridging tech, activism, and the future of software archival https://www.softwareheritage.org/2025/08/26/neha_oudin_ambassador/ Tue, 26 Aug 2025 09:38:57 +0000 https://www.softwareheritage.org/?p=46699 Meet our new Ambassador, Neha Oudin, a data platform engineer, privacy advocate, and free software contributor.

The post Bridging tech, activism, and the future of software archival appeared first on Software Heritage.

]]>
Forget polite chit chat about historic buildings or the perfect waffle. When our new Ambassador, Neha Oudin, met team member Nicolas Dandrimont in Belgium, their conversation took a hard turn. They had known each other for years, but as usual, for almost an hour, they went deep, dissecting the nitty-gritty of building efficient, rock-solid hash tables. She had attended a keynote about Software Heritage back when it was a brand-new project and wanted to contribute more. Now, Oudin joins the ambassador community, eager to raise awareness about preserving technical and scientific knowledge.

She’s a Data Platform Engineer at Canonical, a company founded to market commercial support and related services for Ubuntu and other related projects. Fluent in Python, Rust, and Zsh, she also has knowledge of several other languages. Her technical interests include software engineering, security, free software, backend development, and scalable database deployment.

Oudin is also a privacy advocate. She’s a member of “La Quadrature du Net”, a French non-governmental organization dedicated to promoting and defending fundamental freedoms in the digital world. Within the activist community, it operates at the intersection of two influential forces: the free (libre) software activist movement, fueled by the emancipatory spirit of hackers and early internet pioneers, and various human rights associations, both French and international.

Oudin is also involved in free software projects like Tor, a free overlay network for anonymous communication. She contributes to BorgBackup, too, a deduplicating backup program that efficiently and securely backs up data, optionally supporting compression and authenticated encryption. She also presents at the Chaos Communication Congress and FOSDEM each year.

While Oudin’s background makes her right at home talking tech, she’s also a big believer in software archival as vital for many communities. She’s convinced that the earlier children start learning about the societal role of software, the better, and suggests weaving this topic into school lessons.

She views software archiving as a crucial element on which society is built. Their perspective brings an important, diverse view while actively protecting the rights of minorities and transgender individuals.

While Oudin is at home in technical conversations thanks to her background, she’s also a strong advocate for software archival as a vital need across many communities. She believes that the earlier, the better: the societal role of software should be integrated into school curricula. And, indeed, tech discussions may quickly lead to broader topics; therefore, she views software archiving through the lens of protecting transgender rights. Software is a vital component of our cultural heritage.

You can find more about her projects and contact information on her Ambassador profile.
We’re also seeking passionate individuals and organizations to volunteer as Ambassadors and help grow the Software Heritage community. If you’re interested in becoming an Ambassador, please share a bit about yourself and your connection to the Software Heritage mission.

The post Bridging tech, activism, and the future of software archival appeared first on Software Heritage.

]]>
Episciences links article code through Software Heritage https://www.softwareheritage.org/2025/08/21/episciences-links-article-code-software-heritage/ Thu, 21 Aug 2025 14:35:00 +0000 https://www.softwareheritage.org/?p=46663 Episciences enables linking publications to source code archived in Software Heritage, enhancing research reproducibility.

The post Episciences links article code through Software Heritage appeared first on Software Heritage.

]]>
Software Heritage, the universal source code archive, preserves and provides access to source code as vital digital heritage. Researchers can directly link their published scholarly articles to the software that powers them. This new capability enhances research reproducibility and transparency by connecting findings to specific software versions.

Software Heritage partnered with the Center for Direct Scientific Communication (CCSD) to make this happen. The collaboration previously enabled software deposits on HAL in 2018, laying the groundwork for this new capability. Episciences, an overlay journal that hosts articles from open repositories like arXiv, Zenodo, and bioRxiv, now builds on a 2018 collaboration that enabled software deposits on open archive HAL, allowing authors to link to software archived there. Authors and journals using Episciences can link their articles with supplementary software via Software Heritage, using a SoftWare Hash IDentifier (SWHID) or a HAL-ID. 

There are three basic steps:

  • Submit software to HAL

Depositing software via HAL ensures its sustainable archiving in Software Heritage. The complete deposit procedure is detailed in the HAL documentation: Deposit software source code.

Building on this ability to link articles and software, Episciences actively works to meet the evolving needs of researchers. Episciences is emerging as a new model in academic publishing, improving the visibility and accessibility of research articles that have already been peer-reviewed and published in conference proceedings. Instead of building a new library (traditional journal), overlay journals act as a highly knowledgeable curator who goes through existing open shelves (repositories), selects the best books, writes introductions for them, and creates a guide (the journal) pointing readers to those excellent, freely available books. This approach allows researchers to submit their conference papers to Episciences for additional scrutiny and broader dissemination, potentially increasing the impact and reach of their work.

“One fundamental aspect of the openness of science is the close link between scientific publications and associated research data. This link is essential for the transparency, reproducibility, and the overall progress of science. Episciences responds to this dynamic by inviting authors to supplement the submission of their document with a link to the dataset and/or software used in their work,” Agnès Magron CCSD

Beyond enabling these connections, Episciences actively contributes to the wider open science movement. The API and connector for Episciences were developed as part of the European Union-funded FAIRCORE4EOSC project. Episciences is also a member of the SCOSS Family. This commitment underscores why enabling this link via Episciences and the Software Heritage integration with HAL is paramount for research reproducibility, transparency, and accountability. 

The next step is leveraging the COAR Notify protocol (developed by the Confederation of Open Access Repositories, COAR) to share links between different research object types. 

The established partnership with CCSD, the successful HAL integration, and its utilization by Episciences provide a practical way for researchers to ensure their essential software is archived by Software Heritage and discoverable with their publications. Researchers and journals using platforms integrated with open repositories like HAL are encouraged to leverage this capability to link their software to their scholarly articles using Software Heritage. The partnership is about building a clearer, more open scientific story, where the findings and the code that powers them are part of the same picture.

The post Episciences links article code through Software Heritage appeared first on Software Heritage.

]]>
Preserving legacy code with Software Heritage: A tutorial https://www.softwareheritage.org/2025/08/13/preserving-legacy-code-software-heritage-tutorial/ Wed, 13 Aug 2025 12:08:00 +0000 https://www.softwareheritage.org/?p=46567 This tutorial shows how to use a structured approach to prepare your legacy software for preservation in the Software Heritage archive.

The post Preserving legacy code with Software Heritage: A tutorial appeared first on Software Heritage.

]]>
This post will walk you through the Software Heritage Acquisition Process (SWHAP), a step-by-step method for properly archiving your legacy source code into the Software Heritage Archive. You can also follow along with the 32-minute YouTube video or use the guide on GitHub prepared by team member Mathilde Fichen. If you’re looking for more help, check out the SWHAP guide or join our mailing list to share information with other rescue and curation teams.

Setting up your local working environment

Let’s get your local workspace set up. First, you’ll use a GitHub template to create a new repository, then clone it to your computer. This creates a local copy, making it easy to manage your files.

Start by creating your own GitHub repository using the provided template. Name it after your software, adding “workbench” to the end (e.g., “my software workbench”) and indicate that it’s a private, temporary workspace. After you create it, you can update the README with details about your software.

Now, let’s create a local copy of this environment. Click the “code” button, copy the SSH link, and then use the git clone command in your Linux terminal to clone the repository to your computer.

Uploading raw material

Once your local workbench is set up, the next crucial step is to upload all your initial materials into the raw materials folder. This includes your original source code material, such as scanned paper listings, compressed files, or any initial digital versions. It’s also vital to upload any relevant documentation that explains the source of the code, such as emails from the historical author, provided the author consents.

Zenith Z-19 Terminal” by ajmexico is licensed under CC BY 2.0

Next, you’ll upload the machine-readable version of your source code into the source code folder. If your code is in a non-digital format (like on paper), you’ll need to transcribe it first.

For better organization, especially if your software has multiple files, it’s a good idea to create subfolders. Just be sure to use the correct file extensions for your programming language (e.g., .for for Fortran or .py for Python).

To wrap things up, you’ll need to fill out the metadata folder. This folder contains several important elements that you should complete as thoroughly as possible:

  • Catalog: This file references the initial elements you uploaded into the raw materials folder. You should include details like the item’s name (e.g., “listing from 1971”), its origin (e.g., “author’s personal archives”), where the original is stored, the author’s name, approximate dates, and who collected it, along with any relevant descriptions or notes.
  • License: If you know the software’s license, fill it in. For private code that you own, you can specify any license you wish. If there’s no license, but you have explicit permission to archive and use the code (for academic or educational purposes, for example), be sure to state that.
  • Version history.csv: This CSV file is designed to register data for each version of your software. It’s useful for automating the reconstruction of your software’s development history if you have multiple versions. Remember to fill in details such as the directory where each version is stored, author names and emails, creation dates, release tags (official version numbers if available), and a commit message for each version.
  • Codemeta.json: This file, in JSON format, is not meant for human reading but is crucial for search engines to easily find and identify your code and its linked data once archived. While you can update your codemeta.json file manually, we recommend using the CodeMeta generator website, which allows you to enter your software data in a user-friendly interface and then generates the necessary JSON snippet to paste into your codemeta.json file.

Syncing with GitHub

Once you’ve added all your materials and metadata locally, the next step is to synchronize these changes with your online GitHub repository. You’ll do this using a few Git commands in your Linux terminal. Navigate to your workbench directory and use the git add, git commit, and git push commands for the raw materials, source code, and metadata folders. This ensures all your local work is backed up on your GitHub repository.

After you’ve uploaded all your initial materials locally, the next big step is to make sure everything you’ve added is synchronized with your GitHub repository online. Here’s how you do it:

First, navigate to your workbench directory using your command line. Once you’re in the workbench directory, you’ll use specific Git commands to synchronize your files.
You’ll do this in three main parts:

Raw materials:

  • Add your raw materials: git add raw materials
  • Commit these changes: git commit -m "Your small message here"
  • Push the changes to GitHub: git push

Source code:

  • Add your source code: git add source code
  • Commit these changes: git commit -m "Your small message here"
  • Push the changes to GitHub: git push

Metadata:

  • Add your metadata: git add metadata
  • Commit these changes: git commit -m "Your small message here"
  • Push the changes to GitHub: git push

Finally, check your GitHub repository to confirm that all your documents, like your raw materials, are visible. With that, you’ve now completed the first major step of getting your initial materials uploaded and synced to your workbench.

Reconstructing development history

This is a crucial phase, especially if your software has multiple versions. Your goal is to rebuild the development timeline of your source code on a new, dedicated GitHub branch.

1. Create an orphan branch: From your workbench, you first create a new branch called source code. This orphan branch is completely detached and doesn’t carry any previous commit history from your master branch.

2. Clean the branch: After creating a SourceCode branch, you’ll clear out any existing files within it by running git rm -r . and then committing the change. This prepares the branch for you to add each version of your source code one by one.

3. Copy and commit versions: Next, copy paste the first version of your software’s source code into this new branch.

Copy the source contents into our branch:

git checkout master -- source_code/v1/*
mv source_code/v1/* .
rm -rf source_code

Then use the following template to manually create an individual commit/release:

export GIT_COMMITTER_DATE="YYYY-MM-DD HH:MM:SS"
export GIT_COMMITTER_NAME="Commiter Name"
export GIT_COMMITTER_EMAIL="email@address"
export GIT_AUTHOR_DATE="YYYY-MM-DD HH:MM:SS"
export GIT_AUTHOR_NAME="Author Name"
export GIT_AUTHOR_EMAIL=<email@address>"
git add -A
git commit -m "Commit Message Here"

Mind the metadata

When you import source code and commit it, Git will, by default, use your current user information and the present date. This means you would appear as both the committer and the author of the code, and the timestamp would be today’s date—not the historical date from when the code was originally created.

That’s not what we want. To get the commit history right—so it shows the code’s real origin—you have to change the commit’s metadata manually. The template in the guide allows you to explicitly set the author, committer, and dates for the commit, preserving the historical information of the source code. Finally, add a Git tag (for example, v1) to mark this as an official version.

export GIT_COMMITTER_DATE="2024-05-01 00:00:00"
export GIT_COMMITTER_NAME="Math Fichen"
export GIT_COMMITTER_EMAIL="mathfichen@monadresse.com"
export GIT_AUTHOR_DATE="1972-05-01 00:00:00"
export GIT_AUTHOR_NAME="Colmerauer et al."
export GIT_AUTHOR_EMAIL="<>"
git add -A
git commit -m "V1 of MySoftware"

4. Repeat for subsequent versions: If you have multiple versions, repeat the process. You’ll clean the repository again, copy the next version of the source code, and commit it with its respective historical metadata and a new tag (e.g., “v2”).

5. Push the branch: Finally, you’ll push this new source code branch (with its reconstructed history) to your GitHub repository.

Pro-tip: automate the process

If you have many software versions, you can automate the process of updating the commit metadata with a small script called DT2SG. That way you can use the data you entered in the version history.csv file to apply the correct historical metadata automatically.

Run the following Git commands:

dotnet ./DT2SG/DT2SG_app.dll -r mathfichen/MySoftware_Workbench/source_code/ -m mathfichen/MySoftware_Workbench/metadata/version_history.csv

Creating the final public repo

Once the development history is reconstructed in your workbench, you’re ready to create the final public repository on GitHub. This is the repository that will be shared and ultimately archived by Software Heritage.

Go to GitHub and create a new repository. Name your repository after your software and make it public so Software Heritage can harvest it.

Copy the URL of this new, public repository.

Using specific Git commands in your Linux command line, you will transfer all the work you’ve done in your private “workbench” repository into this new public repository. This essentially pushes all branches and their content (master branch with raw materials and metadata, and the source code branch with its development history) to the public repository.

As a final touch, it’s a good idea to add topics to your GitHub repository, such as software heritage, legacy code, archive, and swap. This makes the repository easier to find when people search.

Triggering software heritage archival

The last step is to trigger the Software Heritage acquisition process itself.

  • Navigate to the Software Heritage “Save Code Now” page.
  • Enter the URL of your final, public GitHub repository into the designated section.
  • Submit the URL. Software Heritage will then process and archive your code. After a few minutes, you should be able to search for your software on the Software Heritage archive and find it archived.
  • As a final touch, you can generate “badges” for your archived software. This generates a code snippet (typically Markdown) that you can copy into your public GitHub repository’s README, displaying a badge confirming your software’s successful archival in Software Heritage.

And just like that, your legacy software is preserved in the Software Heritage archive.

The post Preserving legacy code with Software Heritage: A tutorial appeared first on Software Heritage.

]]>
How to preserve legacy code with Software Heritage https://www.softwareheritage.org/2025/08/06/how-to-preserve-legacy-code-software-heritage/ Wed, 06 Aug 2025 14:44:00 +0000 https://www.softwareheritage.org/?p=46532 Code is history. Discover SWHAP, the process designed by Software Heritage to
preserve legacy software.

The post How to preserve legacy code with Software Heritage appeared first on Software Heritage.

]]>
The Software Heritage Acquisition Process, or SWHAP, is a method developed by the Software Heritage team and its partners for saving and archiving older source code. This post and the companion 10-minute YouTube video offer an overview of what SWHAP is all about.

Understanding Software Heritage

First, a quick refresher on Software Heritage. It’s a non-profit dedicated to building a universal, open archive of source code. Usually, Software Heritage works by automatically collecting and saving public code already hosted on forges – online platforms like GitHub or GitLab where devs keep their projects. This automated system is massive, having already archived 400 million projects and over 25 billion unique source code files to date.

The challenge with legacy code

But what if your code isn’t sitting on one of these public platforms? That’s the core issue with “legacy” source code. This is code that isn’t easily accessible online – maybe it’s printed on paper, stuck on an old floppy disk, or just living on your hard drive. Getting this kind of code properly archived for the future is where things get complicated.

Why preserving source code matters

You might wonder why we bother preserving old, seemingly outdated code. Beyond its immediate function, source code is an invaluable record of technological history and human ingenuity. It safeguards intellectual heritage, allowing future generations to learn from past solutions and understand the evolution of software that underpins our world. Preserving these digital artifacts provides crucial context for researchers, historians, and developers to trace ideas and comprehend the thought processes behind their creation.

Margaret Hamilton and the AGC source code
Margaret Hamilton standing beside the Apollo Guidance Computer (AGC) source code, now archived at Software Heritage.

“Programs must be written for people to read, and only incidentally for machines to execute.”
― Harold Abelson, Structure and Interpretation of Computer Programs

This perspective highlights code’s role as a human document, not just machine instructions. If you have valuable source code you want to preserve but aren’t sure how, the SWHAP process is designed to help.

How to SWHAP: The basics

The SWHAP process involves two primary steps:

  1. Get your legacy source code onto a forge. In most cases, GitHub is the preferred platform, simply because it’s so widely used.
  2. Once it’s on GitHub, we can then trigger Software Heritage’s automated system. This ensures your code is securely pulled into the Software Heritage Archive.

What you’ll need

SWHAP requires a few specific tools and some prep work. First, if your code isn’t already in a digital format – say, if it’s a printout – you’ll need to transcribe it into an electronic file.
After that, you’ll need:

  • A GitHub account
  • A Linux command-line interface
  • Git installed on your computer
  • A secure SSH key configured for your GitHub account.

If these technical requirements seem daunting, our detailed SWHAP guide provides comprehensive setup assistance. You can also join our mailing list to share information with other rescue and curation teams.

The pitfalls of GitHub uploads

You might be tempted to skip these steps and just manually upload your code directly to GitHub. But that approach can cause significant problems. Here’s a real example of what SWHAP aims to prevent: a public GitHub repository for C-Prolog. While this is historically important code—an early interpreter from 1982—a glance at the screen reveals a GitHub user uploaded it in 2017.

A casual visitor might assume the code is much newer than it is, and that the GitHub user, not the actual creator, wrote it. Worse, if you try to verify the code’s accuracy or origin, the only information is that it was “found somewhere on the net.” That offers no way to confirm its true source or authenticity. This is why SWHAP matters: it makes sure your code lands on GitHub with the correct history and vital information, preventing misunderstandings for anyone looking at it in the future.

Setting up your GitHub repository for SWHAP

Before diving into the precise steps, let’s go over what your GitHub repository should look like for SWHAP.

In this example, the code is called “MySoftware,” and the repository bears that name. It has two main sections, or branches:

  1. The master branch: This holds all the initial information for preservation, including metadata and the code’s origin. Typically, there are three key folders:
    • raw materials: For any original documents related to the code you’re preserving (e.g., a scanned paper listing).
    • source code: This is where the machine-readable version of your code goes.
    • metadata: As the name suggests, this folder holds all the descriptive information about your software.
  2. The source code branch: This becomes crucial if your software has multiple versions. For instance, if you have 10 different iterations, a future user might not want to sift through each one. However, seeing the code’s development over time is still very valuable. In this branch, we’ll recreate the software’s development timeline, adding each version sequentially using Git’s commit feature. This provides a practical way for anyone viewing the repository in the future to track how the source code evolved.

That’s it for the overview. Check out part two, which has a more detailed, step-by-step explanation of the SWHAP process.

The post How to preserve legacy code with Software Heritage appeared first on Software Heritage.

]]>
Why we need better software identification https://www.softwareheritage.org/2025/07/31/why-we-need-better-software-identification/ Thu, 31 Jul 2025 14:38:00 +0000 https://www.softwareheritage.org/?p=46509 Driven by cyberattacks and new regulations, software supply chain security is a top concern that requires robust
software identification.

The post Why we need better software identification appeared first on Software Heritage.

]]>
With cybersecurity breaches and new regulations making headlines, software supply chain security is now top of mind for many people. New laws like the European Union’s Cyber Resilience Act (CRA) and recent United States Executive Orders are pushing for more transparency in digital goods.

All this attention means we need a solid, trustworthy way to identify software. Here’s the problem: how we currently name software and point to it in repositories often falls short. These ways can be temporary, vague, or just not secure enough. That leads to messy situations like confusion, name clashes, and outdated links. These aren’t just minor annoyances; they’re open doors for attacks, like “dependency confusion,” where bad actors trick systems into using malicious code. Plus, software bits can just disappear or move, making it impossible to check them later.

Clearly, we need a permanent fix that guarantees we can always find and verify software. This post outlines key information from the preprint paper “Software Identification for Cybersecurity: Survey and Recommendations for Regulators,” authored by Olivier Barais, Roberto Di Cosmo, Ludovic Mé, Stefano Zacchiroli, and Olivier Zendra with support from the SWHSec project.

Existing ID approaches: The good and the bad

There are two main types of software identification:

  • External IDs: These rely on outside info, like product names, version numbers, or links to package managers.
    • Pros: They’re usually easy for humans to read and work with existing lists like the National Vulnerability Database. Some examples: the SWID, Package URL (purl), and SPDXID.
    • Cons: Their reliability depends on external lists or naming rules, which can change or even be reused. That causes conflicts and makes them unreliable for security checks.
  • Internal IDs: These come directly from the software’s actual content, usually using a cryptographic hash (like a digital fingerprint).
    • Pros: They offer uniqueness and integrity without relying on a central authority. They’re great for spotting if something’s been tampered with, don’t rely much on outside dependencies, and are difficult to fake with good hashing. Simple SHA256 checksums and Software Hash IDentifiers (SWHIDs) are examples.
    • Cons: They’re often not very human-readable, which can make searching or brand recognition tricky.

In the real world, effective software bills of materials (SBOMs) and supply-chain tools generally combine both external references (which help connect with existing databases, vulnerability feeds, or licensing tools) and internal references (for strong integrity checks and guaranteed uniqueness). This means the smart approach is often to publish both—say, a purl or SWID alongside a cryptographic hash or SWHID. That way, you ensure both discoverability and verifiability.

Inside the SWHID

SWHIDs are based on content, they’re permanent, and they can’t be tampered with easily. In 2025, they became an international standard (ISO/IEC 18670), making them globally recognized.

SWHIDs essentially package up both the data and its context using a clever Merkle DAG structure. This means each ID is directly tied to the exact piece of software it refers to.
They follow a simple pattern:
swh: <schema_version> : <object_type> : <object_id>

Key types include:

  • Content (cnt): Identifies a single file based on its raw contents:
    swh:1:cnt:94a9ed024d3859793618152ea559a168bbcbb5e2
  • Directory (dir): Points to a directory’s layout and what’s inside it, including IDs of its contents:
    swh:1:dir:d198bc9d7a6bcf6db04f476d29314f157507d505
  • Revision (rev): Like a “commit” in version control, holding details like who did it, when, and the message: swh:1:rev:309cf2674ee7a0749978cf8265ab91a60aea0f7d
  • Release (rel): Similar to a “tag,” pointing to a specific revision and maybe including a version name or signature: swh:1:rel:22ece559cc7cc2364edc5e5593d63ae8bd229f9f
  • Snapshot (snp): Captures everything in a whole version control system (all branches) at one specific moment:
    swh:1:snp:c7c108084bc0bf3d81436bf980b46e98bd338453

SWHIDs also allow for optional qualifiers to add more context. You can specify:

  • Lines qualifier (lines=…): To point to specific lines in a file: swh:1:cnt:94a9ed024d3859793618152ea559a168bbcbb5e2;lines=112-116
  • Origin qualifier (origin=…)To say where the software was first seen: swh:1:rev:309cf2674ee7a0749978cf8265ab91a60aea0f7d;
    origin=https://github.com/example/repo
  • Path, anchor, and context qualifiers. These help pinpoint subdirectories, specific parts, or other key info for super-precise references:
    swh:1:dir:d198bc9d…;path=/docs;anchor=readme-section

This way, SWHIDs combine the best of both internal and external identification methods into one stable system.

SWHIDs + The Software Heritage Archive

SWHIDs get even more robust when you link them with the Software Heritage Archive. Software Heritage is a non-profit project that saves publicly available source code and its entire history, and once code is in, it’s never deleted. It’s the biggest public archive of source code, 400 million projects, over 25 billion unique source code files, and more than five billion unique commits. The archive stores everything in a cryptographically secure way, which helps with saving space by not duplicating things and makes sure everything is truly what it claims to be.

The combination of SWHIDs and the Software Heritage archive offers real advantages for meeting today’s legal requirements:

  • Guaranteed integrity: If the code changes even a little, the SWHID changes. This makes tampering immediately detectable.
  • Always there: SWHIDs don’t rely on outside services or websites, so they stay valid no matter where the code is hosted or if the original platform goes down. This solves the problem of code just vanishing.
  • Trackable history: SWHIDs identify parts of the Software Heritage structure, letting you trace a project’s development history, see where code came from, and check how different parts are related. Those extra qualifiers let you even track tiny code snippets.
  • Plays nice with rules: This combined approach directly helps meet the strict requirements for Software Bills of Materials (SBOMs), open-source security, and vulnerability management that the CRA and US Executive Orders demand.
  • Works everywhere: SWHIDs work consistently across all sorts of version control systems and software ecosystems.

The authors recommend SWHIDs, paired with the Software Heritage Archive, as the standard approach for referencing software, especially concerning the CRA and relevant US Executive Orders.
Here are some specifics for stakeholders:

  • Policy makers: Should mention SWHIDs (ISO/IEC 18670) in their rules and encourage their use in government purchases and funding programs.
  • Software companies: Should start making SWHIDs a part of their development process (CI/CD pipelines) to get stable IDs for their releases and patches.
  • Open source communities: Should publish official releases with their SWHIDs, ensure their code and history are archived by Software Heritage, and adopt best practices for referencing any outside software they use via SWHIDs in docs and SBOMs.

To wrap it up, using content-based, permanent software identifiers—specifically SWHIDs linked with the Software Heritage archive—is a strong and reliable answer to today’s cybersecurity and regulatory challenges. This approach builds trust and transparency, keeps us aligned with regulations, and even helps with innovation and saving money by simplifying compliance checks and cutting down on supply chain risks.

For more details and recommendations for implementation, check out the paper (preprint).

See our Publications section for more research from the Software Heritage Archive.

The post Why we need better software identification appeared first on Software Heritage.

]]>
A new era of software engineering, cybersecurity, & AI https://www.softwareheritage.org/2025/07/23/software-heritage-oplss-lecture-software-engineering-cybersecurity-ai/ Wed, 23 Jul 2025 14:57:00 +0000 https://www.softwareheritage.org/?p=46440 A recent talk by Director Roberto Di Cosmo highlights how 10 years in, Software Heritage aims its 'large telescope' at the future of code.

The post A new era of software engineering, cybersecurity, & AI appeared first on Software Heritage.

]]>
Software powers virtually everything we do, yet much of its history remains elusive. It’s more than just a tool; it’s a ‘concentrate of knowledge’—both human-readable and executable. Despite its critical role, current approaches to managing, preserving, and analyzing software in academia and industry are, frankly, far from satisfactory. Reproducibility suffers, documentation vanishes, and vital source code becomes impossible to find years later. That’s exactly why Software Heritage was born, says director Roberto Di Cosmo, speaking recently to students at the Oregon Programming Languages Summer School (OPLSS).

The big picture: A “telescope for software”

Imagine a project dedicated to going out,  collecting, preserving, organizing, and sharing all publicly available software source code. That’s the core mission of Software Heritage. You can think of it as a reference catalog or a digital archive. But even more interestingly, it’s designed to be a real research infrastructure – like a “gigantic telescope that allows us… to do massive analysis of the galaxy of software development today.” Just as the James Webb Space Telescope explores the universe, Software Heritage aims to explore the universe of code. This monumental effort isn’t driven by venture capital but a long-term nonprofit multistakeholder initiative. 

“In academia, software management is poor; in industry, it’s a looming disaster. A dedicated infrastructure is urgently needed to tackle these complex issues,” Roberto Di Cosmo

The scale is staggering: it holds almost 400 million projects, over 25 billion unique source code files, and more than five billion unique commits. Software Heritage tirelessly crawls more than 5,000 different code hosting and distribution platforms every single day.

Building the beast: The unified cryptographic graph

So, how do you even begin to handle such an immense and diverse amount of data? The internet has standards like HTTP and URLs that make web crawling relatively straightforward. Well, in the world of software, every platform (GitHub, GitLab, npm, PyPI) has its own feed, API, and language for interfacing. On top of that, developers use many different tools to keep the version control of the system (Subversion, Git, Bazaar, Mercurial). To manage this vast, diverse data, Software Heritage developed adapters and a specialized tool for converting information from various version control systems. The goal is to consolidate all software development history into a single, large ‘cryptographic graph.’ This graph works like a blockchain, but without needing a distributed consensus algorithm.

It’s made up of six fundamental object types:

  • Contents
  • Directory
  • Revision
  • Release
  • Snapshot
  • Orgins

Each object is identified by a unique cryptographic permanent identifier called the Software Hash IDentifier (SWHID), a universal identifier officially the ISO/IEC international standard 18670. Origins are
identified by their URL. A significant feature of this structure is deduplication: if a file appears in 1,000 projects, it’s stored only once, preventing the storage of thousands of copies of the same content. This systematic collection and organization into a “single unified graph with a simple format” effectively performs a massive data cleaning and normalization sweep, which is often “the most annoying, time-consuming, not interesting part” of big data analysis.

“Nobody ever writes a piece of software from scratch today; we all re(use) directly or indirectly hundreds of components off the shelf,” Roberto Di Cosmo

Beyond traditional databases: Graph power 

Traditional SQL databases are effective for tabular data but are “inherently limited when it comes to querying graph structures or performing ‘transitive closure,” Di Cosmo notes. SQL struggles with operations like tracing all the content within a directory’s full structure, or following a file back to its original source.

Software Heritage addresses this with a specialized framework called WebGraph, designed for compressing and traversing large graphs. This means the entire graph can fit into under half a terabyte of memory, supporting complex graph traversals that would be unfeasible with SQL.

Applications and insights from the graph 

The power of this graph structure enables various analyses. Here are just a few examples:

  • Tracing Android app origins: SQL can easily find specific files, like an Android manifest. But tracing that file’s entire history or its original source is like following a very intricate family tree. SQL isn’t designed to follow those deep connections, which is why the full graph is needed.
  • Mapping global development activity: By analyzing email addresses linked to commits and their timestamps within the graph, researchers can map the geographical distribution of software development. This global view, which was previously impossible, revealed that over a third of public software originates in the US.
“Geographic diversity in public code contributions: an exploratory large-scale study over 50 years”
  • Understanding programming language evolution: By identifying the approximate programming language of each file and using commit timestamps from the full graph, Software Heritage can illustrate the rise and fall of languages like Java and the explosive growth of source code modifications over time.
  • Addressing cybersecurity (‘one-day vulnerabilities’): Traditional vulnerability tracking often focuses on individual projects. But the Software Heritage graph provides a global overview of all software, allowing researchers to trace connections across the entire codebase. This global perspective helps them identify, for instance, two million forks potentially vulnerable to ‘one-day vulnerabilities’ (known but unfixed issues)—a level of insight impossible without such a universally connected view.
  • Measuring institutional impact: The graph helps analyze contributions from academic institutions. By linking email addresses in commits to specific universities and traversing the graph, researchers can measure these contributions. For instance, one university alone has contributed almost 30,000 individuals and 2 million commits to public software since 1970.

Future work with AI: Building trust through transparency

Software Heritage champions openness, traceability, and respect for code authorship among AI researchers and industry, training large language models (LLMs) on public source code. At the heart of this effort lies CodeCommons, a new initiative designed to create a shared, trustworthy, and open foundation for AI models built on code.

Three principles underpin this vision:

  • Open foundational models: If a foundational model is trained on public code, it “must be given back,” at least through open or accessible licensing. AI should not privatize the commons that were built openly.
  • Transparency: There must be full transparency regarding which code files from the archive are used in training. Thanks to the Software Heritage archive and its intrinsic identifiers (SWHIDs), each source file can be precisely referenced, ensuring traceable datasets.
  • Opt-out mechanism: Code owners must retain agency. A robust opt-out mechanism should allow maintainers to request that their code be excluded from future model training.

This approach has already borne fruit in collaborative projects like The Stack v2, a public dataset of source code used for training code LLMs, built using GitHub repositories archived in Software Heritage as a core source of truth. Likewise, the StarCoder2 family of models demonstrates how high-performance generative models can be trained transparently on openly documented datasets.

CodeCommons continues this momentum, building the technical infrastructure, metadata, and ethical scaffolding needed to develop AI responsibly. It plans to integrate metadata from multiple sources (e.g., event feeds, research articles, CVEs, etc.), ensure deduplication at petabyte scale, and enable detailed attribution of training data. This is essential not only for compliance with emerging regulations like the European Union Artificial Intelligence Act (AI Act)  and Cyber Resilience Act, but for maintaining trust between developers, researchers, and the public.

In short, the goal is clear: to enable powerful, open, and accountable AI models without compromising the integrity of our shared digital commons.

Learn more about CodeCommons: https://codecommons.org

Preserving our software commons: A shared responsibility

The mission of Software Heritage is no less than building a “modern Library of Alexandria for software” as a universal infrastructure that serves science, industry, culture, and society at large. This initiative, built with a long-term, nonprofit vision, is already delivering tangible results: enabling software citation, supporting reproducibility and traceability via ISO-standard SWHIDs, enabling AI training with transparency and openness through projects like CodeCommons, and uncovering millions of one-day vulnerabilities across forks to secure the open-source software supply chain.

But the scale of this mission calls for collective effort: if we care about the long-term integrity of science, the transparency of AI, and the resilience of our digital infrastructure, we need to invest in the foundations. Software Heritage is one such foundation – and it’s open to all.

Get involved

Let’s build together the reference infrastructure that our software-powered world deserves.

Check out the lecture below or on YouTube

The post A new era of software engineering, cybersecurity, & AI appeared first on Software Heritage.

]]>
An engineer’s path from data loss to software preservation https://www.softwareheritage.org/2025/07/16/simon_delamare_ambassador/ Wed, 16 Jul 2025 02:23:00 +0000 https://www.softwareheritage.org/?p=46394 Simon Delamare, our newest Ambassador, champions software sustainability. After experiencing irreversible digital decay, he hopes to ensure that code is never lost again.

The post An engineer’s path from data loss to software preservation appeared first on Software Heritage.

]]>
When a discontinued forge led to the loss of several research software programs, the message was stark: never again. This irreversible digital decay hit Simon Delamare hard, highlighting a critical threat to scientific progress. Broken links, data loss, and loads of anxiety: Delamare and his coworkers experienced the “hard road to reproducibility.” Later, a FOSDEM 2017 talk about Software Heritage clicked for Delamare: he’d found a solution for software sustainability.

What truly cemented his involvement was a major step forward in 2023: the École Normale Supérieure de Lyon’s open science roadmap, which named Software Heritage as a key infrastructure for preserving and sharing source code. His institution officially acknowledged software as a first-class academic output. Delamare knows this kind of cultural shift needs a big outreach push. That’s why our newest ambassador plans to team up with the academic library to reach a wide range of communities beyond computer science. Delamare is eager to promote the use of Software Heritage in the academic community through presentations and practical tutorials, and by documenting case studies illustrating the possibilities offered by the use of this project. He also views depositing software in HAL as a perfect starting point, as many researchers across various fields already use it to share their outputs. HAL offers a crucial way to track software output, a task that’s challenging without appropriate tooling. While policies like the UNESCO recommendations, the DORA declaration, and the French National Plan for Open Science increasingly recognize software source code as a legitimate research output, its unique complexity, layered dependencies, and dynamic nature pose significant challenges for preservation, curation, attribution, and discoverability.

Delamare is a research engineer at the LIP (Laboratoire de l’informatique du parallélisme), a computer science laboratory based at the École Normale Supérieure de Lyon since 2012. He holds a PhD in computer science in distributed systems and networks from Telecom ParisTech. He contributes to the lab’s research in high-performance and distributed computing and artificial intelligence, including the development of related infrastructure. He was also involved in Grid’5000, a large-scale and flexible testbed for experiment-driven research in all areas of computer science, with a focus on parallel and distributed computing, including cloud, high-performance computing (HPC), big data, and artificial intelligence, the precursor of the Scientific Large Scale Infrastructure for Computing/Communication Experimental Studies – France (Slices-FR). In 2021 and 2022, Delemare served as Grid’5000’s technical director. Historically, research in parallel and distributed systems has long focused predominantly on PC clusters. The term “grid” then gained significant traction, leading to Michel Cosnard‘s launch of the ACI Grid initiative in 2001. The vision behind ACI Grid was to create algorithms and software prototypes that would allow the globalization of computing and data resources, making them accessible at regional, national, and even international scales. The first funded projects quickly highlighted a critical need for large-scale experimentation to validate theoretical and technical advancements. In France, the objective was to run experiments across multiple geographically dispersed sites. Yet, the available resources at the time presented an insurmountable hurdle: it was impossible to ensure the control, measurement, and reproducibility crucial to the scientific method and familiar to the parallel and distributed computing community. In 2003, Grid’5000 was conceived to overcome this exact challenge.

Now, Delamare contributes to the development of the Slices-FR platform, which is designed to support large-scale, experimental research focused on networking protocols, radio technologies, services, data collection, parallel and distributed computing, and, in particular, cloud and edge-based computing architectures and services. 

Software Heritage Ambassadors are volunteers who offer expert advice in various sectors and languages on how to use our services. Here’s more information on how to book one for a free consultation.

If you’d like to connect with Simon Delamare, please reach out using this link: https://graal.ens-lyon.fr/~sdelamar

We’re also seeking passionate individuals and organizations to volunteer as ambassadors and help grow the Software Heritage community. If you’re interested in becoming an Ambassador, please share a bit about yourself and your connection to the Software Heritage mission.

The post An engineer’s path from data loss to software preservation appeared first on Software Heritage.

]]>
Share how your code defines our world for UNESCO Exhibit https://www.softwareheritage.org/2025/07/07/code-exhibit-unesco-cfp/ Mon, 07 Jul 2025 14:03:02 +0000 https://www.softwareheritage.org/?p=46325 Tell your code's story at our UNESCO exhibit. Submission deadline: September 8, 2025.

The post Share how your code defines our world for UNESCO Exhibit appeared first on Software Heritage.

]]>
From the lines that landed Apollo 11 to the algorithms that built the web, source code is the silent architecture of our civilization. It’s more than logic; it’s a profound human artifact, revealing history, culture, and artistry. For the 10th anniversary of Software Heritage in 2026, we’re putting code’s untold stories in the spotlight at the UNESCO Headquarters in Paris.

We need your help to make it happen – we’re looking for contributors to share their own code stories for this exhibit, which will highlight how crucial it is to save our digital history.

What makes this exhibit different? We’re taking an unusual approach: treating source code as the main event, not just background material. It goes back to something wise Harold Abelson said in 1984: “Programs must be written for people to read, and only incidentally for machines to execute.” That idea inspires us, because we see source code as so much more than its technical function. It’s a rich tapestry of meaning—revealing the creator’s intent, reflecting its historical, social, and cultural moment, and even showing off personal style, creativity, or aesthetic choices.

Here’s where you come in: each poster will feature a different contributor, sharing their take on a source code snippet they’ve chosen. We’re looking for all sorts of voices—from computer scientists and humanities scholars (think historians, philosophers, linguists) to activists, artists, and beyond. This variety is key to showing just how many ways source code can be understood and appreciated.

Code Debug
Photo by Hitesh Choudhary on Unsplash

What stories does code tell?

Source code as historical testimony

Think of code as a witness to history. We’re looking for contributions that spotlight code as a historical artifact—whether it’s a monumental piece like the Apollo-11 lunar landing equations or those amazing, grassroots creations found in old hobbyist journals that truly capture the DIY spirit of early computing.

Source code as a mirror of society

Code isn’t created in a bubble. It’s deeply embedded in our social structures, carrying all the biases, norms, and assumptions of its time. We’d love to see how source code reveals the values, politics, and even the inequalities of the societies that produced it.

Source code as a cultural artifact

Just like a great book, a piece of music, or a painting, source code can be appreciated for its sheer expressive and creative power. This theme is all about exploring code as a medium of human expression—diving into its aesthetics, rhythms, structures, and the unique “voice” of its author.

Ready to contribute?

Our exhibition committee (you’ll find them listed below) will be reviewing proposals. We’re keen to see your original ideas, how they connect with our main themes and the overall quality of what you share.

This exhibit aims to reflect the diversity that shapes the world of code. So, we especially encourage ideas from anyone in an underrepresented group in tech or culture. We want to showcase the varied individuals and perspectives behind source code, past and present.

Submit your proposal to sourcecode-exhibit@inria.fr by September 8, 2025.

Binary Art” by Dade Freeman is licensed under CC BY-NC-ND 2.0

What to submit

You’ll start by proposing a title for your submission. Don’t worry, we can always refine it together during the editing phase.

  1. A source code snippet

Submit the source code excerpt you wish to feature in the exhibit. It may be your code or someone else’s, as long as it’s covered by an appropriate license (see license section below).

  • Your source code snippet can be submitted either as an image file (PNG or JPG), a PDF file, or a text file. An image or PDF is most appropriate if you wish to preserve the execution environment or the physical medium on which the code was printed. For example, this could include a screenshot of code running on a specific computer or development interface, or a scan of source code printed in a magazine, etc.
  • Alternatively, if your primary interest is in the code itself, it is preferable to submit it as a text file (.txt). In that case, make sure that the code indentation is preserved.

If you’re submitting your source code as an image, please follow these guidelines to ensure adequate image quality:

  • Resolution: Ideally, the image should be 300 DPI (dots per inch) or higher, with a minimum size of about 2300 × 3500 pixels (or 20 × 30 cm or 8 × 12 inches).
  • Format: Use PNG or high-quality JPG (PNG is preferred for text clarity).
  • Compression: Do not compress the image before sending it. Use a file transfer service to preserve quality.
  • Framing: Avoid excessive margins or framing in the image.

If you’re unsure about the quality of your image, reach out. In some cases, we may be able to work with lower-quality images or suggest alternatives.

  1. A caption

 Submit a caption for your source code, indicating the source, author, creation date, and license of the code

  1. Your code’s story

You’ll also include your take (around 300 words). This is where you explain why your chosen code snippet matters—whether it’s for its technical side, historical impact, social context, aesthetic appeal, or even personal relevance.

Right now, we’re more interested in your ideas than perfect phrasing, so feel free to send in an initial draft. If your proposal is selected, our team will work with you during an editing phase to polish your contribution, making sure it’s clear and engaging for everyone, even those new to programming.

  1. A short biography (max 100 words)
  • A brief paragraph describing your background and your connection to source code in your professional or creative work.

Please note that the final exhibition panel will be presented in both English and French. You may submit your contribution in either language—or both. Our team will handle the translation.

Don’t hesitate to reach out to sourcecode-exhibit@inria.fr if you have any questions. 

License

The source code you submit must meet one of the following conditions:

  • Released under a free/libre open source license (see indicative list below), or
  • Either authored by you, or submitted with permission from the original author who has granted you the rights to do so.

Possible license for code files

  • MIT License 
  • BSD 2-Clause / 3-Clause  
  • Apache License 2.0 
  • GPL (v2 or v3)
  • LGPL
  • CC0 (Creative Commons Zero) 

Possible license for images/pdf

  • CC BY 
  • CC BY-SA 
  • CC0 
  • CC BY-ND

The final exhibition panels will be produced under a Creative Commons Attribution (CC BY 4.0) license. This means they may be reused to extend the exhibition beyond UNESCO’s premises. If needed, contributors can request a more restrictive license, subject to approval.

Key dates

September 8 — Submission deadline

September 30 — Selection

  • Selected contributors will be notified at the end of September.

Editorial review

  • Our exhibition committee will work with contributors to fine-tune texts and bios, ensuring everything is consistent and clear.

Organizers and exhibition committee

This project kicked off with Software Heritage, a non-profit dedicated to collecting and preserving source code, with support from Inria (the French National Institute for Research in Digital Science and Technology). It’s funded by both Software Heritage and Inria, and proudly backed by the UNESCO Memory of the World Programme.

Exhibit organizers:

Exhibition committee, responsible for reviewing proposals and curating the exhibition, includes:

The full call for contributions

EN: Source Code Exhibit – Call for proposal.pdf
FR: Expo Code Source – Appel-contribution.pdf

Featured image: ASCII clock by Yusuke Endoh. Winner 2020: Most head-turning category, The International Obfuscated C Code Contest, licensed by Landon Curt Noll, CC BY-SA 4.0

The post Share how your code defines our world for UNESCO Exhibit appeared first on Software Heritage.

]]>
Code at the core: Software Heritage at UN Headquarters https://www.softwareheritage.org/2025/07/02/un_open_source_week_2025/ Wed, 02 Jul 2025 14:51:00 +0000 https://www.softwareheritage.org/?p=46272 Software content doubles every 2-3 years. We're capturing this immense growth, securing code that fuels global science & innovation.

The post Code at the core: Software Heritage at UN Headquarters appeared first on Software Heritage.

]]>
Our digital world isn’t just built on code; it is code. Recognizing this fundamental truth, UN Open Source Week 2025 wasn’t just another conference. This global gathering of practitioners convened right in the heart of international diplomacy: the iconic United Nations Headquarters in New York. From June 16-20, beneath the towering, glass-curtained Secretariat Building, leaders, innovators, and policy-makers from around the world met to tackle big questions about open source, digital public infrastructure (DPI), and how technology can be used to solve humanity’s most pressing challenges.

Software Heritage co-founder Roberto Di Cosmo returned to the event for a second year, this time delivering a presentation that opened the session on Navigating Digital Cooperation Across Layers of Governance. (His talk starts at around the 2-hour 30 mark). Morane Gruenpeter, Head of Open Science, also participated in events that week, catch the videos from DPI day for more.

Di Cosmo didn’t just give a talk, however. He offered a clear-eyed look through the “Software Heritage looking glass” at open source as a global undertaking. His message was clear and urgent: preserving software source code isn’t a niche academic pursuit; it’s critical for the future of science and society.

The critical importance of preserving software

Di Cosmo laid out exactly how deeply scientific progress relies on software. He cited eye-opening data from the French Open Science Monitor, which revealed a staggering reliance on open-source code across disciplines. Think about this: a remarkable 61% of fundamental biology papers (out of 15.4k total) and 58% of computer and information sciences papers (out of 4.1k total) explicitly mention using code. Even fields like medical research (32% from 23.5k total), humanities (21% from 4.5k total), and mathematics (23% from 3.1k total) show a significant dependence. This isn’t just a trend; it’s a fundamental shift, proving that collective scientific advancement hinges on having software available and—crucially—kept safe for the long haul. And it’s not just a local phenomenon; a study tracking public code contributions over five decades (1971-2020) shows that code is a truly global collaboration, with contributions pouring in from every corner of the world.

The great library of source code

Then there’s the sheer scale of the problem: software is exploding. Di Cosmo underscored this exponential growth. The Archive shows “revisions” expanding by 27.30% annually and “contents” by an astonishing 40.25% every year. What does that mean in human terms? The amount of software being created doubles roughly every two to three years. Managing this ever-expanding digital universe is why Software Heritage exists. Launched in 2016, the mission is to build “The Great Library of Source Code.” Its mission is simple, but vital: to collect, save, and universally share every piece of software source code, thereby protecting cultural heritage and supercharging both software development and scientific discovery. This isn’t just a giant hard drive; it’s a comprehensive reference point, a rock-solid archive, and a powerful tool for global research. Such an effort is critical because software faces endless threats: corruption, disasters, malicious attacks, obsolescence, accidental deletion, and maddening format issues.

A recognized global infrastructure

Software Heritage isn’t just an idea; it’s a global infrastructure. It operates as a unique, non-profit, open, and shared organization, backed by a powerhouse alliance of sponsors. From tech titans like IBM and Microsoft to leading research institutions and universities, a diverse group is championing its cause. The numbers speak for themselves: the archive is colossal, holding over 24 billion source files, 5 billion commits, and 372 million projects.

Software Heritage Archive at the time of publication, July 2025

It’s the biggest archive of its kind, ever. And its influence is reaching deep into global policy, exemplified by the ISO/IEC 18670:2025 standard for Software Hash Identifier (SWHID), recognized earlier this year, which provides a standardized way to reference software components universally.

The impact ripples further. The French government recently recognized Software Heritage as an essential national initiative. Software Heritage also plays a crucial role in vital areas like cybersecurity and building transparent artificial intelligence systems, with projects like SWHSec (February 2024) and CodeCommons (November 2024) actively contributing to these efforts.

Even the United Nations itself is on board with open source: an analysis shows various UN entities like un.org (320 contributions), worldbank.org (162 contributions), and fao.org (158 contributions) are deeply engaged in open-source projects. UN Open Source Week specifically aims to connect these open-source communities across UN Member States, pushing forward on AI policy, digital governance, and new innovations—areas where Software Heritage’s work fits into the bigger picture.

Code for cooperation

UN Open Source Week 2025 galvanized the push for global digital cooperation. Software Heritage functions as the open code infrastructure foundational to this collective effort. By diligently collecting, preserving, and making accessible the vast, ever-growing ocean of source code, the initiative ensures that the digital foundations of scientific knowledge, cultural history, and future innovations remain intact and open for everyone, for generations to come. It’s a testament to the power of open science and a vivid demonstration of how global teamwork can truly build a digital future that’s more robust, transparent, and resilient.

“There are many things we still need to do,” Di Cosmo said in closing. “In the last year, we spent a lot of time traveling around the globe establishing connections. Every single country confronts the same problems around Open Science, transparency in AI, cybersecurity, reuse of software, knowledge sharing, code sharing. We need to think about how we can work together, not each country reinventing the wheel, but by building a shared infrastructure at the service of everyone.”

The post Code at the core: Software Heritage at UN Headquarters appeared first on Software Heritage.

]]>