SWHID Archives - Software Heritage

Episciences links article code through Software Heritage

Nicole Martinelli — Thu, 21 Aug 2025 14:35:00 +0000

Software Heritage, the universal source code archive, preserves and provides access to source code as vital digital heritage. Researchers can directly link their published scholarly articles to the software that powers them. This new capability enhances research reproducibility and transparency by connecting findings to specific software versions.

Software Heritage partnered with the Center for Direct Scientific Communication (CCSD) to make this happen. The collaboration previously enabled software deposits on HAL in 2018, laying the groundwork for this new capability. Episciences, an overlay journal that hosts articles from open repositories like arXiv, Zenodo, and bioRxiv, now builds on a 2018 collaboration that enabled software deposits on open archive HAL, allowing authors to link to software archived there. Authors and journals using Episciences can link their articles with supplementary software via Software Heritage, using a SoftWare Hash IDentifier (SWHID) or a HAL-ID.

There are three basic steps:

Submit software to HAL

Depositing software via HAL ensures its sustainable archiving in Software Heritage. The complete deposit procedure is detailed in the HAL documentation: Deposit software source code.

Or, you can deposit software directly into Software Heritage by making a “Save Code Now” request with a GitHub URL. For more details, see this post: Save and reference research software
Link the software to the Episciences publication by adding a SWHID or a HAL-ID to your publication. For more, check out the Episciences documentation or this YouTube walkthrough.

Building on this ability to link articles and software, Episciences actively works to meet the evolving needs of researchers. Episciences is emerging as a new model in academic publishing, improving the visibility and accessibility of research articles that have already been peer-reviewed and published in conference proceedings. Instead of building a new library (traditional journal), overlay journals act as a highly knowledgeable curator who goes through existing open shelves (repositories), selects the best books, writes introductions for them, and creates a guide (the journal) pointing readers to those excellent, freely available books. This approach allows researchers to submit their conference papers to Episciences for additional scrutiny and broader dissemination, potentially increasing the impact and reach of their work.

“One fundamental aspect of the openness of science is the close link between scientific publications and associated research data. This link is essential for the transparency, reproducibility, and the overall progress of science. Episciences responds to this dynamic by inviting authors to supplement the submission of their document with a link to the dataset and/or software used in their work,” Agnès Magron CCSD

Beyond enabling these connections, Episciences actively contributes to the wider open science movement. The API and connector for Episciences were developed as part of the European Union-funded FAIRCORE4EOSC project. Episciences is also a member of the SCOSS Family. This commitment underscores why enabling this link via Episciences and the Software Heritage integration with HAL is paramount for research reproducibility, transparency, and accountability.

The next step is leveraging the COAR Notify protocol (developed by the Confederation of Open Access Repositories, COAR) to share links between different research object types.

The established partnership with CCSD, the successful HAL integration, and its utilization by Episciences provide a practical way for researchers to ensure their essential software is archived by Software Heritage and discoverable with their publications. Researchers and journals using platforms integrated with open repositories like HAL are encouraged to leverage this capability to link their software to their scholarly articles using Software Heritage. The partnership is about building a clearer, more open scientific story, where the findings and the code that powers them are part of the same picture.

The post Episciences links article code through Software Heritage appeared first on Software Heritage.

Why we need better software identification

Nicole Martinelli — Thu, 31 Jul 2025 14:38:00 +0000

With cybersecurity breaches and new regulations making headlines, software supply chain security is now top of mind for many people. New laws like the European Union’s Cyber Resilience Act (CRA) and recent United States Executive Orders are pushing for more transparency in digital goods.

All this attention means we need a solid, trustworthy way to identify software. Here’s the problem: how we currently name software and point to it in repositories often falls short. These ways can be temporary, vague, or just not secure enough. That leads to messy situations like confusion, name clashes, and outdated links. These aren’t just minor annoyances; they’re open doors for attacks, like “dependency confusion,” where bad actors trick systems into using malicious code. Plus, software bits can just disappear or move, making it impossible to check them later.

Clearly, we need a permanent fix that guarantees we can always find and verify software. This post outlines key information from the preprint paper “Software Identification for Cybersecurity: Survey and Recommendations for Regulators,” authored by Olivier Barais, Roberto Di Cosmo, Ludovic Mé, Stefano Zacchiroli, and Olivier Zendra with support from the SWHSec project.

Existing ID approaches: The good and the bad

There are two main types of software identification:

External IDs: These rely on outside info, like product names, version numbers, or links to package managers.
- Pros: They’re usually easy for humans to read and work with existing lists like the National Vulnerability Database. Some examples: the SWID, Package URL (purl), and SPDXID.
- Cons: Their reliability depends on external lists or naming rules, which can change or even be reused. That causes conflicts and makes them unreliable for security checks.
Internal IDs: These come directly from the software’s actual content, usually using a cryptographic hash (like a digital fingerprint).
- Pros: They offer uniqueness and integrity without relying on a central authority. They’re great for spotting if something’s been tampered with, don’t rely much on outside dependencies, and are difficult to fake with good hashing. Simple SHA256 checksums and Software Hash IDentifiers (SWHIDs) are examples.
- Cons: They’re often not very human-readable, which can make searching or brand recognition tricky.

In the real world, effective software bills of materials (SBOMs) and supply-chain tools generally combine both external references (which help connect with existing databases, vulnerability feeds, or licensing tools) and internal references (for strong integrity checks and guaranteed uniqueness). This means the smart approach is often to publish both—say, a purl or SWID alongside a cryptographic hash or SWHID. That way, you ensure both discoverability and verifiability.

Photo by George Prentzas on Unsplash

Inside the SWHID

SWHIDs are based on content, they’re permanent, and they can’t be tampered with easily. In 2025, they became an international standard (ISO/IEC 18670), making them globally recognized.

SWHIDs essentially package up both the data and its context using a clever Merkle DAG structure. This means each ID is directly tied to the exact piece of software it refers to.
They follow a simple pattern:
swh: : :

Key types include:

Content (cnt): Identifies a single file based on its raw contents:
swh:1:cnt:94a9ed024d3859793618152ea559a168bbcbb5e2
Directory (dir): Points to a directory’s layout and what’s inside it, including IDs of its contents:
swh:1:dir:d198bc9d7a6bcf6db04f476d29314f157507d505
Revision (rev): Like a “commit” in version control, holding details like who did it, when, and the message: swh:1:rev:309cf2674ee7a0749978cf8265ab91a60aea0f7d
Release (rel): Similar to a “tag,” pointing to a specific revision and maybe including a version name or signature: swh:1:rel:22ece559cc7cc2364edc5e5593d63ae8bd229f9f
Snapshot (snp): Captures everything in a whole version control system (all branches) at one specific moment:
swh:1:snp:c7c108084bc0bf3d81436bf980b46e98bd338453

SWHIDs also allow for optional qualifiers to add more context. You can specify:

Lines qualifier (lines=…): To point to specific lines in a file: swh:1:cnt:94a9ed024d3859793618152ea559a168bbcbb5e2;lines=112-116
Origin qualifier (origin=…)To say where the software was first seen: swh:1:rev:309cf2674ee7a0749978cf8265ab91a60aea0f7d;
origin=https://github.com/example/repo
Path, anchor, and context qualifiers. These help pinpoint subdirectories, specific parts, or other key info for super-precise references:
swh:1:dir:d198bc9d…;path=/docs;anchor=readme-section

This way, SWHIDs combine the best of both internal and external identification methods into one stable system.

SWHIDs + The Software Heritage Archive

SWHIDs get even more robust when you link them with the Software Heritage Archive. Software Heritage is a non-profit project that saves publicly available source code and its entire history, and once code is in, it’s never deleted. It’s the biggest public archive of source code, 400 million projects, over 25 billion unique source code files, and more than five billion unique commits. The archive stores everything in a cryptographically secure way, which helps with saving space by not duplicating things and makes sure everything is truly what it claims to be.

The combination of SWHIDs and the Software Heritage archive offers real advantages for meeting today’s legal requirements:

Guaranteed integrity: If the code changes even a little, the SWHID changes. This makes tampering immediately detectable.
Always there: SWHIDs don’t rely on outside services or websites, so they stay valid no matter where the code is hosted or if the original platform goes down. This solves the problem of code just vanishing.
Trackable history: SWHIDs identify parts of the Software Heritage structure, letting you trace a project’s development history, see where code came from, and check how different parts are related. Those extra qualifiers let you even track tiny code snippets.
Plays nice with rules: This combined approach directly helps meet the strict requirements for Software Bills of Materials (SBOMs), open-source security, and vulnerability management that the CRA and US Executive Orders demand.
Works everywhere: SWHIDs work consistently across all sorts of version control systems and software ecosystems.

The authors recommend SWHIDs, paired with the Software Heritage Archive, as the standard approach for referencing software, especially concerning the CRA and relevant US Executive Orders.
Here are some specifics for stakeholders:

Policy makers: Should mention SWHIDs (ISO/IEC 18670) in their rules and encourage their use in government purchases and funding programs.
Software companies: Should start making SWHIDs a part of their development process (CI/CD pipelines) to get stable IDs for their releases and patches.
Open source communities: Should publish official releases with their SWHIDs, ensure their code and history are archived by Software Heritage, and adopt best practices for referencing any outside software they use via SWHIDs in docs and SBOMs.

To wrap it up, using content-based, permanent software identifiers—specifically SWHIDs linked with the Software Heritage archive—is a strong and reliable answer to today’s cybersecurity and regulatory challenges. This approach builds trust and transparency, keeps us aligned with regulations, and even helps with innovation and saving money by simplifying compliance checks and cutting down on supply chain risks.

For more details and recommendations for implementation, check out the paper (preprint).

See our Publications section for more research from the Software Heritage Archive.

The post Why we need better software identification appeared first on Software Heritage.

Using the SoftWare Hash Identifier (SWHID): A tutorial

Nicole Martinelli — Fri, 13 Jun 2025 09:25:00 +0000

Software identification is crucial for ensuring the long-term traceability of scholarly outputs. However, identifying software can be complex, resembling an investigation requiring tailored solutions. The Software Hash Identifier (SWHID) is an intrinsic identifier designed for software, acting like a unique fingerprint or DNA sequence intrinsically bound to the software’s content. It complements extrinsic identifiers like DOIs, which typically identify metadata records or broader projects. The SWHID provides actionable solutions for researchers, repository managers, and others involved in the scholarly ecosystem.

This tutorial provides a guide for research support staff, designed to answer the question: “What does an end-user from my institution need to understand about software identification?”

We’ll explain why common identifiers like DOIs aren’t always sufficient for software, highlighting the specific concerns of unique software identification. Most importantly, we’ll introduce a straightforward, “plug-and-play” solution that your community can use, emphasizing the crucial role you’ll play in helping them implement it. This post derives from a two-hour live session by the Software Heritage Open Science team, Morane Gruenpeter and Sabrina Granger, as part of the FAIR implementation workshops. The slides are also available.

Understand what SWHID Identifies

SWHID is used to identify specific software artifacts at different levels of granularity.SWHIDs identify the source code content itself, rather than the project or its metadata. The different types of objects identifiable by a SWHID include:

CNT (Content): Identifies the content of a single file.
DIR (Directory): Identifies a directory, including its contents and the names of the files within it. This SWHID type is recommended for academic use – it’s self-contained and doesn’t depend on external services like Software Heritage to work.
REV (Revision): Identifies a commit in a development history sequence.
REL (Release): Identifies a tagged release, similar to a revision but specifically marked as a release.
SNP (Snapshot): Identifies a point in time, recording all entry points (like branches and releases) found in a software origin and where they pointed at that time.

These intrinsic identifiers correspond to granularity levels from the bottom of the software identification pyramid (Level 10: Code Fragment, Level 9: File, Level 8: Directory, Level 7: Commit, Level 6: Release, Level 5: Snapshot), where the number of items increases as you go down the pyramid.

How to generate a SWHID

A key feature of SWHID is that any end-user can generate one. You do not need an account on Software Heritage or need to be the software author. SWHIDs are free. For digital resources that are frequently created or modified, especially in large volumes, charging a per-identifier fee just doesn’t work.

You can find the SWHID for software artifacts already archived in Software Heritage in the permalinks box on the artifact’s page.
You can also compute a SWHID locally on your own machine using a command-line tool. For the same content, the SWHID computed locally will be the same as the one computed by Software Heritage, as long as the computational method (schema version) is the same.

Deconstruct the SWHID structure

A SWHID is a structured identifier with several parts:

Prefix: Always starts with SWH.
Schema Version: Indicates the hash computation method used (currently 1 for SHA-1). This can evolve if needed, with older hashes remaining valid.
Object Type: Indicates the type of software artifact being identified (C, DI, RE, RL, or SNP).
Hash: The hash value computed for the specific content or object.
Context Parameters (Optional): Provide additional information about where or when the artifact was found or its position within a larger structure. These parameters can include:
- Origin: The URL from which the software originated (e.g., a GitHub or GitLab repository). This parameter differentiates SWHIDs for identical content found in different locations.
- Visit: For artifacts lower in the graph (Content, Directory, Revision, Release), this refers to the snapshot in which the artifact was seen.
- Anchor: For artifacts lower than a snapshot, this is a Revision item from the graph that provides a specific point of reference.
- Path: The path to the artifact within a directory or revision.
- Lines: For content fragments, specifies the lines of code being identified.

Context parameters explain variations in seemingly identical SWHIDs: the core content hash is the same, but the context (e.g., path, origin) differs.

How to use SWHIDs

SWHIDs have several important use cases, primarily related to referencing, reproducibility, and citation of software source code:

Referencing specific code: SWHIDs allow you to point directly to specific versions or parts of software code (files, directories, revisions, etc.). This is different from DOIs, which often point to a metadata record about the software.
Ensuring reproducibility: Because SWHIDs are based on the intrinsic content, they enable reproducibility. If you have the SWHID, you can potentially regenerate or verify the exact content it refers to, even if the original infrastructure where it was found is no longer available.
Citing software: SWHIDs are designed to be used in software citations. The recommended way to facilitate this is to include metadata files like code meta.json or citation.cff alongside your code. Software Heritage can use these files to generate a citation that includes the SWHID of the corresponding artifact (e.g., the directory SWHID is often recommended for academia).
IMPORTANT CITATION RULE: Never include the SWHID itself within the source code files. Adding the SWHID changes the file contents, resulting in a new SWHID for the changed file, which breaks the link to the original content. Instead, include metadata files that allow platforms to generate citations, including the SWHID.
Resolving SWHIDs: SWHIDs can be resolved to access the corresponding software artifact, for example, on the Software Heritage archive (softwareheritage.org) or its operational mirror networks.

What the SWHID is not for

Data Sets: SWHIDs are designed specifically for software source code. While data might be stored alongside code in repositories and thus archived by Software Heritage, SWHIDs are not the recommended identifier for data sets. Other identifier types are more appropriate for data.
AI-Generated Code: Currently, SWHIDs cannot distinguish code generated by AI tools from human-generated code, nor do they provide functionality to specifically track the origin of AI-generated code.

By understanding these steps, you can leverage SWHIDs for robust and reproducible identification, referencing, and citation of software artifacts.

A toolbox

For further info:
https://www.softwareheritage.org/faq/#3_Referencing_and_identification
https://www.softwareheritage.org/how-to-archive-reference-code
https://www.softwareheritage.org/software-hash-identifier-swhid
https://www.swhid.org

The post Using the SoftWare Hash Identifier (SWHID): A tutorial appeared first on Software Heritage.

How Software Heritage ensures reliable Guix deployment

Nicole Martinelli — Wed, 21 May 2025 14:12:40 +0000

Reproducibility in research is a growing challenge. After researchers publish their findings, the underlying software — the source code that drives their results — often disappears or becomes unusable over time. This makes verifying and building upon past work incredibly difficult.

For Simon Tournier, a research engineer at Université Paris Cité, the complexity of bioinformatics software was a constant battle. As a member of the Saint Louis Core Facilities team, he deals with various specialized workflows. Struggling to manage the intricate web of dependencies, from laptop to cluster, Tournier discovered GNU Guix, which helps in composing these computational environments. Now, he considers Guix indispensable for reproducible deployments, a critical component of reproducible research and Open Science.

His talk at FOSDEM offered a practical perspective, exploring five years of work on this critical problem. It’s the output of a joint effort between Guix contributors and the Software Heritage team. (He kicked off the talk by thanking Antoine R. Dumont, Antoine Eiche, Antoine Lambert, Ludovic Courtès, Stefano Zacchiroli, and Timothy Sample.)

The dual nature of software creates a challenge: humans read source code, while machines execute binaries. The tricky part? How binaries are produced from source code. For example, compilers or interpreters transform source code into something hard to decipher just by looking at the binary. Since Reproducible Research is built on the top of full transparency, reproducible deployments require both source code and transformation; both are crucial. Software Heritage provides the means to audit and verify original source code, and Guix offers the capability to audit and verify the process of transforming that code into binaries.

Credit: Simon Tournier

The challenge of reproducibility

Imagine this: Alice publishes her research in 2022, sharing her code, made with version 0.9 of the software. A few years later, Blake tries to reproduce Alice’s results. Version 1.2 is readily available, but the outcomes differ. Even after finding and installing version 0.9, when Blake tries to recreate the original results, things don’t match up. Why? Because Blake doesn’t fully control all the variables in the original computing environment. To address this, it’s crucial to identify these variables: the specific code used, the tools required to build the source code, the tools needed to run the compiled binary, and all the dependencies; i.e., there are four identifications for each tool of the workflow and recursively for each dependency.

Package managers and Guix

The problem isn’t new; that’s the job of package managers, e.g., Conda, APT, Brew, etc. Deploying a computational environment essentially involves deploying a dependency graph, and package managers are the tools that manage this graph. Each node of the graph represents source code and all the various parameters to configure, compile, and build the source code, while edges describe the dependencies. For instance, installing a package like “harmony” for processing biological sequencing data using the R language can involve a graph of approximately 500 nodes.

Credit: Simon Tournier

Guix can be seen as a package manager. The main difference between Guix, which has a software deployment model based on the same principles as Nix, and the other package managers is how this graph is described and how the graph of dependencies is handled. Contrary to other package managers, where the graph is dynamically resolved from the constraints of the version specifications, Guix provides revisions that specify one unambiguous graph. A Guix revision pins a complete collection of packages and Guix itself.

When Alice says “I used this tool at version 0.9”, it implicitly means Alice used a whole specific graph of dependencies. That means that Blake needs this same dependency graph to reproduce, audit, or verify the environment. Using Guix, Alice provides the Guix revision (e.g., EB34FF1), which encapsulates the tool at version 0.9 and all the dependencies, including the options for configuring, compiling, and building. Now Blake can use this revision to redeploy the same bit-to-bit computational environment, ensuring transparency and verifiability.

The deployment model works under two assumptions: deterministic builds and publicly available source code. When building all the nodes of the graph, they must be deterministic. It’s more challenging than it might appear at first, and we are very grateful for the effort led by the project Reproducible Builds. And, to reproduce or audit a deployment, all necessary source code (remember, e.g., 500 packages) must be accessible.

Enter Software Heritage

Source code disappearing from the internet is a real problem. This link rot is a significant challenge, which is why the first mission of Software Heritage is to collect and preserve source code. Today, it’s the largest publicly available archive of software source code.

A substantial percentage of source code is missing from its original location. Considering Guix as an example, by 2024, around 3.6 percent of source code from 2022 is already missing, and the situation is worse and worse when going back further in the past: about 8 percent of source code packaged by Guix five years ago is now unreachable from its original location. Moreover, the loss of only one package can have cascading effects due to dependencies; the disappearance of one package, like OpenJDK, can result in the loss of hundreds of dependent packages, not just one.

“Science is building on sand. Research projects are created, papers are published, and the source code is just disappearing from the internet; that’s why we need Software Heritage,”
Simon Tournier

A story about content-addressed identifiers

Software Heritage uses intrinsic identifiers, much like checksums, to identify source code. This method guarantees that the identifier points directly to the content, allowing for precise referencing of individual files, snapshots, releases, revisions, directories, and other elements. The Software Hash IDentifier (SWHID), a universal identifier for software pioneered by Software Heritage, is officially the ISO/IEC international standard 18670. Software Hash IDentifiers (SWHID), which are now an ISO standard with specifications available at swhid.org, function as content-addressed identifiers.

On the other hand, Guix also employs an intrinsic identifier to identify source code; the format is Normalized ARchived (NAR), inherited from Nix. Once packaged by Guix, the source code is “essentially” content-addressed. Specifically, if a source code location becomes stale, e.g., the URL is no longer available or does not serve the exact same source code, then the user can manually provide the new URL, and Guix proceeds. Or Guix can automatically check and download if a copy is available on alternative locations like those provided by the Guix project, the Nix project, or Software Heritage. The automation relies on a bridge provided by Software Heritage between the different content-addressed identifiers: SWHID for one, NAR for the other.

Hold on, a lot of packages depend on compressed tarballs, and Software Heritage archives software as source code, doesn’t it? Indeed, compressed tarballs pose a unique challenge because, in addition to source code, they require metadata to be rebuilt bit-for-bit. For example, different compression levels produce different checksums, which undermines a content-addressed system.

This is where the Disarchive tool comes in. Timothy Sample designed it to disassemble the compressed tarball and extract all the metadata, such as compression level, timestamps, etc. These metadata are stored in a dedicated Disarchive database hosted by the Guix project, while the data itself (plain source code) is archived in Software Heritage. On request, Disarchive can assemble the metadata and data and then output the bit-identical compressed tarball.

The architecture is twofold. On one hand, the Guix project feeds Software Heritage and the Disarchive database. The source code origin of all the packages for the last Guix revision is continuously listed and provided to Software Heritage, which ingests and archives it. At the same time, all the compressed tarballs are disassembled, and the metadata is saved. However, when Guix attempts to rebuild a computational environment with missing source code, the success of the rebuild hinges on how Guix originally packaged that code.

When the Guix package relies on the version-control-system (VCS) source code, Guix queries Software Heritage using the NAR identifier and gets back an SWHID. Then Guix asks the Software Heritage Vault to “cook” the files to fetch them. When a Guix package relies on a compressed tarball, Guix proceeds as follows: it queries the Disarchive database with the NAR identifier to obtain the SWHID, then requests the data from the Vault, and finally assembles a bit-identical compressed tarball using Disarchive metadata.

Credit: Simon Tournier

Connecting Guix with Software Heritage makes Guix the first free software distribution and tool backed by a stable source code archive.

What’s next

Work is still ongoing. On the Guix side, the machinery that exploits Software Heritage fallback needs amelioration. For instance, running a 2019 Guix revision now triggers the Software Heritage recovery mechanism as it was in 2019, in Guix’s early days. Although the recovery mechanism is continuously improving, it would be ideal if past revisions relied on current techniques for source code recovery. A further step towards improvement is ensuring complete source code coverage for all Guix revisions. However, this isn’t currently achieved, as Disarchive’s support for archive formats like lzip and zip needs additional development.

On the Software Heritage side, Guix provides various test-cases to challenge the “cooking” system of the Vault. Another direction is about the Disarchive database. The recovery architecture depends on the availability of this database, and today it’s only backed up by the Guix project. Incorporating this database into the larger Software Heritage framework would make the entire system more robust.

The integration of Guix and Software Heritage paves the way for more transparency and verification of the whole computational environment involved in scientific research. Scientific production should be robust to external service failures; for example, being able to audit or reuse scientific findings should not depend on the availability of platforms hosting source code. That’s why backing package managers with Software Heritage appears vitally important.

Check out the full 23-minute talk, download the slides, and read the paper “Source Code Archiving to the Rescue of Reproducible Deployment.”

Many thanks to Simon Tournie r for reviewing and contributing to this post.

The post How Software Heritage ensures reliable Guix deployment appeared first on Software Heritage.

ISO Standard for SWHIDs: Robust software identification

Nicole Martinelli — Wed, 14 May 2025 14:19:00 +0000

Imagine trying to track the evolution of a complex machine where every part could be subtly altered without a trace. This is the challenge we face in the digital realm, especially with software, the invisible engine powering our modern world. Ensuring the integrity and long-term traceability of these digital artifacts is paramount, and that’s where the significance of international standards like ISO comes into play, particularly for innovations like the SoftWare Hash IDentifier (SWHID).

Achieving an ISO standard for the SWHID (ISO/IEC 18670 or free public specification ) marks a major milestone in establishing a globally recognized framework for uniquely and permanently identifying software. This isn’t just about assigning names. It’s about creating a robust system, based on cryptographic principles, to ensure that a specific piece of software – be it a single file, a directory structure, a particular release, or even the complete state of a version control repository – can be definitively identified and its integrity verified, regardless of where it’s stored or who is accessing it.

Why is ISO important for software?

Getting an ISO standard for the SWHID matters because it brings several crucial benefits to the world of software:

Global recognition and trust: An ISO standard means the SWHID specification is recognized and respected internationally. This fosters trust among developers, researchers, legal teams, and anyone who relies on software. It signals that SWHID is not just a niche idea but a robust and well-vetted approach.
Interoperability and compatibility: ISO standards promote consistency. With a standardized way to identify software artifacts, different tools, platforms, and organizations can work together. Imagine different libraries using the same cataloging system – it makes finding and using information much easier. The ISO standard for SWHID helps different systems understand and utilize these unique identifiers.
Long-term preservation and reliability: Software is constantly evolving, but its history is important. Cryptographic hashes enable SWHIDs to permanently and reliably identify specific software versions. An ISO standard reinforces the long-term viability and trustworthiness of this identification method, making it crucial for archival and referencing over time.
Increased adoption and support: An ISO stamp of approval can encourage wider adoption of SWHID by tools, platforms, and research initiatives. It provides a solid foundation for building infrastructure and services around these identifiers.

Photo by Immo Wegmann on Unsplash

Okay, so ISO is good. But what exactly is this SWHID thing?

Think of a SWHID as a digital fingerprint for a piece of software (or even parts of it, like a specific file or directory). It’s a unique and unchangeable identifier that’s calculated directly from the software’s content.

Here’s a breakdown:

It’s like a super-secure barcode for software: Just like a barcode uniquely identifies a product, a SWHID uniquely identifies a specific version of a software artifact.
It understands how software is built: SWHID isn’t just for individual files. It understands the structure of modern software development, including folders, different versions (like revisions and releases), and even the entire history tracked by systems like Git.
It’s based on strong cryptography: This means that if even a tiny little bit of the software changes, the SWHID will be completely different. This makes it reliable for verifying if a piece of software is exactly what it claims to be.
It is forward-looking: the SWHID standard incorporates a version number and can be easily updated with stronger cryptographic algorithms in the future if needed.
Anyone can calculate and verify it: You don’t need to ask a central authority to get or check a SWHID. If you have the software, you can calculate its SWHID yourself and be confident that the corresponding source code hasn’t been tampered with. This decentralized nature is a powerful feature for trust and transparency.

Why should you care about the SWHID?

Even if you’re not a software developer, SWHID has implications for you:

Trusting the software you use: Imagine being able to trust with a high degree of confidence that the software you’re downloading is the version the developers intended, without any hidden changes. SWHID makes this possible.
Reliable research and citation: For researchers who rely on software, SWHID provides a stable and verifiable way to cite specific versions of code, ensuring reproducibility of results.
Long-term access to digital heritage: Software is becoming crucial to our cultural and scientific heritage. SWHID helps ensure these digital artifacts can be reliably identified and preserved for future generations.

What’s next

Getting ISO recognition for the Software Hash ID (SWHID) is a significant step forward for the software world. It brings the weight of international standards to a powerful technology that provides a reliable and transparent way to identify software. Just like standardized ISBNs for books or DOIs for research papers, SWHIDs, backed by ISO, can help bring order, trust, and long-term reliability to the vast and ever-evolving software landscape. And that’s something we can all benefit from.

The post ISO Standard for SWHIDs: Robust software identification appeared first on Software Heritage.

Why software development history is worth preserving

Nicole Martinelli — Tue, 25 Mar 2025 14:29:00 +0000

By Alex Khrustalev, Software Heritage Ambassador

We know a lot about ancient Mesopotamia. Some events can be dated precisely to the year despite having occurred thousands of years ago. Why? The Sumerians, the main inhabitants of the area, wrote a lot, and luckily for us, they wrote on clay tablets. A clay tablet is a very durable medium, much more durable than a papyrus. While papyrus quickly turns to ashes in a fire, clay tablets instead become even more resilient when exposed to heat and survive for thousands of years.

Of course, we no longer use clay tablets, they’re not exactly practical. But we can learn a lesson from the Sumerians: preserving information is just as crucial as producing it. The way we store data matters. The medium has evolved, we are moving to digital storage away from paper and other analog formats. Therefore, it reduces the risks attached to previous forms of storage and presents a set of new challenges. Fires are no longer such a significant risk because copying digital media is far cheaper and easier than a book. But we have another risk – what if someone tampers with it, how do we ensure data integrity?

The mission of Software Heritage is to solve these kinds of problems related to preserving software source code. If you’re not a developer, you might not be familiar with how software is created, stored, and shared. Let me break it down for you.

Software source code consists of folders containing files inside them written in a specific fashion depending on the programming language used. Software developers can modify these files by creating new ones, updating content, reorganizing folders, or deleting unnecessary files. That means source code is always evolving; it’s never in a static state.

Each software developer working on a source code has a local copy of the entire source code and makes changes independently. Here’s the problem: How do you combine all of the local changes from different developers and apply them to a final version? For that, a special software was created called a version control system (VCS). When source code is placed in a VCS, it becomes a repository. In modern VSC tools such as Git (the most widely used) there’s usually a main repository (often called “upstream”) that serves as the central source of truth. A local copy of the same repository is called “downstream.” The upstream repository is often hosted on a hosting platform such as Github or Gitlab. When a developer makes a change to a file or a set of files, they record this change with VCS – a process known as making a commit. Then the commit is pushed from downstream to upstream, making changes available for other developers.

Developers find this incredibly convenient for reviewing changes over time. By examining a commit I can understand which changes were made, and most importantly, who made them, so I can reach out to the author if I have any questions or need clarification about the changes.

This is all cool, but it works only within a specific repository. But what happens if this repository is deleted? This is a pretty common problem. The most notable incident, which shook the industry, was the removal of the “left-pad” package from npm. Although npm Inc. quickly restored the repository, it caused significant disruptions in service, affecting large companies like Facebook and Netflix.

Software Heritage is developing a system to address this. Going back to the concept of a commit, each commit has a unique identifier that developers can use to reference a specific point in a repository’s history. Software Heritage has its own unique identifier, called Software Heritage ID (SWHID), but it goes beyond just commits. SWHIDs can identify not only commits but a wide variety of software artifacts: files, directories, revisions (aka commits), and more. Unlike traditional version control systems, SWHIDs are not tied to a specific repository. The fact that the archive collects repositories from multiple sources – Github, Gitlab, npm, to name a few – makes it possible to have a unified archive with persistent identifiers across different platforms. It’s possible to adapt any kind of software source, even if it’s not currently stored in VCS. To see the full list of origins go to the archive.

Here’s an explanation of how Software Heritage differs from something like Git.

SWHID has a wider scope. The first fundamental difference is that a Git commit hash identifies a specific commit in a single repository. In contrast, the SWHID identifies a wide range of software artifacts (not only commits) including files and directories.
These SWHID examples demonstrate its capabilities across various artifacts.

Directory:

swh:1:dir:717248067ccd951a4dd64d63353ad491fcb7b7eb – a SWHID which identifies a directory in a git repository of Elixir Ecto project at path /lib/ecto/adapter/

File Content:

swh:1:cnt:7590da26689a6269ef5a6dbe75dbecf6531c7d8f – a SWHID which identifies a file content in a git repository of Elixir Ecto project at path /lib/ecto/adapter/storage.ex.

Commit (Revision):

swh:1:rev:6c612ca358b567242a13fee1fcc3fceb2edce6a6 – a SWHID which identifies a commit (revision) in a git repository of Elixir Ecto project authored by José Valim 26 February 2025, 11:34:18 UTC with a message “Update changeset.ex”.

There are other artifacts such as origins, projects, releases, snapshots, and visits.

SWHID is platform agnostic. It makes it possible to crawl the source code from different origins (e.g. Github, Gitlab, Bitbucket, etc.) across many different VCS (Git, Mercurial, Subversion, etc.) and it’s possible to adapt any kind of software source, even if it’s not currently stored in VCS.

SWHID is designed to be persistent indefinitely. Even if the original repository disappears, the Software Heritage archive will preserve it intact.

Hopefully, this article has given you insight into how development history is preserved, why it’s important, the unique role of Software Heritage and how it adds value to traditional VCS. By archiving source code on a global scale, Software Heritage stores knowledge of software development for current and future generations.

About the author

Alex Khrustalev is a full-stack developer, with experience from delivering small to large-scale web applications across various domains. His main interests include functional programming, UI/UX development, and distributed systems. A huge open-source software enthusiast, he’s been using it for years at Prosapient. Although Khrustalev is adept at working with both server-side and client-side codebases, his true passion lies in crafting user interfaces and delving into the latest trends in web development. He has a blog at Hackernoon where he writes on different software topics related to web development and programming in general.

You can book a free consultation with him or an ambassador in your field to learn more about Software Heritage by sending an email to: ambassadorprogramATsoftwareheritage.org

The post Why software development history is worth preserving appeared first on Software Heritage.