security Archives - Software Heritage

What’s next in research for Software Heritage

Nicole Martinelli — Wed, 09 Apr 2025 08:02:00 +0000

After nearly a decade of Software Heritage, changes are afoot. Co-founder Stefano Zacchiroli is shifting focus to become the Chief Scientific Officer (CSO), while Thomas Aynaud joins the team as the new Chief Technical Officer, taking over Zacchiroli’s previous responsibilities.

The move to CSO allows Zacchiroli to focus on his research interests: digital commons, open-source software engineering, computer security, and the software supply chain. He’s a full professor of computer science at Télécom Paris, Polytechnic Institute of Paris. A Debian developer since 2001, he served as Debian project leader from 2010 to 2013.

In this interview, he talks about getting back to research full-time, how Software Heritage helps make open-source more secure for everyone, and why keeping the hobbyist ethos alive is important.

Stefano Zacchiroli. © Inria / Photo B. Fourrier

You’re the co-founder of Software Heritage, what was the pivot point for changing roles?

In the beginning, we were a fairly small team, so we had to distribute the roles among ourselves. I was technically inclined, since I’d been doing free software technical work for decades, so I picked up the CTO role. I was happy to lay the technical foundations of Software Heritage, which are still fundamental for the archive today. But my real life is as a researcher – I’m a computer science professor, and I’d been doing research for most of my career. Over the past 10 years, all my research has been built upon what Software Heritage enables: very large-scale empirical analyses of the software commons. Eventually, my vocation pulled me in that direction more and more, and I was delighted to find someone to step into the CTO role so I can go back to full-time research work.

What’s a Chief Scientific Officer (CSO)?

Doing research and enabling research has been part of the mission of Software Heritage from the very beginning. We have a lot of research work conducted by team members themselves (like me), members of the technical team helping researchers conduct their work, as well as outside researchers from universities and research labs mining the Software Heritage archive. The CSO role is about coordinating all of that, it doesn’t mean running every possible research project: we want people to be able to do research on Software Heritage on their own. But it does mean keeping an eye on how we interact with researchers, to make their work easier.

“Research used to be a side activity of Software Heritage, not part of our main strategy. Now, it’s a key focus and we’re making it visible.“

What are you most excited about in your new role?

There’s a lot of research going on around Software Heritage, from all kinds of angles. I’m the principal investigator for many of them, especially around security use cases. On that front, the question is: how can we effectively leverage all the knowledge collected by Software Heritage, and use it so that free software developers can create more secure software for all of us? The same general principle of efficiently leveraging Software Heritage knowledge to enable the public good can be applied to other fields. Two I’m actively pursuing are: first, how Software Heritage can enable reproducible research when it comes to software. And second, how it can enable build reproducibility of open source software. This lets people using open-source software check where it comes from and have a solid way to trust the link between the code and what’s running on their devices.

Finally, another angle I’ve been working on extensively is the human aspect of software engineering, specifically the global collaboration involved in creating open-source software. We’ve previously studied diversity in terms of gender and geographic origin. Software Heritage offers a unique perspective on how these trends evolve worldwide.

Software Heritage recently held a partner kickoff for CodeCommons, what’s your involvement in that project?

So as you can imagine, a lot of people have been interested in trying to use the Software Heritage Archive for large language models (LLMS) for code. We’re not providers of models, but we want to understand our role in helping create ethical datasets, where ethical means, first and foremost, keeping track of where the code comes from when it is used to train LLMs. We want people to be able to know if their code has been used in LLM training. We want to be able to produce open datasets for that, and we want to provide all the relevant information that enables LLM producers to respect the license of the code used in the model and have information about the provenance of the code that has been used.

What are you most looking forward to leaving behind?

The technical work to maintain and develop the Software Heritage archive has really changed in scale, over time. We have a larger code database, of course, because that’s ever-growing, and we have a larger team. We have about a dozen engineers now. We have changed technologies, so we’re using more large-scale technologies than we were using in the very beginning. I’m very glad to hand that off to the capable hands of Thomas Aynaud, who’s far more skilled than I am on that front and is taking over the leadership of that team, too.

How will this transition impact the scientific direction of SWH, and what changes can we expect short and long term?

Software Heritage as an initiative is primarily an enabler of many use cases, and large-scale research on the software commons is one of them. I don’t expect the focus of Software Heritage to change as part of my role change. What will change is that Software Heritage will support more independent research teams doing this kind of large-scale research. I expect more collaborations to materialize, more researchers will use Software Heritage data, leading to increased research results, both directly and indirectly.

Do you see your contributions changing much, or is it a question of quantity or focus?

No, it’s more about peace of mind. I’ve been primarily doing research for years. This is simply aligning my title with my actual work, not changing what I’m doing. Research used to be a side activity of Software Heritage, not part of our main strategy. Now, it’s a key focus and we’re making it visible.

Follow-up question from the recent Symposium: You mentioned that SWHSec empowers anyone, not just large corporations, to contribute to software security.

SWHSec is a large and diverse research project, with many teams looking at how Software Heritage can help secure open source for everyone. My presentation was about one specific R&D task: figuring out if a specific software version in the archive is vulnerable. This isn’t a new problem, and there are industry solutions for it. But Software Heritage has something unique: deep source code visibility. We can track vulnerabilities down to individual commits and see if even a random fork on a little-known development forge is affected by a specific vulnerability or not. This is a kind of hidden problem industry usually ignores. For example, if you forked a project two years ago when it was vulnerable, and never merged back a fix, we can tell you it’s still vulnerable. Sure, there’s not a lot of money in this, but we can give developers data and tools to fix these issues. We’ve found real vulnerabilities and helped fix them, that maintainers didn’t know about. As we can do so, it’s part of our social responsibility to do so.

How do you see Software Heritage evolving in terms of its impact on open-source sustainability?

Open-source sustainability is a huge topic, and people are trying all sorts of things to improve the status quo. A common approach is to convince maintainers to become independent businesses, so they get paid for their work. Another is to look at how civil servants build software for governments. Software Heritage helps by showing all the ways people contribute to keeping open source going, and how funding works. For instance, we’ve recently used Software Heritage to show that researchers are really important for maintaining data and machine learning open-source software, which is super important these days. The role of civil servants and researchers is often overlooked in sustainability discussions, but you can see it in our data – by showing who’s really contributing to important software, we can help shape how funding is used. We can also spot software that’s important but isn’t getting enough attention – it needs more maintenance, more contributions, or more funding – so it’s at risk. We’re the only ones who can do this on such a large scale, and see all the different kinds of contributions you can’t see just by looking at GitHub.

You’ve been a Debian developer for a long time. How has your perspective on open-source development changed over the years?

Well, back in the early days, it was all volunteers, right? People were doing this just for fun or to “scratch an itch,” as the saying goes. Free software made collaboration possible, and then the industry took notice. Now there’s a lot of intermingling between paid and volunteer contributions. There are very relevant projects like the Linux kernel that are almost entirely maintained by paid contributions. But I think it’s important to keep the door open for hobbyists, both volunteer maintainers and drive-by contributors, to participate. We’re risking a strict separation between commercial open source, paid for by large companies, and hobbyist open source, which gives the ability for everyone to contribute and understand the code they run.

The post What’s next in research for Software Heritage appeared first on Software Heritage.

Joining forces for a secure open source software supply chain

Nicole Martinelli — Tue, 01 Oct 2024 07:00:47 +0000

The digital landscape is evolving, and with it, the responsibilities that come with creating, maintaining, and securing software. Landmark regulations like the European Cyber Resilience Act (CRA) are reshaping the way open-source software is used and governed. As these regulations set new standards, organizations must adapt to ensure compliance and security.

At Software Heritage, we believe that these changes present not only challenges but also opportunities to create a safer, more transparent open-source ecosystem. As the largest public archive of source code, supported by French research institute Inria and UNESCO, we’re a founding member of the Eclipse Foundation’s newly formed Open Regulatory Compliance Working Group (ORCWG). This group is dedicated to helping open-source projects navigate emerging regulatory requirements while ensuring that innovation and collaboration continue to thrive.

Why now: A new era of regulation

Over the past few years, regulations such as the Digital Services Act, Digital Markets Act, and the General Data Protection Regulation (GDPR) have introduced sweeping changes to the tech landscape.

The CRA, set to be in full force by the end of 2027, will also impact all software put on the market in Europe. It will require organizations to trace their software’s origins, manage vulnerabilities, and ensure that critical software components are properly documented. This highlights the need for strong tools and infrastructure to meet these requirements, which Software Heritage is equipped to provide.

Traceability and security with Software Heritage

Modern software development has often been compared to a Jenga tower. Each block is a component, and if one wobbles, the whole thing could crumble. Today, most software stacks are built on a foundation of external components, often open source. To ensure a secure system, it’s crucial to know exactly what those blocks are and where they came from.

Enter ‘Know Your Software (KYSW). Just as banks must identify their customers, developers need to understand their software’s components. To achieve complete traceability, every piece of software, from binary to source code, must be created, shared, validated, and tracked.

That’s where Software Heritage comes in. We’ve secured over 50 billion software artifacts through the Software Hash Identifier (SWHID) specification, guaranteeing long-term availability, ensuring integrity, and enabling traceability across the entire software ecosystem.

With new regulations come basic needs that become best practices: making source code publicly available, identifying precisely the versions with or without this or that known vulnerability, tracing the origin of software components, finding a reference place where to store qualified metadata, and more.

Contributing to the future of open-source security

Joining the ORCWG is just the next step in our mission to make software safer and more open. We’ve been actively engaged in discussions about securing the software supply chain for years, and the SWHID is part of the SPDX 2.2 specification and included in the 2021 report of the working group on Software Bill of Materials (SBOM) that NTIA launched in 2018.

ORCWG just launched but is already gearing up for a major challenge: building a blueprint for cybersecurity that aligns with CRA. ORCWG’s mission? To deliver a clear roadmap for open-source projects, helping them navigate the new security landscape.

Get involved

We’re in good company: key players from foundations and corporations are joining forces in this new working group, organized by the Eclipse Foundation. At launch time, members included Apache Software Foundation (ASF), Blender Foundation, Robert Bosch GmbH, CodeDay, The Document Foundation, FreeBSD Foundation, Matrix.org Foundation, NLnet Labs, Open Elements, OpenForum Europe, OpenInfra Foundation, Open Source Initiative (OSI), Open Source Robotics Foundation (OSRF), OWASP, Payara Services, The PHP Foundation, Python Software Foundation, Rust Foundation, SCANOSS and Siemens.

If you’d like to join us, there are plenty of ways to get involved from a mailing list to a Matrix chat, weekly office hours, webinars and repos. You can also apply to become a member.

The post Joining forces for a secure open source software supply chain appeared first on Software Heritage.