Making Open Data work for Open Source Software
Open source is stronger today than it ever has been, but watching Google I/O last month and Apple’s WWDC keynote earlier this evening ignites a frisson of concern in me for its future. Whilst Open Source is excellent at delivering operating system components, web browsers, compilers, runtimes and other astonishingly intricate technical products, its competition on the desktop is rolling out increasingly intelligent personal assistants, semantic search, real-time translation and transcription, and all manner of other incredibly powerful and transformative technologies (as well as, in Apple’s case, an application that literally reminds its users to breathe in and out ). Take Shotwell — a perfectly competent photo manager that’s available in Ubuntu 14.04 and many other Linux distributions. You can group photos into events, manage them in folders, tag and sort, but that’s all stuff that iPhoto 7 did very competently back in 2007 . Subsequent iPhoto versions added face detection, location clustering , and its newly i- less successor will soon even be able to identify objects and richly combine the best-looking photos and videos into shareable mini-movies , all (apparently) without compromising privacy. Without some help, Shotwell and its sister applications risk being left bereft of the features that modern users expect, and that’s a problem.
Of course, Open Source software also offers other less tangible features, like awesome configurability and privacy (as well as genuinely free cost), but the bottom line is that every missing feature is a potential barrier stopping people from trying out open source programs. Figuring out how to marry these features of open source with increasing machine intelligence is — I think — the defining problem that open source faces in this decade.
Of course, just because a program is free to an end-user doesn’t mean that organisations can’t fund its development by configuring it, deploying it, securing it, and adapting it. But data-heavy programs upset this because not only do the programmers need to pay rent, but so do the people who create, curate and annotate the raw material. The natural solution is to drive the costs of data acquisition down to source data from users working for no monetary fee. But open source projects face a scaling problem in adopting this model: only a small percentage of users will provide any useful data, so you either need to control enough Internet property to gather the information, or you need to have generous donor support to create a proprietary dataset.
So perhaps a solution might be to create a new community organisation— perhaps a kind of Apache Foundation for data — whose sole purpose is to figure out the data-intensive direction of prominent software projects and incubate them: generating innovative new data acquisition strategies, figuring out funding and making the resulting datasets free to use by anyone. As an example, think of what an organisation could do if managed to convince a small number of sites to switch from Google’s RECAPTCHA to something that contributed to an open source dataset instead. Such an organisation would be free to pursue innovative funding models — perhaps charging for access to particular verified datasets, and using the proceeds to pay site owners based on the validity of their user’s results. However it’s tackled, someone has to figure out how to apply what works for open source software work for open source data as well. If something is not done, we risk ceding the functional high ground and control over essential categories of modern computer software to an ever smaller and more powerful number of elite organisations with the resources to amass the huge datasets that drive the modern world, and that’s not a future I want to live in.