NoML Open Letter

A specification for those who want content searchable on search engines, but not used for machine learning.

Publishers need improved ways to indicate how they want content to be used in search and machine learning. Using robots.txt does not cover all use cases, and so a complementary approach is needed as proposed here. It is one which can be applied to individual webpages as desired, and can be preserved as such in datasets of web content.

Sign the open letter via Github, or if you'd prefer to, you can sign the letter via email.

NoML Proposal

Four cases

Blocking a user agent in robots.txt (such as the search spider BingBot) for ML/AI products may also block potential visibility in associated search results. So if you don’t want your content to be used in products which leverage machine learning, you may also lose the benefits of being listed in search results. This is an unacceptable situation where content creators and publishers may be required to make a choice between “all or nothing”. We therefore need an approach which addresses four basic use cases as follows:

1. Do use content for search
Do use content for machine learning
2. Do use content for search
Do not use content for machine learning
3. Do not use content for search
Do use content for machine learning
4. Do not use content for search
Do not use content for machine learning

Several suggestions for how to address this issue have been put forward. Here we propose something different, simpler and we think fairer. It is a simple adaptation of how meta and X-Robots tags are already used as we explain below in detail.

Explanation and Examples

Since there are many companies scraping/crawling webpages in order to collect data, and often without identifying themselves, a way which addresses more than just search engine crawler bots and using robots.txt is needed. The proposal here is to add a new ‘noml’ value to the already-existing meta and X-Robots tag. Meta robots tags are used already for search engine crawlers, so companies like Google that crawl for their search engine already follow requests made in this way. For example the noindex and nofollow values are used to instruct search engines on how to handle a webpage.

Also meta data is stored in Common Crawl whose crawled data is a very common, and often the biggest, source of data used in machine learning and to train AI models. The noindex attribute is used to tell search engines not to include a page in their search results, even though they can crawl the page. This tag is used when a webpage is not intended to be indexed by search engines. The nofollow attribute is used to tell search engines not to follow the links on a webpage. This attribute is used when a webpage that contains links should not be considered as endorsements or recommendations. Similarly the noml could be used to instruct any service (e.g. a search engine or AI builder) that any data from that page should not be used in machine learning.

This can be simply expressed for HTML pages using:

<meta name="robots" content="noml">

and for non-HTML using:

X-Robots-Tag: noml

Just as for the meta tag, where the name “robots” refers to all user-agent tokens, you can also identify individual user-agents. For example, you can currently request that Google does not include a page in the search results:

<meta name="googlebot" content="noindex">

and similarly you could request that Google does include a page in their search index, but does not use the data for machine learning with:

<meta name="googlebot" content="noml">

and request that Microsoft does not include a page in their Bing search index and does not use the page for training, with:

<meta name="bingbot" content="noindex, noml">

you might also request that OpenAI, for example, does not use the page for machine learning with:

<meta name="gptbot" content="noml">

however in this case, since OpenAI is not operating a search engine, you could (also) block them from crawling in the robots.txt file using:

User-agent: gptbot
Disallow: /

Further examples of are shown below, for HTML and non-HTML:

Include page in search Follow links Use in machine learning
<meta name="robots">
X-Robots-Tag:
✓ ✓ ✓
<meta name="robots" content="noindex">
X-Robots-Tag: noindex
✕ ✓ ✓
<meta name="robots" content="nofollow">
X-Robots-Tag: nofollow
✓ ✕ ✓
<meta name="robots" content="noml">
X-Robots-Tag: noml
✓ ✓ ✕
<meta name="robots" content="noindex, nofollow">
X-Robots-Tag: noindex, nofollow
✕ ✕ ✓
<meta name="robots" content="noindex, nofollow, noml">
X-Robots-Tag: noindex, nofollow, noml
✕ ✕ ✕

Search Engine API Usage

Crawler based search engines such as Mojeek, Bing and Google, also provide their results to other search engines and services via APIs. We propose that these search engines, as Mojeek would do, include the ‘noml’ directive within their API response. Search API providers should make it part of their terms of API service that results labelled as such are not used for machine learning purposes by API end users.

Conclusion

This proposed solution achieves the desired goals, by simply adding one additional value to already existing methods, and without necessarily requiring more user-agents to be used.

With this proposal creators and publishers can indicate separately whether they want content to be findable in search engines and/or used for machine learning.

Share the open letter

an icon representing twitter

an icon representing reddit

an icon representing mastodon

an icon representing facebook

an icon representing linkedin

Dear AI companies, Crawlers, Search Engines, ML Projects, Scrapers etc.

We, the undersigned, support the adoption of this proposal, which enables creators, publishers, and all other content contributors on the web to indicate whether their content can be utilized for machine learning training and applications.

Signatures from organisations and companies

Signatures from individuals

  • Reinout van Schouwen
    Yoink B.V.
  • Pelle Wessman
    Individual
  • Hugh Maaskant
    Individual
  • Colin Hayhurst
    Mojeek
  • Joshua Long
    Mojeek
  • Edan Osborne
    Individual
  • Petr Strnad
    Individual
  • Ahmed Ayman
    Individual
  • Roman Láncoš
    JojoYou
  • Innocent Tumwesigye
    Mubende Miners
  • Louise Winters
    Individual
  • Mike Bugembe
    Together For Africa
  • Jacob Forget
    Resulti
  • Peter Beackley
    Playful Technology Ltd
  • Matt Vestengen-Cox
    Individual
  • Phil Höfer
    SUMA-EV/MetaGer
  • Nigel Jacklin
    Accord.me.uk
  • Matias Lavik
    Individual
  • Rohan Kumar
    Individual
  • André Jaenisch
    André Jaenisch WD & C
  • N. Jansen
    Individual
  • Joe Devon
    Yuda Digital
  • Maciej Pędzich
    Individual
  • Bobby Hiltz
    Individual
  • Jack Yan, CEO
    Jack Yan & Associates
  • Ryan Brown
    Individual
  • Nikolaos Dimitropoulos
    Individual
  • Casey Johnston
    Individual
  • Miklo
    Citizen4
  • Rich Lott
    Individual
  • Tommi
    Individual
  • Erik
    Individual
  • Andreas Schamanek
    Individual
  • Ros Jackson
    Individual
  • Pug
    Individual
  • Bryce A. Lynch
    Virtual Adept Networks
  • girlbossceo (June)
    Individual
  • Mocking Blerd
    Individual
  • Rhodes Atukwase
    Child Network Org
  • Jakob Senkl
    Individual
  • Przemysław Gasiński
    Individual
  • Matthew Alp
    x-Shopify
  • Lily 'MinekPo1' A.N
    Individual
  • Randell Brian Knight
    Individual
  • Alfonso Hernandez
    CEO Astian, Inc
  • Niklas Poslovski
    Individual
  • James North
    Individual
  • rav3ndust
    Ibara Software
  • Elmar Dott
    Software Consultation
  • James Nock
    Individual
  • Candlewax
    Individual
  • JacksonChen666
    Individual
  • Milosz Galazka
    Individual
  • Timo Tijhof
    individual
  • Bruno Manach
    individual
  • mirabilos
    Individual
  • Raphaël Maurille
    Student
  • Klaus Alexander Seistrup
    Individual
  • Stefan Judis
    free
  • jessie
    Individual
  • Reza Almanda
    Individual
  • Jacob Trebil
    Andi Search
  • Gianmarco Gargiulo
    Individual
  • Alban LECOCQ
    Individual
  • Richard Lees
    Mojeek
  • Anthony D
    Individual
  • Chris Heinz
    Individual
  • Robin Berjon
    individual
  • Haelwenn Monnier
    Individual
  • Eric Cassidy
    Individual
Sign on GitHub Sign via Email Blog Post Press
an icon representing twitter

an icon representing reddit

an icon representing mastodon

an icon representing facebook

an icon representing linkedin