NoML Open Letter

A specification for those who want content searchable on search engines, but not used for machine learning.

Publishers need improved ways to indicate how they want content to be used in search and machine learning. Using robots.txt does not cover all use cases, and so a complementary approach is needed as proposed here. It is one which can be applied to individual webpages as desired, and can be preserved as such in datasets of web content.

Sign the open letter via Github, or if you'd prefer to, you can sign the letter via email.

NoML Proposal

Four cases

Blocking a user agent in robots.txt (such as the search spider BingBot) for ML/AI products may also block potential visibility in associated search results. So if you don’t want your content to be used in products which leverage machine learning, you may also lose the benefits of being listed in search results. This is an unacceptable situation where content creators and publishers may be required to make a choice between “all or nothing”. We therefore need an approach which addresses four basic use cases as follows:

1. Do use content for search
Do use content for machine learning
2. Do use content for search
Do not use content for machine learning
3. Do not use content for search
Do use content for machine learning
4. Do not use content for search
Do not use content for machine learning

Several suggestions for how to address this issue have been put forward. Here we propose something different, simpler and we think fairer. It is a simple adaptation of how meta and X-Robots tags are already used as we explain below in detail.

Explanation and Examples

Since there are many companies scraping/crawling webpages in order to collect data, and often without identifying themselves, a way which addresses more than just search engine crawler bots and using robots.txt is needed. The proposal here is to add a new ‘noml’ value to the already-existing meta and X-Robots tag. Meta robots tags are used already for search engine crawlers, so companies like Google that crawl for their search engine already follow requests made in this way. For example the noindex and nofollow values are used to instruct search engines on how to handle a webpage.

Also meta data is stored in Common Crawl whose crawled data is a very common, and often the biggest, source of data used in machine learning and to train AI models. The noindex attribute is used to tell search engines not to include a page in their search results, even though they can crawl the page. This tag is used when a webpage is not intended to be indexed by search engines. The nofollow attribute is used to tell search engines not to follow the links on a webpage. This attribute is used when a webpage that contains links should not be considered as endorsements or recommendations. Similarly the noml could be used to instruct any service (e.g. a search engine or AI builder) that any data from that page should not be used in machine learning.

This can be simply expressed for HTML pages using:

<meta name="robots" content="noml">

and for non-HTML using:

X-Robots-Tag: noml

Just as for the meta tag, where the name “robots” refers to all user-agent tokens, you can also identify individual user-agents. For example, you can currently request that Google does not include a page in the search results:

<meta name="googlebot" content="noindex">

and similarly you could request that Google does include a page in their search index, but does not use the data for machine learning with:

<meta name="googlebot" content="noml">

and request that Microsoft does not include a page in their Bing search index and does not use the page for training, with:

<meta name="bingbot" content="noindex, noml">

you might also request that OpenAI, for example, does not use the page for machine learning with:

<meta name="gptbot" content="noml">

however in this case, since OpenAI is not operating a search engine, you could (also) block them from crawling in the robots.txt file using:

User-agent: gptbot
Disallow: /

Further examples of are shown below, for HTML and non-HTML:

Include page in search Follow links Use in machine learning
<meta name="robots">
X-Robots-Tag:
✓ ✓ ✓
<meta name="robots" content="noindex">
X-Robots-Tag: noindex
✕ ✓ ✓
<meta name="robots" content="nofollow">
X-Robots-Tag: nofollow
✓ ✕ ✓
<meta name="robots" content="noml">
X-Robots-Tag: noml
✓ ✓ ✕
<meta name="robots" content="noindex, nofollow">
X-Robots-Tag: noindex, nofollow
✕ ✕ ✓
<meta name="robots" content="noindex, nofollow, noml">
X-Robots-Tag: noindex, nofollow, noml
✕ ✕ ✕

Search Engine API Usage

Crawler based search engines such as Mojeek, Bing and Google, also provide their results to other search engines and services via APIs. We propose that these search engines, as Mojeek would do, include the ‘noml’ directive within their API response. Search API providers should make it part of their terms of API service that results labelled as such are not used for machine learning purposes by API end users.

Conclusion

This proposed solution achieves the desired goals, by simply adding one additional value to already existing methods, and without necessarily requiring more user-agents to be used.

With this proposal creators and publishers can indicate separately whether they want content to be findable in search engines and/or used for machine learning.

Share the open letter

an icon representing twitter

an icon representing reddit

an icon representing mastodon

an icon representing facebook

an icon representing linkedin

Dear AI companies, Crawlers, Search Engines, ML Projects, Scrapers etc.

We, the undersigned, support the adoption of this proposal, which enables creators, publishers, and all other content contributors on the web to indicate whether their content can be utilized for machine learning training and applications.

Signatures from organisations and companies

Signatures from individuals

  • Elmar Dott
    Software Consultation
  • Mike Bugembe
    Together For Africa
  • Milosz Galazka
    Individual
  • Klaus Alexander Seistrup
    Individual
  • Alban LECOCQ
    Individual
  • Tommi
    Individual
  • Raphaël Maurille
    Student
  • Louise Winters
    Individual
  • Jacob Forget
    Resulti
  • Candlewax
    Individual
  • Robin Berjon
    individual
  • Lily 'MinekPo1' A.N
    Individual
  • Niklas Poslovski
    Individual
  • André Jaenisch
    André Jaenisch WD & C
  • N. Jansen
    Individual
  • Rich Lott
    Individual
  • Petr Strnad
    Individual
  • Stefan Judis
    free
  • Bruno Manach
    individual
  • Roman Láncoš
    JojoYou
  • Innocent Tumwesigye
    Mubende Miners
  • Rhodes Atukwase
    Child Network Org
  • Joe Devon
    Yuda Digital
  • jessie
    Individual
  • Alfonso Hernandez
    CEO Astian, Inc
  • girlbossceo (June)
    Individual
  • Miklo
    Citizen4
  • Nikolaos Dimitropoulos
    Individual
  • Matthew Alp
    x-Shopify
  • JacksonChen666
    Individual
  • Przemysław Gasiński
    Individual
  • Pug
    Individual
  • Reza Almanda
    Individual
  • Matias Lavik
    Individual
  • Colin Hayhurst
    Mojeek
  • Peter Beackley
    Playful Technology Ltd
  • Chris Heinz
    Individual
  • Matt Vestengen-Cox
    Individual
  • Andreas Schamanek
    Individual
  • Anthony D
    Individual
  • Hugh Maaskant
    Individual
  • Eric Cassidy
    Individual
  • Pelle Wessman
    Individual
  • Haelwenn Monnier
    Individual
  • Ros Jackson
    Individual
  • Erik
    Individual
  • Maciej Pędzich
    Individual
  • James Nock
    Individual
  • Ryan Brown
    Individual
  • Timo Tijhof
    individual
  • Joshua Long
    Mojeek
  • Jakob Senkl
    Individual
  • mirabilos
    Individual
  • Randell Brian Knight
    Individual
  • Rohan Kumar
    Individual
  • Bryce A. Lynch
    Virtual Adept Networks
  • Gianmarco Gargiulo
    Individual
  • Richard Lees
    Mojeek
  • Jack Yan, CEO
    Jack Yan & Associates
  • Phil Höfer
    SUMA-EV/MetaGer
  • Bobby Hiltz
    Individual
  • Nigel Jacklin
    Accord.me.uk
  • James North
    Individual
  • Reinout van Schouwen
    Yoink B.V.
  • Mocking Blerd
    Individual
  • Edan Osborne
    Individual
  • Jacob Trebil
    Andi Search
  • Casey Johnston
    Individual
  • rav3ndust
    Ibara Software
  • Ahmed Ayman
    Individual
Sign on GitHub Sign via Email Blog Post Press
an icon representing twitter

an icon representing reddit

an icon representing mastodon

an icon representing facebook

an icon representing linkedin