Wikipedia talk:Large language models/Archive 5

This is an archive of past discussions on Wikipedia:Large language models. Do not edit the contents of this page. If you wish to start a new discussion or revive an old one, please do so on the current talk page.

Archive 1

←

Archive 3

AI Generated Content:(Ban, Limit or Allow)?

(Moving this from Village Pump's policy board)

Prompted by some responses in the Wikimedia Community Discord to a query I had about using AI tools (such as LLM's or Alpha to generate content for wikis.

I present three choices of wording of a policy/guideline on the inclusion use of AI Generated content, such as for example that from LLM's :-

1. " Wikipedia is a entirely work of collaborative human authorship, Use or contribution of material generated wholly or mostly in part from non human sources (such as LLM based generation) is prohibited."

2. "Wikipedia is a primarily a work of human authorship, Use of content generated from AI's s (such as LLM based generation) should be used sparingly and content generated with it's assistance should be clearly identifiable as such, with full attribution of the tools or models used."

3. "Wikipedia is a collaborative work, and users may make use of appropriate tools such as LLM's (with appropriate attribution), in order to further this aim."

This of course assumes that the generated content meets all other considerations for content that would apply irrespective of human vs AI generation.

I'm not going to argue for any specific position, but my concerns about AI generated content, are the lack of clarity and transparency about usage rights under compatible license, and the possibility of copyright material 'leaking' into an otherwise 'freely' licensed wiki.

English Wikipedia should have a clearly documented policy, on what 'AI/machine-generated' content can or cannot be included.

I also appreciate that there are plenty of passive bots on Wikipedia that assist skilled users in performing tasks that would be time consuming to do manually. ShakespeareFan00 (talk) 12:29, 11 April 2023 (UTC)

It's not merely a matter of copyright "leaking" (i.e., chunks of material from the model's training set coming out in the material it generated.) At least some of the providers of AI tools are claiming copyright in the output. While there is much question as to the legal viability of such a move, to the best of my not-a-lawyerly knowledge this has not yet been settled in court. As such, we may discover at a later date that query results that have been integrated into the text are actually the copyright of an AI-generating firm that did not consent to its inclusion, even if they are not currently claiming that right. One could in theory have an AI system trained only on public domain texts and issuing its results with a Creative Commons license, but that is not the default, if it exists at all. As such, it may behoove us to wait until there is better established law that clears these concerns. --Nat Gertler (talk) 14:25, 11 April 2023 (UTC)

I like the "work of collaborative human authorship" language. Another option would be to treat LLMs as bots and require them to go through the bot approval process. –dlthewave ☎ 18:15, 11 April 2023 (UTC)

How would that work with indvidual (human authors) using LLM drafted content? A Policy that to use LLM derived content you have to have a bot flag for those edits perhaps? ShakespeareFan00 (talk) 19:28, 11 April 2023 (UTC)

I also like work of collaborative human authorship. The bots that operate at all in the mainspace are maintenace-focused at best, rescuing sources and auto-reverting the most blatant vandalism. I don't even like the thought of whitelisting LLMs for use because of how it could spiral out of control, given their lack of human constraints.

While I have a great gratitude for editors arguing the copyright side of AI-generated content, my focus lies in the logistics of maintaining veracity and due-attention in the content of articles itself. Large language models are by their very nature and development, made to be believed, not to be true. They are grown to replicate human language and be able to pass for the work of a person, not to furnish truth or to adhere to good research practices. If you developed an LLM on content or writings exclusively written in British English, the algorithm would swear up and down the garden path that the word color can only be spelled colour. The algorithm cannot discern, and that is a problem. They are primarily concerned with responding to prompts with an answer that is both plausible and delivered like a human because these are what makes them convincing.
TL;DR Even in the hands of experienced editors, LLMs will deliver misinformation to its users and will, by their very design, be as convincing with these falsehoods as possible. GabberFlasted (talk) 18:59, 11 April 2023 (UTC)

Another concern. LLM's cannot necessarily infer bad intention of the users of them. If you ask a "wrong" question, the right way, or a "right" question the wrong way, you might get something unexpected. ShakespeareFan00 (talk) 19:37, 11 April 2023 (UTC)

+1 - you wrote exactly what I was going to say. LLMs are designed to the very best at producing plausible content to fool humans, and care about truth only to the extent that truth can help fool humans. If there isn't already an essay explaining this there should be one. I think the only tenable option for the survival of Wikipedia as a reliable work is banning LLMs. Galobtter (talk) 19:58, 11 April 2023 (UTC)

I agree with an approach based on how programs are used, rather than the specific underlying technologies, which are evolving and becoming increasingly embedded into many programs. Ultimately, though, if there is a significant increase in poorly written submissions, the community needs to figure out a way to handle them. A policy alone isn't going to prevent them. isaacl (talk) 21:41, 11 April 2023 (UTC)

becoming increasingly embedded into many programs

Indeed, when LLMs are by default completing everything you write in MS Word and Google Docs, a policy isn't going to do anything anymore. PopoDameron ⁠talk 23:05, 11 April 2023 (UTC)

I would be opposed to a complete ban on several grounds; it's impractical and doesn't recognize the wide range of ways LLMs might be used. The key thing is that we have to make sure everything added adheres to our existing core content policies; enforcing those properly will make most of the problems go away (aside from maybe the more hand-wavy copyright issues, which I feel are speculative and would oppose writing policy around today - as opposed to when LLMs spit out concrete copyvios, of course, which fall under current policy.) The one thing I feel we might want to consider is tighter rules on automated or semi-automated mass-creations, which LLMs might enable and which we should probably require clear prior approval for on a case-by-case basis. --Aquillion (talk) 16:49, 12 April 2023 (UTC)
If there is a particular way LLMs are useful, then that use can be allowed, but I don't see any issue with banning use until that is shown. My issue with allowing use is that it makes creating content much much quicker than verifying it, especially since it seems LLMs are very adept at creating fake references. It also allows for good-faith users to create lots of false, unverifiable information inadvertently - this is something that is much harder without LLMs. Even if mass creation is not done, this still is a big issue, as we rely on the fact that good faith users generally add verifiable content.

Enforcing this policy is going to be hard, but my goal is that at least good-faith editors aren't mislead into thinking LLMs are a useful or endorsed way to write articles. Galobtter (talk) 08:57, 13 April 2023 (UTC)
Some other concerns I had thought of from reading the article are WP:SYNTH and WP:REFLOOP, if an LLM's been trained on wiki based sites, and that without significant oversight, I'm also concerned that "hallucinated" content about a BLP, could be inserted. (Aside: I am reminded that one A.P. Herbert's 'Misleading Cases' was about a suing a computer for defamation.) ShakespeareFan00 (talk) 09:36, 13 April 2023 (UTC)
I don't think there's an LLM whose dataset doesn't include Wikipedia, so yeah I think almost by definition LLM additions violate our core content policies. Galobtter (talk) 21:02, 13 April 2023 (UTC)
I assume that what you mean here is circular citation. While this is certainly true, I don't think this requires anything close to a complete prohibition: previous discussion here (and the several demo pages linked to from here) have given a litany of constructive uses. I agree that citing content directly to the language model as a source is unbelievably dumb and bad, and that blindly pasting output from the model into Wikipedia is also dumb and bad (which is an opinion shared by many, hence its inclusion on most if not all guidance pages that have been written thus far). jp×g 01:34, 14 April 2023 (UTC)
There are many reasons not to use LLM-generated prose, but I don't think citogenesis would be an issue as long as everything is cited and verified. If an LLM happened to output a direct copy of a Wikipedia article, for example, it would be no different from Copying within Wikipedia. The only thing to worry about would be proper attribution. –dlthewave ☎ 15:51, 15 April 2023 (UTC)

@Galobtter: There are examples of their use in the linked transcripts here (to wit: User:JPxG/LLM demonstration, User:JPxG/LLM demonstration 2, User:Fuzheado/ChatGPT, User:DraconicDark/ChatGPT, and Wikipedia:Using neural network language models on Wikipedia/Transcripts). The issues with fake references are indeed bad. Any time a person types "Write a Wikipedia article about XYZ" into ChatGPT and pastes the output straight into mainspace, it is trash and should be deleted (and I think WP:G3 should be expanded to this effect), but there are mane other ways to use these models. jp×g 01:39, 14 April 2023 (UTC)
For the cases where it is actually used for creating content in Wikipedia, which is what I care about, what I'm seeing is mostly evaluations of the like "almost all incorrect" etc. Stuff like giving suggestions is not really what I care about, and the use cases shouldn't be conflated. Galobtter (talk) 03:24, 14 April 2023 (UTC)
It seems to me that every few weeks, a new discussion is opened about this subject is started, and more or less the same points are raised as during the previous ones. In this case, the specific issues of whether LLM output should be covered under WP:BOTS (i.e. require BAG approval) and whether it inherently violated copyright was discussed at great length here and at the village pump in January. The product of these discussions, more or less, exists at WP:LLM and WP:LLMCOPY. In my opinion, it would be quite useful (and perhaps necessary) to work towards adopting or rejecting an existing proposal, or at least toward addressing whether or not existing proposals are good, versus having the same discussions a priori each time. jp×g 01:31, 14 April 2023 (UTC)
- I think we should start speeding things up toward initiating the policy proposal in the next few days. —Alalch E. 14:03, 14 April 2023 (UTC)
- Maybe start a pre-VPP RfC on this talk page with the options: A—this draft is finished enough to be proposed as the 'Large language models policy', B—this draft is not finished enough to be proposed as the 'Large language models policy', C—there should not be a new policy about this. It would be purely consultative in nature, and would not actually prevent someone from proposing it as a policy, if, for example, C gets the most support. —Alalch E. 14:21, 14 April 2023 (UTC)
  This discussion was originally at VPP. I was asked to read the draft policy, and the thread seemed better here.
  
  I think the VPP RfC should be of the Ban, Limit, Allow style of debate.
  
  ShakespeareFan00 (talk) 14:30, 14 April 2023 (UTC)
  The VPP RfC is going to be adopt / don't adopt new policy (this is a draft for that policy). —Alalch E. 14:44, 14 April 2023 (UTC)
  It may be better to get a view first on what general bounds the community agrees upon for use of writing assistant tools, and then revise the draft policy accordingly. This would increase the likelihood of the policy lining up with community consensus. isaacl (talk) 15:06, 14 April 2023 (UTC)
  isaacl, I strongly agree that we should get consensus on 3 or 4 "big questions" before presenting the entire page for adoption. We don't want one unpopular point to derail the whole thing. –dlthewave ☎ 16:57, 14 April 2023 (UTC)
  I also agree that we should start with an RfC with the overarching questions, like "Should LLMs be banned?" and such. PopoDameron ⁠talk 18:19, 14 April 2023 (UTC)
  Such an RfC was already held at VPP, essentially: Wikipedia:Village pump (policy)/Archive 179#Crystallize chatbot discussions into a policy?. At the very least, there was no consensus for a blanket ban. While more editors have become aware of the problem due to the ANI threads that have appeared in the meantime, little changed—everything that has been happening was predicted at some point, so there's not much new information that can be expected to influence someone to change their mind.
  What are the other 2-3 "big questions"? —Alalch E. 10:15, 15 April 2023 (UTC)
  Since there doesn't seem to be appetite for an outright ban, questions like "Should LLM use be allowed for minor edits but banned from use in content creation?" and "Should LLM users be required to obtain bot approval per WP:MEATBOT?" would help establish limits around the most potentially disruptive uses. –dlthewave ☎ 13:00, 15 April 2023 (UTC)
  Would the first question refer to WP:MINOR edits, or does it refer to non-major edits, or just edits that don't add new content? The second question addresses a novel idea, that can maybe be discussed here and worked out within the draft, before proposing to the wider community. —Alalch E. 13:16, 15 April 2023 (UTC)\
  Edits that don't add new content. I'm not quite sure how to define it, but the idea is to allow straightforward tasks like changing the color scheme of a table while restricting anything with the potential for "hallucinations". –dlthewave ☎ 15:37, 15 April 2023 (UTC)
  So maybe a three option RfC: A—blanket ban; B—ban on all use except for (ennumerated?) uses that don't involve the risk of adding hallucinations (straightforward tasks like changing the color scheme of a table); C—no blanket ban and not B either?—Alalch E. 16:36, 15 April 2023 (UTC)
  Yes, that's what I had in mind, thanks. Another option I had in mind would be something like "consensus required", where an editor wishing to use a LLM for a certain task would have to demonstrate its reliability and gain approval through either an RfC or Bot Approvals Group. Would this add too much complication? –dlthewave ☎ 12:48, 16 April 2023 (UTC)
  Personally I think it might trigger a lot of discussion that might be better directed at changing the general guidance rather than dealing with an individual exception. Do you have some examples in mind to help illustrate the kinds of tasks you are thinking of? isaacl (talk) 17:11, 16 April 2023 (UTC)
  If we agree the first option (RfC as described above) is a good idea, we might as well stick with that first option. Another thing to do before asking if the draft (or... a draft) on this page is the something that should be proposed as the policy is getting a view on what general bounds the community agrees upon for use of writing assistant tools, which would be a separate RfC (or perhaps not an actual RfC). So there are approximately four things to do before VPP: (1) the "blanket ban or something close to it RfC", (2) the "writing assistant discussion", (3) revising the draft, (4) the "ready to go?" pre-VPP RfC. Is that about right? —Alalch E. 20:01, 16 April 2023 (UTC)
  I'm not sure if we have the same understanding about what you called the "writing assistant" discussion. I think if there is support for defining specific uses of tools (regardless of underlying technology), then there will have to be a discussion to agree upon those uses. (Roughly speaking, working out the details for a position between "ban all uses" and "allow all uses".) Is this what you have in mind? I don't think a formal RfC to establish if the draft is ready to proceed to an RfC is needed, but certainly soliciting more viewpoints from editors other than those who participated in creating the draft can help. isaacl (talk) 21:02, 16 April 2023 (UTC)
  When you had mentioned writing assistant tools, I interpreted that in the context of how people have been bringing up that LLM-powered applications are becoming ubiquitous: from Microsoft Editor to grammar checkers (some of which have even branched off into full-fledged text-generating apps like GrammarlyGo). So, as in the question of: What is even meant when a policy would refer to an "LLM"? But that's not what you meant, I get it.—Alalch E. 21:19, 16 April 2023 (UTC)
  It's related to what I meant. As someone mentioned previously, I don't think anyone is concerned about the technology underlying grammar checkers. Rather than have a policy about technology, I think having a policy about the extent to which tools can be used to help with writing versus doing the writing (even if based on some direction from humans) would better align with the concerns of the community. isaacl (talk) 21:37, 16 April 2023 (UTC)
  So something like "Should editors be prohibited from using writing assistant tools (ranging across grammar checkers, tools that offer writing suggestions, all the way up to the tools that generate new text or source code) to make their contributions in whole or in part, and if no (not prohibited), to which extent / in which cases / under which circumstances/modalities should the use of said tools be allowed?" —Alalch E. 22:18, 16 April 2023 (UTC)
  Continuing discussion below at § Focus on types of uses... isaacl (talk) 22:11, 17 April 2023 (UTC)
  
  The discussion was not a formal RfC and branched off into many directions, and thus it's not clear to me that there was sufficient focused discussion to be considered a consensus viewpoint of the community. isaacl (talk) 15:39, 15 April 2023 (UTC)
  In a section dedicated to this matter at a project-wide venue that is VPP, approximately 27 editors made formatted comments ("support/oppose blanket ban" along with a few non-explicitly-advocating "comment" comments) about the blanket ban idea. Consensus for a blanket ban was obviously not reached. Should someone really start a blanket ban RfC now hoping for a productive outcome? Seems like a probable waste of time. I might be wrong.—Alalch E. 16:25, 15 April 2023 (UTC)
  Discussions with a single question get better focus, and I suspect having formal RfC notifications will generate a larger sampling of interested parties. Given that the small handful of editors on this talk page haven't really come to any consensus, as the page goes through cycles of expansion and trimming, I think there's a reasonable probability that there's a significant divergence from community consensus, and it might be better to get a more definitive view on high-level consensus, to smooth the way for a policy to be approved. isaacl (talk) 02:48, 16 April 2023 (UTC)
As I've said before, I think an outright ban is not possible because all variety of autocompletion, grammar checking and suggestion, optical character recognition, and such (the list is probably much longer) may use some form of language model and be considered "AI" in the broadest sense. The scan of a historical book into text that one uses may have been accomplished with some form of AI... This will increase in the future. —DIYeditor (talk) 10:22, 15 April 2023 (UTC)
Rewording one of the positions I gave initially, "Wikipedia is overwhelmingly a work of human authorship, Use of content generated from AI's s (such as LLM based generation) should be used only to meet specific defined goals or narrow technical requirements, under close scrutiny of the Wikipedia community, and subject to compliance with any existing policy, guideline or customary practice by the contributor using the AI tool concerned." ShakespeareFan00 (talk) 10:34, 15 April 2023 (UTC)
Have you read the draft (Wikipedia:Large language models)? —Alalch E. 10:38, 15 April 2023 (UTC)
Out of curiosity, I told GPT-4 to write an article, with citations, about the Belleclaire Hotel in NYC (which doesn't have an article yet). While the AI got a lot of things right, there were a few things I noticed immediately:

The AI wrote an article that is a little promotional in tone. For example, The hotel's location on the Upper West Side of Manhattan makes it a convenient destination for visitors to the city. It is located just steps away from Central Park and many popular museums and attractions, including the American Museum of Natural History and the Metropolitan Museum of Art. If I were a new editor and I submitted this draft to AFC, it would be declined.
While we're at it, the hotel is actually two blocks (not "steps") away from Central Park, and it's across the park from the Metropolitan Museum of Art, so that's also factually wrong.
The AI cited Tripadvisor as a source. Again, this is not really ideal if I were a new editor submitting a draft.

On the whole, the AI got most of the facts correct, but these facts are presented in such a way that the article would need significant revisions. I do not think that a total ban is warranted, but, at the very least, we would have to be very judicious with the use of AI. – Epicgenius (talk) 19:01, 20 April 2023 (UTC)

How do you feel about the sources? Do they directly support what was written by ChatGPT? –dlthewave ☎ 01:55, 21 April 2023 (UTC)
Nope, not a single one of them actually support the text. The NY Times source doesn't even mention the article's subject at all. Given how this is the most reliable of the four sources that GPT gave, it's definitely a red flag. The Historic Hotels of America source doesn't even exist anymore. The NYC Architecture source is about a different building entirely (the Century (apartment building)). And I have no idea what to even say about this Tripadvisor source about a hotel in West Virginia.
Funnily, there are sources that support the text, like this. However, I can definitely say that ChatGPT creates fictitious references, so it's of no use if you're trying to find actual sources. – Epicgenius (talk) 02:21, 21 April 2023 (UTC)

In response to some plans by the EU, to much more tightly regulate Generative AI's and foundation models, potentially in ways that make it far harder for smaller and open source implementers, I'm changing my viewpoint to:-

Total Ban on the use of Generative AI and LLM derived content on all Wikimedia sites, until the regulatory framework is certain, and individual providers are completely transparent about what their models can and cannot do, and what mitigation measures they have taken to ensure appropriate compliance with regulatory requirements. ShakespeareFan00 (talk) 23:08, 20 April 2023 (UTC)

@ShakespeareFan00, how would you implement it? As for me, I see a great potential for AI, as it would be possible to run it through unsourced or expandable content. I personally am on your side on the use of AI content, but my experience is that it will be rather difficult to find consensus to stop or even regulate semiautomated editing. For the regulation WP:MEATBOT or WP:BOTUSE already exist but so far no-one could show me an editor who applied for permission at the BRFA as mentioned at MEATBOT. Paradise Chronicle (talk) 06:42, 21 April 2023 (UTC)

Could you be more specific what are "some plans by EU"? -- Zache (talk) 08:15, 21 April 2023 (UTC)

https://www.theregister.com/2023/04/18/eu_lawmakers_want_ai_regulation/ ShakespeareFan00 (talk) 08:26, 21 April 2023 (UTC)

https://www.reuters.com/technology/eu-lawmakers-call-political-attention-powerful-ai-2023-04-17/ ShakespeareFan00 (talk) 08:27, 21 April 2023 (UTC)

Total ban for me. Even if an LLM is adept, it's not perfect. Limiting its usage to require checking all citations means that, mostly, just as much work is required to do fact-checking by both the editor using the LLM, and other editors who have to be highly suspicious as to the assisted edits. We can and should put more trust in human edits to be accurate over LLMs. I would at minimum require more scrutiny, but that takes effort. A flat ban is how I see the best option being. (I am not watching this page, so please ping me if you want my attention.) SWinxy (talk) 03:27, 3 May 2023 (UTC)

I see LLMs as being a big accessibility tool for allowing users to create better prose than they may be able to otherwise. For this reason I support the use of LLMs. They can't be trusted for facts but they are useful for creating readable text. Immanuelle ❤️💚💙 (talk to the cutest Wikipedian) 16:20, 3 May 2023 (UTC)

Oppose ban, per Aquillion and Immanuelle. The case for a ban is far from weak, but I think LLMs can be judiciously used in ways that benefit the encyclopaedia. The main potential harms are mass LLM edits and article creations, which should (be clarified to) fall under the meatbot policy. I wish this had been kept at VPP and reworded, because there's no way we can reach a local consensus on such a big question. At least, with a central well-attended RfC, we would have a formal close we could point to and say: "there's consensus to ban LLM", or "no consensus", or "consensus not to ban". But there, I think Alalch E. is right that it might be pointless to start a "ban/no ban" RfC since I doubt it would go differently from the last WP:VPP discussion. It just feels like we're headed nowhere right now. DFlhb (talk) 16:36, 3 May 2023 (UTC)
Oppose the ban and would like to quote colleague User:DFlhb from a month ago:

we've likely overreacted and thrown everything but the kitchen sink into that draft; ChatGPT's been out of months, and the LLMpocalypse hasn't happened. And after checking the long WP:VPP discussion on LLMs, I'm not even sure where we got that "mandatory disclosure" idea from, because I'm not seeing any community consensus for it.

I think I couldn't agree more with that. Ain92 (talk) 18:34, 15 May 2023 (UTC)

Focus on types of uses

Continuing the discussion on types of uses: I suggest not having one combined question as in this comment, but separately asking about different categories of tools. For example, there could be analysis tools (such as spell checkers, grammar checkers, reading level analyzers), text generators (such as tools generating text from human prompts), and conversion tools (such as voice-to-text tools, optical character recognition tools, translators). Alternatively, since the text generator category is of most interest, the question could just be about that category: do not use versus use with restrictions, with a non-exhaustive list of potential restrictions. isaacl (talk) 22:11, 17 April 2023 (UTC)

We already allow use of machine translation as long as people fix the issues, so I don't think there's anything new to discuss there. There's been a lot of discussion conflating the various use cases, but they are very distinct. I'm not sure why people are bringing up grammar checkers and voice-to-text tools in a discussion that's primarily and obviously about text generation. None of the use cases mentioned create material that is "wholly or mostly in part from non human sources" except for text generation. Text generation is the only use case that's dramatically changed with the new LLMs, and that's the one that matters and should be discussed (and relatedly is use of images created through e.g. DALL-E though that should be a separate RfC).

There is a bit of a fuzzy line in terms of LLM autocomplete tools and such, but I think that falls into text generation. Galobtter (talk) 04:03, 18 April 2023 (UTC)

Yes, I'm aware of the current guidance for translation. People bring up other uses for technology X because the discussion has been framed as a discussion about technology X (witness the name of this page). Personally I agree on focusing on text generation. I'm not sure what types of autocompletion tools you are considering; if it's more akin to a thesaurus then I'd see it as an analysis tool. I think there may be a divergence in community views for code generation, as I think there are many who see it as a way to extend their ability to write code. isaacl (talk) 07:22, 18 April 2023 (UTC)

About translations, I made translation from PinePhone Pro to Finnish article fi:PinePhone Pro and there is prompts used in the talk page. Substantial difference between translating pages using ChatGPT style software compared to Google translate for example is that translator can do "translate + summarise + restore references + convert links and referenes to local wiki" instead of direct translation which is more useful. Note: The original article was also mainly written by me. -- Zache (talk) 08:14, 18 April 2023 (UTC)

"Micro-hallucinations"

Something I posted at Wikipedia talk:Using neural network language models on Wikipedia, but perhaps better said here:

There are a lot of stories of rather large-scale "hallucinations" (lying/fiction) on the part of ChatGPT, etc., but it's become clear to me just experimenting a bit that every single alleged fact, no matter how trivial or plausible-looking, has to be independently verified. I asked the current version of ChatGPT at https://chat.openai.com/ to simply generate a timeline of Irish history (i.e., do nothing but a rote summarization of well-established facts it can get from a zillion reliable sources) and it produced the following line item:

1014 CE: Battle of Clontarf in Ireland, in which High King Brian Boru defeats a Viking army and secures his rule over Ireland

That's patent nonsense. Brian Boru and his heirs died in the Battle of Clontarf, though his army was victorious. In the larger timeline, it also got several dates wrong (off by as much as 5 or so years).

We need to be really clear that nothing an AI chatbot says can be relied upon without independent human fact-checking. — SMcCandlish ☏ ¢ 😼 06:55, 6 May 2023 (UTC)

The hallucinations are pretty common in the existing chatbots from what I have seen. Sometimes it'll be coherent and seem to stick to facts, but at any point it could wander off into fantasy/lies/fiction. —DIYeditor (talk) 07:31, 6 May 2023 (UTC)

This kind of thing is why the language in the draft like Large language models can be used to copy edit or expand existing text, to generate ideas for new or existing articles, or to create new content is being too kind. Permission shouldn't come first, followed by qualifications and caveats. The hazards come first.

Too much of this draft policy is analogous to saying, "You can go ahead and do Original Research or play journalist in Biographies of Living Persons, as long as you pinkie-promise to be extra careful." XOR'easter (talk) 20:23, 16 May 2023 (UTC)

I feel like many of these concerns presuppose that AI text is riddled with errors (which it often is) while human text is largely correct (which it is not always). I fear that lambasting generative tools as blatant OR while also ignoring the fact that human editors paraphrase/pick-and-choose which facts to report when they edit (more subtle forms of OR, imho) will result in a skewed understanding of this issue.--Gen. Quon_[Talk] 18:55, 17 May 2023 (UTC)

Well, we have WP:CIR for human editors. The big difference between humans and LLMs with regard to this is that human errors usually follow a visible pattern of incompetent behavior (or else are accidents that can be quickly recognized by the person when pointed out). An LLM is not able to actually recognize, identify and correct past errors, even if it can mimic the process of apology fairly well. There's humans that do that too; for the most part, they're indefinitely blocked. _signed,Rosguill ^talk 21:41, 17 May 2023 (UTC)

That sentence is not a blanket permission. The meaning of "can" is as follows: "It is a given—arising from the current state of technology—that large language models can /objectively/ be used to copy edit or expand existing text, to generate ideas for new or existing articles, or to create new content"—Alalch E. 21:32, 17 May 2023 (UTC)

I think this has come up before, I too was initially confused by this sentence until it was explained to me. The intent is good but the wording could use improvement. It's a stretch even to say that LLMs are technically capable of writing article content, it would be more accurate to say that they can generate text that has the appearance of a Wikipedia article.

A different approach would be to open with a detailed description of how LLMs work and explain in the same paragraph that although they have the appearance of intelligence, they don't actually "understand" what they're writing and often produce false information that's difficult to distinguish from fact (AKA "vaguely plausible bullshit"). Most editors on this page take this for granted, but we need to write for folks who have heard amazing things about AI or played around with ChatGPT and are eager to use it on Wikipedia. –dlthewave ☎ 02:09, 18 May 2023 (UTC)

I don't think that's even an accurate statement of what the current technology makes possible. I mean, one might also try "to generate ideas for new or existing articles" using a Ouija board and a bottle of tequila, but to say that the combination "can objectively be used" for the purpose stretches the word can beyond the point of meaningfulness. At any rate, if the sentence needs this much explication, it's not a good line to put in a policy. XOR'easter (talk) 22:41, 18 May 2023 (UTC)

Circling back to getting this into a presentable state

Pinging the major contributors to the current page per Sigma's tool. @DFlhb, Alalch E., and Phlsph7: I think we should return to DFlhb's suggestion in early April and just trim the damn thing down to the absolute minimum so that we can have an RfC and there can be something. I'm going to look at all the LLM- and AI-related policy pages and see if I can come up with some scheme that makes sense (there are about a million of them and they all overlap in bizarre ways). jp×g 09:56, 18 May 2023 (UTC)

~~I support trimming back to something similar to DFlhb's trim.~~ Actually I'm unsure. Maybe start that RfC that was discussed, independent from this as a potential proposal. That's what isaacl suggested and I think DFlhb agreed.—Alalch E. 10:15, 18 May 2023 (UTC)
I still stand by that diff. We need an "allow vs ban" RfC to give us the community's assent. I'd also rather we hold that RfC now & take advantage of there being no LLM-related controversy at WP:ANI, so the !votes are representative and not a heat-of-the-moment under/overreaction. Afterwards we can ask WP:VPI what they think of the current draft, my trim, jpxg's potential rework, or anything else. DFlhb (talk) 21:06, 18 May 2023 (UTC)
Please start the RfC. —Alalch E. 21:40, 18 May 2023 (UTC)
XOR'easter, you said you're on the Ban side. Want to take a stab at it? Feels unfair for me to start an RfC that I'd (likely) oppose. DFlhb (talk) 22:17, 18 May 2023 (UTC)
No. I am, frankly, exhausted by trying to deal with one disaster after another after another while trying to keep an eye on discussions that should have been closed months ago. To be blunt, the last thing I want is to be the whipping-person for an RfC that will have half the people here wanting to run me out on a rail for Luddism. XOR'easter (talk) 22:54, 18 May 2023 (UTC)
To sum up what I wrote earlier, personally I think the RfC should ask if the use of programs to generate text for inclusion in an article is supported. This would include text generated using existing content as an input. I think this is the key area of concern for most people, so I think it would be better to tailor the question to the specific category of tools that is a potential problem. isaacl (talk) 22:32, 18 May 2023 (UTC)
I agree, people including AI-generated texts directly in articles is probably the central issue for most editors. Phlsph7 (talk) 07:50, 19 May 2023 (UTC)
There are so many potential variations and aspects that I don't think this would be well-suited for an RfC. We couldn't even agree on how to phrase it in the last 2 discussions.

If this is where we're at, then let's just trim & go straight to WP:VPI, which is where these nuances are best discussed. If a majority favours a ban, they'll let us know anyway. I now think I was wrong to believe an RfC would help. DFlhb (talk) 10:47, 19 May 2023 (UTC)
To illustrate how the trim makes this dispute easier to resolve, I've tweaked the trim's first bullet point. Only one sentence (in italics) would be in contention, instead of many paragraphs:
LLMs can make things up, produce biased outputs, include copyrighted content, or generate fake or unreliable citations. Never paste LLM outputs directly into Wikipedia. You must rigorously scrutinize all your LLM-assisted edits before hitting "Publish", and you are responsible for ensuring that these edits comply with all policies.

Without the trim, we can hold RfCs on abstract questions, but we'll still struggle to figure out how to turn the RfC results into specific wording. With the trim, anyone who wants to tweak the policy can just put an alternative sentence to an RfC. Removes one layer of abstraction.

Another benefit is that it doesn't explicitly condone any use cases, unlike the current draft. Saves us the need to gauge consensus on which use cases are okay by putting the locus on policy-compliance rather than use cases. I think that makes sense, since I doubt there's such a thing as a "low risk" use case. DFlhb (talk) 11:48, 19 May 2023 (UTC)

The issue is that there hasn't been agreement on having a minimal proposed draft, in part because different editors want to clarify many different aspects based on what they assume the community wants. If community consensus is reached on disallowing text generated by programs, for example, then the resulting policy will be very short with respect to this aspect. isaacl (talk) 15:38, 19 May 2023 (UTC)

There has been very little progress in terms of consensus on how to proceed in the last weeks/months. Trimming it down wouldn't be my first choice. But it's better than keeping the draft in an indefinite state of limbo. As I see it, the important point would be to find a path forward, one way or the other. Phlsph7 (talk) 10:34, 18 May 2023 (UTC)

Taken literally, "allow" vs. "ban" are the two extremes (total green light and total ban) and IMO not good choices for an RFC. What's happening here is a lot of work to refine a proposal for an RFC where I'd guess the RFC would be whether or not to make it a guideline. IMO it should be a separate later question on whether to upgrade it from a guideline to a policy. Sincerely, North8000 (talk) 15:58, 19 May 2023 (UTC)

I suggest a third way forward beside the trim and allow-ban-RfC. Does the current draft have parts with which several editors strongly disagree? If yes, we can consider an alternative version of each controversial part, have a short discussion on it, and then an RfC. We do that until all strong disagreements are sorted out. Then we can propose the draft at WP:VPI. This probably works best if each discussed alternative focuses only on one core issue, for example, by taking a couple of sentences and suggesting how they should be changed.

Compared to the trim, it has the advantage that it does not involve a radical change to what we were already working on all these months. It has the disadvantage that the result will probably be longer and more difficult to manage. Compared to the allow-ban RfC, it has the advantage that we have clearly defined alternatives for the RfCs and therefore a good chance to get a consensus one way or the other. It has the disadvantage that the people participating in the RfCs have less influence since each one affects only a small portion. Phlsph7 (talk) 20:05, 20 May 2023 (UTC)

We do that until all strong disagreements are sorted out After the trim, I was invited to do that, but refused, for reasons later expressed by XOR'easter better than I could have. So far, we've been ineffective at defining problems precisely, and have not just disagreed on the best options, but on what these options even meant. I suggest that we make the "next step" WP:VPI, not further discussions here. Though maybe there are downsides to "rushing over there" that I'm not seeing, so please scrutinise my reasoning. DFlhb (talk) 22:41, 22 May 2023 (UTC)

Note that trimming isn't required before going to VPI. Let's just see what they think; many of the things I disliked have been addressed. DFlhb (talk) 23:41, 22 May 2023 (UTC)

Personally, I'm with you. I just had another look at the draft. It's not perfect but it seems to me to do a decent job at describing how LLMs should not be used, how they can be used, and what dangers this involves. If the proposal fails we may still learn important things on how to improve it and either repropose it or change it into a guideline. Phlsph7 (talk) 09:04, 23 May 2023 (UTC)

I suggest we take a few weeks to finalize a draft that has pretty strong support here, put a notice / invitation to participate at the pump that we are doing that. Then we should do an prominent RFC to accept it as a guideline. (leave the policy idea for later) Also indicate that the RFC is just to put it in, not lock it all in...that further evolution is planned. If possible, folks here should support the result even if it is only 90% how you want it. If the RFC changes into a brainstorming page with a zillion versions, it will die under it's own weight. What do y'all think about that idea? North8000 (talk) 21:35, 22 May 2023 (UTC)

I agree that compromises are necessary if we want to move forward. I'm not sure about the issue of policy vs guideline. So far, the draft was treated as a draft of a policy. We could try to get it accepted as a policy and use the guideline approach as a plan B in case that fails. Phlsph7 (talk) 09:13, 23 May 2023 (UTC)

I have no strong opinions, I was just trying to crystalize something. Also, in the current "crowd source" architecture of Wikipedia, with the usual approaches, something of this scale has about a 2% chance of getting accomplished. I think that the sequence of a lot of work and input to crystalize a single proposal, and then the drafters agreeing to support the outcome even if you only like 90% of it is a way of raising those 2% odds to 90%. North8000 (talk) 13:19, 24 May 2023 (UTC)

Add a header to this page

Should we add "This article is about the use of Large Language Models in Wikipedia. For the article about Large Language Models (general), see https://en.wikipedia.org/wiki/Large_language_model instead?" It would help clear up confusion, especially for people who Google "LLMs" and find this page. Thegamebegins25 (talk) 23:36, 22 May 2023 (UTC)

The idea makes sense. But I checked a few other policies and guidelines, like Wikipedia:Copyrights, Wikipedia:Plagiarism, and Wikipedia:Public domain: they don't have a disambiguation link to the corresponding mainspace articles. Phlsph7 (talk) 09:19, 23 May 2023 (UTC)

Balance

The nutshell summarizes the draft adequately: Use of large language models (LLMs) to aid editing is never recommended over unassisted editing, and is prohibited in some areas and some scenarios. Because of the many risks and pitfalls associated with LLMs, their output must be rigorously scrutinized, and only editors with substantial prior experience in the intended task are trusted to use them constructively. Repeated LLM misuse is a form of disruptive editing.

This view is not balanced. It takes aim at the risks and pitfalls but does not balance it with the opportunity to get work done. One of the complaints is that LLMs may be unbalanced. In view of this it is worrying that this trial at a normative text is so unbalanced. It is worrying but not unexpected. My experience is that GPT4 is much more balanced than the normal Wikipedian. Ettrig (talk) 08:37, 25 May 2023 (UTC)

What makes it "much more balanced" than normal Wikipedians?Harryhenry1 (talk) 04:38, 27 May 2023 (UTC)

GPT4 has been trained to "guess" what the next word is in a really large amount of text. I think Wikipedia is about 4%. It doesn't have any views of its own, neither any affiliations. It has just been trained to emulate what other writers would have done. Not any particular writer. Just those that seem to best suit the context.--Ettrig (talk) 14:19, 27 May 2023 (UTC)

What other changes are needed to make the draft balanced?—Alalch E. 16:04, 27 May 2023 (UTC)

There is absolutely no trial at seeing positive effects of using LLMs. There is no awareness that the LLMs are being improved. There is a consistent effort of forbidding things. Not a few problematic places. Here is one: You may use LLMs for copyediting, summarization, and paraphrasing, but note that they may not properly detect grammatical errors or keep key information intact. Use due diligence and heavily edit the response. This reflects an incorrect view of the current state of LLMs. The reality is that ChatGPT based on GPT4 is much better at summarizing than the average Wikipedian. So what is the motivation for noting that they may not properly detect grammatical errors or keep key information intact? What is the reason to heavily edit the response? I have put about 500 summaries by ChatGPT in Wikipedia, lightly edited. The reaction by the Wikipedia community shows that these are good texts. So far, I have seen no responses to the effect that the summary is factually incorrect. The grammatical changes that I have seen, I do not consider corrections.--Ettrig (talk) 19:07, 31 May 2023 (UTC)

Thanks for that response. So how would you change the draft, according to that reasoning? —Alalch E. 21:30, 31 May 2023 (UTC)

This is an answer to the question but not an actual suggestion: The best LLMs may be used fruitfully for copyediting, summarization, and paraphrasing. Probably many more uses will be found in the future The general principles for contributions to Wikipedia of course hold for users of LLMs. This includes that you can be banned if you cause more harm than good (however that is measured). The special problem with LLMs is that it is possible to produce a lot of text with small effort. If you have a process to do that. It should be used at small scale at first. Then it can be upscaled gradually, if the tests are positive. --Ettrig (talk) 08:43, 1 June 2023 (UTC)

Your answer is pretty much our third basic guideline: 3. You may use LLMs for copyediting, summarization, and paraphrasing.... It doesn't contain the reference to "many more uses ... found in the future". I'm not sure if that should be included already now. The policy could be adjusted in the future once we know what these uses are. Phlsph7 (talk) 11:54, 1 June 2023 (UTC)

Tangential, but AI-generated images are being inserted

It's tangential to this proposal as it currently stands, but AI-generated images are being inserted into articles as illustration. Really speaks to me of the need to work on a Wikipedia:Computer-generated content along side this one, if they are to remain separate. Not to beat a dead horse... —DIYeditor (talk) 16:02, 30 May 2023 (UTC)

Maybe we should merge/ move this to Wikipedia:Computer-generated content. LLM might be too specialized anyway. ? North8000 (talk) 20:45, 30 May 2023 (UTC)

Verifying content

Has anyone else considered using LLMs as a way to speed up verification checks? I do a lot of COI clean up and it is very common for references to be present which don't actually support the content. It would be helpful if statements could be checked more quickly and in a more automated fashion. I just did a test using chatgpt to ask whether this was supported by the reference and it correctly identified that it was not. I suspect that there have probably been similar attempts to create a tool to do this, but I can't think of any specifics. SmartSE (talk) 09:05, 24 May 2023 (UTC)

This is actually a good test of how ChatGPT responds to questions that are beyond its capabilities.

User input:

Please tell me if the following sentence is supported by the reference provided: Prior, she was a consultant to the Arlington Hospital, Virginia (1995-1996) as well as a consultant to the VNA Lancaster County, Pennsylvania (1994-1995). She was also a bioethics consultant to the Virginia Bioethics Network at the University of Virginia (1993-1994).^[1]

ChatGPT output:

I apologize, but I cannot access the content of the reference provided as my training data only goes up until September 2021. Therefore, I cannot verify whether the sentence is supported by the given reference. However, based on the sentence you provided, it states that the person in question was a bioethics consultant to the Virginia Bioethics Network at the University of Virginia from 1993 to 1994, which is not directly related to the Arlington Hospital or VNA Lancaster County."

The first two sentences are correct: ChatGPT cannot verify sources like this. The last sentence happens upon the correct answer by chance but uses completely faulty reasoning, since there is no reason that the two entities that she consulted for would need to be be directly related. (If anyone else has a better way of writing the prompt that produces different results, please share.)

This reminds me of the story about the Texas A&M instructor who used ChatGPT to accuse an entire class of using AI to write their papers. These models will readily respond to questions that they have no way of answering and absolutely should not be relied on for this sort of thing. This is an inherent part of how they work, and they only refuse to do so if the programmers have hard-coded a "guardrail" for that specific scenario. –dlthewave ☎ 13:14, 24 May 2023 (UTC)

@Smartse: please see [1], in particular the second of two scripts there. I let that languish for too long and should have time to turn it into a proper tool soon. Sandizer (talk) 07:57, 1 June 2023 (UTC)

I think such a tool shouldn't just give a yes/no answer; it should present the relevant sections of both texts side by side (our article, and the source). That's what Facebook's Side AI does (see demo). It's open source so you may find inspiration there. DFlhb (talk) 11:57, 3 June 2023 (UTC)

That's a great idea, but the Facebook system doesn't have an article parser, just a giant dataset where someone (crowdworkers?) already decided which article text segment applies to any given citation,[2] which is proving to be a very difficult and crucial problem here. Sandizer (talk) 17:30, 3 June 2023 (UTC)

References

^ "The Rutgers Journal of Bioethics" (PDF). 2019.

Great work

Just wanted to drop a note here and say thanks to everyone who's been working on this - back when I looked at it in March/April I honestly found it hard to work out what it was saying, but now with the upfront 'basic guidance', it seems much clearer, and much more likely that people will be able to read it and take it on board. Look forward to seeing this in an RFC soon!

A couple of thoughts -

For the nutshell, points #1 and #8 from the basic guidance (do not generate article content, do not generate talkpage comments) could reasonably be added to the nutshell in some way - I think these are the really key points to get across to a user who's casually wondering, hey, can I do this, more so than the point about disruptiveness.
Disclosure - perhaps an example of an edit summary would be helpful here - it's a bit odd to say "must disclose" without saying how. (Something like "Section copyedited with ChatGPT"?).

Andrew Gray (talk) 22:56, 5 June 2023 (UTC)

"Do not use LLMs to write your talk page or edit summary comments."

Do not use LLMs to write your talk page or edit summary comments.

What was the reason for adding that? I don't understand. Schierbecker (talk) 02:18, 18 May 2023 (UTC)

My guess is that someone was concerned that some malicious editor could weaponize LLMs to generate walls of text to WP:BLUDGEON the process. Well-intentioned rule, I think, but I think prohibiting all LLM-generated content on talk pages goes too far. Schierbecker (talk) 02:27, 18 May 2023 (UTC)

Also a guess, but allowing either might prevent inter-peer assessment of WP:COMPETENCE. Those items used to be considerably fewer paragraphs, and it was in the one that began "You are responsible for ensuring your use of LLMs does not disrupt Wikipedia," if I remember correctly. Sandizer (talk) 02:31, 18 May 2023 (UTC)

I'm deleting that line until someone can explain their reasoning for why LLM content in edit summaries is worse than the same in article space, Schierbecker (talk) 22:34, 18 May 2023 (UTC)

Because an automated sealioning machine that is able to spit out endless paragraphs of "civil" POV pushing is the last thing that we need? Heaven knows we have enough problems already with debates being "decided" or "resolved" because one faction was the last one standing. LLM-generated content on talk pages is discussion poison. XOR'easter (talk) 22:57, 18 May 2023 (UTC)

And ultimately if someone cannot competently express their view without an AI to do it... they really shouldn't be on Wikipedia at all. Der Wohltemperierte Fuchs ^talk 23:31, 18 May 2023 (UTC)

I could imagine someone who doesn't write in English very well using LLM to clean up their grammar. One hardly wants to say "Sorry, subject of this article, but if you want to tell us about the error in this article, you need to spend the next couple of years studying English first, because otherwise you're incompetent to express your view."

There's an editor I'm kind of wishing would use something like LLM right now. So far, the editor keeps asking if I've ever been uploaded to a website. Since Mind uploading isn't a thing, the answer is no, but I believe that's not really what the editor wants to know. WhatamIdoing (talk) 02:29, 14 June 2023 (UTC)

An automated system for condensing endless paragraphs into a short summary might be helpful. WhatamIdoing (talk) 02:30, 14 June 2023 (UTC)

@Fuzheado was experimenting on something like this. Schierbecker (talk) 03:15, 14 June 2023 (UTC)

So, the admonition forbidding using LLMs to generate talk page comments has been removed, even though the obvious consensus above was against doing so. I'm restoring it. Sandizer (talk) 15:06, 31 May 2023 (UTC)

"Editors should have enough familiarity with the subject matter to recognize when an LLM is providing false information"

This is essentially an impossible bar. Already in baby versions of LLM, the vast majority of Wikipedia editors would not be able to recognize false information from an LLM in any subject,^[a] and it might even be challenging for all but post-docs to spot the false information in the topic of their Ph.D. thesis. As the software gets better, even Ph.D.'s may have to comb through sources to be certain about false information in the topic of their specialization. This sentence appears in § Specific competence is required, and I'm not sure what to do about it, but imho as written, it excludes all editors from using LLM in any subject—which maybe is okay, but in that case it should be stated categorically. Mathglot (talk) 14:11, 13 June 2023 (UTC)

The idea is just that there is an increased likelihood of detection of false information when an editor is familiar with the subject matter. The intended meaning is not that "enough familiarity" guarantees that one will be able to recognize etc. It should be reworded.—Alalch E. 14:27, 13 June 2023 (UTC)

To ensure verifiability, editors would need to comb through sources to confirm that every fact is supported by the cited source which would uncover any false information in the process. This is more dependent on Wikipedia experience than subject matter knowledge, since even a PhD likely wouldn't be able to tell you whether a specific fact appears in a specific source. –dlthewave ☎ 15:38, 13 June 2023 (UTC)

I've removed the paragraph. But what about this alternative (didn't put much effort in it, but my thinking goes something like this):

If editors use an LLM to paraphrase source material or existing article content, they should have some familiarity with the topic to be able to identify whether the meaning has changed along with the wording. Noticing subtle (but possibly quite significant) unintended changes to sourced content could be beyond the ability of an editor, even an experienced one, with no deeper understanding of the topic, despite their best efforts to recheck the claims against the sources and see if any deviations have appeared.

—Alalch E. 15:59, 13 June 2023 (UTC)

The required familiarity includes knowledge gained by reading the sources in the added text. isaacl (talk) 16:48, 13 June 2023 (UTC)

Citations are important to editors, but verifiability isn't contigent on them. The verifiability of statements doesn't depend on whether the material is cited, or if it is, whether the cited source is a good one. Consider:

Smoking cigarettes increases the risk of lung cancer.
Smoking cigarettes increases the risk of lung cancer.<fake ref>
Smoking cigarettes increases the risk of lung cancer.<weak ref>
Smoking cigarettes increases the risk of lung cancer.<high-quality ref>

The claim is verifiable every single time, because I am able to verify that this claim appears in at least one reliable source (e.g., by spending a minute with a web search engine).

When you cite a claim, you're making it easier for other editors to figure out that the claim is able to be verified, but it is still verifiable whether you make it easy for them or not. (Making it relatively easy is often required by policy.) When another editor determines that the material in the Wikipedia article matches the material in the cited source, that makes it "verified", not "verifiable". WhatamIdoing (talk) 02:10, 14 June 2023 (UTC)

When editors add text without the assistance of a program, they must understand the topic well enough to know that what they are writing is accurate and to be able to provide appropriate citations. The same remains true when a program is used. isaacl (talk) 16:48, 13 June 2023 (UTC)

I agree, but there are layers of complexity here. I can tell that some things are unverifiable at a glance: HIV really has been scientifically proven to cause AIDS, measles vaccines really do not cause autism, horse de-wormer really has not been scientifically proven to help COVID-19 patients (unless they have intestinal parasites as well, I guess), etc., so claims to the contrary are not verifiable. I don't have to do a detailed review of sources to figure out whether smoking cigarettes increases the risk of lung cancer.

There are other things that I don't know off hand, but that I would expect to be able to find a source for, and there are claims that could be true but different sources have different information. Last I checked, Cancer gave two different numbers for the percentage of cancer deaths caused by tobacco. They can't both be right, but they are both verifiable, they are both cited, and they are both directly supported by high-quality reliable sources.

I think it's helpful to know the subject area, but it's also helpful to understand your own limits. WhatamIdoing (talk) 02:20, 14 June 2023 (UTC)

The point is there isn't a special exception because you used a program to help you write the text. For any submission you make, you're responsible for ensuring the content is verifiable. If you include a citation to a source, you have to read it and understand it sufficiently to know that your content is adequately backed up by the source. isaacl (talk) 02:30, 14 June 2023 (UTC)

I would agree with this. Mathglot (talk) 02:08, 15 June 2023 (UTC)

Yes, I agree with this, too. In Germany, many stores post signs that say "Eltern haften für ihre Kinder" ("Parents are liable for their children"). I think the fundamental rule of a wiki is "Editors are liable for their edits". Whatever method you use to make the edit, you have to stand behind it. There is no get-out-of-responsibility-free card for any tool – script, bot, LLM, or anything else. WhatamIdoing (talk) 03:08, 15 June 2023 (UTC)

"There isn't a special exception because you used a program to help you write the text" I think this is the best way to get the point across. Maybe I'm reading too far into it but the "familiarity with the subject matter" idea seems like it might lead to editors claiming that otherwise-competent editors can't use LLMs because they don't meet some arbitrary knowledge threshold or, conversely, that subject matter experts have free rein to do so. The lead already includes "As with all their edits, an editor is fully responsible for their LLM-assisted edits" which is in keeping with our current standards that folks are already familiar with and encompasses the necessary level of familiarity without coming out and saying it outright. –dlthewave ☎ 03:58, 15 June 2023 (UTC)

As noted above, the paragraph in question has been removed by Alalch E.. Looking at the nutshell at the top, it appears to me to retain a trace of the removed text in summary form; unless I'm reading it wrong or it was intended to summarize some other part of the page, a portion of the nutshell should be removed as well, or reworded to more clearly represent what it's trying to convey. Mathglot (talk) 02:04, 15 June 2023 (UTC)

There was indeed a trace of that sort in the lead, and I've removed it, but I'm not seeing it in the nutshell. That part of the nutshell has always referred only to the first paragraph of "Specific competence is required".—Alalch E. 15:55, 15 June 2023 (UTC)

notes

^ Anticipating huffy responses on the order of, "Nonsense; I've found dozens of examples, it's easy!" I would just say, "Sure, me too", but also that anyone editing this page is a very highly self-selected group, and a tiny, tiny minority of the 120,097 active users. And also that LLM is in its infancy, and our hubris will soon be challenged by future versions. Remember 1997.

Interweaving of things that should be policy and things that should be guidelines

This page for the most part would fall into being an editing guideline, maybe a content guideline. But these bits should probably be in a policy section(s):

Required disclosure in edit summary
Do not use to write talk page comments or edit summaries
Perhaps the bit that articles written solely by an LLM with no useful content are candidates for deletion, but first we need to clarify what "candidate for deletion" means here.

Snowmanonahoe (talk · contribs · typos) 00:23, 13 June 2023 (UTC)

I think you'll find that the difference between policies, guidelines and essays is more subtle and obscure than that.

(Fun fact: The policy/guideline page that uses words like must and do not the most is a guideline.) WhatamIdoing (talk) 02:22, 14 June 2023 (UTC)

That essay helpfully lists a bunch of things that are not the answer to the question in the title, without answering it. The interpretation of the difference I have reached, while probably imperfect, is "while both policies and guidelines are subject to common sense, policies are far less likely for such exceptions to be necessary". There are many potential reasons why you may not want to follow the exact prescripted article structure of Wikipedia:Manual of Style/Video games. On the other hand, there are very few potential reasons you would not want to be civil. I think that applies here. We should make it clear that disclosure of LLM use in an edit summary, and using it only for content and not discussion, are both bright-line, absolute requirements. Snowmanonahoe (talk · contribs · typos) 22:25, 14 June 2023 (UTC)

I'm concerned that both of those are unenforceable in practice.

On edit summaries:

How would you know if I were posting LLM-generated content? How could you prove it if I denied it? How would a new editor discover that this rule exists, so that they could choose to comply with it?

What would you do with the information in the edit summary? How do you imagine people using that information in actual practice? What if I decided to engage in Malicious compliance and used a script to include "This edit may or may not have used LLM" in every edit summary? Or that I claimed all my edits involved LLMs, since you propose no rule against lying about that? What if I use an LLM but then heavily modified it, so it's mostly my own work? Should I claim the untruth and pretend it's all LLM-generated content, with ensuing copyright complications? Explain in detail?

On talk pages:

All of the above about how you could know, but this injunction seems to be driven by a fear of verbosity. Have you heard what Pascal wrote, "Je n’ai fait celle-ci plus longue que parce que je n’ai pas eu le loisir de la faire plus courte" ("I have made this longer than usual because I have not had time to make it shorter")? What if people used LLMs to make their points more clearly and concisely? I believe that a look through my own contributions would prove to any impartial observer that verbosity is not the exclusive domain of LLMs. What are you trying to achieve? WhatamIdoing (talk) 03:03, 15 June 2023 (UTC)

They will be unenforceable, but for now at least, unedited LLM output is pretty easy to recognize. I disagree that policies have to consider every potential edge case and rules-lawyering tactics that could be used on it—part of why we have IAR and dislike the word "rule" is so that we do not have to consider those things. The issue is not copyright, as the US doesn't recognize LLM output as copyrightable (see Commons:Template:PD-algorithm). The issue is that LLM edits need to be scrutinized. Very few people will listen to this, just like very few people read the guide to appealing blocks. That doesn't make it not useful. Rather, it points out editors that do make an attempt to listen.

As for your second paragraph, You could have a point there. I wasn't the only who originally wrote that into the draft for the record. Snowmanonahoe (talk · contribs · typos) 14:50, 19 June 2023 (UTC)

@Snowmanonahoe, there is a copyright problem with me creating copyrighted material myself and then telling you it's PD-algorithm.

How can you recognize unedited LLM output? If I posted 10 sentences here, and asked you to tell me which ones were generated by LLM, how many do you think you'd correctly identify? WhatamIdoing (talk) 18:29, 1 July 2023 (UTC)

There are tools that try to detect gpt (see Wikipedia talk:Large language models/Archive 3#Detection tools, for a list). I'm not sure the performance is so great, but its worth noting that these models often work by predicting how likely a world is given previous words so the act of generating sentences and identifying output are in some sense linked (see perplexity) Talpedia 19:34, 1 July 2023 (UTC)

So the more typical your writing style – and for LLMs trained on the English Wikipedia, the more appropriate and Wikipedia-ish your writing style – the more likely you are to be accused of using an LLM.

Have you ever tried running your own contributions through those detectors? WhatamIdoing (talk) 08:19, 2 July 2023 (UTC)

Indeed. I have not played with these tools at the moment. Talpedia 08:26, 2 July 2023 (UTC)

I tried one of them just now, on five paragraphs I've written recently. (It wouldn't take a single sentence.) In two cases, involving weird/niche subjects, it said that it was 100% human written. It gave one a middling score (~60% GPT) and two were declared to be 100% GPT.

Flipping a coin would have given the same results. WhatamIdoing (talk) 08:46, 2 July 2023 (UTC)

It's more obvious in large sections of text. This rewrite of the Spades, Indiana article by CreatrixInspirata (contributions) was almost certainly using LLM output: https://en.wikipedia.org/w/index.php?title=Spades%2C_Indiana&diff=1148265516&oldid=943913673 Rjjiii (talk) 20:35, 1 July 2023 (UTC)

@Rjjiii, what makes you think that it's LLM-generated? The editor who posted that also says "It's discouraging putting the time only to have your effort changed back to a stub article". That doesn't sound like a likely response from someone using an LLM. WhatamIdoing (talk) 08:18, 2 July 2023 (UTC)

@WhatamIdoing: A few things. I came across this at the Teahouse and suggested they reach out to the user who reverted and explained some issues with their sourcing (https://en.wikipedia.org/w/index.php?title=Wikipedia:Teahouse&diff=prev&oldid=1148272457). The bit about the church is, in my opinion, the kind of hallucinations that LLM will generate with limited data. I tried to follow up the dead link a few ways (thinking I could help a new editor with sourcing and research) but couldn't find any evidence that the church ever existed. I pulled up Google Maps and located the churches in this area. None quite match that description. The closest actually has a plaque on the building visible in "street view" from when (going by memory so I may be wrong) it was previously a school. The church really stood out to me but that kind of hallucination is present in some of the other sources as well. I think what happens when you attempt to point an LLM at a niche topic is that it has to fuse known tropes and expectations. I also don't think this person was malicious. They didn't announced their usage of an LLM, but they also didn't try to hide what they were doing. I think this is a good faith attempt to create an article using AI; it just didn't work. Rjjiii (talk) 08:33, 2 July 2023 (UTC)

Trying to independently verify the alleged facts sounds like it will be a more promising detection approach than language-oriented models. (That Teahouse link doesn't lead to the Teahouse; I think you might be missing a digit at the end.) WhatamIdoing (talk) 08:49, 2 July 2023 (UTC)

I fixed it, but you were too quick for me (https://en.wikipedia.org/w/index.php?title=Wikipedia:Teahouse&diff=prev&oldid=1148272457). Yeah, the writing is sufficient enough in terms of grammar and syntax. It's only the conceptual stuff that seems like such a giveaway. Another example from that diff is the Ripley County Historical Society which has the nonsense URL: https://www.rchslib.org/ The diff includes a source from ripleycountyhistory.org which would make enough sense for their address, but it's another fabrication. Rjjiii (talk) 08:58, 2 July 2023 (UTC)

This suggests that defending against LLM-generated content is going to require extra time and effort on the part of RecentChanges reviewers. As speed is something that group values, this is not going to be popular with them.

This makes me think we're on the wrong track entirely with this page. We don't need a guideline that nobody will follow; that's just security theater. We need tools that can "read" contributions, flag broken URLs/bad domain names, and run a few web searches to see whether the claims are plausible. Imagine something that marries the text-chunking abilities of our existing copyvio checks with a web search. WhatamIdoing (talk) 09:24, 2 July 2023 (UTC)

Unblock requests

CAT:RFU (unblock request) patrollers such as myself are more frequently seeing unblock requests generated by large language models. These are uniformly terrible, basically content-free and failing to address the reason for the block. The basic guidance described on the draft would already help (assuming people followed it... ahem), but it might be helpful to expand on point 8 ("Do not use LLMs to write your talk page or edit summary comments."), perhaps to "Do not use LLMs to write your talk page or edit summary comments or make unblock requests." This is a polite suggestion and I will not be at all put out if you think this isn't a great idea. --Yamla (talk) 19:00, 16 June 2023 (UTC)

I've seen a fair share of LLM-generated unblock requests as well and I think making the unblock part explicit is a good idea so that there is no confusion on that point when people read the page. That it can be pointed to and make clear that it's not merely the admin's opinion that the unblock request shouldn't be generated in that way, but that the community consensus is that you should not use LLMs to generate an unblock request. - Aoidh (talk) 19:06, 16 June 2023 (UTC)

If the requests are poor quality, it doesn't matter whether or not a tool was used to help with their creation. I don't think it's a good idea to try to list all the ways poor quality posts can be used: the list is neverending, and it may give the impression that the list is all-inclusive. isaacl (talk) 20:19, 16 June 2023 (UTC)

While it doesn't need to be an exhaustive list and it's not an attempt to include all scenarios, it should include the more common scenarios. Using LLMs to write unblock requests is becoming very common. In terms of non-article LLM usage it is by far the most common usage I've encountered. - Aoidh (talk) 20:50, 16 June 2023 (UTC)

People have always been writing content-free appeals. We should focus on rejecting them quickly, regardless of how they were written. We shouldn't have to create meta-instructions saying please follow the instructions for X, Y, and Z. We already have guidance on how to write appropriate appeals, talk page comments, edit summaries, and so forth. isaacl (talk) 20:58, 16 June 2023 (UTC)

As these requests are made on Talk pages this seems to already fall under the current language in this draft: "Do not use LLMs to write your talk page or edit summary comments."

With that said, I think it might be helpful to more explicitly state that the guidance is about communicating with other editors. For example, we could write: "Do not use LLMs to communicate with other editors e.g., write edit summaries, make requests or suggestions on Talk pages." ElKevbo (talk) 21:17, 16 June 2023 (UTC)

People who make LLM-generated unblock requests won't read this page anyway. This might merit inclusion on Wikipedia:Guide to appealing blocks though. Snowmanonahoe (talk · contribs · typos) 22:10, 16 June 2023 (UTC)

WP:BEANS applies. Schierbecker (talk) 00:10, 17 June 2023 (UTC)

It's only fair to let editors know they need to acknowledge why they were blocked and commit to change in their own words, so they don't just keep plugging new prompts into ChatGPT until it comes up with something convincing.

But there are also bigger-picture considerations: If they're using a LLM to write unblock requests, there's a high likelihood that they're also using it to edit articles. This would be a good opportunity to inform them of our policy and also check their edits for factual accuracy. –dlthewave ☎ 02:59, 17 June 2023 (UTC)

Hate to sound like a dick, but I want the kind of people who would use LLMs for unblock requests to actually do so, and to get rejected for it. It's a great way to detect if a person has poor judgment and shouldn't be trusted to edit articles. Aoidh's argument is better, but I think it's better to keep this unwritten; the current wording is general enough to give legitimacy to these admin actions, without needing to spell it out. WP:BEANS applies, because if we spell it out, those people will still use LLMs, and just paraphrase it to make it less recognizable. We would lose a valuable way to detect WP:CIR and nip it in the bud. Unwritten rules are rarely appropriate, but I just can't imagine any valuable editors doing this, so we might as well preserve this "tell". DFlhb (talk) 12:07, 17 June 2023 (UTC)

I have to say, this is a pretty compelling argument. Also, it made me laugh. Thanks, DFlhb. :) --Yamla (talk) 18:12, 17 June 2023 (UTC)

+1 Snowmanonahoe (talk · contribs · typos) 18:59, 17 June 2023 (UTC)

I have to agree that this is a persuasive argument and completely valid on all points. - Aoidh (talk) 02:35, 18 June 2023 (UTC)

On procedural grounds, one could argue that using CHATGPT in such circummstances is a de facto demonstration of Lack of WP:COMPETENCE. {The poster formerly known as 87.81.23.195} 46.65.228.117 (talk) 04:39, 18 June 2023 (UTC)

Agree MM (Give me info.) ^(Victories) 18:51, 25 June 2023 (UTC)

[1] "The Rutgers Journal of Bioethics" (PDF). 2019.

[2] Anticipating huffy responses on the order of, "Nonsense; I've found dozens of examples, it's easy!" I would just say, "Sure, me too", but also that anyone editing this page is a very highly self-selected group, and a tiny, tiny minority of the 120,097 active users. And also that LLM is in its infancy, and our hubris will soon be challenged by future versions. Remember 1997.

[1]

[a]