Wikipedia:Bots/Requests for approval/KiranBOT 12

KiranBOT 12

New to bots on Wikipedia? Read these primers!

Approval process – How this discussion works
Overview/Policy – What bots are/What they can (or can't) do
Dictionary – Explains bot-related jargon

Operator: Usernamekiran (talk · contribs · SUL · edit count · logs · page moves · block log · rights log · ANI search)

Time filed: 15:59, Tuesday, September 24, 2024 (UTC)

Function overview: update Accelerated Mobile Pages/AMP links to normal links

Automatic, Supervised, or Manual: automatic

Programming language(s): pywikibot

Source code available: github repo

Links to relevant discussions (where appropriate): requested at BOTREQ around 1.5 years ago: Wikipedia:Bot requests/Archive 84#Accelerated Mobile Pages link eradicator needed, and village pump: Wikipedia:Village_pump_(technical)/Archive_202#Accelerated_Mobile_Pages_links, recently requested at BOTREQ a few days ago: special:permalink/1247505851.

Edit period(s): either weekly or monthly

Requested edit rate: 1 edit per 50 seconds.

Estimated number of pages affected: around 8,000 for now, but the estimation is high, around thousands of pages. later as they come in.

Namespace(s): main/article

Exclusion compliant (Yes/No): yes (for now), if required, that can be changed later

Function details: with usage of extensive regex patters, the bot looks for AMP links. It avoids false matching with general "amp" words in the domains eg yamaha-amplifiers.com. After finding, and updating the a link, the bot checks if the new/updated link is working, if it gets a 200 response code, the bot updates the link in article. Otherwise, the bot adds that article title, and (non-updated) link to a log file (this can be saved to a log page as well). —usernamekiran (talk) 15:59, 24 September 2024 (UTC)[reply]

addendum: I should have included this already, but I forgot. In the BOTREQ, and other discussions, an open source "amputatorbot" github was discussed. This bot has a lot of irrelevant functions for wikipedia. The only relevant feature is to remove AMP links. But for this, the amputatorbot utilises a database for storing a list of ~~~400k~~ ~200k AMP links, and another list of canonical links of these AMP links. Maintaining this database, and the never-ending list of links for Wikipedia is not feasible. The program I created utilises comprehensive regex patterns. It also handles the archived links gracefully. —usernamekiran (talk) 17:50, 28 September 2024 (UTC)[reply]

Discussion

Maintaining this database, and the never-ending list of links for Wikipedia is not feasible But you wouldn't have to maintain this database right, if the authors of that GitHub repo already do, or have made it available?
The program I created utilises comprehensive regex patterns. It also handles the archived links gracefully. Would you mind providing those patterns here for evaluation?

Aside from that, happy for this to go to trial. @GreenC: any comments on this, and does this fall into the scope of your bot? ProcrastinatingReader (talk) 10:40, 29 September 2024 (UTC)[reply]

I will soon post the link to github, and reasoning for avoiding the database method. —usernamekiran (talk) 13:21, 29 September 2024 (UTC)[reply]
@ProcrastinatingReader: Hi. Yes, the author at github has made it available, but I think the database has not been updated in 4 years, I am not sure though. I also could not find the database itself. If we utilise the database, the bot would not process the "unknown" amp links that are not in the database. In that case we will have to use the method that we are currently using. Also, the general process would be more resource intensive I think, ie: "1: search for the amp links in articles 2: if amp link is found in article, look for it in the database 3: find the corresponding canonical link 4: replace in the article. Even if the database is being maintained, we will have to keep it updated, and we will have to add our new findings to the database. I think this simpler approach would be better. KiranBOT at github, AmputatorBot readme at github. Kindly let me know what you think. —usernamekiran (talk) 19:50, 29 September 2024 (UTC)[reply]

PS: I notified GreenC on their talkpage. Also, in the script, I added more comments than I usually do, and the script was created over the days/in parts, so the commenting might feel a little odd. —usernamekiran (talk) 19:54, 29 September 2024 (UTC)[reply]
This sounds like a good idea. I ran into AMP URLs with the Times of India domains, and made many conversions. It seemed site specific. Like m.timesofindia.com became timesofindia.indiatimes.com and "(amp_articleshow|amp_videoshow|amp_etphotostory|amp_ottmoviereview|amp_etc..)" had the "amp_" part removed. Anyway, I'll watchlist this page and feel free to ping me for input once test edits are made. -- GreenC 23:42, 29 September 2024 (UTC)[reply]
@ProcrastinatingReader: if there are no further questions/doubts, is a trial in order? I am sure about one issue related to https, but I think we should discuss it after the trial. —usernamekiran (talk) 15:16, 2 October 2024 (UTC)[reply]
{{BAG assistance needed}} —usernamekiran (talk) 08:42, 5 October 2024 (UTC)[reply]
Reviewing the code, you're applying a set of rules (amp.domain.tld → www.domain.tld, /amp/ → /, ?amp=true&... → ?...) and then checking the URL responds with 200 to a HEAD request. That seems good for most cases, but there are going to be some instances where the site uses an unusual AMP URL mapping and responds with 200 to all/most/some invalid requests, especially considering we are following redirects (but not updating the URL to the followed redirect). It also will not work for the example edit from the BOTREQ? I don't know how to solve this issue without some way of checking the redirected page actually contains some of the content we are looking for, or access to a database of checked mappings. Maybe the frequency of mistakes will be low enough for this to not be a problem? I am unsure. Any thoughts from others? — The Earwig (talk) 16:10, 5 October 2024 (UTC)[reply]
These are good points. Soft-404s and soft-redirects are the biggest (but not only) issues with URL changes. With soft-404s, you first process the links without committing changes, log redirect URLs, see which redirect URLs are repeating, manually inspect them to see if they are a soft-404; then process the links again with a trap added to treat the identified soft-404s as a dead link. Not all repeating redirects are soft-404s but many will be, you have to do the discovery work. For soft-redirects, it requires foreknowledge based on manual inspections, like the Times of India example above. URL changes are difficult for these reasons, and others mentioned in WP:LINKROT#Glossary. -- GreenC 17:53, 5 October 2024 (UTC)[reply]
@GreenC any suggestions on logic/algorithm? I will try to implement them. I dont mind further work to perfect the program —usernamekiran (talk) 20:32, 6 October 2024 (UTC)[reply]
@GreenC, ProcrastinatingReader, and The Earwig: I updated the code, and tested it on a few types of links (that I could think of), as listed in this version of the page, diff of the fix. Kindly suggest me more types/formats of AMP links, and any suggestions/updates to the code. —usernamekiran (talk) 02:49, 31 October 2024 (UTC)[reply]
1. I see you log failed cases. If not already, also log successes (old url -> new url), in case you need to reverse some later (new url -> old url).
2. One way to avoid the problems noted by The Earwig is simply skip URLs with 301/302 headers. Most soft-404s are redirect URLs. With the exception of http->https, those are OK. You can always go back and revisit them later. One way to do this is log the URL "sink" (the final URL in the redirect chain), then script the logs to see if any sinks are repeating.
-- GreenC 04:19, 31 October 2024 (UTC)[reply]
okay, I will try that. —usernamekiran (talk) 17:41, 11 November 2024 (UTC)[reply]
{{BAG assistance needed}} I made a few changes/additions to the program. In summary: 1) if original URL works, but cleaned url fails, saving is skipped 2) if AMP url, and cleaned url both return non-200, cleaned url is saved 3) if the cleaned url results in a redirect (301, or 302), and the final url after redirection differs from the original AMP url's final destination, saving is skipped. All the events are logged accordingly. I think we are good for a 50 edit trial. courtesy ping @GreenC: —usernamekiran (talk) 05:51, 16 November 2024 (UTC)[reply]
Just noting this has been seen; I'll give GreenC a few days to respond but otherwise I'll chuck this to trial if there is no response (or a favourable response). Primefac (talk) 20:39, 17 November 2024 (UTC)[reply]
Hi. Given the large number of pages affected, and in case there is some issue — then potential of breaking references —essentially breaking WP:V, I don't want to take any chances. So no hurries on my side either. —usernamekiran (talk) 13:23, 20 November 2024 (UTC)[reply]
I think it would be easier to error check if you were able to make 10 edits on live pages. If those go well, then 10 more. And so on, going through the results manually verifying, and refactoring edge cases as they arise, before moving to the next set. We should know by 50 edits total how things are. In that sense, if you were approved for 50 trial edits. User:Primefac. -- GreenC 17:11, 20 November 2024 (UTC)[reply]
yes, I was thinking the same. I tested the program on Charles III, and few other pages, but I'm still doubtful about various possibilities. Even if approved, I'm thinking to go very slow for the first few runs, and only after thorough scrutiny I will run it normally, with 1 edit per 5 seconds. —usernamekiran (talk) 10:22, 21 November 2024 (UTC)[reply]
Approved for trial (50 edits). Please provide a link to the relevant contributions and/or diffs when the trial is complete. Please follow the time frame set out by GreenC - you do not necessarily have tag this with {{BotTrialComplete}} after each grouping of 10 (that would get a little silly) but post the results of each group here so that others may review. For the sake of expanded viewing, please do not mark the edits as minor. Primefac (talk) 11:36, 21 November 2024 (UTC)[reply]
Trial complete. 54 edits I apologise, I somehow missed the "dont mark edits as minor", but I manually checked each edit soon after saving the page, and reverted the problematic edits immediately. I also miscalculated my previous edit count, and thought I had 15 left (when only 10 were left), so I accidentally almost performed 55 edits. In the earlier edits, there were few minor issues, I resolved them. In the final run, marked as BRFA 12.7, there was only one issue: when there was web.archive url in question, the bot was sending head requests to bare/non-archive URLs. That resulted in two incorrect updates: 1, and 2. I have resolved this issue. In this edit, the old/amp URL is not functional, but the updated/cleaned URL works. Requesting another trial for 50 edits. courtesy ping @GreenC, ProcrastinatingReader, and The Earwig:. Also, should we create some log page on Wikipedia to document failures/skips, and sinks (on separate pages)? —usernamekiran (talk) 18:36, 13 December 2024 (UTC)[reply]
I checked every edit. Observations:
1. In Islamic State line 480, there is a mangling problem, though oddly the URL still works in this case, it should not happen.
2. In Afghanistan in first edit, broken archive URL.
3. In Oasis (band) in first edit, removed only some of the amp
4. In Kamala Harris in first edit, broken archive URL
5. In Islamic State in first edit, broken archive URL
6. In Argentina in first edit, broken archive URL
7. In FC Barcelona in diff, a couple broken archive URLs
8. In FC Barcelona in diff, another broken archive URL
9. In Syria in diff, added extraneous curly brackets to each citation
10. In Charles III first edit, broke the primary URL
11. In Islamic State in diff, broken archive URL
12. In Anime in diff, broken archive URL(s)
13. In Bill Clinton in diff, broken archive URL
14. In Kayne West in diff, broken primary and archive URLs
15. In Lil Wayne in first edit, both the new and old primary URL are 404. There is no way to safely migrate the URL in that scenario.
16. In Lebanon in line #198, the primary and archive URL are mangled
17. In Nancy Pelosi in diff, broken archive URL
18. In Charles III in diff, mangled URLs
Suggestions:
1. Before anything further, please manually repair the diffs listed above. Please confirm.
2. When using insource search it will tend to sort the biggest articles first. This means the bot's early edits, the most error prone, will also be in the highest profile articles, often with the most changes. For this reason I always shuf the list first, to randomize the order of processing, mixing big and small articles randomly.
3. Skip all archive URLs. They are complex and error prone. When modifying an archive URL, the WaybackMachine returns a completely different snapshot. It might not exist at all, or contain different content. Without manually verifying the new archive URL, or access to the WM APIs and tools, you will be lucky to get a working archive URL. There is no reason to remove AMP data from archive URLs it does not matter.
4. Manually verify every newly modified URL is working, during the testing period.
-- GreenC 19:56, 13 December 2024 (UTC)[reply]
Thanks for doing the work here, and agree with these suggestions. This is too high of an error rate to proceed without changes. I'm particularly confused/concerned about what happened on Syria with the extra curly braces. — The Earwig (talk) 21:52, 13 December 2024 (UTC)[reply]
@GreenC and The Earwig: I have addressed most, almost all of the issues that arose before the trial "12.7". It also includes the issue with extra curly brackets that Earwing has pointed out, it has been taken care of. The WaybackMachine/archive is difficult. Regarding Lil Wayne, I had specifically coded the program to update the URL if both the URL ends up in 404. I am not sure what you meant by Lebanon/line 198, I could not find any difference around line 198, or nearby. Even after the approval/trial period, I will set the cap on max edits, and I will be checking every edit until I am fully confident that it is okay to unsupervised. I should I have mentioned when I posted "trial finished": I have included one more functionality (in the edits with summary including 12.7): when the program finds amp characteristics in URL, it then fetches html of that particular page, and looks for amp attributes, if true, only then the URL is repaired. I have also added the functionality to look for canonical/non-amp URL on the page itself. In case it is not found, only then the program tries to repair the URL manually, and then tests the repaired URL. Should I update the code to skip updating URL if bot old and new are 404? I can keep on working/improving the program with dry runs if you'd like. —usernamekiran (talk) 17:16, 14 December 2024 (UTC)[reply]
Can you confirm when you repair the errors listed above? That would mean manually editing each of those articles and fixing the errors the bot introduced during the trial edit. -- GreenC 20:11, 14 December 2024 (UTC)[reply]

Since you are using Pywikibot and this is a complex task, you can make things more controlled by using pywikibot.showDiff for trials. This way you can review the diffs before saving any changes. Additionally, if this trial is extended, you could use the input() function to create an AWB-like experience. This allows you to confirm whether to save changes, which helps prevent mistakes during actual edits. While a dry run is usually the best approach, I prefer this method for similar tasks.

if changes_made:
      print(f"Changes made to page: {page.title()}")
      print(pywikibot.showDiff(original_text, updated_text))
      response = input("Save? (y/n): ")
      if response.lower() == "y":
           page.text = updated_text
           page.save(summary="removed AMP tracking from URLs [[Wikipedia:Bots/Requests for approval/KiranBOT 12|BRFA 12.1]]", minor=True, bot=True)
          # your code...
      else:
            print(f"Skipping {page.title()}")
            # your code...

Also, since the botflag argument is deprecated, you should use bot=True to mark the edit as a bot edit. – DreamRimmer (talk) 14:47, 16 December 2024 (UTC)[reply]

@GreenC: Hi. I was under impression that I had checked all the diffs, and repaired them. Today I fixed a few of them, and I will fix the remaining ones after 30ish hours. During the next runs, I will mostly save the updated page text to my computer, and manually test the "show changes" through browser. This gives better control/understanding. When performing actual edits, I will add a delay of five minutes between each edit, that way I would be able to test the URLs in real time. @DreamRimmer: thanks. but commenting out the page save operation, and saving the updated text to file is better option, you can see the relevant code from line 199 to 209. Its very old code though, the current program is drastically different that that one. —usernamekiran (talk) 17:52, 17 December 2024 (UTC)[reply]