Wikipedia:Bots/Requests for approval/KiranBOT 12
New to bots on Wikipedia? Read these primers!
- Approval process – How this discussion works
- Overview/Policy – What bots are/What they can (or can't) do
- Dictionary – Explains bot-related jargon
Operator: Usernamekiran (talk · contribs · SUL · edit count · logs · page moves · block log · rights log · ANI search)
Time filed: 15:59, Tuesday, September 24, 2024 (UTC)
Function overview: update Accelerated Mobile Pages/AMP links to normal links
Automatic, Supervised, or Manual: automatic
Programming language(s): pywikibot
Source code available: github repo
Links to relevant discussions (where appropriate): requested at BOTREQ around 1.5 years ago: Wikipedia:Bot requests/Archive 84#Accelerated Mobile Pages link eradicator needed, and village pump: Wikipedia:Village_pump_(technical)/Archive_202#Accelerated_Mobile_Pages_links, recently requested at BOTREQ a few days ago: special:permalink/1247505851.
Edit period(s): either weekly or monthly
Requested edit rate: 1 edit per 50 seconds.
Estimated number of pages affected: around 8,000 for now, but the estimation is high, around thousands of pages. later as they come in.
Namespace(s): main/article
Exclusion compliant (Yes/No): yes (for now), if required, that can be changed later
Function details: with usage of extensive regex patters, the bot looks for AMP links. It avoids false matching with general "amp" words in the domains eg yamaha-amplifiers.com
. After finding, and updating the a link, the bot checks if the new/updated link is working, if it gets a 200 response code, the bot updates the link in article. Otherwise, the bot adds that article title, and (non-updated) link to a log file (this can be saved to a log page as well). —usernamekiran (talk) 15:59, 24 September 2024 (UTC)
- addendum: I should have included this already, but I forgot. In the BOTREQ, and other discussions, an open source "amputatorbot" github was discussed. This bot has a lot of irrelevant functions for wikipedia. The only relevant feature is to remove AMP links. But for this, the amputatorbot utilises a database for storing a list of
~400k~200k AMP links, and another list of canonical links of these AMP links. Maintaining this database, and the never-ending list of links for Wikipedia is not feasible. The program I created utilises comprehensive regex patterns. It also handles the archived links gracefully. —usernamekiran (talk) 17:50, 28 September 2024 (UTC)
Discussion
[edit]Maintaining this database, and the never-ending list of links for Wikipedia is not feasible
But you wouldn't have to maintain this database right, if the authors of that GitHub repo already do, or have made it available?The program I created utilises comprehensive regex patterns. It also handles the archived links gracefully.
Would you mind providing those patterns here for evaluation?
Aside from that, happy for this to go to trial. @GreenC: any comments on this, and does this fall into the scope of your bot? ProcrastinatingReader (talk) 10:40, 29 September 2024 (UTC)
- I will soon post the link to github, and reasoning for avoiding the database method. —usernamekiran (talk) 13:21, 29 September 2024 (UTC)
- @ProcrastinatingReader: Hi. Yes, the author at github has made it available, but I think the database has not been updated in 4 years, I am not sure though. I also could not find the database itself. If we utilise the database, the bot would not process the "unknown" amp links that are not in the database. In that case we will have to use the method that we are currently using. Also, the general process would be more resource intensive I think, ie: "1: search for the amp links in articles 2: if amp link is found in article, look for it in the database 3: find the corresponding canonical link 4: replace in the article. Even if the database is being maintained, we will have to keep it updated, and we will have to add our new findings to the database. I think this simpler approach would be better. KiranBOT at github, AmputatorBot readme at github. Kindly let me know what you think. —usernamekiran (talk) 19:50, 29 September 2024 (UTC)
- PS: I notified GreenC on their talkpage. Also, in the script, I added more comments than I usually do, and the script was created over the days/in parts, so the commenting might feel a little odd. —usernamekiran (talk) 19:54, 29 September 2024 (UTC)
- This sounds like a good idea. I ran into AMP URLs with the Times of India domains, and made many conversions. It seemed site specific. Like m.timesofindia.com became timesofindia.indiatimes.com and "(amp_articleshow|amp_videoshow|amp_etphotostory|amp_ottmoviereview|amp_etc..)" had the "amp_" part removed. Anyway, I'll watchlist this page and feel free to ping me for input once test edits are made. -- GreenC 23:42, 29 September 2024 (UTC)
- @ProcrastinatingReader: if there are no further questions/doubts, is a trial in order? I am sure about one issue related to https, but I think we should discuss it after the trial. —usernamekiran (talk) 15:16, 2 October 2024 (UTC)
- {{BAG assistance needed}} —usernamekiran (talk) 08:42, 5 October 2024 (UTC)
- Reviewing the code, you're applying a set of rules (
amp.domain.tld
→www.domain.tld
,/amp/
→/
,?amp=true&...
→?...
) and then checking the URL responds with 200 to a HEAD request. That seems good for most cases, but there are going to be some instances where the site uses an unusual AMP URL mapping and responds with 200 to all/most/some invalid requests, especially considering we are following redirects (but not updating the URL to the followed redirect). It also will not work for the example edit from the BOTREQ? I don't know how to solve this issue without some way of checking the redirected page actually contains some of the content we are looking for, or access to a database of checked mappings. Maybe the frequency of mistakes will be low enough for this to not be a problem? I am unsure. Any thoughts from others? — The Earwig (talk) 16:10, 5 October 2024 (UTC)- These are good points. Soft-404s and soft-redirects are the biggest (but not only) issues with URL changes. With soft-404s, you first process the links without committing changes, log redirect URLs, see which redirect URLs are repeating, manually inspect them to see if they are a soft-404; then process the links again with a trap added to treat the identified soft-404s as a dead link. Not all repeating redirects are soft-404s but many will be, you have to do the discovery work. For soft-redirects, it requires foreknowledge based on manual inspections, like the Times of India example above. URL changes are difficult for these reasons, and others mentioned in WP:LINKROT#Glossary. -- GreenC 17:53, 5 October 2024 (UTC)
- @GreenC any suggestions on logic/algorithm? I will try to implement them. I dont mind further work to perfect the program —usernamekiran (talk) 20:32, 6 October 2024 (UTC)
- These are good points. Soft-404s and soft-redirects are the biggest (but not only) issues with URL changes. With soft-404s, you first process the links without committing changes, log redirect URLs, see which redirect URLs are repeating, manually inspect them to see if they are a soft-404; then process the links again with a trap added to treat the identified soft-404s as a dead link. Not all repeating redirects are soft-404s but many will be, you have to do the discovery work. For soft-redirects, it requires foreknowledge based on manual inspections, like the Times of India example above. URL changes are difficult for these reasons, and others mentioned in WP:LINKROT#Glossary. -- GreenC 17:53, 5 October 2024 (UTC)
- Reviewing the code, you're applying a set of rules (
- @GreenC, ProcrastinatingReader, and The Earwig: I updated the code, and tested it on a few types of links (that I could think of), as listed in this version of the page, diff of the fix. Kindly suggest me more types/formats of AMP links, and any suggestions/updates to the code. —usernamekiran (talk) 02:49, 31 October 2024 (UTC)
- I see you log failed cases. If not already, also log successes (old url -> new url), in case you need to reverse some later (new url -> old url).
- One way to avoid the problems noted by The Earwig is simply skip URLs with 301/302 headers. Most soft-404s are redirect URLs. With the exception of http->https, those are OK. You can always go back and revisit them later. One way to do this is log the URL "sink" (the final URL in the redirect chain), then script the logs to see if any sinks are repeating.
- -- GreenC 04:19, 31 October 2024 (UTC)
- okay, I will try that. —usernamekiran (talk) 17:41, 11 November 2024 (UTC)
- {{BAG assistance needed}} I made a few changes/additions to the program. In summary: 1) if original URL works, but cleaned url fails, saving is skipped 2) if AMP url, and cleaned url both return non-200, cleaned url is saved 3) if the cleaned url results in a redirect (301, or 302), and the final url after redirection differs from the original AMP url's final destination, saving is skipped. All the events are logged accordingly. I think we are good for a 50 edit trial. courtesy ping @GreenC: —usernamekiran (talk) 05:51, 16 November 2024 (UTC)
- Just noting this has been seen; I'll give GreenC a few days to respond but otherwise I'll chuck this to trial if there is no response (or a favourable response). Primefac (talk) 20:39, 17 November 2024 (UTC)
- Hi. Given the large number of pages affected, and in case there is some issue — then potential of breaking references —essentially breaking WP:V, I don't want to take any chances. So no hurries on my side either. —usernamekiran (talk) 13:23, 20 November 2024 (UTC)
- I think it would be easier to error check if you were able to make 10 edits on live pages. If those go well, then 10 more. And so on, going through the results manually verifying, and refactoring edge cases as they arise, before moving to the next set. We should know by 50 edits total how things are. In that sense, if you were approved for 50 trial edits. User:Primefac. -- GreenC 17:11, 20 November 2024 (UTC)
- yes, I was thinking the same. I tested the program on Charles III, and few other pages, but I'm still doubtful about various possibilities. Even if approved, I'm thinking to go very slow for the first few runs, and only after thorough scrutiny I will run it normally, with 1 edit per 5 seconds. —usernamekiran (talk) 10:22, 21 November 2024 (UTC)
- Approved for trial (50 edits). Please provide a link to the relevant contributions and/or diffs when the trial is complete. Please follow the time frame set out by GreenC - you do not necessarily have tag this with {{BotTrialComplete}} after each grouping of 10 (that would get a little silly) but post the results of each group here so that others may review. For the sake of expanded viewing, please do not mark the edits as minor. Primefac (talk) 11:36, 21 November 2024 (UTC)
- yes, I was thinking the same. I tested the program on Charles III, and few other pages, but I'm still doubtful about various possibilities. Even if approved, I'm thinking to go very slow for the first few runs, and only after thorough scrutiny I will run it normally, with 1 edit per 5 seconds. —usernamekiran (talk) 10:22, 21 November 2024 (UTC)
- I think it would be easier to error check if you were able to make 10 edits on live pages. If those go well, then 10 more. And so on, going through the results manually verifying, and refactoring edge cases as they arise, before moving to the next set. We should know by 50 edits total how things are. In that sense, if you were approved for 50 trial edits. User:Primefac. -- GreenC 17:11, 20 November 2024 (UTC)
- Hi. Given the large number of pages affected, and in case there is some issue — then potential of breaking references —essentially breaking WP:V, I don't want to take any chances. So no hurries on my side either. —usernamekiran (talk) 13:23, 20 November 2024 (UTC)
- Just noting this has been seen; I'll give GreenC a few days to respond but otherwise I'll chuck this to trial if there is no response (or a favourable response). Primefac (talk) 20:39, 17 November 2024 (UTC)