Wikipedia:Request a query/Archive 3
This is an archive of past discussions on Wikipedia:Request a query. Do not edit the contents of this page. If you wish to start a new discussion or revive an old one, please do so on the current main page. |
Archive 1 | Archive 2 | Archive 3 | Archive 4 | Archive 5 |
Category Queries
Hi, I am trying to work out how useful categories are 1. I was trying to find out a count by month of # of new categories
- wikimedia:Manual:category table - MediaWiki doesn't have dates
[[1]]
- the dates restrictions on PetScan don't seem to work for categories
- I found this SQL, but after choosing dB = 'wiki" it can't find the file select count(*) from page where page_namespace = 14;
but I "Can't connect to MySQL server on 'wiki.analytics.db.svc.wikimedia.cloud' ([Errno -2] Name or service not known)"
- There does seem to be a revision table wikimedia:Wikibase/Schema/wb changes - MediaWiki
2. Is it possible to find out how may times a reader clicks on a category on a page?
I asked a similar question to the help desk asking if there were reports Wakelamp d[@-@]b (talk) 04:25, 20 August 2022 (UTC)
- The database for the English Wikipedia is "enwiki_p", not "wiki".You could try to get historical counts of pages in the category namespace by looking for creation times in revision and deletions in logging, but it's going to be slow and error-prone.
I'd advise using the database dumps instead. all-titles.gz will probably be easier to work with than category.sql.gz.There's a database report including the count. Just walk its history.View counts are most easily seen at https://pageviews.wmcloud.org/. The number of times categories (or other pages or links) are clicked on from a specific page aren't publicly available. —Cryptic 10:02, 20 August 2022 (UTC)- Clickstream reveals click counts from one page to another but only within mainspace, not to categories. Certes (talk) 14:46, 20 August 2022 (UTC)
- Thank=you. I am not certain what you mean by "Just Walk it's history Have WMF been willing to provide query results from their web server in the past?
- My interest is because of this thread (I was OP) Wakelamp d[@-@]b (talk) 04:55, 1 September 2022 (UTC)
- Go to the history of that page and you can get what the count is for every week going back for years. —Cryptic 11:17, 1 September 2022 (UTC)
RfA by number of participants?
Could someone run a query on requests for adminship by number of participants (support/neutral/oppose) please? Information from it may be featured in an article in The Signpost. ––FormalDude (talk) 21:38, 25 September 2022 (UTC)
- This is something that a query can only approximate (if you just want the total number of participants), or can't do at all (if you want them separated by s/n/o). Fortunately, the data's already been collected at Wikipedia:Requests for adminship by year. —Cryptic 00:41, 26 September 2022 (UTC)
- I'm just looking for total number of participants per RfA. Don't really want to have to add up the support/neutral/opposes for each one. ––FormalDude (talk) 01:28, 26 September 2022 (UTC)
- quarry:query/67678. WP:RFY will be (a lot) more accurate. I suppose I can put together a script to add up the columns from that, too, though it'll still only go back to 2006. —Cryptic 02:26, 26 September 2022 (UTC)
- If it's not too much trouble to make that script it would be much appreciated! No problem that it only goes back to 06'. ––FormalDude (talk) 02:30, 26 September 2022 (UTC)
- User:FormalDude/RFAs. Results codes are described at Template:Recent RfX/doc. —Cryptic 03:08, 26 September 2022 (UTC)
- Thanks! ––FormalDude (talk) 03:14, 26 September 2022 (UTC)
- User:FormalDude/RFAs. Results codes are described at Template:Recent RfX/doc. —Cryptic 03:08, 26 September 2022 (UTC)
- If it's not too much trouble to make that script it would be much appreciated! No problem that it only goes back to 06'. ––FormalDude (talk) 02:30, 26 September 2022 (UTC)
- quarry:query/67678. WP:RFY will be (a lot) more accurate. I suppose I can put together a script to add up the columns from that, too, though it'll still only go back to 2006. —Cryptic 02:26, 26 September 2022 (UTC)
- I'm just looking for total number of participants per RfA. Don't really want to have to add up the support/neutral/opposes for each one. ––FormalDude (talk) 01:28, 26 September 2022 (UTC)
Number of refs in the median article
I haven't found this anywhere, but perhaps it could be answered with a fairly simple query? I'd like to know the number of inline citations (using <ref>
tags) in the median article. Ideally, this would exclude disambiguation pages and other non-articles, but indisputable evidence is not required. My guess is that it's more than one, but probably not more than five. WhatamIdoing (talk) 22:56, 3 October 2022 (UTC)
- SQL queries (what we specialize in here) can look at almost all public database information except for article text. Could possibly try to look at templatelinks for things like Cite book, Cite news, etc. but that might be too slow. Could possibly do it over a small range (last 30 days - recentchanges table?) That might be a decent approach. –Novem Linguae (talk) 23:19, 3 October 2022 (UTC)
- That won't work either; a page that transcludes {{cite news}} four hundred times looks exactly the same in the database as one that transcludes it just once.This can be extracted from the database dumps, but I'm really not the best person to do it - there just aren't any reasonably good wikitext parsers in the languages I'm most fluent in. I could put together something in an hour or two that just counts the number of times /<ref\b/ appears in the current revisions of all articles, or spend a week fiddling around with figuring out how bzip2 extraction and xml parsing works in python so I can use MWParserFromHell and only count things that actually are refs. (And even that's going to miss ref tags added by templates.) —Cryptic 23:33, 3 October 2022 (UTC)
- WhatamIdoing, are you also interested in other types of inline refs, like those using {{sfn}}? – Jonesey95 (talk) 03:04, 5 October 2022 (UTC)
- I'm interested in any method that lets me make a basically accurate statement like "Most articles have no more than n inline citations" or "If your new article has n citations, then it is already above average" and possibly even "Why are you complaining about there 'only' being 2n inline citations in this article, when fully half of Wikipedia's articles have n or fewer?". If you prefer the non-grumpy version of this, Help:Your first article could get a statement like "Most Wikipedia articles cite at least n-1 sources, and while it's not actually required to have at least n-1, it is an excellent goal" (unless n turns out to be two or three, because I really wouldn't want to position a one- or two-source article as an aspirational goal). If it were easy enough to get a distribution count (e.g., 5% are unsourced, 5% have one, 10% have two, etc., up to some reasonable number like 10 or 12) then Wikipedia:Inline citation#Ref tags could have a graph.
- Since {{sfn}} is used in less than two percent of articles, and some of those also use ref tags, I doubt that it will make much difference to the overall calculation. (In the previous sentence, some probably means "most" and might mean "almost all". I checked six pages in Special:WhatLinksHere, and all six had ref tags in addition to sfn templates.) WhatamIdoing (talk) 04:27, 5 October 2022 (UTC)
- WhatamIdoing, are you also interested in other types of inline refs, like those using {{sfn}}? – Jonesey95 (talk) 03:04, 5 October 2022 (UTC)
- That won't work either; a page that transcludes {{cite news}} four hundred times looks exactly the same in the database as one that transcludes it just once.This can be extracted from the database dumps, but I'm really not the best person to do it - there just aren't any reasonably good wikitext parsers in the languages I'm most fluent in. I could put together something in an hour or two that just counts the number of times /<ref\b/ appears in the current revisions of all articles, or spend a week fiddling around with figuring out how bzip2 extraction and xml parsing works in python so I can use MWParserFromHell and only count things that actually are refs. (And even that's going to miss ref tags added by templates.) —Cryptic 23:33, 3 October 2022 (UTC)
pagetriage_page_tags
The pagetriage_page_tags table has around 19 rows per ptrpt_page_id. Each row corresponds to a unique page tag (ptrpt_tag_id). The lookup is stored in the table pagetriage_tags, which maps ptrt_tag_id to ptrt_tag_name, such as 'copyright', 'user_name', etc.
I'm trying to find the page ids for any pages in pagetriage_page_tags that only have 1 entry instead of 19.
The way I chose to do this was with a query that matched the bug report. The bug report had only a copyright tag, and no other tags, so I picked a random tag (user_name) to check in a subquery to see if it also existed.
Rough draft query here but I messed something up. Sample data with 3 pages x 19 tags. Help appreciated :) –Novem Linguae (talk) 03:39, 21 October 2022 (UTC)
- Here you go. Counting by tag type (quarry:query/68192) seems hinky at a glance, but I've never explored this part of this db. —Cryptic 03:54, 21 October 2022 (UTC)
- quarry:query/68194 has a fixed version of your query. GROUP BY...HAVING is the usual pattern for "which values have exactly n rows containing them", though. —Cryptic 04:06, 21 October 2022 (UTC)
Active long-term users
Category:Members of the Fifteen Year Society of Wikipedia editors has 562 members. It it possible to extract a list of those who are still active (those who have made an edit in the last 12 months)? Kudpung กุดผึ้ง (talk) 00:21, 19 October 2022 (UTC)
- There's only 539 members due to transclusions (e.g. all of User:Valjean, User:Valjean/Leftcolumn, and User:Valjean/Userboxes are in the category). quarry:query/68150 has the 517 with an edit in the last year. There's plenty of users who registered 15 or more years ago who aren't in the category, of course, such as me. —Cryptic 02:22, 19 October 2022 (UTC)
- As a starting point, the result is very encouraging. Thanks. Kudpung กุดผึ้ง (talk) 00:36, 25 October 2022 (UTC)
Proportion of biographies
I'd like to know, if possible, the proportion of biographies where the subject is:
- American
- a sportsperson (if there isn't an existing definition of this, then their article is in a subcategory under Category:People in sports works for me)
- An American sportsperson
Thanks. Thryduulf (talk) 10:28, 12 October 2022 (UTC)
- The category tree on enwiki is useless for this sort of question - go deep enough into, say, Category:Births to get to all the individual years' articles, and you start finding ones like Category:Sulaiman Mountains. Category:People in sports and Category:American people are just as bad. You might have better luck at d:WD:RAQ. —Cryptic 11:03, 12 October 2022 (UTC)
- On the other hand, if this is for WT:ITN#Is there an editorial policy on Main Page Recent Deaths entries?, then you're probably interested in the number of biographies of living people. That's easy, since CAT:LP is non-diffusing. And while counting the number of biographies of living Americans and of living sportspeople is still hard, there's a special case of the latter that's also feasible: essentially all footballers transclude {{Infobox football biography}}. quarry:query/68028: there's 1048809 blps, and 168886 footballer blps (1 out of 6.2). This is actually slightly worse than when I first ran these queries in early 2019 (quarry:query/33441, quarry:query/33442) and got 1 out of 6.4. —Cryptic 20:32, 12 October 2022 (UTC)
- If Wikidata results will do, then have a look at https://w.wiki/5pBZ I don't know how to limit it to entries associated with an English Wikipedia article, but I believe that is possible. WhatamIdoing (talk) 02:23, 14 October 2022 (UTC)
- @Thryduulf@WhatamIdoing This is two weeks late, I'm afraid, but in case it is still of interest - per wikidata, English WP has -
- 389,053 Americans ("person whose country of citizenship is the US");
- 631,736 sportspeople ("person whose occupation is a subclass of athlete");
- and then 97,178 American athletes (both together)
- The WHGI tool has pre-crunched the data and reports, as of mid-October, 1,908,734 enwiki biographies; this would make our pool of biographies 20.4% Americans, 33.1% athletes (!), and 5.1% American athletes, give or take. Strikingly, those numbers imply that American bios are noticeably less likely to be athletes (~1/4) than the overall average on enwiki (~1/3), which surprised me a bit.
- Both WD and WP approaches are quite generous in their counting - for example George W. Bush is in a rugby players category and has the occupation listed on WD, so will be in the count by either approach. Granted, he did play for Yale, but it is not quite what we might expect to see! I don't think there is a practical way to work out just 'professional athletes' from either source.
- Lastly, baselining this against @Cryptic's figures for footballers, I think there are currently 198,214 mainspace pages transcluding {{Infobox football biography}}, and WD reports 204,265 "occupation: subclass of association football player + on enwiki". So the two approaches don't line up precisely, but are easily within a couple of percent of each other, which is reassuring. Andrew Gray (talk) 12:21, 26 October 2022 (UTC)
- On second thoughts, we can sort of proxy for "professional athlete": of our 97,178 American athletes, a total of 11,308 have a second occupation listed which is not a subclass of athlete, and across all athletes it is 56,340, ~11% in both cases.
- This is imperfect in both directions (a lot of "primarily athletes" will also have some other secondary occupation tag, especially for post-retirement careers) but I think it can give us some confidence that we're not getting an excessively high rate of false positives from the generous counting. Andrew Gray (talk) 12:41, 26 October 2022 (UTC)
- @Thryduulf@WhatamIdoing This is two weeks late, I'm afraid, but in case it is still of interest - per wikidata, English WP has -
- Blast from the past: User talk:BrownHairedGirl/Archive/Archive 065#Systemic bias * Pppery * it has begun... 04:16, 14 October 2022 (UTC)
Frequency of certain phrases in revision edit histories by page type
Looking to find out how often some profanity, curse words, are said in revision history. I have done the same thing manually for article, article talk, user, user talk. I suspect it is much higher on revision history
Wakelamp d[@-@]b (talk) 15:15, 1 November 2022 (UTC)
- Got any additional details? What word(s) should be searched for? What namespaces? etc. –Novem Linguae (talk) 15:51, 1 November 2022 (UTC)
- Excuse the language, but what about
fuck off
,cunt
, andfuckwit
for just one day on the revision history for article? Wakelamp d[@-@]b (talk) 13:51, 4 November 2022 (UTC)- One day? Yesterday had only this one edit. The whole of October had 28, half "cunts" (with plenty of false positives), half "fuck offs", and no "fuckwits": quarry:query/68602. —Cryptic 14:23, 4 November 2022 (UTC)
- Thank-you. Scunthorpe is an awesome false positive :-) But overall it's far less than I expected.... Maybe people don't attack each other on summary as the other editor won't see it unless they look, or are pinged? Or maybe other words are being used? Also The WMF/ Google- JIGSAW paper, found that 80 % of the abuse was from Ip editors, or very new editors,...
- So may I have another query that doesn't have a filter on the words, has editor name, and their edit count
- I will then download the file, import it into Excel to analyze (Excel has a limit of 1.2 m rows, and 2 GB size)
- The file should be within the limits - en WP has 43K revisions per day? (24 hr/day* 60 min/hr *30 edits per day) which I think is 17 MBish (your 28 records saved as 12 K in Excel so 400 bytes
- Wakelamp d[@-@]b (talk) 07:15, 5 November 2022 (UTC)
- On November 4 UTC, there's currently 177248 non-deleted revisions, of which 99916 are currently in mainspace. So your estimate for rowcount is off by about a factor of either 2 or 4; I don't care to guess what the true average size for edit summaries will be.You want, what? Username/ip, editcount, edit summary? (Edit count for ips isn't easily available the way it is for registered users.) Do you need the page name, a timestamp, and/or a rev_id (for Special:Diff/)? And just in mainspace again? —Cryptic 09:30, 5 November 2022 (UTC)
- See also Scunthorpe problem. Certes (talk) 11:44, 5 November 2022 (UTC)
- I am always happy within a factor of 10 - do not sit on my IKEA furniture :-). - and i expect the distribution of size to strongly skewed left so I agree on not guessing) My plan is to use Excel and keep on introducing filters until I can find 80 %ish. The Scunthorpe problem is a big issue.
- Edit histories for Article/amin, article talk, user talk as separate files. I am a bit concerned how my laptop will cope as it ha16 b of ram, an the excel files take up more in ram than in storage.
- Page name (so I can remove comments fuckwit on page fuckwit)
- Username Edit summary
- Edit count (which you have advised will not include Ips)
- minor edit ( just interested to see)
- Hypothesis I am trying to check -
- 1 . Was the WMF research correct that IP editors, and editors with less than 5 comments (I am using edits instead) are responsible for 80 % of abuse?
- 2. Was the WMF research correct that User talk has 3 times more abusive than artic
- 3. Was the WMF research correct that a few users create 10 % of the abuse? Wakelamp d[@-@]b (talk) 14:58, 6 November 2022 (UTC)
- Mainspace, Talk:, User talk:. All are for November 5; the first is large, so may take a while to open and sort. —Cryptic 20:04, 6 November 2022 (UTC)
- @Cryptic Thank-you for your help. I am finding some dupes ie,g, , https://en.wikipedia.org/w/index.php?title=Talk:1885_in_Norway&action=history , Wakelamp d[@-@]b (talk) 13:47, 8 November 2022 (UTC)
- I don't understand what you mean? There's only one result for Talk:1885 in Norway (ascending sort by pagename, and it's on page 8), and there's exactly one edit from November 5 in its history. Yes, Nythar made many edits to similarly-named pages that day with the same edit summary, but you said you wanted no filter. —Cryptic 03:08, 9 November 2022 (UTC)
- @Cryptic Thank-you for your help. I am finding some dupes ie,g, , https://en.wikipedia.org/w/index.php?title=Talk:1885_in_Norway&action=history , Wakelamp d[@-@]b (talk) 13:47, 8 November 2022 (UTC)
- Mainspace, Talk:, User talk:. All are for November 5; the first is large, so may take a while to open and sort. —Cryptic 20:04, 6 November 2022 (UTC)
- I am always happy within a factor of 10 - do not sit on my IKEA furniture :-). - and i expect the distribution of size to strongly skewed left so I agree on not guessing) My plan is to use Excel and keep on introducing filters until I can find 80 %ish. The Scunthorpe problem is a big issue.
- Thank-you. Scunthorpe is an awesome false positive :-) But overall it's far less than I expected.... Maybe people don't attack each other on summary as the other editor won't see it unless they look, or are pinged? Or maybe other words are being used? Also The WMF/ Google- JIGSAW paper, found that 80 % of the abuse was from Ip editors, or very new editors,...
- One day? Yesterday had only this one edit. The whole of October had 28, half "cunts" (with plenty of false positives), half "fuck offs", and no "fuckwits": quarry:query/68602. —Cryptic 14:23, 4 November 2022 (UTC)
- Excuse the language, but what about
Displaying all categories in [Category:Container categories] which also have mainspace pages in them.
This would make finding and fixing these container categories much easier. DemonStalker (talk) 01:58, 23 November 2022 (UTC)
Living people over some age
How can I get a list of all the individuals currently alive in the English Wikipedia over the age of say 95?
Thank You Dleigthful (talk) 06:18, 21 November 2022 (UTC)
- @Dleigthful: This is easily done with PetScan instead of a database query. Here's a list of people in Category:1904 births through Category:1927 births (pagepile 46206) that are also in Category:Living people: pagepile 46207 (PetScan query). – Joe (talk) 06:42, 21 November 2022 (UTC)
- @Joe RoeI tried going to this link that you sent here and I can't seem to get it to load. I tried going to your PetScan query above but I can't get it to load for some reason. Am I doing something wrong by calling it by clicking on the link, or is there something else I have to do to get your output. Dleigthful (talk) 06:50, 21 November 2022 (UTC)
- It takes a while to run the query. Click this link to just see the results. – Joe (talk) 07:41, 21 November 2022 (UTC)
- @Dleightful @Joe Roe would you be able to go over how you got this answer? I'm new to PetScan. For instance, how would I change the parameter to say everyone older than 93 years instead. Dleigthful (talk) 01:38, 23 November 2022 (UTC)
- @Dleigthful: So you have to do it in two stages. First enter the list of birth-year categories under categories and select "union" to get a list of all the articles in any of those categories. Under output you can get this list as a pagepile, which you then use as the input of the second query (under other sources) along with Category:Living people. – Joe (talk) 07:01, 29 November 2022 (UTC)
- @Dleightful @Joe Roe would you be able to go over how you got this answer? I'm new to PetScan. For instance, how would I change the parameter to say everyone older than 93 years instead. Dleigthful (talk) 01:38, 23 November 2022 (UTC)
- It takes a while to run the query. Click this link to just see the results. – Joe (talk) 07:41, 21 November 2022 (UTC)
- I'm not familiar with pagepile, but if your description of what it's querying is accurate, it's insufficient - the birth year categories only have articles where the exact year is known. Russ Witte, for example, is in only in Category:1910s births, since it's unknown whether he was born in 1916 or 1917. (It also omits Lucile Randon from Category:1904 births.) quarry:query/69125 has pages in Category:Living people and at least one of Category:1900s births, Category:1910s births, or Category:1900 births–Category:1927 births. —Cryptic 07:43, 21 November 2022 (UTC)
- Oh yes, I went from 1904 births because of Randon too, that was a typo. Good point with the uncertain year births, although comparing our results that apparently that only includes nine people :) – Joe (talk) 07:50, 21 November 2022 (UTC)
- @Joe RoeI tried going to this link that you sent here and I can't seem to get it to load. I tried going to your PetScan query above but I can't get it to load for some reason. Am I doing something wrong by calling it by clicking on the link, or is there something else I have to do to get your output. Dleigthful (talk) 06:50, 21 November 2022 (UTC)
- A Wikidata query in SPARQL (tutorial) could also select people with a suitable DoB, no death date and an enwp article, though some of them may have died without bothering to inform Wikidata. Certes (talk) 13:14, 21 November 2022 (UTC)
NPP stats
Hi Cryptic
For a meeting with the WMF this week we need
- Average number of daily new articles (not redirects)
- Average daily number that are passed by auto patrollers
- Average daily number that are passed by NPPers
- Average daily number that are passed but tagged for improvement/issues
- Average daily number that are listed for the AfD, CSD, PROD deletion systems (together will do)
- Average daily number that are moved to draft
Sample period would need to be at least over the last 6 months to make any sense.
Is this something you can do for us? Thanks in advance. Kudpung กุดผึ้ง (talk) 06:45, 6 December 2022 (UTC)
Articles with duplicative sections
Would it be possible to construct a query that returns all articles that have two or more sections with the same name? I just merged one such instance at Concord, Massachusetts (which had two "Transportation" sections) and I suspect there may be many others. {{u|Sdkb}} talk 00:08, 9 December 2022 (UTC)
- I don't think that's possible in Quarry. Article text, including section headers, is not available. The nearest I can think of is flagging redirects to multiple sections, e.g. Foobar→Foo#Bar and Fooqux→Foo#Qux, which would give a very incomplete list (and hit some non-heading anchors which we couldn't filter out). You could suggest a new database report if it's important enough. Certes (talk) 10:02, 9 December 2022 (UTC)
Small tweak to existing query
Hi! Could someone please tweak this query to only list existing talk pages? Thanks! 🐶 EpicPupper (he/him | talk) 02:58, 12 December 2022 (UTC)
- @EpicPupper: I've tweaked that as quarry:query/69705. 400 found. Certes (talk) 10:15, 12 December 2022 (UTC)
Fully-protected high-use templates
Hello! I'd appreciate it if a query could be created to list all fully-protected templates and their number of transclusions. Additionally, it'd be great if the least-transcluded template, most-transcluded template, and median and average number of transclusions could also be calculated some way. Much thanks! 🐶 EpicPupper (he/him | talk) 04:42, 17 December 2022 (UTC)
- @EpicPupper: I've created quarry:query/69843, which may do what you need. Certes (talk) 13:12, 17 December 2022 (UTC)
Stats for WMF/NPP meeting
Could we please have these for a meeting with the WMF next week:
- Average number of daily new articles (not redirects)
- Average daily number that are passed by auto patrollers
- Average daily number that are passed by NPPers
- Average daily number that are passed but tagged for improvement/issues
- Average daily number that are listed for the AfD, CSD, PROD deletion systems (together will do)
- Average daily number that are moved to draft
Sample period would need to be at least over the last 6 months to make any sense. Thanks. Kudpung กุดผึ้ง (talk) 18:46, 4 January 2023 (UTC)
Shortdesc query
I tried fiddling around in PetScan, and couldn't see a way to do this: I want to see all articles without short descriptions, sorted by pageviews (high to low). — PerfectSoundWhatever (t; c) 02:19, 28 December 2022 (UTC)
- @PerfectSoundWhatever: I went ahead and pulled this data for you: User:Uhai/Pages without short descriptions by view count. Note there's over a million non-redirect and non-disambiguation mainspace pages that do not transclude the short description template, so this report only shows the top 1000 by view count. View counts are the monthly aggregate from November 2022 per page so pages created or moved after November 2022 will not appear in this list. Also note view counts may be influenced by events occurring in this timeframe, e.g. World Cup or pages making the front page of Reddit.
- The report I linked contains more information and a link to the full dataset. Uhai (talk) 18:34, 4 January 2023 (UTC)
- Know I wasn't the original requestor, but that's amazing, thanks!! ~ Eejit43 (talk) 18:43, 4 January 2023 (UTC)
- @Uhai: This is really cool, thanks a lot! — PerfectSoundWhatever (t; c) 00:22, 5 January 2023 (UTC)
- I absolutely love this query. How easy is it to run it again? Ideally, we'd be able to run it every month or two to get the most viewed pages done at the time. — PerfectSoundWhatever (t; c) 00:56, 5 January 2023 (UTC)
- @PerfectSoundWhatever: It's easy to run but slow and not a super great solution as I had to hit the Wikimedia pageviews API endpoint >1 million times (basically once per article) to get the data. The total run time, not counting crashes and subsequent bug fixing I had to do, was nearly thirty hours I think. If I don't receive any angry e-mails from Wikimedia Foundation about hitting the API so many times, I could run it again in two months. A better solution would be to parse the pageview data dumps instead of using the API, but this would require significantly more programming effort. I'll see what I can do in my free time, though. Uhai (talk) 01:14, 5 January 2023 (UTC)
- I see. My programming knowledge is fairly basic but I might give the dumps option a try. Thanks a lot for your effort! — PerfectSoundWhatever (t; c) 02:13, 5 January 2023 (UTC)
- User:Certes/Reports/Top 1000 pages lacking short descriptions runs each week. It uses User:HostBot/Top 1000 report, which lists only the top 2200 or so most viewed pages, so produces very few results. If there's a longer list of most-viewed pages on Wikipedia, that would provide an easy way to extend the query. Certes (talk) 12:59, 5 January 2023 (UTC)
- Another option is to feed topviews into PetScan as a Manual List or pagepile, filtering by a suitable template/category of pages having/lacking short description. The topviews UI requires hammering the "Show more" button repeatedly to get a decent amount of results, but it may be possible to analyse the underlying GET requests (or whatever) and simulate them without wearing your mouse out. Certes (talk) 13:07, 5 January 2023 (UTC)
- You can download that data and that gives you the top 1000, similar to HostBot's report ~ Eejit43 (talk) 14:23, 5 January 2023 (UTC)
- Good point. That gave me the first 989. All have SDs. Certes (talk) 16:08, 5 January 2023 (UTC)
- You can download that data and that gives you the top 1000, similar to HostBot's report ~ Eejit43 (talk) 14:23, 5 January 2023 (UTC)
- I see. My programming knowledge is fairly basic but I might give the dumps option a try. Thanks a lot for your effort! — PerfectSoundWhatever (t; c) 02:13, 5 January 2023 (UTC)
- @PerfectSoundWhatever: It's easy to run but slow and not a super great solution as I had to hit the Wikimedia pageviews API endpoint >1 million times (basically once per article) to get the data. The total run time, not counting crashes and subsequent bug fixing I had to do, was nearly thirty hours I think. If I don't receive any angry e-mails from Wikimedia Foundation about hitting the API so many times, I could run it again in two months. A better solution would be to parse the pageview data dumps instead of using the API, but this would require significantly more programming effort. I'll see what I can do in my free time, though. Uhai (talk) 01:14, 5 January 2023 (UTC)
- I absolutely love this query. How easy is it to run it again? Ideally, we'd be able to run it every month or two to get the most viewed pages done at the time. — PerfectSoundWhatever (t; c) 00:56, 5 January 2023 (UTC)
Number of non-locked registered users across Wikimedia
Hi there, I'd really appreciate it if a query for the number of registered users (movement-wide) that aren't globally locked could be created. Thanks! EpicPupper (talk) 01:59, 31 January 2023 (UTC)
- Hey @EpicPupper. I think quarry:query/71506 should do the trick. The answer appears to be 71,244,967. Hope that helps. –Novem Linguae (talk) 22:20, 19 February 2023 (UTC)
Subpages of Wikipedia talk:Articles for creation
Hello! Would it be possible to get a quarry on new pages created as a subpage of Wikipedia talk:Articles for creation, organized by date of creation? Thank you so much! HouseBlastertalk 21:37, 19 February 2023 (UTC)
- Hey @HouseBlaster. I think quarry:query/71505 should do the trick. Hopefully the query isn't buggy... it's actually harder than it looks to get the creation date of a series of pages in SQL because we have to LEFT JOIN the revision table (and then GROUP BY and ORDER BY), which I find unintuitive.
- By the way, what process is creating all these AFC drafts in the wrong place? Perhaps we should investigate that and try to fix it. –Novem Linguae (talk) 22:07, 19 February 2023 (UTC)
- @Novem Linguae: it was the pre-draftspace draftspace, taking advantage of the fact that IPs could create talk pages. However, it is now apparently on the titleblacklist, so we should be all good. Thank you so much for your help! HouseBlastertalk 22:43, 19 February 2023 (UTC)
- Very interesting. I've learned some history today. By the way, my query has a bug. rev_timestamp is the newest revision, not the oldest revision. Let me try some stuff and I'll fix it shortly. –Novem Linguae (talk) 22:55, 19 February 2023 (UTC)
- Ok, query fixed. With thanks to @Robertsky for the idea to use rev_parent_id = 0. –Novem Linguae (talk) 01:43, 20 February 2023 (UTC)
- Very interesting. I've learned some history today. By the way, my query has a bug. rev_timestamp is the newest revision, not the oldest revision. Let me try some stuff and I'll fix it shortly. –Novem Linguae (talk) 22:55, 19 February 2023 (UTC)
- @Novem Linguae: it was the pre-draftspace draftspace, taking advantage of the fact that IPs could create talk pages. However, it is now apparently on the titleblacklist, so we should be all good. Thank you so much for your help! HouseBlastertalk 22:43, 19 February 2023 (UTC)
- They are nearly all redirects. Only 48 are real pages, and most of those are administrative pages such as WT:Articles_for_creation/2006-05-22. Certes (talk) 22:12, 19 February 2023 (UTC)
Statistics on speedy deletion
Hi. I'm looking for:
- Number of G13 deletions by year. E.g. How many pages were deleted under CSD G13 in 2016? How many in 2019 and in 2022? This query would involve searching the deletion log for the string "G13".
- Number of article deletions by year. E.g. how many (mainspace) articles were deleted in 2016?
I suspect to see a shift over the years from deleting articles to deleting drafts. I'm very curious to see how big the shift has been. If results could be broken down by month that would be even better. Thanks in advance! Clayoquot (talk | contribs) 20:02, 21 February 2023 (UTC)
- @Clayoquot: I've had a go in quarry:query/71599. Deletion counts are for all namespace 0 pages because, after deletion, the database no longer indicates whether each page was an article, a redirect or a dab. G13 counts are for all non-talk namespaces but will apply mainly to Draft:. Certes (talk) 00:41, 22 February 2023 (UTC)
- Beautiful, thanks Certes! Clayoquot (talk | contribs) 01:01, 22 February 2023 (UTC)
How to exclude pages that are populating a particular category?
See quarry:query/69410. There are about 98 pages Category:Redirects_for_discussion_with_talk_page_redirects that I want excluded from my query results. It's easy to qualify by limiting to only pages populating a category, but seems not so much for the reverse. Many pages populating that category are populating at least one other category as well, and if they are they are still listed in the report when I don't want them there. wbm1058 (talk) 21:10, 28 February 2023 (UTC)
- Could try something like...
- –Novem Linguae (talk) 21:15, 28 February 2023 (UTC)
-- your regular query goes here AND p.page_title NOT IN ( -- the quarry:query/69410 query, without the concat part )
- Thanks Novem Linguae! You pointed me in the right direction and I figured it out after I searched for "NOT IN" help and found SQL NOT IN with Subqueries – wbm1058 (talk) 22:24, 28 February 2023 (UTC)
- Awesome. I'm glad you got it working. –Novem Linguae (talk) 22:31, 28 February 2023 (UTC)
- Thanks Novem Linguae! You pointed me in the right direction and I figured it out after I searched for "NOT IN" help and found SQL NOT IN with Subqueries – wbm1058 (talk) 22:24, 28 February 2023 (UTC)
searchindex
Is it possible to access this table? The presence of example queries suggests it is, but it doesn't appear to exist on enwiki_p. BilledMammal (talk) 20:29, 6 March 2023 (UTC)
- From searchindex table:
This table is populated when edits are saved on pages, unless $wgDisableSearchUpdate is set to true, or an extension that implements its own search index (like CirrusSearch extension) is installed.
We have CirrusSearch, so sadly the table may not be available. Certes (talk) 20:34, 6 March 2023 (UTC)- Unfortunate, but thank you. BilledMammal (talk) 20:41, 6 March 2023 (UTC)
Most used red links
At the Tea house, a user has asked for "the most used red links (ignoring templates and similar)". There seems to be an ambiguity here, where one reading is, 1. "unique red links with the most user clicks", and the other is, 2. "unique red links which appear a) in the greatest number overall", or, b) "...in the greatest number of articles" (i.e., ignoring dupes in the same article, which if they don't overlink, should be similar to 2a). I'm assuming they meant 2a. Can someone help with this request? Thanks, Mathglot (talk) 08:08, 3 March 2023 (UTC)
- I think 2a is covered by Special:WantedPages. There's a lot of noise in that report though, since the most linked red links are links to deleted pages, or pages moved without leaving a redirect. Hope it helps. –Novem Linguae (talk) 09:15, 3 March 2023 (UTC)
- @Mathglot@Novem Linguae For #1, this used to be in the pageviews data - but it is not currently in either the clickstream files or the web pageviews tool. I don't think it's made available in the raw public pageviews data either but I don't have a sample to check.
- meta:Research:Wikipedia clickstream mentions it being added to the clickstream data in Feb 2015. Checking in the data for that month shows about 3.5m logged hits to 114k redlinks; only redlinks with >10 hits from a given source page are shown, and over 100k of them had only one source page generating significant traffic, which is interesting - I suspect that pattern still holds today. The busiest was getting a little under 250 hits/day.
- I suspect if we needed this data WMF would be able to generate it (I am sure it is recorded by them, just not published with the main traffic data), but I also wonder if the fact it was no longer published might indicate some issues with the data that makes it less useful. Andrew Gray (talk) 23:28, 3 March 2023 (UTC)
- Thanks, Novem, and Andrew! User:Samwalton9 (WMF), is this in your bailiwick, or could you ping someone who might comment further on this? Thanks, Mathglot (talk) 23:53, 3 March 2023 (UTC)
- @Mathglot I don't know anything about this but I'll see what I can find out! Samwalton9 (WMF) (talk) 11:01, 6 March 2023 (UTC)
- Sam, thanks. The OP user has been notified, so at this point answers are for our own edification and curiosity. I'm still curious, but this isn't a "need-to-know" kind of thing, in case that helps you prioritize . Mathglot (talk) 11:09, 6 March 2023 (UTC)
- @Mathglot @Andrew Gray I asked our data analytics team and it seems like we don't have this data directly. There are a couple of adjacent methods we could use to get partial data on this but I don't think they would be all that helpful. We really don't collect much data on users at all unless we have a specific need for it! Samwalton9 (WMF) (talk) 10:48, 7 March 2023 (UTC)
- Sam, thanks. The OP user has been notified, so at this point answers are for our own edification and curiosity. I'm still curious, but this isn't a "need-to-know" kind of thing, in case that helps you prioritize . Mathglot (talk) 11:09, 6 March 2023 (UTC)
- @Mathglot I don't know anything about this but I'll see what I can find out! Samwalton9 (WMF) (talk) 11:01, 6 March 2023 (UTC)
- Thanks, Novem, and Andrew! User:Samwalton9 (WMF), is this in your bailiwick, or could you ping someone who might comment further on this? Thanks, Mathglot (talk) 23:53, 3 March 2023 (UTC)
- I've had a go at 2b. in quarry:query/71997. It only covers targets beginning with digits and A at the moment. The query could easily be changed to other initials, but goes off the rails on attempting the complete alphabet. Many of the results are linked from one or two widely used templates rather than directly in articles, e.g. Andesite Mountain Bushveld appears via {{Biodiversity of South Africa}}, which consists mainly of 400+ vegetation types lacking articles. Certes (talk) 23:18, 4 March 2023 (UTC)
- That's useful, and I've notified the Teahouse user; thanks! Mathglot (talk) 11:09, 6 March 2023 (UTC)
- @Certes One option I was wondering about is if we could count redlinks that have no inbound template links? It would of course lose some we are interested in, but it would also mean that the effects of heavily-used inbound templates can be avoided. I couldn't quite work out how to do it in Quarry, though.
- Investigating this also turned up another unexpected weirdness. A lot of the items here are US election articles, but while they seem to be linked from pages using {{United States elections}}, plus the yearly templates like {{1958 United States elections}}, at first glance, those templates don't have any redlinks. It turns out there is some complex trickery with #ifexist to make the links go to category pages if there is not an article, which somehow means it shows up as a link to the page despite not appearing as such in the article. Mysterious! and admittedly kind of tangential, but interesting to note. Andrew Gray (talk) 18:30, 6 March 2023 (UTC)
- Yes, I'm working on ignoring links via templates but the query will run like treacle, especially now they have unhelpfully removed the two useful columns (tl_namespace and tl_title) from the templatelinks table. Using #ifexist records a wikilink, despite requests to change that behaviour. (I wrote a sneaky {{Linkless exists}} template to check without recording a link, but it exploits an undocumented information leak and may stop working.) Certes (talk) 18:53, 6 March 2023 (UTC)
- I don't think it's possible in SQL. Excluding only the templates linked in the article runs s l o w l y: just the prefix Ab takes 18 minutes. Excluding all templates runs faster. However, neither actually works. For the reason, consider Abkhaz art. It has 75 incoming links via
{{Asian topic||art}}
and similar, but {{Asian topic}} itself doesn't link there (it's not about art) and so that link can't be detected as template-made. Someone would have to search a wikitext dump for [[Abkhaz art]] and variants, and repeat that search for thousands of other redlinks. Another complication is that we may want to count links via some templates: if someone deliberately links to{{stn|Narnia}}
then that makes Narnia station a wanted article. Certes (talk) 19:23, 6 March 2023 (UTC)- I'm dying to create Category:Queries that run like treacle. Someone please hold me back, or I'm gonna get blocked for violating some weird no-cat rule. Mathglot (talk) 19:41, 6 March 2023 (UTC)
- Also relevant: phab:T14019. Certes (talk) 19:54, 6 March 2023 (UTC)
- That ticket belongs in related Category:Phab links that run like treacle. It'll be old enough to get its driver's license (or sweet sixteen party) later this year. Mathglot (talk) 22:09, 6 March 2023 (UTC)
- Also relevant: phab:T14019. Certes (talk) 19:54, 6 March 2023 (UTC)
- Oh argh - I had forgotten the complication of the "generate a topic" navigation boxes. Yes, those would definitely mess things up. Amazing how this is one of those questions that feels like it ought to be simple to answer, but in practice is more or less impossible! Andrew Gray (talk) 22:52, 6 March 2023 (UTC)
- It's possible. It just needs someone to download the entire wikitext of article namespace, extract the link targets (bonus point for finding links in file captions), sort and count, eliminate the blue links by matching to a list of pages (including redirects, etc.), then sort by count descending. It's quite simple processing, but on a large dataset. There may be some platform somewhere that both allows software to run and has a wikitext dump, but I've yet to find one, and I expect there would be a ton of bureaucracy to wade through before running the ten-line program. Certes (talk) 23:41, 6 March 2023 (UTC)
- I'm dying to create Category:Queries that run like treacle. Someone please hold me back, or I'm gonna get blocked for violating some weird no-cat rule. Mathglot (talk) 19:41, 6 March 2023 (UTC)
Unsourced BLPs from autopatrollers
I revoked autopatrolled the other day from someone who had been creating (among other things) unsourced BLPs, and that left me wondering whether it was just an isolated incident or if there was a broader problem. Would it be possible to get a list of articles that 1) were created by someone with the autopatrol right and 2) were deleted with a log summary mentioning "BLPPROD" 3) within the last year? I'd certainly appreciate it. Thanks in advance! Extraordinary Writ (talk) 20:53, 16 March 2023 (UTC)
- Does the database retain creation details for deleted pages? They are available via the API, but probably not in SQL. Certes (talk) 22:16, 16 March 2023 (UTC)
- I'm definitely not an expert on Quarry, but the database would have all the entries from Special:Log/create, right? That's not perfect (it doesn't include pages moved to mainspace from draft/userspace, for instance), but it'd at least be better than nothing. Extraordinary Writ (talk) 22:35, 16 March 2023 (UTC)
- I've created quarry:query/72365 but it shows nothing in the last year. At least one of the articles was created by a very respected editor, temporarily undeleting someone else's page rather than writing an new BLP. Certes (talk) 21:12, 18 March 2023 (UTC)
- Great, thanks. Happy to see it's not a major issue! Extraordinary Writ (talk) 21:40, 18 March 2023 (UTC)
- I've created quarry:query/72365 but it shows nothing in the last year. At least one of the articles was created by a very respected editor, temporarily undeleting someone else's page rather than writing an new BLP. Certes (talk) 21:12, 18 March 2023 (UTC)
- I'm definitely not an expert on Quarry, but the database would have all the entries from Special:Log/create, right? That's not perfect (it doesn't include pages moved to mainspace from draft/userspace, for instance), but it'd at least be better than nothing. Extraordinary Writ (talk) 22:35, 16 March 2023 (UTC)
Redirects to disambiguation pages that are not dab pages
I've been working underground, cleaning up a "trainwreck", which lead me to find pages like Buffalo Boys (disambiguation) and Chris Breen (disambiguation) transcluding {{R to disambiguation page}} and thus populating Category:Redirects to disambiguation pages. But these aren't redirects to disambiguation, they are redirects to articles, and there is no active mechanism flagging them as errors needing correction, so I don't know how many other pages like these are lurking on Wikipedia.
I've been aware of this issue since at least April 2017, when I put it on my to-do list:
- To do: Check these 192,000+ pages tagged "R to disambiguation page" to make sure they actually do redirect to disambiguation pages.
We're now up over 254,000 such redirects. I guess solving the issue hasn't gotten any easier over the past six years.
My first thought was to try to track the problem in real-time using template magic. I could modify {{R to disambiguation page}} to examine the {{target of}} the page transcluding the template to confirm that it populates Category:All disambiguation pages, and report all redirect targets that don't populate that category as errors. I found {{if in category}} which might theoretically do that, if it didn't only detect categories directly on a page, not those added indirectly through templates. Looking at how that "very expensive" template works using {{find page text}} I suppose it might be possible to search for the templates indirectly populating the category, including all their redirect-variant names. Ick.
So my second thought was to create a database query to find them. This query pulls up the first 200 pages in the redirect table, but as soon as I add:
- INNER JOIN categorylinks c ON p.page_id = c.cl_from
- WHERE c.cl_to = "Redirects_to_disambiguation pages"
it finds nothing. Is there a way to make this work? Or is it not possible because such a query is too inefficient and resource-intensive?
If a database query can't be made to work, I guess my third thought is that I need to write a PHP bot that likely will take a few hours to manually walk through all 254,000+ redirects and check them one-at-a-time and generate a list of the bad ones like Buffalo Boys (disambiguation) and Chris Breen (disambiguation). wbm1058 (talk) 17:36, 20 March 2023 (UTC)
- @Wbm1058: First the easy bit: you need an underscore (not a space) before "pages" on line 11. Some "Foo (disambiguation)" redirects target lists with a dab-like function. We'd need to eliminate those from our enquiries, as the denizens of RfD are not sympathetic to their deletion. Other than that, we should be able to solve this in SQL: I'll have a go. Certes (talk) 19:08, 20 March 2023 (UTC)
- I've had a go in quarry:query/72459. I found 439 where the target is not a dab, list, SIA or name page. This includes a few like Big Butt Mountain (disambiguation) where the target is dab-like but not marked as such in any machine-readable way. We may be able to refine the query further. Certes (talk) 19:39, 20 March 2023 (UTC)
- Thanks! As to that space that should have been an underscore, oops... uggh. I shouldn't try to work on more complex stuff like this when I'm tired and have basketball games running in the background. I see you used the fairly-new linktarget table – that's something I might not have found or thought to use until after I'd banged on this for days more! wbm1058 (talk) 19:54, 20 March 2023 (UTC)
- linktarget is just a nuisance. The page linked to used to be stored in templatelinks; now we have to do another join to get it, which slows everything down (though not by much in this case). categorylinks and pagelinks will suffer the same misfortune soon. My favourite redirect has to be Undisputed WWE Universal Championship (disambiguation); there really shouldn't be more than one of those. Certes (talk) 21:41, 20 March 2023 (UTC)
- @Certes: doh: User:RussBot/Non-disambiguation redirects/001. Funny I even listed that on my user page under §Redirects. But I had my to-do item listed on my user page under §Disambiguation. Now at User:Wbm1058/Redirects to disambiguation pages that are not dab pages. Heh, this way I can generate an updated list whenever I want, rather than wait for someone else's bot to run. Thanks for your help. – wbm1058 (talk) 20:23, 21 March 2023 (UTC)
- The two reports do slightly different things but either may work for your needs. How do you update RussBot's generated list? I've copied my query to User:Certes/Reports/Redirects to dab where target is not a dab, which has a link to re-run it on demand without forking it on Quarry. Certes (talk) 22:22, 21 March 2023 (UTC)
- I just drop a note on Russ talk when I notice that his bot hasn't run recently. Your report is cleaner than the one I put on my user subpage. I'd forgotten about the RussBot report and if I hadn't I wouldn't have asked for help here. I found it again when I saw via "what links here" it was linking to many of the same pages as your report. These reports are a symptom of our lack of competent admins. Problem caused by page-movers who swap-dump dab histories to off-topic pages because they don't have the power to delete them (which is the best solution). Page-movers' solution is an icky hack. Swaps should only be made with same-topic variants (other spelling or capitalization, synonym, longer/shorter name, etc. And the (disambiguation) redirect is like the musical chair that can't find a place to swap to, so it lingers. – wbm1058 (talk) 23:07, 21 March 2023 (UTC)
- The two reports do slightly different things but either may work for your needs. How do you update RussBot's generated list? I've copied my query to User:Certes/Reports/Redirects to dab where target is not a dab, which has a link to re-run it on demand without forking it on Quarry. Certes (talk) 22:22, 21 March 2023 (UTC)
- @Certes: doh: User:RussBot/Non-disambiguation redirects/001. Funny I even listed that on my user page under §Redirects. But I had my to-do item listed on my user page under §Disambiguation. Now at User:Wbm1058/Redirects to disambiguation pages that are not dab pages. Heh, this way I can generate an updated list whenever I want, rather than wait for someone else's bot to run. Thanks for your help. – wbm1058 (talk) 20:23, 21 March 2023 (UTC)
- linktarget is just a nuisance. The page linked to used to be stored in templatelinks; now we have to do another join to get it, which slows everything down (though not by much in this case). categorylinks and pagelinks will suffer the same misfortune soon. My favourite redirect has to be Undisputed WWE Universal Championship (disambiguation); there really shouldn't be more than one of those. Certes (talk) 21:41, 20 March 2023 (UTC)
- Thanks! As to that space that should have been an underscore, oops... uggh. I shouldn't try to work on more complex stuff like this when I'm tired and have basketball games running in the background. I see you used the fairly-new linktarget table – that's something I might not have found or thought to use until after I'd banged on this for days more! wbm1058 (talk) 19:54, 20 March 2023 (UTC)
Retention: edit activity of Wiki Education student editors after their course ends
For purposes of WP:EDITORRETENTION, I'm interested in gathering some data on the contribution history of Wikipedia:Education program student editors after their course ends. (Sample Wiki Ed course page here.) These Wiki Ed student editors typically contribute for several weeks during the second half of their semester course, and then disappear immediately after the course ends, no doubt busy with college courses, graduation, and real life. Afaik, there's almost a total lack of data about what happens afterward, so it's not even clear what query to formulate, and I suspect it might be a refinement process, with early results informing later queries. Maybe one might start with a scoping query to see what we're dealing with, perhaps this:
- List the userids of the top N Wiki Ed students who were or are still actively contributing after delta elapsed time after their course ends, sorted by total edits after the end date of their course.
Columns might be:
- Userid of student editor[a][b] (preferably wikilinked; otherwise I'll post-process)
- Date of the course[c] – the tricky part; this is inferrable, maybe, from the course name; see notes.
- total edits at Wikipedia
- total edits since course ended + delta (drop col if too hard; see notes).
Notes:
- a "Wiki Ed student" is any user whose UTP contains (at least one) {{dashboard.wikiedu.org talk course link}} template on it.
- The date of the course (the tricky part) – this may require some parsing of the course name. Most (but not all) of these courses have a course name containing the season and year, e.g.,
Fall 2019
, at the end of the course name, but not all do. If we recast 'Winter 2019' as '2019 Q1', 'Spring 2019' as '2019 Q2', and so on, it would give us an easily sortable value. (Or just '2019-1', -2, -3, -4 if better.) If parsing is hard, try any 4-digit sequence in (2015..2023) in the course name, especially if it's anchored at the end of the string. If year or season/quarter cannot be determined, probably we should throw out the row.[d]- multiple courses – it's possible that a user may have participated in more than one Wiki Ed course (corresponding to the number of templates on their Talk page). This is relatively infrequent, so we could just exclude students in more than one course if that's easier; alternatively "course date" would be the date of the most recent course listed on the Talk page. Or if easier, just use the first template on the page; it's rare enough it won't matter much in the results if it's not the most recent one.
- total edits since course end + delta : i.e., date of course + length of course (assume 4 months) + delta (another 6 mos.?)
- Sort by: total edits since course end+delta, then by userid. If 'total edits since course end+delta' is problematic, then 'total edits' is an acceptable proxy for it as a first approximation as far as sorting, since training edits + course edits is a fairly static number for most students.
This query is in support of this discussion at the Wikipedia:Education noticeboard. Thanks, Mathglot (talk) 00:31, 31 March 2023 (UTC)
- Hi @Mathglot:
- This took me literally half a day (Gods I'm rusty with SQL) but I think I've got it.
- Instead of trying to parse dates (which would have been ridiculously hard), I decided to instead look at when pages were created. Basically the idea that any professor would start their course page some given time before starting the actual course.
- It was a bit hard to figure out exactly what you wanted and meant, so I took my best judgement and whatever template interactions already existed... And made a few assumptions for that. They are
- - Every program has a page in the Wikipedia namespace (For example)
- - program page has Template:Dashboard.wikiedu.org_talk_course_link on it
- - program start date is date of creation of program page on wiki
- - students are all listed on program page (with a link to their username)
- - all students have Category:Dashboard.wikiedu.org_course_pages on their talk page
- I think those were all the assumptions. Taking those, I just made 3 possible options for what you might find useful. Hopefully all the columns and descriptions are self explanatory -
- - Assuming last edit on the Program page was "End of course", Edits by student since end of course - https://quarry.wmcloud.org/query/72697
- - Assuming first edit on Program page was "start of course", edits by student since 4 months after 'start of course' - https://quarry.wmcloud.org/query/72702
- - Assuming first edit on Program page was "start of course", edits by student since 10 months after 'start of course' - https://quarry.wmcloud.org/query/72703
- There's going to be a few outliers (courses that are still ongoing etc) but I think by and large you'll cover most editors. I believe anyone who has 0 edits since "end of course" is just skipped here instead of being listed with 0 edits. (Any student who is listed on other course page for any reason [say they're a coordinator for next year without being a re-enrolled] would also be an outlier)
- I think the query is written in a way that you can tweak it yourself if you understand basic SQL. But if not, feel free to ping me if you need some modifications on these queries/ideas. No promises this time though (I did spend multiple hours relearning my SQLs for this :) ) but overall I think this provided enough columns for you to most "basic" manipulation to be things you can do easily.
- Let me know if this is enough.
- I dont believe my query handles the usecase of users who participated in multiple courses. It chooses one of the courses, but I dont believe I specified which course anywhere. Hopefully that wont skew things too much Soni (talk) 11:49, 31 March 2023 (UTC)
- Wow! This is great stuff. I need some time to delve in and catch up with this; will probably export as table and tweak formatting for display at the WP:ENB discussion. (Used to be pretty fair at SQL eons ago, but I'm ignorant about the table structure, but as you've laid it all out, I should be able to clone/tweak as needed; if I hit a roadblock, I'll ask you.) Thanks again for this amazing effort! Mathglot (talk) 22:04, 31 March 2023 (UTC)
Notes
- ^ Here's a link to the first 500 of 80,000 user talk pages containing the template.
- ^ "Actively contributing" – had at least N edits after the end+delta date; N=1 if not too many results, else N=10, or whatever works well for the query time.
- ^ Here are 5,200 users with courses from 'Fall 2022'.
- ^ If this data collection idea flies at all, we might have to request Wiki Ed folks to automatically add parameters to that template which already exist (i.e.,
|end_date=
) but which are not currently added to the template by default. That will make this value much easier to determine, but we don't have that yet. Or maybe don't throw out the row, but consider year only, and ignore season.
ReferenceExpander
As described over at the admin noticeboard, the ReferenceExpander bot hasn't always been expanding references. Instead, it does things like following a URL that now redirects to a page which doesn't have the information being cited, and then replacing a manually formatted footnote that includes an archive URL with a {{cite web}} template that doesn't. It'll drop bylines from articles, eliminate quotes, ignore the second of two references put in the same footnote, etc. Is there a way to get a list of all the edits that include "ReferenceExpander" in the summary, maybe sorted by delta-article-size so that we can prioritize checking those cases where it made the article shrink? XOR'easter (talk) 21:52, 1 April 2023 (UTC)
- @XOR'easter At least for edits within the last 90 days, this is pretty easy: this is a query for those within the last few days, which can be expanded to go further back. Galobtter (pingó mió) 22:10, 1 April 2023 (UTC)
- I made the query order by the delta of the edit. Galobtter (pingó mió) 22:14, 1 April 2023 (UTC)
- Thanks! That's a great start. We'll probably have to go further back, but it's a very helpful beginning. XOR'easter (talk) 22:18, 1 April 2023 (UTC)
- What does it look like for the past 90 days? (And is it possible to get direct links to specific diffs?) XOR'easter (talk) 15:20, 2 April 2023 (UTC)
- There's a summary by month here; just under half of all edits mentioning ReferenceExpander were in 2023. I've been trying to find a date range just short enough that the query doesn't time out - searching by comment is slow, and much slower once you're past the start of the recentchanges table. —Cryptic 16:32, 2 April 2023 (UTC)
- Ah, good to know. This is not a side of the project that I've had much experience with at all! XOR'easter (talk) 16:40, 2 April 2023 (UTC)
- 2023 edits. You can get to a specific diff by plugging the rev_id into Special:Diff/ - for example, Philoserf's January 11 edit to Libertarianism with a -17031 delta is at Special:Diff/1132868351. —Cryptic 17:17, 2 April 2023 (UTC)
- Thanks! XOR'easter (talk) 19:35, 2 April 2023 (UTC)
- There's a summary by month here; just under half of all edits mentioning ReferenceExpander were in 2023. I've been trying to find a date range just short enough that the query doesn't time out - searching by comment is slow, and much slower once you're past the start of the recentchanges table. —Cryptic 16:32, 2 April 2023 (UTC)
- I made the query order by the delta of the edit. Galobtter (pingó mió) 22:14, 1 April 2023 (UTC)
Most thanked bots
Result set seems a bit small. What'd I mess up? https://quarry.wmcloud.org/query/72954 –Novem Linguae (talk) 01:53, 11 April 2023 (UTC)
- Looks like you can't thank usernames with the bot flag. That probably explains the small result size. Query is probably fine. You can disregard my question. Thanks. –Novem Linguae (talk) 02:06, 11 April 2023 (UTC)
# of edits each admin had at the time of their RFA
Requested on Discord by @Ixtal. They're interested in getting the # of edits each admin had at the time of their RFA.
I'm thinking this could possibly be done with a subquery, and feeding the SQL query some manual data. For example, if I had something like...
CREATE TEMPORARY TABLE list_of_admins (user_name varbinary(255), rfa_date binary(14));
INSERT INTO list_of_admins SET user_name = 'Test1', rfa_date = '20210101000000';
INSERT INTO list_of_admins SET user_name = 'Test2', rfa_date = '20220202000000';
SELECT user_name, rfa_date, edit_count AS (
SELECT COUNT(*)
FROM revision
JOIN actor ON rev_actor = actor_id
JOIN user ON actor_user = user_id
WHERE user.user_name = list_of_admins.user_name
AND rev_timestamp < rfa_date
)
FROM list_of_admins;
But quarry doesn't allow temporary tables. Any ideas? –Novem Linguae (talk) 01:55, 7 April 2023 (UTC)
- Quarry:query/72861 gets you part of the way there; it produces a list of 720 admins and the date of their first RfA. I'm not sure where the other 188 admins are, and I'm not certain yet how to get the date of the most recent RfA. BilledMammal (talk) 04:26, 7 April 2023 (UTC)
- Some of the 188 admins are going to be users who got renamed after their RfA. Galobtter (pingó mió) 04:48, 7 April 2023 (UTC)
- I assumed that most of those would have had redirects created from their new name to their former name's RfA?
- I alternatively looked at the logs, but those only go back to 2012, and I haven't considered how to exclude resysop's yet. BilledMammal (talk) 04:56, 7 April 2023 (UTC)
- Hmm, I guess redirects are probably done. I suppose there are some RfA by email people still around but I don't think that many. Galobtter (pingó mió) 05:26, 7 April 2023 (UTC)
- Indeed, I created redirects for basically all renamed admins' RfAs back in 2019, except for a few who were renamed for obvious privacy reasons. * Pppery * it has begun... 22:21, 7 April 2023 (UTC)
- Hmm, I guess redirects are probably done. I suppose there are some RfA by email people still around but I don't think that many. Galobtter (pingó mió) 05:26, 7 April 2023 (UTC)
- Some of the 188 admins are going to be users who got renamed after their RfA. Galobtter (pingó mió) 04:48, 7 April 2023 (UTC)
- Besides just the renames, those whose first RFA was unsuccessful, and the handful from before there was an RFA at all, there's a fairly large chunk of admins still around from before RFA had subpages. Building a query for this part is never going to be accurate enough.You can simulate your temp table with if you can squeeze it into the max query length. I don't know offhand how long a query Quarry supports. (Also, you're leaving out deleted edits.) —Cryptic 06:01, 7 April 2023 (UTC)
WITH list_of_admins (user_name, rfa_date) AS (SELECT 'Test1', '20210101000000' UNION SELECT 'Test2', '20220202000000' UNION SELECT 'Test3', '20230303000000' -- ... ) SELECT user_name, rfa_date, edit_count -- ...
- Amazing answer as always Cryptic. Very good to see you back :) –Novem Linguae (talk) 07:15, 7 April 2023 (UTC)
Quarry:query/72815 produces a list of 626 editors with the timestamp of their earliest (IIRC, although I'm not sure how the group by works here) granting of sysop rights. Doesn't check if their current ug includes sysop, tho. Also, I'm looking for the # of edits for each admin after their RFA. — Ixtal ( T / C ) ⁂ Non nobis solum. 08:41, 7 April 2023 (UTC)
- Quarry:query/72865 now does basically what I want to do (in this case manually selecting 28bytes rather than iterating through revisions for each admin). What I would do is if I can more or less accurately get the time of first adminship (the subquery adt) I could probably write some code that iterates through revision for each row in adt and appends the result to a temp table to be returned or something. — Ixtal ( T / C ) ⁂ Non nobis solum. 08:59, 7 April 2023 (UTC)
- I think we could make a query that combines your "get admins" query Quarry:query/72815 with my "get edit count at X time" query Quarry:query/72876 as a subquery. Then you can do it all in one go, without needing Toolforge or a programming language or loops or anything like that. How does Quarry:query/72877 look? –Novem Linguae (talk) 17:30, 7 April 2023 (UTC)
- Mine won't show up due to age and other factors, but my edit count was noted at the time of my RfA – a little under 33,000 when it started, and a little over 34,000 by the time it closed. I have since added ~2,120,000. BD2412 T 17:45, 7 April 2023 (UTC)
- Flipping the inequality sign, a fork of 72877 (Quarry:query/72878) seems to not be correct. A Train, for example, shows as 0 edits since adminship in 2016 but last edited this January. — Ixtal ( T / C ) ⁂ Non nobis solum. 17:54, 7 April 2023 (UTC)
- log_name is like page_name; it replaces spaces with underscores. user_name doesn't. There's no row in user with user_name = 'A_Train'.You should also be joining archive_userindex to get a count of deleted edits; at that point, you can just subtract the pre-RFA sum from user.user_editcount to get the post-RFA. —Cryptic 19:04, 7 April 2023 (UTC)
- Thanks, Cryptic. 72878 works well now without joining archive_userindex. The way I tried joining it is probably wrong, though, as it takes over 20 minutes without completing ([2]). — Ixtal ( T / C ) ⁂ Non nobis solum. 15:34, 8 April 2023 (UTC)
- I worded that very sloppily, sorry. Here you go. As I say in the comments there, I don't think this approach of pulling usernames/timestamps from the logs (or from RFA pages, either) is ever going to be good enough to answer the original request's intent. —Cryptic 20:16, 8 April 2023 (UTC)
- Wow, that's fantastic Cryptic! I'll fork the query to get some more granular data and get to writing on whatever interesting stuff I find over at User:Ixtal/AAA. — Ixtal ( T / C ) ⁂ Non nobis solum. 12:20, 9 April 2023 (UTC)
- Managed to remove bots and account for early admin grants with this query I think. — Ixtal ( T / C ) ⁂ Non nobis solum. 20:21, 17 April 2023 (UTC)
- Wow, that's fantastic Cryptic! I'll fork the query to get some more granular data and get to writing on whatever interesting stuff I find over at User:Ixtal/AAA. — Ixtal ( T / C ) ⁂ Non nobis solum. 12:20, 9 April 2023 (UTC)
- I worded that very sloppily, sorry. Here you go. As I say in the comments there, I don't think this approach of pulling usernames/timestamps from the logs (or from RFA pages, either) is ever going to be good enough to answer the original request's intent. —Cryptic 20:16, 8 April 2023 (UTC)
- Thanks, Cryptic. 72878 works well now without joining archive_userindex. The way I tried joining it is probably wrong, though, as it takes over 20 minutes without completing ([2]). — Ixtal ( T / C ) ⁂ Non nobis solum. 15:34, 8 April 2023 (UTC)
- log_name is like page_name; it replaces spaces with underscores. user_name doesn't. There's no row in user with user_name = 'A_Train'.You should also be joining archive_userindex to get a count of deleted edits; at that point, you can just subtract the pre-RFA sum from user.user_editcount to get the post-RFA. —Cryptic 19:04, 7 April 2023 (UTC)
- I think we could make a query that combines your "get admins" query Quarry:query/72815 with my "get edit count at X time" query Quarry:query/72876 as a subquery. Then you can do it all in one go, without needing Toolforge or a programming language or loops or anything like that. How does Quarry:query/72877 look? –Novem Linguae (talk) 17:30, 7 April 2023 (UTC)
Identifying Lepidoptera redirect/stub mismatches
I run from time to time into former Lepidoptera articles that have since been redirected to another article, but where the WikiProject tags on the talkpage weren't updated and still claim it to be a stub (or start/etc., but Lepidoptera articles in classes other than stub are relatively scarce and have low enough numbers checking manually is feasible. Manually checking the near-100k stub-rated Lepidoptera pages, on the other hand, not so much.)
Could anyone by any chance write me a query, or point me in the right direction of how to do it myself, to identify all mainspace redirects where the accompanying talkpage is in Category:Stub-Class Lepidoptera articles? My gut feeling is that there are at least several hundreds, and quite possibly a few thousand, such mismatches. Depending on the scale of the results, I might either tackle the issue myself or request a bot task for it. AddWittyNameHere 09:59, 21 April 2023 (UTC)
- quarry:query/73264. —Cryptic 10:20, 21 April 2023 (UTC)
- Wow, 3519. I've forked that as quarry:query/73265 to identify other suspicious stub categories. The top ten are:
- — Preceding unsigned comment added by Certes (talk • contribs)
- Thanks, Cryptic! And oof, Certes. Can't say I'm entirely surprised by the 3519 such mismatches in Lepidoptera, even if I'd hoped for fewer, but really didn't expect it to be such a major thing in other projects too. (Especially WikiProject Romania. That's a whopping 38% of their stub-rated articles being redirects instead. Makes Lepidoptera's ~3.5% look almost negligible in comparison.) Oh well, at least it makes my decision easy: off to WP:BOTREQ it is. ~47k pages just across the top 10 involved WikiProjects is very clearly bigger than is appropriate to fix through manual editing. AddWittyNameHere 12:01, 21 April 2023 (UTC)
- EnterpriseyBot just fixed Talk:Aaron Schatz for us, and may already be working on other cases. Certes (talk) 12:16, 21 April 2023 (UTC)
- Thanks for pointing that out, I'll look into that. Might be easier to just pop over to Enterprisey's talk page instead of BOTREQ, in that case. AddWittyNameHere 13:38, 21 April 2023 (UTC)
- EnterpriseyBot just fixed Talk:Aaron Schatz for us, and may already be working on other cases. Certes (talk) 12:16, 21 April 2023 (UTC)
- Thanks, Cryptic! And oof, Certes. Can't say I'm entirely surprised by the 3519 such mismatches in Lepidoptera, even if I'd hoped for fewer, but really didn't expect it to be such a major thing in other projects too. (Especially WikiProject Romania. That's a whopping 38% of their stub-rated articles being redirects instead. Makes Lepidoptera's ~3.5% look almost negligible in comparison.) Oh well, at least it makes my decision easy: off to WP:BOTREQ it is. ~47k pages just across the top 10 involved WikiProjects is very clearly bigger than is appropriate to fix through manual editing. AddWittyNameHere 12:01, 21 April 2023 (UTC)
- — Preceding unsigned comment added by Certes (talk • contribs)
Stats on new page patrol process
I am trying to track the change in new page patrol outcomes over time. The backlog is rather low now, so this seems like an opportune time. I was wondering if anyone might try to query, perhaps grouped on a monthly basis:
1) the number of pages reviewed
2) the outcome - "approved" (ie, no action taken), CSDed, PROD, Draftify, BLAR. I think that would cover most cases.
I suspect (2) will be non-trivial. Maury Markowitz (talk) 14:34, 22 April 2023 (UTC)
Articles defining exactly one of two specific parameters when calling a specific template
Hi! I'm tryna get a template-protected template edit made to disambiguate the targets of the display text of two parameters that currently link to the same target.
I'm trying to figure out the use cases of having only one of these two parameters defined, since I haven't been able to find any, and the existence of such a set of articles would be the strongest counterargument against disambiguating the wikilinks, and I want to make sure I understand the effects of the proposed disambiguation before I start a discussion about it.
Specifically, I'm looking for a query that will return the list of articles that both: call Template:Infobox royalty and: pass it values for exactly one of {family_name
,clan_name
}. I understand this will involve parsing the source of pages, and this template has over 21,000 transclusions, so I do have a backup plan involving a proposal to add more complicated parsing logic to the template source if this query is not feasible. Thanks, Folly Mox (talk) 00:06, 28 April 2023 (UTC)
- There are 61 uses for Chinese royalty with only family_name; everything else has both (or neither). search. There are also 128 where clan_name is provided but is empty: search. Certes (talk) 11:35, 28 April 2023 (UTC)
- Thank you! 🙏🏽 Folly Mox (talk) 15:31, 28 April 2023 (UTC)
Vital article talk transclusions
For this discussion, I'd like to have data on how many talk pages have {{Vital article}} but not {{Talk header}}, as well as how many have {{Vital article}} but not {{WikiProject banner shell}}. I tried using PetScan but either it's currently broken or I did something wrong, so it didn't work. Can anyone help? Thanks, {{u|Sdkb}} talk 21:28, 4 May 2023 (UTC)
- 31,466 and 17,425 according to search. Each count includes the same 14,148 pages which have {{Vital article}} but neither {{Talk header}} nor {{WikiProject banner shell}}. For completeness, 15,295 pages have all three templates. Certes (talk) 21:49, 4 May 2023 (UTC)
- Thanks! {{u|Sdkb}} talk 21:58, 4 May 2023 (UTC)
Unused templates created by users with more than 10000 edits, sort by the number of edits by the creator
Some users who edit a lot may have created some unused templates a long time ago, but they forgot about it. At this time, I can remind them through this query and let them delete them by themselves. Q𝟤𝟪 06:50, 22 May 2023 (UTC)
- All unused templates are already listed at Wikipedia:Database reports/Unused templates (filtered)/1. There are only 2,100 left. – Jonesey95 (talk) 13:39, 23 May 2023 (UTC)
- quarry:query/73899 has the creator and editcount of templates linked from that page. (Not that I think editcount's a terribly good metric, and plenty of those templates look like they shouldn't have transclusions anyway, but whatever.) —Cryptic 20:12, 23 May 2023 (UTC)
"make a list of users with a specific role, and filtering by date of last contribution"
"the wiki nowiki and the roles I'm interested in are patruljør (patroller) and autopatruljerte (autopatrolled)."
Requested on Discord. I started a query here but got stuck on the "date of last contribution" column. How do I pull the timestamp of a user's most recent contribution from the revision table? Thanks. –Novem Linguae (talk) 15:22, 27 May 2023 (UTC)
- @Novem Linguae: You'd have to do something like quarry:query/73964. Certes (talk) 20:25, 27 May 2023 (UTC)
Candidate edit count at time of RFA
Would someone like to take a stab at a list of candidate edit counts at the time of their RFAs? Suggested algorithm: query all the subpages of WP:RFA, do a join on revision
to figure out the page created date, isolate the candidate name using REPLACE()
, then subquery the revision
table and do a count for that user before the page created date. 2nd and later RFAs won't isolate the username easily, which is fine, doesn't have to be perfect. Thanks. –Novem Linguae (talk) 16:43, 14 June 2023 (UTC)
- First half done at quarry:query/74495. –Novem Linguae (talk) 17:16, 14 June 2023 (UTC)
- I don't think you're likely to get any better answers than the last time you asked. —Cryptic 17:35, 14 June 2023 (UTC)
- Oh. I asked on behalf of another user so I guess it didn't stay in my memory. Sorry for the double question. Looks like quarry:query/72911 is the best query from that thread. –Novem Linguae (talk) 17:41, 14 June 2023 (UTC)
- Generating the list of admins there is pretty well isolated; you should be able to replace the use of the rights log with your parse of RFA subpages. But with two months to think on it, I still think the only reasonable way to get that list of admins and promotion dates is manually. (Bonus: you can include unsuccessful rfas too if you do it like that.) If you can't fit the not-really-a-temp-table into the query with the WITH (SELECT...UNION...UNION...UNION) hack, you can create a fake userspace subpage with redlinks like 20230608130912 Novem_Linguae, pull the rows out of pagelinks, and then split pl_title to get the timestamp and admin name back. —Cryptic 17:51, 14 June 2023 (UTC)
- Oh, and Ixtal refined that query a bit more, and continued the discussion at WT:RFA#pre/post-RFA edits for admins; I think the final version is quarry:query/73123. —Cryptic 17:55, 14 June 2023 (UTC)
- Oh. I asked on behalf of another user so I guess it didn't stay in my memory. Sorry for the double question. Looks like quarry:query/72911 is the best query from that thread. –Novem Linguae (talk) 17:41, 14 June 2023 (UTC)
- I don't think you're likely to get any better answers than the last time you asked. —Cryptic 17:35, 14 June 2023 (UTC)
Find all simple cycles on category trees
I wrote a query to show short cycles (not longer than 3 for now) in paths column at quarry:query/74589. It simply queries category-links from categories recursively and keeps each path to find a cycle. The cycles are duplicate if they could be reached from different start points. That leads to inefficiency of this query, which will timeout if the parameter max_recursive_iterations is assigned larger than 3. I'm here to ask for a way to reduce duplicates and get each simple cycle only once during the query. NmWTfs85lXusaybq (talk) 14:24, 19 June 2023 (UTC)
Mixed script redirects
Could someone compile a list of mixed script redirects? For example, if Αlpha (Greek Alpha + latin lpha) → Alpha existed, it should be picked up. Or if Вaseball (Cyrillic B + latin aseball) → Baseball existed, also list it. Perhaps highlighting the non-dominant characters? E.g.
Mixed script | Target | Scripts |
---|---|---|
Αlpha | Alpha | Latin/Greek |
Вaseball | Baseball | Latin/Cyrillic |
Imagine the red links being blue, because I'm assuming they exist rather than be hypothetical. Headbomb {t · c · p · b} 14:24, 18 June 2023 (UTC)
- @Headbomb Thanks for requesting this query.
- Some examples of pages I've nominated for deletion (with Cyrillic characters hilighted) "Belgrade Мunicipalities" "Lеlаng Соmmаndеry 192.76.8.65 (talk) 14:40, 18 June 2023 (UTC)
In the above format, that would be
Mixed script | Target | Scripts |
---|---|---|
Belgrade Мunicipalities | Subdivisions of Belgrade | Latin/Cyrillic |
Lеlаng Соmmаndеry | Lelang Commandery | Latin/Cyrillic |
Headbomb {t · c · p · b} 14:47, 18 June 2023 (UTC)
- Also I don't see why the query should only include redirects? If an article is located at one of these titles it would need moving. 192.76.8.65 (talk) 14:45, 18 June 2023 (UTC)
- Yes, I suppose that's true. I just considered the possibility unlikely, but there ought to be corner cases around. Headbomb {t · c · p · b} 14:48, 18 June 2023 (UTC)
- Which has occurred with Chervoniy Gіrnik (which I have since moved). As a side note, should we keep the redirects around for articles that were at the mixed-script title for a long period of time to avoid breaking external links (like in this case)? Skarmory (talk • contribs) 09:30, 20 June 2023 (UTC)
- Not necessarily. α,β-Unsaturated carbonyl compound (Greek α,β; Latin rest) is there as the result of an WP:RM. It is the actual true name of the given compound. I also created Template:R from homoglyph (as a redirect to {{R from misspelling}}, check discussion linked at template history). This one was created for use on A,β-Unsaturated carbonyl compound (Latin A, Greek β; Latin rest), because A (Latin) is a homoglyph of Α (Greek), which appears as the url of the article, due to first character capitalisation on MediaWiki. —CX Zoom[he/him] (let's talk • {C•X}) 19:20, 20 June 2023 (UTC)
- Also I don't see why the query should only include redirects? If an article is located at one of these titles it would need moving. 192.76.8.65 (talk) 14:45, 18 June 2023 (UTC)
- I've checked for Greek and Cyrillic characters in quarry:query/74602. Other scripts are available, but that looks like enough cases for now. Most of them seem to be legitimate. Certes (talk) 20:04, 18 June 2023 (UTC)
- @Certes: Would there be a way to filter out chemicals and stars? Like Androst-1-ene-3β,17β-diol redirects to 1-Androstenediol, which has {{Infobox drug}} (and several chemical-related categories) on it? Likewise for Υ Leonis → Upsilon Leonis, which has {{starbox begin}}, Headbomb {t · c · p · b} 21:31, 18 June 2023 (UTC)
- @Certes also would it be possible to exclude pages containing "μSA"? They account for approx 1/16th the results and it's the terminology for a Micropolitan statistical area. Thank you very much for running this query! 192.76.8.65 (talk) 21:45, 18 June 2023 (UTC)
- I thought those were nonsense, good catch. Headbomb {t · c · p · b} 21:47, 18 June 2023 (UTC)
- Changed, but we've not eliminated much. Categories don't seem particularly useful here as they tend to be detailed and low-level, e.g. Category:Androgens and anabolic steroids, with no {{allincluded}} ancestors. Certes (talk) 22:17, 18 June 2023 (UTC)
- @Certes Thanks a bunch, Headbomb edited their comment to add {{starbox begin}} as another template worth excluding, would it be possible to add that to the list of templates?
- Thanks again! I can see a load of pages in there that could use cleanup! 192.76.8.65 (talk) 22:22, 18 June 2023 (UTC)
- Starbox added. I've also excluded common subunits μF, μm, μg and μs. Certes (talk) 22:31, 18 June 2023 (UTC)
- Are we mainly interested in titles having Latin and Greek/Cyrillic letters adjacent, as in "Мunicipalities" (with a Cyrillic M), at least as a first pass? That gives a much shorter list. Certes (talk) 22:24, 18 June 2023 (UTC)
- That query does seem to have a much higher true positive ratio, that looks like it will be a good starting point at least.
- Thank you, 192.76.8.65 (talk) 22:41, 18 June 2023 (UTC)
- First RFD batch at Wikipedia:Redirects_for_discussion/Log/2023_June_18#Mixed_Latin_and_Greek/Cyrillic_scripts,_A Headbomb {t · c · p · b} 23:59, 18 June 2023 (UTC)
- @Headbomb: A few of those had incoming links, which I'm fixing. Do we mind? You may also be interested in User:Mykhal/Articles with name in multiple scripts. Certes (talk) 20:38, 19 June 2023 (UTC)
- First RFD batch at Wikipedia:Redirects_for_discussion/Log/2023_June_18#Mixed_Latin_and_Greek/Cyrillic_scripts,_A Headbomb {t · c · p · b} 23:59, 18 June 2023 (UTC)
- Think I've cleaned up all the egregious mixed-script titles that were still actual article titles and not redirects, with the possible exception of Russian_locomotive_class_А, which I'm not sure about its status as a potentially acceptable mixed-script title? There's barely any google results for it, and the one citation in the article never mentions the article's title; I'm not sure if it's an official name or just another mixed-script title error. I would assume this should be moved, but I wanted to check first. (There's also a fair amount of links to 45th Air and Air Defenсe Forces Army and Rogožarski SIM-Х that need to be fixed, but I'll get around to that eventually unless there's a better way to do it than manually.) Skarmory (talk • contribs) 10:11, 20 June 2023 (UTC) edited 10:29, 20 June 2023 (UTC)
- AWB or JWB is your friend for repeating similar edits tens or hundreds of times with manual checking. They have a learning curve, but it's worth it. X here is a Roman numeral, so Х seems wrong, and why on earth would we spell Defence with a с? Certes (talk) 11:01, 20 June 2023 (UTC)
- In case anyone hasn't noticed, clicking the RedirectTarget heading in quarry:query/74617 sorts the results by redirect target, handily bringing pages which are articles rather than redirects to the top. Certes (talk) 11:05, 20 June 2023 (UTC)
- Yeah, that's how I sorted them; only reason I actually got through the articles. It seems like the rest are at least not misleading mixed-script titles, though I'm not sure where consensus stands on ones like Дeva, and it's always possible I glanced over a misleading one.
- As for AWB/JWB, I've never really thought about them since they needed a WP:PERM request, but I might as well make one now. (And yes, the Defence/SIM-X ones are bad titles and I've moved them as such; they have leftover links to the old title. The one I'm seeking input on is Russian locomotive class А.) Skarmory (talk • contribs) 11:46, 20 June 2023 (UTC)
- The place to request access is Wikipedia:Requests for permissions/AutoWikiBrowser, even if you only intend to use JWB. The loco class is a tough choice. Neither А nor A seems obviously wrong, and I can't find any English sources. List of Russian steam locomotive classes shows both alphabets, but similar titles such as Russian locomotive class Kh use Latin transliteration with no incoming redirect from Cyrillic. It's a "weak move" from me, if only for consistency. Certes (talk) 12:35, 20 June 2023 (UTC)
- JWB gotten, links fixed from mainspace. Non-mainspace links can stay as they're probably not actively hurting anything and given I think these should/probably will be kept around as {{R from move}}s to not break external links they don't need urgent fixing.
- As for the move, a "weak move" is sort of what I have too, so I think I'll probably take it to WP:RM. Can't hurt to get a few more opinions. Skarmory (talk • contribs) 13:26, 21 June 2023 (UTC)
- Either RM, or I don't think anyone will object to a bold move on grounds of consistency with the others in Category:Steam locomotives of the Russian Empire. Certes (talk) 18:55, 21 June 2023 (UTC)
- I think that category convinces me the bold move is the best route. Will move it, then. Skarmory (talk • contribs) 23:14, 21 June 2023 (UTC)
- Either RM, or I don't think anyone will object to a bold move on grounds of consistency with the others in Category:Steam locomotives of the Russian Empire. Certes (talk) 18:55, 21 June 2023 (UTC)
- The place to request access is Wikipedia:Requests for permissions/AutoWikiBrowser, even if you only intend to use JWB. The loco class is a tough choice. Neither А nor A seems obviously wrong, and I can't find any English sources. List of Russian steam locomotive classes shows both alphabets, but similar titles such as Russian locomotive class Kh use Latin transliteration with no incoming redirect from Cyrillic. It's a "weak move" from me, if only for consistency. Certes (talk) 12:35, 20 June 2023 (UTC)
Find all instances of journal=publisher
If I could have one query per parameter, that would be great. Specifically, in all citation templates ({{cite xxx}} and {{citation}}) find all instances of
|encyclopedia=Foobar
=|publisher=Foobar
|journal=Foobar
=|publisher=Foobar
|magazine=Foobar
=|publisher=Foobar
|newspaper=Foobar
=|publisher=Foobar
|series=Foobar
=|publisher=Foobar
|title=Foobar
=|publisher=Foobar
|website=Foobar
=|publisher=Foobar
|work=Foobar
=|publisher=Foobar
and also
|journal=Foobar
=|series=Foobar
The matches should be exact, save for whitespace. Headbomb {t · c · p · b} 12:31, 25 June 2023 (UTC)
- Quarry can't do that, because it can't access wikitext. Assuming Foobar means the same text must occur twice, Cirrus search can't do that either, because its limited regexp doesn't do backreferences. This may need a bot. Certes (talk) 13:51, 25 June 2023 (UTC)
- Same as the last time you asked. —Cryptic 14:43, 25 June 2023 (UTC)
- Is there a tool which can run any grep-like search (preferably PCRE) on a wikitext dump (article namespace would suffice)? If not then this seems like an easy win to solve this sort of problem. Runs could take a while, but then so do some Quarry queries. Certes (talk) 16:43, 25 June 2023 (UTC)
- AWB's offline database scanner can do this, I think. –Novem Linguae (talk) 16:49, 25 June 2023 (UTC)
- Yes, as can grep, perl, etc., but I'd need to download a multi-gigabyte dump file and keep it up to date. I was thinking of somewhere that I can send the search regexp and receive a results set measured in kB or low MB, either titles with captures or at worst the full wikitext of a small number of articles. Certes (talk) 17:00, 25 June 2023 (UTC)
- I download the dumps twice a month, but I'll need the regex. Headbomb {t · c · p · b} 17:03, 25 June 2023 (UTC)
- If you need this to be accurate, a regex won't suffice. You can't parse html (or wikitext) with regex. What you want to do is run the articletext through mwparserfromhell, which can give you a list of templates and each of their parameters. I'm not familiar enough with xml parsing (to get the articletext out of the dumps) in python to try this myself, though. —Cryptic 17:35, 25 June 2023 (UTC)
- I need this to be accurate ish. There can be some false positives, and tiny mismatches, or maybe the whitespace matching has to be exact instead of loose. The goal is to unleash WP:Citation bot on these articles to trigger cleanup, and tackle the rest semi-automatically. Headbomb {t · c · p · b} 17:47, 25 June 2023 (UTC)
- A very rough regexp which is far from bulletproof but may be good enough is
\{\{[Cc]it(?:ation|e )[^}]*\|\s*encyclopedia\s*=\s*((?:[^][|}]|\[\[[^]|]+(?:\|[^]]*)?\]\])+)\s*(?:\|[^}]*)?\|\s*publisher\s*=\s*\1\s*[|}]
. You may also want to repeat it with encyclopedia and publisher swapped. Certes (talk) 17:45, 25 June 2023 (UTC)- Whatever you're grepping with needs to be able to deal with newlines, which citation template invocations often have. Regex matching in most any programming language can do this, but it keeps you from just piping the raw dumps through bzip2 -cd | grep. —Cryptic 17:55, 25 June 2023 (UTC)
- I settled on
\{\{[Cc]it(?:ation|e )[^}]*\|\s*(encyclopedia|journal|magazine|newspaper|series|title|website|work|publisher)\s*=\s*((?:[^][|}]|\[\[[^]|]+(?:\|[^]]*)?\]\])+)\s*(?:\|[^}]*)?\|\s*(encyclopedia|journal|magazine|newspaper|series|title|website|work|publisher)\s*=\s*\1\s*[|}]
- this way it catches every permutation of x + publisher and publisher + x.
- Open to refinements if you have ideas. Headbomb {t · c · p · b} 04:06, 26 June 2023 (UTC)
- The backreference is pointing at the (encyclopedia|journal|etc|etc) group now instead of the repeated parameter value; change it to \2 or make the group non-capturing with (?:encyclopedia|and|so|on).This won't match if there's more than one unrelated parameter between the first encyclopedia/etc and the second. To fix that, change
(?:\|[^}]*)?
before the second encyclopedia|etc group to(?:\|[^}]*)*
.You're probably already aware this'll match {{citation|title=Foo|series=Foo}}, i.e. repetition of any two of your left-hand column, without necessarily having a matching publisher parameter.Be aware this won't match {{citation|newspaper=[[The New York Times|New York Times]]|publisher=New York Times}}, nor anything that uses another template (or single close-brace) in between the "citation|cite whatever" and the second encyclopedia/publisher/whatever parameter. Worse (though less likely to be encountered), if one of citation's params is a template invocation that itself has an "encyclopedia/whatever=" param in that, the regex will treat it like it was the outer citation template's. Proper parsing fixes the first problem cheaply, and the second and third for free. —Cryptic 07:14, 26 June 2023 (UTC)- Yes I meant /2, I just forgot to update that above. I'm aware of the limitations of this regex, but it's good enough to give a starting list for the bots. The real solution would likely be a CS1 error/maintenance message. Good tweak for
(?:\|[^}]*)?
→(?:\|[^}]*)*
though, I wouldn't have thought of that right away. Headbomb {t · c · p · b} 10:14, 26 June 2023 (UTC)- Either ? or * should work here:
(?:\|[^}]*)?
also matches "|this|that". I suggested ? as it looked marginally faster, though I didn't test that. Certes (talk) 11:33, 26 June 2023 (UTC)- You're right on both counts. I was reading it as (?:\|[^|}]*). Even though I retyped it twice. ... Don't look at me like that, it was 3 am here. —Cryptic 01:27, 27 June 2023 (UTC)
- Well, I just tested my regex with the database scanner and AWB just chokes on it, and gets stuck at 0.005% finding no hits. "(encyclopedia|journal|magazine|newspaper|series|title|website|work|publisher)" might be too much for it. Headbomb {t · c · p · b} 11:41, 26 June 2023 (UTC)
- That's less likely to be the problem than matching the arguments and skipping unrelated parameters, particularly with the pessimization I introduced. Rather than try to optimize the regex - which will be complex and error-prone, and still won't ever be capable of fully handling the task - if I take the time to do this semi-properly by muddling through a python script, will you be able to run it locally? —Cryptic 16:17, 27 June 2023 (UTC)
- The core idea is to get the list of articles. After that, it's just fed to Citation Bot which should parse things intelligently and do it's magic. Headbomb {t · c · p · b} 16:58, 27 June 2023 (UTC)
- That's... not what I asked. If I give you the script, will you be able to run it yourself? Or am I going to have to babysit it through 6,928,710 pages, less however many were created since the last dump, when it currently only goes through about 6200 per minute? (I can improve on that, but not, realistically, by more than about 2x.) —Cryptic 21:40, 28 June 2023 (UTC)
- Sorry I misread. Possibly? Probably? Do I just run it by compiling/double-clicking on it? Headbomb {t · c · p · b} 22:11, 30 June 2023 (UTC)
- Do you have an environment where you can run
python myprogram < mydumpfile > myoutput
or the equivalent in your operating system? Certes (talk) 23:14, 30 June 2023 (UTC) - If you're on linux or I think a mac, then you should be able to run it from a console. If on windows (version?), if you have to ask then no; but I'm certain I'll be able to build you an executable. Once I figure out how. I don't work with Python very often. —Cryptic 23:18, 30 June 2023 (UTC)
- I'm on windows. I use IDLE to edit/compile/run python stuff. But if you tell me how to run the script from some command line interface, I should be able to manage. Headbomb {t · c · p · b} 23:46, 30 June 2023 (UTC)
- Well then. User:Cryptic/dup-publisher.py (probably easiest to save it from raw mode, here). Run with
python3 dup-publisher.py --help
to see the options. You'll have to install mwparserfromhell if you haven't already; from a command line,python3 -m pip install mwparserfromhell
. —Cryptic 00:06, 1 July 2023 (UTC)
- Well then. User:Cryptic/dup-publisher.py (probably easiest to save it from raw mode, here). Run with
- I'm on windows. I use IDLE to edit/compile/run python stuff. But if you tell me how to run the script from some command line interface, I should be able to manage. Headbomb {t · c · p · b} 23:46, 30 June 2023 (UTC)
- Do you have an environment where you can run
- Sorry I misread. Possibly? Probably? Do I just run it by compiling/double-clicking on it? Headbomb {t · c · p · b} 22:11, 30 June 2023 (UTC)
- That's... not what I asked. If I give you the script, will you be able to run it yourself? Or am I going to have to babysit it through 6,928,710 pages, less however many were created since the last dump, when it currently only goes through about 6200 per minute? (I can improve on that, but not, realistically, by more than about 2x.) —Cryptic 21:40, 28 June 2023 (UTC)
- The core idea is to get the list of articles. After that, it's just fed to Citation Bot which should parse things intelligently and do it's magic. Headbomb {t · c · p · b} 16:58, 27 June 2023 (UTC)
- That's less likely to be the problem than matching the arguments and skipping unrelated parameters, particularly with the pessimization I introduced. Rather than try to optimize the regex - which will be complex and error-prone, and still won't ever be capable of fully handling the task - if I take the time to do this semi-properly by muddling through a python script, will you be able to run it locally? —Cryptic 16:17, 27 June 2023 (UTC)
- Either ? or * should work here:
- Yes I meant /2, I just forgot to update that above. I'm aware of the limitations of this regex, but it's good enough to give a starting list for the bots. The real solution would likely be a CS1 error/maintenance message. Good tweak for
- The backreference is pointing at the (encyclopedia|journal|etc|etc) group now instead of the repeated parameter value; change it to \2 or make the group non-capturing with (?:encyclopedia|and|so|on).This won't match if there's more than one unrelated parameter between the first encyclopedia/etc and the second. To fix that, change
- I settled on
- Whatever you're grepping with needs to be able to deal with newlines, which citation template invocations often have. Regex matching in most any programming language can do this, but it keeps you from just piping the raw dumps through bzip2 -cd | grep. —Cryptic 17:55, 25 June 2023 (UTC)
- If you need this to be accurate, a regex won't suffice. You can't parse html (or wikitext) with regex. What you want to do is run the articletext through mwparserfromhell, which can give you a list of templates and each of their parameters. I'm not familiar enough with xml parsing (to get the articletext out of the dumps) in python to try this myself, though. —Cryptic 17:35, 25 June 2023 (UTC)
- I download the dumps twice a month, but I'll need the regex. Headbomb {t · c · p · b} 17:03, 25 June 2023 (UTC)
- Yes, as can grep, perl, etc., but I'd need to download a multi-gigabyte dump file and keep it up to date. I was thinking of somewhere that I can send the search regexp and receive a results set measured in kB or low MB, either titles with captures or at worst the full wikitext of a small number of articles. Certes (talk) 17:00, 25 June 2023 (UTC)
- AWB's offline database scanner can do this, I think. –Novem Linguae (talk) 16:49, 25 June 2023 (UTC)
- Is there a tool which can run any grep-like search (preferably PCRE) on a wikitext dump (article namespace would suffice)? If not then this seems like an easy win to solve this sort of problem. Runs could take a while, but then so do some Quarry queries. Certes (talk) 16:43, 25 June 2023 (UTC)
sample output from enwiki-20230620-pages-articles16.xml-p20460153p20570392.bz2 - 25635 pages, 68 matches, 4:08 runtime
|
---|
J. M. G. Le Clézio bibliography {{cite web}}:publisher,title="THE NOBEL FOUNDATION 2008" Hot Issue (EP) {{cite web}}:publisher,work="Asianpopcorn" Tokyo Day Trip {{cite web}}:publisher,website="Nonesuch Records." Per- and polyfluoroalkyl substances {{cite web}}:publisher,website="National Collaborating Centre for Environmental Health" Jake Parker {{cite news}}:publisher,work="NPR" German University of Technology in Oman {{cite web}}:publisher,title="ACQUIN" Where the City Meets the Sea {{cite web}}:publisher,title="iTunes Store" Khurshed Makhmudov {{cite web}}:publisher,website="today.tj" Caritas Christi Health Care {{cite web}}:publisher,title="Caritas Christi Health Care" Audi R15 TDI {{cite web}}:publisher,title="audir15tdi.com" SS Automedon {{cite book}}:publisher,title="Lloyd's Register" Grace Hospital (Winnipeg) {{cite web}}:publisher,title="Grace Hospital" History of the Jews in Mumbai {{cite web}}:publisher,title="JTA" Hlaing Township {{cite web}}:publisher,website="International School of Myanmar" Glen Coffee {{Cite news}}:publisher,work="SaturdayDownSouth" Doug Yates {{cite web}}:publisher,website="motorsport.com" Maryse Joissains {{cite web}}:publisher,title="Maryse Joissains-Masini" List of parishes in the Diocese of Salford {{cite web}}:publisher,title="Holy Souls Parish" North Country Supervisory Union {{Cite news}}:publisher,work="The Chronicle" Irish Baroque Orchestra {{cite web}}:publisher,title="Temple Bar Cultural Trust" Riverfront Park (Spokane, Washington) {{cite web}}:publisher,title="North Idaho Centennial Trail Foundation" Rainbow Theatre {{cite web}}:publisher,website="Historic England" World War I in popular culture {{Cite web}}:publisher,title="H-Nationalism" Scremerston {{cite web}}:publisher,website="Archeological Research Services" Sport in Lima {{cite web}}:publisher,website="World Baseball Softball Confederation" Svend Bayer {{cite web}}:publisher,title="Pucker Gallery" Supreeme {{cite web}}:publisher,title="Record Collection" Mumford & Sons {{cite web}}:publisher,website="Never Enough Notes" Hecht Company Warehouse {{cite web}}:publisher,work="Douglas Development Corporation" God's Squad {{cite web}}:publisher,work="CathNews" LGBT rights in the 19th century {{cite web}}:publisher,title="Sodomylaws.org" John Najarian {{cite web}}:publisher,website="Mprnews" Tame Impala {{cite web}}:publisher,title="Dave Fridmann" Escot, Talaton {{cite news}}:publisher,work="The Telegraph" Yasser Khalil {{cite web}}:publisher,title="Yasser Khalil" American Artists' Congress {{cite web}}:publisher,website="The Annex Galleries" Castle (TV series) {{cite web}}:publisher,title="ABC Medianet" Agnete Saba {{cite web}}:publisher,website="vg.no" Mark Simmonds (saxophonist) {{cite web}}:publisher,work="Australian Rock Database" Frederikssund {{cite web}}:publisher,title="vikingespil.dk" Air sock {{cite web}}:publisher,website="Prihoda UK" Halfway Bush {{cite web}}:publisher,title="Halfway Bush School" Baháʼí Faith in Dominica {{cite web}}:publisher,work="thearda.com" Middle Cyclone {{cite web}}:publisher,work="Stereogum" William Wellington House {{cite web}}:publisher,work="Kirkbride Buildings" Alfred Sankoh {{cite web}}:publisher,website="dt.no" Hadleigh Farm {{cite web}}:publisher,title="Hadleigh Mountain Bike Club" Drumbeg, Sutherland {{cite web}}:publisher,website="The Gazetteer for Scotland" Tetra Tech {{cite web}}:publisher,website="Engineer News-Record" Fort Nonsense (Annapolis, Maryland) {{cite web}}:publisher,title="Maryland Historical Trust" Funhouse (song) {{cite web}}:publisher,title="Nielsen Music Poland" Denis Haines {{cite web}}:publisher,website="Classic Rock Radio" Government Office for Science {{cite news}}:publisher,title="Government Office for Science" Yazidis {{cite web}}:publisher,work="Statistics of Russia" Reactions to the September 11 attacks {{cite web}}:publisher,website="georgewbush-whitehouse.archives.gov" New Eyes for the Needy {{cite web}}:publisher,title="New Eyes for the Needy" Advertisements for Myself {{cite web}}:publisher,website="Pbs.org" Medal of Honor Aircraft {{cite web}}:publisher,website="Airport-Data.com" Jonathan Leavitt (publisher) {{cite journal}}:publisher,journal="[[American Society for Eighteenth-Century Studies]]" T-Mobile US {{cite web}}:publisher,title="BillShrink.com" Cannabis in New Zealand {{cite web}}:publisher,website="Say Nope To Dope" Nashville Charter Amendment 1 {{Cite web}}:publisher,title="Nashville English First" List of windmills in Leicestershire {{cite web}}:publisher,title="Redmile Archive" Mattapany-Sewall Archeological Site {{cite web}}:publisher,title="Maryland Historical Trust" GCE Advanced Level (United Kingdom) {{cite web}}:publisher,website="UK Parliament" Jared Bernstein {{cite web}}:publisher,title="Cnbc.com" Jeff Glixman {{cite web}}:publisher,title="Starcityrecording.com" Yang Kaiqi {{cite web}}:publisher,title="Chessinchina.Net" |
- You can avoid the download with a toolforge account, at least; the dumps are in /public/dumps/public/enwiki/latest/. —Cryptic 17:08, 25 June 2023 (UTC)
List of usernames with a "creation date" column
To add a "creation date" column to a list of usernames, I need to get the creation date from logging WHERE log_type = 'newusers', right? What's the best way to incorporate it into this query? LEFT JOIN? Subquery? Thanks for the help. –Novem Linguae (talk) 03:35, 3 July 2023 (UTC)
- user has a user_registration column, which is accurate AFAIK; that'll be easiest by far.If you really needed to pull it from the logs, the least worst way is to left join logging_logindex on log_namespace=2 and log_title=replace(user_name, ' ', '_') and log_type='newusers'. That's going to be problematic - the user might have been renamed, for example - and if you try joining on log_actor=actor_id instead, you run up against account creators creating other people's accounts, etc. Plus, the newusers log only goes back to September 2005. At least you (usually) won't have to worry about more than one matching row. —Cryptic 03:50, 3 July 2023 (UTC)
- Perfect, thank you. My brain completely missed that "user_registration" field when I was reading the documentation. I was looking for the word "date" in the field name, I guess. –Novem Linguae (talk) 03:59, 3 July 2023 (UTC)
Edit summary search
I 'd like to find all edits made by User:Qwerfjkl (bot) where the edit summary contains ([[Wikipedia:Bots/Requests for approval/Qwerfjkl (bot) 19|Task 19)]]
[sic] and the edit isn't a page creation. Then I'd like just the revid and page title. — Qwerfjkltalk 15:20, 4 July 2023 (UTC)
- @Qwerfjkl: Does Edit summary search provide what you need here? Certes (talk) 15:57, 4 July 2023 (UTC)
- @Certes, no, unfortunately. — Qwerfjkltalk 15:57, 4 July 2023 (UTC)
- @Qwerfjkl: Is quarry:query/74955 what you need? I added the namespace but they all seem to be Talk:. Certes (talk) 16:13, 4 July 2023 (UTC)
- @Certes, yes, that's perfect, thanks a lot. — Qwerfjkltalk 17:41, 4 July 2023 (UTC)
Spelled out Latin characters in Chemistry articles
Hi, I need a query that lists all non-redirecting mainspace content with one or more of "Alpha", "Beta", etc. (case insensitive) in the title, and is a member of Category:Chemistry to any depth. —CX Zoom[he/him] (let's talk • {C•X}) 06:46, 9 July 2023 (UTC)
- "member of Category:Chemistry to any depth" isn't discriminatory. It includes almost all categories eventually - at least 1957095 (where I stopped recursing) out of 2274095. Pick something more reasonable, please. —Cryptic 07:56, 9 July 2023 (UTC)
- Also, there's a whole lot of page titles containing the shorter greek letter names. For example, "pi" (not even case-insensitive) turns up just short of a half million. —Cryptic 08:02, 9 July 2023 (UTC)
- Even at a shallow depth of 5 (13543 categories), and restricting the letter names to distinct words, you get a lot of noise. For example, Yang Xi (volleyball) is in Category:Chemistry>Category:Chemical elements>Category:Gold>Category:Gold medalists at international sports competitions>Category:Asian Games gold medalists>Category:Asian Games gold medalists for China, and Ho Chi Minh is in Category:Chemistry>Category:Nuclear chemistry>Category:Radioactivity>Category:Nuclear warfare>Category:Cold War>Category:People of the Cold War. —Cryptic 08:14, 9 July 2023 (UTC)
- @Cryptic: Thank you very much for pointing this. How about existence of {{chembox}} in the article or existence of {{WikiProject Chemistry}} in the article's talk page. It would be great if there could be a column, which notes which of the two criteria (or both) is met, though this part is not absolutely necessary. Thanks! —CX Zoom[he/him] (let's talk • {C•X}) 08:52, 9 July 2023 (UTC)
- (edit conflict) Meh, quarry:query/75085 has matches to a depth of 7. I don't know that Category:Set index articles on ships, Category:Holography in television, or Category:Tofu dishes are going to be terribly relevant to you, but at least there's relatively few page title matches in them once they're restricted to distinct words.Matching by templates (either on the article or on its talk) is doable and will narrow things down much more quickly (and accurately!) than the category tree. I'll do them for you in a couple hours if nobody beats me to it. —Cryptic 08:57, 9 July 2023 (UTC)
- No results for pages transcluding {{Chembox}}, and only 4 for pages whose talk transcludes {{WikiProject Chemistry}}. These both still restrict the letter names to distinct words, but I can't imagine relaxing that would get any more good results. —Cryptic 10:27, 9 July 2023 (UTC)
- That's weird because at least beta-Hydroxy beta-methylbutyric acid exists. —CX Zoom[he/him] (let's talk • {C•X}) 10:43, 9 July 2023 (UTC)
- Oh wait, it is {{WikiProject Chemicals}} not Chemistry and uses {{Infobox drug}}. My bad. —CX Zoom[he/him] (let's talk • {C•X}) 10:44, 9 July 2023 (UTC)
- quarry:query/75087 now has pages transcluding either chembox or infobox drug, and quarry:query/75088 pages whose talk transcludes either of the wikiproject templates. Both, and quarry:query/75085 for the cat tree results, correctly match non-space word boundaries now, too, so should've had a lot more results even before adding the alternate templates. —Cryptic 11:00, 9 July 2023 (UTC)
- Thank you very much. These are what I needed. —CX Zoom[he/him] (let's talk • {C•X}) 11:07, 9 July 2023 (UTC)
- quarry:query/75087 now has pages transcluding either chembox or infobox drug, and quarry:query/75088 pages whose talk transcludes either of the wikiproject templates. Both, and quarry:query/75085 for the cat tree results, correctly match non-space word boundaries now, too, so should've had a lot more results even before adding the alternate templates. —Cryptic 11:00, 9 July 2023 (UTC)
- Oh wait, it is {{WikiProject Chemicals}} not Chemistry and uses {{Infobox drug}}. My bad. —CX Zoom[he/him] (let's talk • {C•X}) 10:44, 9 July 2023 (UTC)
- That's weird because at least beta-Hydroxy beta-methylbutyric acid exists. —CX Zoom[he/him] (let's talk • {C•X}) 10:43, 9 July 2023 (UTC)
- No results for pages transcluding {{Chembox}}, and only 4 for pages whose talk transcludes {{WikiProject Chemistry}}. These both still restrict the letter names to distinct words, but I can't imagine relaxing that would get any more good results. —Cryptic 10:27, 9 July 2023 (UTC)
- (edit conflict) Meh, quarry:query/75085 has matches to a depth of 7. I don't know that Category:Set index articles on ships, Category:Holography in television, or Category:Tofu dishes are going to be terribly relevant to you, but at least there's relatively few page title matches in them once they're restricted to distinct words.Matching by templates (either on the article or on its talk) is doable and will narrow things down much more quickly (and accurately!) than the category tree. I'll do them for you in a couple hours if nobody beats me to it. —Cryptic 08:57, 9 July 2023 (UTC)
- @Cryptic: Thank you very much for pointing this. How about existence of {{chembox}} in the article or existence of {{WikiProject Chemistry}} in the article's talk page. It would be great if there could be a column, which notes which of the two criteria (or both) is met, though this part is not absolutely necessary. Thanks! —CX Zoom[he/him] (let's talk • {C•X}) 08:52, 9 July 2023 (UTC)
American football players
If someone has the time, I would appreciate a query of Category:Players of American football for entries with less than 2,500 bytes and which is either unsourced or sourced only to databases (e.g. "pro-football-reference.com", "sports-reference.com", "profootballarchives.com", "espn.com" "footballdb.com", or "nfl.com"). Cbl62 (talk) 17:28, 11 July 2023 (UTC)
- The glib answer is "just the subcats", since there's only two non-category pages in that category and both have external links to other sites.For non-redirect, mainspace pages anywhere in that category tree, quarry:query/75148. This will have false positives, since the closest thing to a "reference" that can be queried for is an external link (example: Angus Daniel McDonald). If you want to exclude databases other than your six examples, you're going to have to list them. —Cryptic 23:57, 11 July 2023 (UTC)
- You can be a little more selective about references; if you look at Quarry:query/73984 it looks for all templates under Category:Citation templates and assumes, with a few exceptions, that the presence of those templates mean a reference is included.
- I don't have time now, but I'll try to write a query incorporating that later. BilledMammal (talk) 00:49, 12 July 2023 (UTC)
- I don't see how that could help except for the fifty-odd pages with no external links at all. I mean, you could make assumptions like "{{cite book}} is never going to point to nfl.com", but people misuse those templates all the time. —Cryptic 02:21, 12 July 2023 (UTC)
ReferenceExpander QueryReduxer
Hi! The five of us still toiling away at the cleanup enabled by query 72741 (ReferenceExpander edits from 1 January 2023 to 2 April 2023) are nearing-ish the completion of the first phase of our task. Fewer than 350 edits remain to be checked. We already have a second phase prepared (based on query 72745), but we are going to need results for two more date ranges.
Given the above, I'd like to request a query similar to 72741 and 72745, with the following alterations:
- page_namespace == 0
- timestamp is either between 2020-05-13 and 2022-06-30 OR after 2023-04-01 (this is possible in SQL, right?)
- comment_text matches
using [[en:w:User:BrandonXLF/ReferenceExpander
(need a more specific match than before, since editors active in the cleanup project have been mentioningReferenceExpander
in edit summaries, which will create a very high false positive rate for the post 2023-04-01 period) - retain rev_id as sort key instead of calculated key named "delta"
I don't remember enough SQL to do this, especially regarding string matching and syntax for escaping special characters. Thanks! Folly Mox (talk) 20:15, 14 July 2023 (UTC)
- Oh right I hadn't thought of the computational load of running regexes against every edit summary of every mainspace diff for approximately three years, so if it makes more sense to break this up chronologically into eras, we can definitely work with those results. Folly Mox (talk) 20:52, 14 July 2023 (UTC)
- (Be aware that the location of a page at the time an edit is made is discarded after a revision falls off of Special:Recentchanges in 30 days. You're going to be getting results based on what namespace the page it was made to is in now.)quarry:query/75208 will probably time out, but might not; we'll know in an hour or so. If it does, I'll split it up into shorter date ranges like quarry:query/72740 does.The problem isn't checking the edit summaries - that's fast, once it figures out which edit summaries to check. It's that the query optimizer estimates that it'll get so many results if it uses the best index (rev_timestamp) first, that it thinks it'll be better off looking at every page currently in the main namespace, fetching its revisions, and then filtering by timestamp. That's a lot better than looking at every revision ever like it was doing before - ~29 296 370 results in the first step instead of ~1 157 121 570 - but still probably too many.Normally we could explicitly say which index to use, but the way the security is set up on the toolforge replicas so that every Quarry user can't see deleted or oversighted revisions (or, say, users' email addresses or their password hashes) makes that impossible. Best we can do is tell it which order to look at the tables, but then it falls back on the even-worse plan with 1.1 billion results.Oh, and the columns in Quarry result pages are sortable by clicking at the top, if you don't like the default. Just so long as the data you want to sort on already has a column to itself. —Cryptic 03:00, 15 July 2023 (UTC)
- To my surprise - I'd just started a query with split date ranges - it did finish. No results earlier than 2022-04-18, though, since that's when the script's edit summary changed to your search string above. Not the only time it's changed, either (the preceding revision, for example, and the first revision was just "Expanding bare references."). I'll take a closer look later and account for that. —Cryptic 04:10, 15 July 2023 (UTC)
- quarry:query/75209. Some of the edits on the first page could conceivably be false positives, since the edit summary before 2020-06-10 was just "Expanding bare references.". —Cryptic 05:34, 15 July 2023 (UTC)
- Thank you very much, User:Cryptic! I think the cleanup crew's self-imposed responsibility for due diligence really only applies to articles that are currently in mainspace, so the current namespace is the one we want. Apologies for not thinking of the idea that the default edit summary may have changed, which certainly caused a not insignificant amount of wasted computational load.All those bits of info about how the queries work and self-optimise are helpful as well. I'm grateful to see zero matches in the second date range, after the alarm was raised at AN. 2100 rows total. Well, the good news is that's about how many we've repaired, cleaned up, or checked in the past three and a half months, and since the edits are older there's a greater chance they'll have been cleaned up in the course of normal editing. I believe we can probably finish this within the calendar year. Anyway, thanks again! Folly Mox (talk) 13:58, 15 July 2023 (UTC)
- The first query shouldn't have any results that aren't also in the second, so "just" 1867. Not sure what you mean about no results in the second range; I see about a hundred, visible as the last page and part of the second-last in both queries. —Cryptic 14:56, 15 July 2023 (UTC)
- Thank you very much, User:Cryptic! I think the cleanup crew's self-imposed responsibility for due diligence really only applies to articles that are currently in mainspace, so the current namespace is the one we want. Apologies for not thinking of the idea that the default edit summary may have changed, which certainly caused a not insignificant amount of wasted computational load.All those bits of info about how the queries work and self-optimise are helpful as well. I'm grateful to see zero matches in the second date range, after the alarm was raised at AN. 2100 rows total. Well, the good news is that's about how many we've repaired, cleaned up, or checked in the past three and a half months, and since the edits are older there's a greater chance they'll have been cleaned up in the course of normal editing. I believe we can probably finish this within the calendar year. Anyway, thanks again! Folly Mox (talk) 13:58, 15 July 2023 (UTC)
- quarry:query/75209. Some of the edits on the first page could conceivably be false positives, since the edit summary before 2020-06-10 was just "Expanding bare references.". —Cryptic 05:34, 15 July 2023 (UTC)
- To my surprise - I'd just started a query with split date ranges - it did finish. No results earlier than 2022-04-18, though, since that's when the script's edit summary changed to your search string above. Not the only time it's changed, either (the preceding revision, for example, and the first revision was just "Expanding bare references."). I'll take a closer look later and account for that. —Cryptic 04:10, 15 July 2023 (UTC)
- (Be aware that the location of a page at the time an edit is made is discarded after a revision falls off of Special:Recentchanges in 30 days. You're going to be getting results based on what namespace the page it was made to is in now.)quarry:query/75208 will probably time out, but might not; we'll know in an hour or so. If it does, I'll split it up into shorter date ranges like quarry:query/72740 does.The problem isn't checking the edit summaries - that's fast, once it figures out which edit summaries to check. It's that the query optimizer estimates that it'll get so many results if it uses the best index (rev_timestamp) first, that it thinks it'll be better off looking at every page currently in the main namespace, fetching its revisions, and then filtering by timestamp. That's a lot better than looking at every revision ever like it was doing before - ~29 296 370 results in the first step instead of ~1 157 121 570 - but still probably too many.Normally we could explicitly say which index to use, but the way the security is set up on the toolforge replicas so that every Quarry user can't see deleted or oversighted revisions (or, say, users' email addresses or their password hashes) makes that impossible. Best we can do is tell it which order to look at the tables, but then it falls back on the even-worse plan with 1.1 billion results.Oh, and the columns in Quarry result pages are sortable by clicking at the top, if you don't like the default. Just so long as the data you want to sort on already has a column to itself. —Cryptic 03:00, 15 July 2023 (UTC)
Refined request
The results of my prior request were quite helpful in allowing the creation of Wikipedia:WikiProject National Football League/Football biography cleanup, a promising effort to improve the quality of sub-stub articles on gridiron football players. In an effort to ensure its completeness, would someone we willing to create a simpler list from Category:Gridiron football players and its subcategories for articles with less than 1,500 total bytes? Cbl62 (talk) 20:48, 23 July 2023 (UTC)
- Do you still just want articles without any external links except in the list from before? If so, quarry:query/75399. —Cryptic 23:31, 23 July 2023 (UTC)
- Thanks, but no, I'm trying not to limit it. I am assuming that at < 1500 characters, there's not going to much sourcing, so no need to included any parameter about sources. Cbl62 (talk) 04:44, 24 July 2023 (UTC)
- OK, same link. —Cryptic 06:03, 24 July 2023 (UTC)
- Thank you, Cryptic. Cbl62 (talk) 17:07, 24 July 2023 (UTC)
- OK, same link. —Cryptic 06:03, 24 July 2023 (UTC)
- Thanks, but no, I'm trying not to limit it. I am assuming that at < 1500 characters, there's not going to much sourcing, so no need to included any parameter about sources. Cbl62 (talk) 04:44, 24 July 2023 (UTC)
Wikitext searching query
Can someone generate a list of all pages with this exact wikitext in it? WHILE WITH NO ACTION: -->[[Category:Wikipedia usernames with possible policy issues|
I need it for an approved AWB bot task. Traditional search option doesn't seem to work, thanks to the -->
part. Thanks! —CX Zoom[he/him] (let's talk • {C•X}) 06:01, 27 July 2023 (UTC)
- This is not possible in Quarry because it has no access to wikitext. However, I don't think the --> stops search from working. This search finds what you seek plus a lot of false positives. Do you have an example of a page that actually matches your string, so we can test with that? Certes (talk) 09:53, 27 July 2023 (UTC)
- I do not know if any such page exists currently. If they do, they would need fixing. Any substitutions of {{uw-username}} between 21 Jan 2014 & 17 Sept 2015 should have the issue that needs fixing. Apparently, they have been removed by bot (example), but I was looking for any stray cases left behind. —CX Zoom[he/him] (let's talk • {C•X}) 10:18, 27 July 2023 (UTC)
- @CX Zoom: This search finds my sandbox and nothing else, so there are probably no problems outstanding, unless they are formatted differently. (I checked both with the category as shown above and on its own line.) Certes (talk) 13:05, 27 July 2023 (UTC)
- Thank you very much! I assume the bot had taken care of it all. —CX Zoom[he/him] (let's talk • {C•X}) 13:30, 27 July 2023 (UTC)
- @CX Zoom: This search finds my sandbox and nothing else, so there are probably no problems outstanding, unless they are formatted differently. (I checked both with the category as shown above and on its own line.) Certes (talk) 13:05, 27 July 2023 (UTC)
- I do not know if any such page exists currently. If they do, they would need fixing. Any substitutions of {{uw-username}} between 21 Jan 2014 & 17 Sept 2015 should have the issue that needs fixing. Apparently, they have been removed by bot (example), but I was looking for any stray cases left behind. —CX Zoom[he/him] (let's talk • {C•X}) 10:18, 27 July 2023 (UTC)
Articles not edited by Citation bot
Could I get, based on this list, the subset of articles that were edited by Citation bot from July 20 of this year onwards. Or alternatively, that were not edited by Citation bot from July 20 of this year onwards?
Headbomb {t · c · p · b} 20:56, 27 July 2023 (UTC)
- We can do that, but it'd be a lot easier if it were based on (links from) the current revision of a page. Otherwise, we'd have to paste all of those article names into the query. —Cryptic 21:50, 27 July 2023 (UTC)
- Restored to live version. Headbomb {t · c · p · b} 00:25, 28 July 2023 (UTC)
- quarry:query/75475. (I saw Citation bot had edit summaries referencing the page, so didn't want to edit it myself. ...Didn't occur to me until now that I could paste them into a different sandbox.) —Cryptic 00:52, 28 July 2023 (UTC)
- Restored to live version. Headbomb {t · c · p · b} 00:25, 28 July 2023 (UTC)
MOS:BOLDLINK Issues
I am interested to see how many articles run counter to MOS:BOLDLINK within Category:Green Bay Packers, which in my experience is very rampant across the project. I would expect a query to look like:
- Grab the first sentence of the article
- Is there bolding in the first sentence?
- If no, disregard
- If yes, is there linking within the bolding (i.e.
'''[[Wikilink]]'''
)- If no, disregard
- If yes, then this is our list of offending articles
Thank you « Gonzo fan2007 (talk) @ 18:54, 28 July 2023 (UTC)
- Not possible with Quarry, which doesn't have access to wikitext, but a bot or tool may be able to help. Certes (talk) 21:12, 28 July 2023 (UTC)
- Even with wikitext access, "grab the first sentence" is a lot harder of a problem than you'd think. I wouldn't care to attempt it in sql at all; if I were doing it with a bot or partial dump, I think I'd still be better off parsing the rendered html instead of the wikitext source, instead of trying to make lists of templates which render as infoboxes or hatnotes or shortdescs or images or what have you as opposed to article text. Not to mention the all the uncommon ways besides triple apostrophes and double square brackets to bold and link text. And it still might be faster to just look at all the pages in the category tree manually - there's only 2655 - than to program something that can do this reliably. We can give you a list of articles in the category tree (here you go) if you're so inclined. —Cryptic 23:35, 28 July 2023 (UTC)
- A search like this one may be somewhat helpful. – Jonesey95 (talk) 03:33, 29 July 2023 (UTC)
- Even with wikitext access, "grab the first sentence" is a lot harder of a problem than you'd think. I wouldn't care to attempt it in sql at all; if I were doing it with a bot or partial dump, I think I'd still be better off parsing the rendered html instead of the wikitext source, instead of trying to make lists of templates which render as infoboxes or hatnotes or shortdescs or images or what have you as opposed to article text. Not to mention the all the uncommon ways besides triple apostrophes and double square brackets to bold and link text. And it still might be faster to just look at all the pages in the category tree manually - there's only 2655 - than to program something that can do this reliably. We can give you a list of articles in the category tree (here you go) if you're so inclined. —Cryptic 23:35, 28 July 2023 (UTC)
Contributions by size
Could someone create a query that shows a user's edits ordered by the size of content they add? TIA. ––FormalDude (talk) 22:58, 3 August 2023 (UTC)
- I've done this before, at quarry:query/61982. It considers size the same way the history page does, e.g. if you add 5000 bytes to one part of a page and remove 4000 from another part in the same edit, that's +1000, not Δ9000 - the only data we can see about content size is the number of bytes in a given revision, so if you're really after the 9000 or 5000 numbers, you're out of luck. The query should still work; instructions to fork and rerun a query are here. How to change whose edits are looked at should be obvious; to show all edits instead of just those with a positive change of 5000 or more, remove the second-last line (mentioning COALESCE and >= 5000). —Cryptic 00:42, 4 August 2023 (UTC)
- Thanks Cryptic! ––FormalDude (talk) 04:06, 4 August 2023 (UTC)
Most used templates
Is it possible to find the top 100 most used templates? I need it for bnwiki. Thanks. আফতাবুজ্জামান (talk) 13:54, 17 August 2023 (UTC)
- Special:MostTranscludedPages should work everywhere. * Pppery * it has begun... 13:57, 17 August 2023 (UTC)
- I don't know how i forgot that 🤦. Thanks. আফতাবুজ্জামান (talk) 14:01, 17 August 2023 (UTC)
- So did I... আফতাবুজ্জামান, if you want one that is easier to extract, query/75912. I haven't verified the results, because I don't speak Hindi, but they should be right. BilledMammal (talk) 14:06, 17 August 2023 (UTC)
- The Special: page includes pages transcluded from any namespace, not just Template:, but it's only infrequently updated (twice a year as of 2014). Both it and BilledMammal's query can only see a single usage of a given template on each page - for instance, {{u}} is currently used dozens of times on Wikipedia:Administrators' noticeboard/Incidents, but that only increases its count by one. There isn't any way to see a real total count, if the distinction is important to you. —Cryptic 16:53, 17 August 2023 (UTC)
Evaluating editor script shyness: dominance of top editors of scripts and other pages, to support a VPT question
In order to support a discussion at WP:VPT regarding the lack of a dedicated "Script:" namespace, I am interested in analyses of unique editors in different namespace pages compared to unique editors in user script pages (likely defined as User subpages having a .js
suffix). Perhaps one or more of these might help:
- avg percent of page bytes contributed by the top (#1) contributor of the page, for 1) templates, 2) modules, 3) drafts, and 4) user scripts; (2-col {ns, avgpct}, 4-row table, as union of 4 queries?).
- average ratio of unique editors contributing 5% or more of page bytes to total unique editors, in 1-4 list;
- top 100 pages by number of edits (restrict by date or whatever is needed so it doesn't run too long) with columns for # of unique editors contributing 5% or more of the bytes in each of 1-4; #4 is the most interesting to know, others are gravy {maybe: ns, pagename, toteds, uqeds, uqedpct=toteds/uqeds, toped, topedpct} sort desc uqedpct, or maybe by topedpct?
Because the goal is to discover "how many are sharing the coding" not the documentation, /doc subpages or other non-code subpages of templates, modules, or scripts (usually just the script page name, minus the .js
suffix) should probably be excluded. Draftspace is kind of a control, because coding isn't involved there. (The common.js files aren't really scripts, and not sure if they need to be excluded to avoid skewing the tallies; there are a lot of them.)
What I'm looking for, is some sort of measure of "editor script shyness", defined as a kind of inverse of "editor collaboration", in an attempt to answer the question, "Do editors tend to avoid collaborating on user script pages because they appear to 'belong' to another user, more so than they would collaborating on modules, templates, or drafts, which have their own namespace and thus don't appear to 'belong' to anybody, and if so, by how much?" A large "editor script shyness quotient" might be an argument in support of creating a "Script" namespace; a low ESSQ would argue against it. Not sure what would be the most useful to help shed light on that question, so feel free to vary the suggestions above or toss them out in favor of whatever seems most useful, or doable. Thanks, Mathglot (talk) 02:20, 21 August 2023 (UTC)