User talk:West.andrew.g/Popular pages/Archive 2
This is an archive of past discussions with User:West.andrew.g. Do not edit the contents of this page. If you wish to start a new discussion or revive an old one, please do so on the current talk page. |
Archive 1 | Archive 2 | Archive 3 |
Weird redirect issue for WP:5000
I've just noticed that WP:5000 (or Wikipedia:5000, which behaves identically) now redirects to Wikipedia:List of Wikipedians by number of edits, but it sure used to redirect to User:West.andrew.g/Popular pages. The strange thing is, that when you follow the redirect, and then go back by clicking on (Redirected from Wikipedia:5000), voila... You see the page http://en.wikipedia.org/w/index.php?title=Wikipedia:5000&redirect=no which says that no, the redirect is actually still pointing to User:West.andrew.g/Popular pages. Could anyone fix this incoherence? Or redirect the problem to a proper place? --Kubanczyk (talk) 19:17, 2 August 2013 (UTC)
- I did actually make a recent edit. Maybe that explains it? Best. Biosthmors (talk) 19:20, 2 August 2013 (UTC)
- Is it working for you now Kubanczyk? Best. Biosthmors (talk) 19:40, 2 August 2013 (UTC)
- Done Works perfectly, muchas gracias. --Kubanczyk (talk) 07:19, 3 August 2013 (UTC)
Questions
Hi, I just had two questions:
- What page/article is "[]" (#2 on your list)?
- Why are so many top articles, not actually articles? They are red links, there are no articles so I don't know how nonexistent pages can be counted (or why they would chart so high).
Thanks for any answers you can provide. I use to track trends on Twitter so I'm always curious in seeing what topics are drawing interest and I'm glad I came across your page. Newjerseyliz (talk) 14:08, 13 August 2013 (UTC)
- I can partially answer #2: It seems there is a Polish bot that is spamming Wikipedia for views, but it's poorly programmed and all the targets contain typographical errors, so they show up as redlinks on the list. As to what to do about it, no clue, really. Serendipodous 15:05, 13 August 2013 (UTC)
- Sorry for the latent response. (1) I would bin this with your second question... (2) Whenever something is requested on the server, and whether it exists or not, this fact is recorded in the statistics aggregation. The "[]" case is likely a syntax error in the placement of wikilinks or their routing. The other "non-pages" exist prominently either due to intentional (i.e., topic spamming) or non-intentional (e.g., a mis-configured content scraper that just keeps re-attempting to download a page) reasons. However, we do assume these "red links" are the result of automated/non-human views. In some cases it can be very tricky to determine whether a popularity spike is driven broadly by society, or just a single person with a bot. It is not difficult to game these trends, though its questionable if that's a worthy use of a miscreants time. The weekly WP:Top25Report attempts to sort through this cruft and produce a witty list of what is actually trending. Thanks, West.andrew.g (talk) 14:49, 16 August 2013 (UTC)
- Thank you, Serendipod and West.andrew.g, for your answers to my questions. They were very helpful.
- I imagine with a project that is on the scale of Wikipedia that these charts can only be compiled by bot and that it would be laborious to go through and delete articles that are present due to gaming the system. And, while some of the redlinks and "[ ]" seem like errors, this is not always obvious.
- Final question: Is it a habit to go through this list and see what pages people are looking for which don't yet exist and start a request for them at AfC? I'm not sure how much work this would involve but it could be that readers are looking for pages that don't yet exist and which should be created. Thanks again for your prompt and polite replies! NewJerseyLiz Let's Talk 14:13, 17 August 2013 (UTC)
- There may have once been value in doing that, but now? With the sheer number of redlinks flooding in thanks to that Eastern European automaton, searching for a properly trending redlink is almost impossible. And let's be honest here; people are ignorant. If something is trending on Wikipedia, it's not because it's some hidden secret that the cognoscenti had been trading feverishly amongst themselves but that the rest of the internet has suddenly become aware of- it's because it's something that was already popular and well-known, and if it's popular and well-known, there is probably a Wiki page about it. Not a particularly GOOD one necessarily, but a Wiki page nonetheless. Whenever some obscure artist or small business owner suddenly gets 300,000 views, it's safer to chalk it up to spam than to genuine interest. Serendipodous 14:24, 17 August 2013 (UTC)
- Sorry for the latent response. (1) I would bin this with your second question... (2) Whenever something is requested on the server, and whether it exists or not, this fact is recorded in the statistics aggregation. The "[]" case is likely a syntax error in the placement of wikilinks or their routing. The other "non-pages" exist prominently either due to intentional (i.e., topic spamming) or non-intentional (e.g., a mis-configured content scraper that just keeps re-attempting to download a page) reasons. However, we do assume these "red links" are the result of automated/non-human views. In some cases it can be very tricky to determine whether a popularity spike is driven broadly by society, or just a single person with a bot. It is not difficult to game these trends, though its questionable if that's a worthy use of a miscreants time. The weekly WP:Top25Report attempts to sort through this cruft and produce a witty list of what is actually trending. Thanks, West.andrew.g (talk) 14:49, 16 August 2013 (UTC)
- User:Newjerseyliz, I turned a few things blue that were red at WP:Topred this past week. Check that list out. And see the active request at the talk page. Yes there is junk there. But I find it useful! You might find it useful to submit AfC requests, but I don't know that process. User:Serendipodous, that was a bit more unhelpful of a reply than it should have been, in my opinion. So Newjerseyliz, don't mind Serendipodous. But to be fair, I do acknowledge that the red links on the WP:5000 lately rarely if ever need creating. Best wishes! Biosthmors (talk) 15:58, 17 August 2013 (UTC)
- Thanks for that link, Biosthmors. As for Serendipodous, he has helped me before and having trudged through tons of data on Twitter trends, coming up with Top 100 lists for years 2009-2011, I know what it's like to filter out the gold from a lot of mud. It's too bad that spambots clutter up the statistics, I'm not sure what they really get out of this behavior but then, hey, I can live without knowing! Thanks to you both! NewJerseyLiz Let's Talk 17:08, 18 August 2013 (UTC)
- I am sorry for coming across as grouchy; I thought you were referring to the redlinks in the top 25. Those are useless. But yes, I agree there is value in WP:TOPRED, though one would have to be careful to judge whether 1000 views a week constituted a genuine desire on the part of the public. Serendipodous 17:16, 18 August 2013 (UTC)
- And now I doubt I struck the best tone! Thanks for your contributions Serendipodous. Biosthmors (talk) 19:43, 18 August 2013 (UTC)
- I am sorry for coming across as grouchy; I thought you were referring to the redlinks in the top 25. Those are useless. But yes, I agree there is value in WP:TOPRED, though one would have to be careful to judge whether 1000 views a week constituted a genuine desire on the part of the public. Serendipodous 17:16, 18 August 2013 (UTC)
- Thanks for that link, Biosthmors. As for Serendipodous, he has helped me before and having trudged through tons of data on Twitter trends, coming up with Top 100 lists for years 2009-2011, I know what it's like to filter out the gold from a lot of mud. It's too bad that spambots clutter up the statistics, I'm not sure what they really get out of this behavior but then, hey, I can live without knowing! Thanks to you both! NewJerseyLiz Let's Talk 17:08, 18 August 2013 (UTC)
"The third column is the number of page views."
- "for the week" should be spelled out. I wasn't clear and had to check one. Otherwise very useful, thanks. A calculated column with the daily average would be helpful too, as daily views are the typical measure we are used to seeing. Johnbod (talk) 13:49, 8 September 2013 (UTC)
- If changes should be made, be bold in doing so! While the actual statistics are updated automatically, the header section is a static transclusion and therefore changes made by normal editors will not be overwritten at the next update. Thanks, West.andrew.g (talk) 14:37, 10 September 2013 (UTC)
Article class as a sortable field
Andrew, I was thinking this table would be more useful if people could sort by article class:
- have you thought of moving article class to a dedicated, sortable field? Or is there any reason not to? DarTar (talk) 21:41, 27 September 2013 (UTC)
- I have done this and the report has been regenerated. To maximize utility I gave each classification its own column and re-ordered the legend accordingly/respectively. This is a non-trivial sort operation even on my quite powerful machine. West.andrew.g (talk) 17:00, 27 November 2013 (UTC)
- definitely non-trivial -- it's prone to crashing now on smaller machines, unfortunately. -- phoebe / (talk to me) 19:36, 8 December 2013 (UTC)
- I have done this and the report has been regenerated. To maximize utility I gave each classification its own column and re-ordered the legend accordingly/respectively. This is a non-trivial sort operation even on my quite powerful machine. West.andrew.g (talk) 17:00, 27 November 2013 (UTC)
- is a version of this table available (on labs?) as a simple API? DarTar (talk) 21:41, 27 September 2013 (UTC)
- There is not, but I champion data sharing if there is a use-case and someone wants to facilitate the labs side of things. West.andrew.g (talk) 17:00, 27 November 2013 (UTC)
- there's a small formatting issue with the stats section at the bottom of the page. DarTar (talk) 21:41, 27 September 2013 (UTC)
- Is something wrong, or is it just isn't pretty? I see that the dividing "equals" signs are being eaten by section parsing. I've made a minor change to report generation. West.andrew.g (talk) 18:41, 27 November 2013 (UTC)
That would be useful! Additionally, I was trying to figure out why some articles that do have class ratings don't have an icon displayed. -- phoebe / (talk to me) 18:27, 26 November 2013 (UTC)
- Can you provide any quick examples? Maybe I need to update the regexps used to detect these memberships. West.andrew.g (talk) 08:48, 27 November 2013 (UTC)
- Well, I had some in hand when I posted this, but it looks like in this batch all of the unrated ones are redirects or truly unrated articles. So I'm not sure -- either I didn't catch that they were all redirects the first time, or something changed. Thanks again for doing this! cheers, -- phoebe / (talk to me) 19:36, 8 December 2013 (UTC)
holiday in [[Canc�n]]
Curious about the odd character that takes the place of "ú", I looked at one of the logs [1] (large file) and found this (first column is the site, where "en" is the English WIkipedia; second column is the page title; third column the number of requests, and fourth column the bytes served):
en Canc%C3%BAn 75 2499038
en Canc%C3%BAn%2C_Mexico 1 30782
en Canc%C3%BAn%2C_Quintana_Roo 2 0
en Canc%C3%BAn,_Quintana_Roo 1 30782
en Canc%C3%BAn_International_Airport 7 219001
en Canc%FAn 689 0
en Canc\xC3\xBAn 4 126709
[...]
en Cancun 2 61560
en Cancun,_Mexico 3 92352
en Cancun_International_Airport 3 369228
en Cancun_Underwater_Museum 2 19440
en Cancun_airport 1 31341
The requests for Canc%FAn seem to be what make this "popular". The wiki only partly supports that encoding; putting it into a URL takes me to the intended article [2] but an attempt at making a wiki-link looks like this: [[Canc%FAn]]. Looking further in the log file, I noticed that there were no requests to other Wikipedias, or for other articles or files, with the word encoded as "Canc%FAn". Requests with the "Canc%C3%BAn" encoding had much more variation:
commons.m File:Aeropuerto_de_Canc%C3%BAn.JPG 1 11518
commons.m File:Canc%C3%BAn,_Quintana_Roo_Collage.jpg 5 79278
commons.m File:Hard_Rock_Cafe_Canc%C3%BAn.JPG 1 9196
commons.m File:Hotel_Bah%C3%ADa_Pr%C3%ADncipe-Chacumal-Estrada_federal_307_Canc%C3%BAn-Chetumal-1.jpg 1 0
de Canc%C3%BAn 3 55545
de UN-Klimakonferenz_in_Canc%C3%BAn 1 46264
en Amante_bandido_-_Miguel_Bos%C3%A9_en_Canc%C3%BAn_(acercamiento_con_binocular) 1 7153
en Aut%C3%B3dromo_de_Canc%C3%BAn 1 18823
en Canc%C3%BAn 75 2499038
en Canc%C3%BAn%2C_Mexico 1 30782
en Canc%C3%BAn%2C_Quintana_Roo 2 0
en Canc%C3%BAn,_Quintana_Roo 1 30782
en Canc%C3%BAn_International_Airport 7 219001
en Category:People_from_Canc%C3%BAn 4 32788
en File:Canc%C3%BAn%2C_Quintana_Roo_Collage.jpg 6 57540
en File:Canc%C3%BAn,_Quintana_Roo_Collage.jpg 5 47950
en Talk:Canc%C3%BAn 1 24111
es Aeropuerto_Internacional_de_Canc%C3%BAn 9 385164
es Canc%C3%BAn 51 3217033
es Estadio_Canc%C3%BAn_86 1 0
es Pioneros_de_Canc%C3%BAn 1 10636
eu Canc%C3%BAn 1 13625
fr Canc%C3%BAn 3 48714
fr Les_Marseillais_%C3%A0_Canc%C3%BAn 7 117465
hr Canc%C3%BAn 1 11223
it Aeroporto_Internazionale_di_Canc%C3%BAn 1 17328
it Canc%C3%BAn 1 17144
ko eu:Canc%C3%BAn 1 20
mr %E0%A4%9A%E0%A4%BF%E0%A4%A4%E0%A5%8D%E0%A4%B0:Sala_embarque_aeropuerto_de_Canc
%C3%BAn.JPG 1 12132
pl Canc%C3%BAn 5 186749
pt Canc%C3%BAn 16 396723
tr Canc%C3%BAn 1 12754
I noticed especially that on the Spanish Wikipedia, there were 51 requests for "Canc%C3%BAn" but none for "Canc%FAn" whereas on the English Wikipedia, there were 4 requests for "Canc\xC3\xBAn" and 689 for "Canc%FAn". —rybec 22:01, 25 December 2013 (UTC)
- Interesting: http://stackoverflow.com/questions/9295049/php-urlencode-charset-encoding-issue -- West.andrew.g (talk) 19:26, 26 December 2013 (UTC)
similar graphs
The Web site stats.grok.se has graphs of the traffic. For last week's list, I noticed that many of the most-requested articles about food, ecology, politics and geography had similar graphs (for Climatic Research Unit email controversy and two others, the similarity to all the others begins after a drastic increase in traffic).
—rybec 10:32, 29 December 2013 (UTC)
I deliberately exclude the climate change articles' views from my reports, because I assume they artificially generated; the fact that they follow similar patterns would appear to support that. Serendipodous 11:03, 29 December 2013 (UTC)
(edit conflict) The ones that didn't match were mainly about current events or entertainment (my computer mangled some of the diacritical marks):
I think the traffic to articles in the first list is mostly automated. On 1 November, noticing a massive number of requests for Harlan Watson, I wrote the article. Several of the sources I found call the man Harlan L. Watson. Since the beginning of November, there have been over a million requests for Harlan Watson, but only 24 for Harlan L. Watson. Also striking is the fact that no one else has edited the article or its talk page. Along the same lines, I notice that:
- Lohachara gets massive traffic (541705 requests in Nov. [404] but Lohachara Island gets much less (255 [405]).
- Hariabhanga river gets massive traffic (542456 requests in Nov. [406]) but Hariabhanga receives very little (0 requests in Nov. [407]).
- Climate change scepticism (a topic associated with the United States) gets massive traffic (541841 requests in Nov. [408]) but the American spelling, Climate change skepticism, gets much less (237 [409]). —rybec 11:45, 29 December 2013 (UTC)
- I don't have a terrible amount to contribute here, but I will add: (a) Don't underestimate the circadian/weekly patterns in your first set of statistics. I don't see much else remarkable going on in those graphs besides that. (b) These weird spellings getting more traffic than the base article are most certainly indicative of bot traffic. This another reason to bug the analytics team to produce some aggregate metrics for us. I am beginning to suspect that WP:5000 and other sources might be dramatically over reporting/suggesting direct human traffic to articles. West.andrew.g (talk) 23:01, 29 December 2013 (UTC)
- I had been looking more at the secular trend than at the weekly cycles. On a weekly scale these still look odd: I just looked at the November 2012 graph for "Denmark" [410] and it showed a regular weekly variation. That's gone in this November's [411] (with 3 times the traffic). "Greenhouse effect" shows weekly cycles in both Novembers, with traffic declining by 19%: [412] [413]; "Greenhouse gas" loses its weekly pattern and takes the same pattern as "Denmark" [414] [415], with traffic increasing to 10.5 times what it was.
- I don't have a terrible amount to contribute here, but I will add: (a) Don't underestimate the circadian/weekly patterns in your first set of statistics. I don't see much else remarkable going on in those graphs besides that. (b) These weird spellings getting more traffic than the base article are most certainly indicative of bot traffic. This another reason to bug the analytics team to produce some aggregate metrics for us. I am beginning to suspect that WP:5000 and other sources might be dramatically over reporting/suggesting direct human traffic to articles. West.andrew.g (talk) 23:01, 29 December 2013 (UTC)
- Before, I had opened the graphs in different tabs in a browser, then switched between tabs. In the first group, the traffic increased steadily from 10 October until 16 November, then gradually declined through December (except for Copenhagen_treaty, Vegetarian_cuisine and Climatic_Research_Unit_email_controversy, for which the high traffic began on 7 November, then followed the same curve as the others in the eco-food group).
- Some articles in the first group are getting far more requests than they did in 2012. Other, comparable articles don't show such drastic changes.
ratio of November 2013 to November 2012 requests
|
---|
Main Page (non-eco group) 352080385/271083976 = 1.30 Meat (eco group) 1664329/45763 = 36.4 Beef (non-eco group) 56685/94833 = 0.60 Quinoa (eco group) 792725/258832 = 3.06 Food (eco group) 659638/123008 = 5.36 India (perhaps I shouldn't have included this in the first group) 1228738/976577 = 1.26 Finland (non-eco group) 225774/182604 = 1.24 Denmark (eco group) 735967/244273 = 3.01 San_Francisco (eco group) 741764/237712 = 3.12 Oakland (non-eco group) 3702/5902=0.62 Environmentalism (non-eco group) 20921/27438 = 0.76 |
—rybec 04:40, 30 December 2013 (UTC)
Significant errors present in next week's data
FYI, the WMF statistical backend malfunctioned for nearly 35 hours over 1/5 and 1/6. Notice the empty (4k) hourly files in the usual location. I don't know if this is something they can recover, but if not, it will certainly have great bearing on our next WP:5000 and its comparisons to previous editions of this list. West.andrew.g (talk) 15:30, 8 January 2014 (UTC)
Yearly summary in production
@The ed17: @Serendipodous: @Milowent: @Yaris678: -- Code is currently running to spit out a 2013 statistical summary equivalent in format to WP:5000. This is no trivial task, and I expect it to take on the order of a couple days to do the massive database join. Once it is done, I am thinking it will be a valuable and fun resource. I can also spin off a couple of tables for the "biggest hours" or "biggest days" for certain events/articles. Framed with discussion this should make a nice Signpost article, and given the success of our last attempt, I'd again like to see this pushed to Reddit, Slashdot and all the other outlets we can think of. Who is on board?
In related news, I'd like to combine these statistics, our previous discussion/analysis in the Signpost, and some novel processing towards an academic publication (a conference deadline friendly to this topic is coming in late February). I'd like to invite those who I interact with regularly here to be my co-authors in that effort. While the Signpost is great for Wikipedia folks, it would be nice to reach out to the larger web research community and perhaps get others interested the data. West.andrew.g (talk) 18:34, 2 January 2014 (UTC)
- Hi. Andrew. Um, I think that what you're doing is great; the top 5000 of the year would certainly be worth a look, but the Foundation already published its annual top 100. Hope that doesn't deflate your sails too much. Serendipodous 18:58, 2 January 2014 (UTC)
- I thought I remembered seeing that a week or two back, so it couldn't have truly captured *all* of 2013 (although the impact is probably minimal). Regardless, I'll spin up the top 5000 (and maybe more) and some related charts. If you'd like to do your usual snarky take on the top of the list, that might also be well received. Moreso than the raw stats, I want to get some discussion about catalysts into academically published form. West.andrew.g (talk) 19:16, 2 January 2014 (UTC)
- My draft is already at the Signpost. The big problem with determining catalysts is figuring out which topics are due to human interest, which are due to error, and which are due to automated bots. We simply don't have enough tools yet. Serendipodous 20:37, 2 January 2014 (UTC)
- It would be interesting to see where the nonexistent articles rank, perhaps mixed in with the ones that exist. —rybec 21:18, 2 January 2014 (UTC)
- Yep, that was my intention. Once I've distilled this comprehensive table, it will be a piece of cake (at least in terms of raw code; time might be another matter), along with possibly some other data questions. Thanks, West.andrew.g (talk) 23:21, 2 January 2014 (UTC)
- I thought I remembered seeing that a week or two back, so it couldn't have truly captured *all* of 2013 (although the impact is probably minimal). Regardless, I'll spin up the top 5000 (and maybe more) and some related charts. If you'd like to do your usual snarky take on the top of the list, that might also be well received. Moreso than the raw stats, I want to get some discussion about catalysts into academically published form. West.andrew.g (talk) 19:16, 2 January 2014 (UTC)
- I'd be interested in pitching in. I am interested to see how one year's data of the WP:5000 has played out. Even if the Foundation has done its top 100 already, we could create something that is much more interesting. Including sublists for top movies, top TV shows, etc., that might be of interest. - it will require human effort to compile, but something like comparing a list of the top 25 movies vs. box office $$ would be interesting. Similar for TV, is Breaking Bad really the most popular TV show in the English-speaking world? Or just among wikipedia readers? And @Serendipodous:, don't get too down about the problem of ferreting out bot-influenced articles, I'd say you've done a good job of finding suspicious entries, e.g., your draft list is sound for removing G-force, these are things we wrestled over early that you now can deal with easily.--Milowent • hasspoken 23:55, 2 January 2014 (UTC)
Is the update late this week? Hope you're not too overloaded. Serendipodous 16:47, 5 January 2014 (UTC)
- Yes, there is a delay. I'm not overloaded, but the CPU that does all this work is. It should appear in the next 10 hours or so I estimate. West.andrew.g (talk) 00:57, 6 January 2014 (UTC)
- And I need to backtrack on something I said to Rybec above. I will not be computing a full redlinks report. This is why I have not published any report yet due to computational struggles surrounding this. The long tail of non-existent pages is incredibly diverse across time. More than 50 million articles will be requested a weeks table (with only ~4.5 million actually existing). I couldn't quite compute this figure for the year, but let's say its probably on the order of 250+ million unique articles requested. When you have 250 million of these and need to join the stats data from 52 weekly tables, things quickly blow up on my infrastructure. I plan to bound this to the 4.5 million existing articles to help with this. West.andrew.g (talk) 19:50, 9 January 2014 (UTC)
Highest traffic events of 2013 (by "article hour")
Below are the busiest "article hours" in 2013. That is, those articles receiving the most traffic in a one hour period. Only the most popular hour for a title is shown, and I've excluded the main page. I've pasted the first 500 entries in raw form. Recall that these dates are in UTC time. If someone would like to wikify and extend this table, perhaps we could try to publicize a bit?
ARTICLE | UTC DATE | VIEWS | REASON ---------------------------------------------------------------------- [[Jorge_Bergoglio]] | March 13, 2013 | 1,460,586 | Papal ascension [[Shakuntala_Devi]] | November 4, 2013 | 766,256 | Google Doodle [[Paul_Walker]] | December 1, 2013 | 752,770 | Death [[Grace_Hopper]] | December 9, 2013 | 621,694 | Google Doodle [[Nelson_Mandela]] | December 5, 2013 | 484,966 | Death [[Jodie_Foster]] | January 14, 2013 | 451,270 | Came out at Golden Globes [[Beyonc%C3%A9_Knowles]] | February 4, 2013 | 378,923 | Super bowl halftime [[Nicolaus_Copernicus]] | February 19, 2013 | 336,836 | Google Doodle [[Seth_MacFarlane]] | February 25, 2013 | 320,999 | Hosted the Oscars [[Daniel_Day-Lewis]] | February 25, 2013 | 318,839 | Oscars [[Society_of_Jesus]] | March 13, 2013 | 287,568 | Papal ascension [[Mindy_McCready]] | February 18, 2013 | 282,679 | Death [[Hermann_Rorschach]] | November 8, 2013 | 276,072 | Google Doodle [[Edith_Head]] | October 28, 2013 | 263,915 | Google Doodle [[Raymond_Loewy]] | November 5, 2013 | 258,301 | Google Doodle [[Margaret_Thatcher]] | April 8, 2013 | 252,906 | Death [[Pope_Francis]] | March 13, 2013 | 248,753 | Papal ascension [[Peter_Capaldi]] | August 4, 2013 | 244,667 | Announced as next Dr. Who
Thanks, West.andrew.g (talk) 20:30, 9 January 2014 (UTC)
The 10,000 most popular articles in 2013
After many computer cycles, the list has generated. I did the top 10k with quality annotations. Give it a while to load, as there is a ton of table processing that has to go on for that page to generate:
The top 10,000 for 2013 -- I would appreciate if people could re-post to whatever talk pages or venues might find this interesting. Thanks, West.andrew.g (talk) 17:02, 13 January 2014 (UTC)
Slightly off-topic question about page view stats
I am currently working on organizing pageview stats for my own purposes, although it may prove useful to others as well if things go well. This seemed to be the best (most watched) place to get the attention of multiple "page view gurus"...
Specifically, I am looking to use the logs to analyze how the Olympics drove traffic on athlete article. However, depending on performance, I have be inspired to expand the project to a longer range of data and make a stats service like grok\wikistats (but more focused on traffic jumps). My first dilemma is how to structure the database - specifically for scalability. My thought was table 1: id (primary key), pagename (indexed). Table 2: id,date,hour (3 col primary key), hits. Initial calculations suggest that will be fine for 1 month of data, but if extended I'm not so sure. Any advice\experiences to share?
Second, any thoughts about combining equivalent hits (example "First_Last" vs. "First%20Last")? Currently I "un-uri-encode" the data and combine identical. This makes import slower, but I think is "correct" as the two requests should resolve the same. Is there any valid reason not to combine?
I will have more questions later on people's preferred way to handle several data handling choices later if I decide to pursue the public stats service idea. --ThaddeusB (talk) 03:09, 3 March 2014 (UTC)
- First off, I posted a version of my code to either the [WikiAnalytics-l] or [WikiResearch-l] mailing list within the past year. See if you can hunt that out.
- If you know which articles' statistics you want (i.e., a category based on Olympic athletes) you have a much smaller problem on your hands. Anything you can hack together should work on a problem this small (I am doing something similar for WP:MED).
- If your looking to store statistics for all pages, you should be aware there are sometimes as many as 5x more requests for non-existent pages than existent ones. These non-existent requests also tend to be quite sparse, so your table length blows up very quickly and increases index lookup times. I used to store hourly statistics (now I just aggregate and store daily totals). Having a dedicated column for each hour is/was a horrible idea. I ended up creating daily columns as BLOBS and then serializing an object to these fields that better handled sparseness. Then at some point I determined this hourly data wasn't getting used enough to justify its storage footprint.
- Your schema sounds more elegant. I just do ([primary] page, day_1, day_2, day_3, ... day_7). I start a new table every 7 days. This is a hacky way to avoid the long-tail issues of the page-view distribution. I am also writing weekly reports for the WP:5000, so one table wraps one of those nicely.
- I do no logic to combine equivalent hits. I do urge you to be careful when handling character encodings. West.andrew.g (talk) 17:00, 3 March 2014 (UTC)
- Thanks for the tips, and mailing list suggestions - some interesting conversation there. I did find the link to your source code, but it was 404 (upenn website). If you want, you can provide a new link - I would take a look out of curiosity, but probably not actually use anything from it. I figure it is better to set it up "right" now in case I want to scale it rather than have to redo things later. --ThaddeusB (talk) 01:07, 5 March 2014 (UTC)
- That link should work if you change "www.cis.upenn.edu/~westand/bla" to "www.andrew-g-west.com/bla". West.andrew.g (talk) 06:55, 6 March 2014 (UTC)
- Thanks for the tips, and mailing list suggestions - some interesting conversation there. I did find the link to your source code, but it was 404 (upenn website). If you want, you can provide a new link - I would take a look out of curiosity, but probably not actually use anything from it. I figure it is better to set it up "right" now in case I want to scale it rather than have to redo things later. --ThaddeusB (talk) 01:07, 5 March 2014 (UTC)
- there are sometimes as many as 5x more requests for non-existent pages than existent ones
- This is such an interesting fact. I'm assuming these are topics that readers want to know more about or are they more jokey? Are there ones that keep appearing? I thought you data was for viewed pages, not for Wikipedia searches. Liz Read! Talk! 19:23, 4 March 2014 (UTC)
- It's mostly junk (malformed requests, people accidentally typing things into their browser at end of URL they just visted, weird bot activity). Andrew does have a most visited redlinks report. --ThaddeusB (talk) 01:07, 5 March 2014 (UTC)
- Correct, these are attempts to access a page inside the English Wikipedia domain space that did not exist. I have to think bots and content scrapers are a major part of the long tail. West.andrew.g (talk) 06:55, 6 March 2014 (UTC)
- It's mostly junk (malformed requests, people accidentally typing things into their browser at end of URL they just visted, weird bot activity). Andrew does have a most visited redlinks report. --ThaddeusB (talk) 01:07, 5 March 2014 (UTC)
Red Links
lots of Red Links on this page
- Mis-configured Polish web scraper. Serendipodous 11:53, 11 June 2014 (UTC)
Reporting error involving colon characters
Following a bug report at User_talk:West.andrew.g#Weird_topics_in_top_5000_list_and_Stats.Grok.Se I have discovered that this aggregation has not been handling colon characters properly. Previous code used colons as an indication that an article was outside of namespace 0 (the "main" or "article" namespace). Therefore article titles that contained colons, such as Call_of_Duty_4:_Modern_Warfare, would have been excluded from this list. That bug has now been fixed. However, this also represents a non-trivial change to the very inner loops of the aggregation routines. Please check for odd behavior at the next update, especially as it pertains to namespaces and titles with colons. Thanks, West.andrew.g (talk) 15:04, 10 June 2014 (UTC)
- Note that the current weekly report is not accurate with respect to reporting pageviews for titles involving colons. I believe I fixed the code some time in the middle of the week, and the process runs nightly, so several days contributing to the weekly total were not back processed. I expected everything to be as expected come next week. Thanks, West.andrew.g (talk) 18:05, 16 June 2014 (UTC)
- Note further that Talk:Main Page should not have made its way into the aggregate reports this week. It is my intention only to process, store, and report on namespace zero article views. This case was an oversight, which has been fixed in code. As with above, some days have already been processed this week, meaning the next set of reports might include "talk" pages and misreport their view counts. Thanks, West.andrew.g (talk) 04:41, 20 June 2014 (UTC)
Update
I noted that the WP:5000 did not get updated on 17 Aug, is it scheduled to occur soon? Cheers.--Milowent • hasspoken 02:57, 20 August 2014 (UTC)
- Andrew hasn't edited since the 15th, so it seems everyone is on Wikibreak at the moment. Fair enough. Serendipodous 16:44, 20 August 2014 (UTC)
- But my platform to comment on Robin Williams will be destroyed!!11 Seriously though, Williams gives me another idea for a side article as well. His article[416] got over 6 million views on August 12. Mandela only got 2.6 million on Dec 6, 2013, the day after his death - Wikipedia:Top_25_Report/December_1_to_7,_2013, AND he got beat on the Top25 by Paul Walker that week, which I must admit caused me to lose some faith in humanity. (Walker got 4.2 million views [417] on Dec 1, 2013 - though look at #6 here[418] and it gives you some insight into how out of sync we may be with the relative popularity of certain celebrities.) Anyway, it would be interesting to create a chart of the most popular wikipedia death articles, probably using the stats for the day after the subject's death, e.g., Michael Jackson got 5.8 million[419] on 26 June 2009. I may start something at User:Milowent/sandbox and invite anyone interested to chime in there.--Milowent • hasspoken 18:48, 20 August 2014 (UTC)
- I'm alive. The report is in progress, but with a couple of hours to process each day, and a couple days of server downtime being the root cause -- it won't be immediate. Also be prepared for some drama involving missing/unreported mobile view counts; something I am currently trying to better understand with the WMF folks. The immediate conclusion/fallout of that will take a bit longer to uncover, though. West.andrew.g (talk) 01:33, 21 August 2014 (UTC)
Mobile views
Do these numbers include views to the mobile versions of these pages? Thanks. Biosthmors (talk) pls notify me (i.e. {{U}}) while signing a reply, thx 17:07, 8 August 2014 (UTC)
- @Biosthmors: To the best of my knowledge, yes they do. The raw files are (project,page,views) tuples. Mobile views do not hit different projects or different page versions. I am not privy to what quantity of views are mobile, though. I assume the Foundation is doing something with UserAgent strings in order to distinguish mobile views, but I am not aware of those numbers being publicly available (outside of the massive aggregates on WikiStats/ReportCard). Thanks, West.andrew.g (talk) 15:37, 9 August 2014 (UTC)
- Err... See the post below ("STICKY: On the Non-reporting of Mobile Views"). West.andrew.g (talk) 18:34, 4 September 2014 (UTC)
On the Non-Reporting of Mobile Views
A significant statistical issue has come to my attention. Quite simply, the WMF does not record/report per-article mobile views, and thus they are unavailable for my aggregation. This means the numbers I present significantly under-report the actual number of total views, as the WMF provides only the "desktop" (non-mobile) perspective.
This has been confirmed via WMF staff. They have indicated to me the processing infrastructure of the WMF is insufficient to handle the workload at this time.
Frankly, this came as a surprise to me. It is a bit perplexing to me why English "mobile" pageviews can't be included in the per-page aggregates for English "desktop" views; they are, after all, the exact same content. This very well could be an artifact of an earlier system design that was not prepared to handle mobile views. I am in no position to comment on that hypothesis.
To say our numbers (limited only to "desktop" views) are under-representing actual views is quite an understatement. The one thing the WMF does monitor in both desktop/mobile formats are project scale view counts, as can be seen in the 2nd and 3rd graphs at [420]. Based on ~9.5B total en.wp views at the last snapshot, ~3B of which were mobile, then the average per-article total under-reports by a factor of 1.38x. We might imagine this factor is even higher on entries found in WP:5000 and WP:Top25Report whose pop-culture nature might lend themselves more to mobile audiences.
If/when the WMF starts reporting per-article mobile views, I'll be quick to integrate them into my reporting infrastructure. Until then? Community awareness of the issue might bring a more rapid solution. Also, should we consider designing a template (with link back to this thread) that points out this fact, and put it atop all of the prior reports? Thanks, West.andrew.g (talk) 18:36, 4 September 2014 (UTC)
- @Jmh649: @The ed17: Thought you may be interested. West.andrew.g (talk) 18:46, 4 September 2014 (UTC)
- Let me make sure I'm understanding this right. You're saying that out of a total view count of 9.5 billion views, the article traffic statistics are missing out on a little less than a third of those views? (I'm confused by that vs. 1.38x, sorry. Math was never a strong suit of mine) Ed [talk] [majestic titan] 19:09, 4 September 2014 (UTC)
- @The ed17: That is correct. If the "Top 5000" were a monthly sorted list of ALL en.wp pages, the sum total of those views would be 6.5B (only the "desktop" portions). In actuality, that sum should be 9B. We are given article granularity statistics capturing 68.4% of traffic, or stated another way, 31.6% of views are missing from the provided counts. West.andrew.g (talk) 19:57, 4 September 2014 (UTC)
- This is why I have developed the new graph here to take into account mobile. But yes it is not perfect. Doc James (talk · contribs · email) (if I write on your page reply on mine) 23:51, 4 September 2014 (UTC)
- It would be nice to see the mobile+desktop views, so I asked about this at User talk:Mdennis (WMF). Thanks for the info Andrew. Biosthmors (talk) pls notify me (i.e. {{U}}) while signing a reply, thx 12:49, 16 September 2014 (UTC)
- I've added my response to the question she posed there. Thanks, West.andrew.g (talk) 01:49, 18 September 2014 (UTC)
- It would be nice to see the mobile+desktop views, so I asked about this at User talk:Mdennis (WMF). Thanks for the info Andrew. Biosthmors (talk) pls notify me (i.e. {{U}}) while signing a reply, thx 12:49, 16 September 2014 (UTC)
- This is why I have developed the new graph here to take into account mobile. But yes it is not perfect. Doc James (talk · contribs · email) (if I write on your page reply on mine) 23:51, 4 September 2014 (UTC)
- @The ed17: That is correct. If the "Top 5000" were a monthly sorted list of ALL en.wp pages, the sum total of those views would be 6.5B (only the "desktop" portions). In actuality, that sum should be 9B. We are given article granularity statistics capturing 68.4% of traffic, or stated another way, 31.6% of views are missing from the provided counts. West.andrew.g (talk) 19:57, 4 September 2014 (UTC)
- Let me make sure I'm understanding this right. You're saying that out of a total view count of 9.5 billion views, the article traffic statistics are missing out on a little less than a third of those views? (I'm confused by that vs. 1.38x, sorry. Math was never a strong suit of mine) Ed [talk] [majestic titan] 19:09, 4 September 2014 (UTC)