User talk:West.andrew.g/Archive 8
This is an archive of past discussions with User:West.andrew.g. Do not edit the contents of this page. If you wish to start a new discussion or revive an old one, please do so on the current talk page. |
Archive 5 | Archive 6 | Archive 7 | Archive 8 | Archive 9 | Archive 10 |
STiki -> Wikimedia Labs?
Have you considered moving STiki to an instance @ Wikimedia Labs? One advantage of doing this is that you can recruit other Labs users (stakeholders in the functioning of STiki) to help with maintenance and testing new features. If you're interested, I'd be happy to help out. Given Snuggle's reliance on STiki, I'd also volunteer to help with maintenance. --EpochFail(talk • work) 15:48, 23 May 2013 (UTC)
- +1 I just came to write exactly the same thing 930913(Congratulate) 15:55, 23 May 2013 (UTC)
- I do have an account at Labs and agree there is no real disadvantage to it running there, aside from some possibly non-trivial initial reconfiguration. However, the machine on which STiki runs is critical to several research endeavors of mine (not all wiki related) -- so I have a vested interest to keep it up and running at all times (and perhaps, an interest to keep all this consolidated). This is just a weird transitory period where I am not physically adjacent to the machine due to my move and its need for a static IP address.
- I'll also note that the times STiki has broken due to reasons outside of my own stupidity (i.e., power failures, network loss, etc.) are extremely few (maybe one or two a year). Prior to this week we were seeing uptime of 100+ days without a hitch. I am blame the computer gods and Aaron for this, since things only break when someone actually needs them. If someone else wanted to lead a transition to Labs, I would not be opposed -- but I think it would be in everyone's best interest to see if things re-stabilize after this string of bad luck. That being said, it is my intention to transition STiki's code onto GitHub to get others involved. Also, I am not opposed to handing out some server credentials to trusted users so they can restart the daemon, etc. (but this doesn't do much good if the server is not reachable). Thanks, West.andrew.g (talk) 18:37, 23 May 2013 (UTC)
- GitHub would be great. Theopolisme (talk) 18:46, 23 May 2013 (UTC)
- I'll also note that the times STiki has broken due to reasons outside of my own stupidity (i.e., power failures, network loss, etc.) are extremely few (maybe one or two a year). Prior to this week we were seeing uptime of 100+ days without a hitch. I am blame the computer gods and Aaron for this, since things only break when someone actually needs them. If someone else wanted to lead a transition to Labs, I would not be opposed -- but I think it would be in everyone's best interest to see if things re-stabilize after this string of bad luck. That being said, it is my intention to transition STiki's code onto GitHub to get others involved. Also, I am not opposed to handing out some server credentials to trusted users so they can restart the daemon, etc. (but this doesn't do much good if the server is not reachable). Thanks, West.andrew.g (talk) 18:37, 23 May 2013 (UTC)
- I do have an account at Labs and agree there is no real disadvantage to it running there, aside from some possibly non-trivial initial reconfiguration. However, the machine on which STiki runs is critical to several research endeavors of mine (not all wiki related) -- so I have a vested interest to keep it up and running at all times (and perhaps, an interest to keep all this consolidated). This is just a weird transitory period where I am not physically adjacent to the machine due to my move and its need for a static IP address.
Popular redlinks with \x
WP:Snuggle is live at snuggle.grouplens.org
Snuggle, the newcomer socialization tool I've been building, is finally ready for general use. All you need to do to get started is point your browser to https://snuggle.grouplens.org. Let me know if you run into any trouble. I'll be watching WT:Snuggle. Or you can also just contact me directly. Thanks for your patience.
See also:
- Documentation
- Bug tracker
- Open source code repository
- My work log describing the last few months of work
--EpochFail(talk • work) 19:46, 14 June 2013 (UTC)
What the heck?
I don't even know where to begin this week. I can only assume that we're being spammed, particularly by a redlink generator with a fondness for Polish. I don't even know if I can comfortably narrow it down. Serendipodous 06:03, 7 July 2013 (UTC)
- It wasn't a rhetorical question. I was genuinely wondering if this was some kind of half-hearted DDoS attack. OK maybe I was overreacting but I had just spent two hours sifting through Reddit threads, news reports and traffic graphs, all the time pondering if I was going to have to do this every subsequent week, and was getting a bit flustered. I appreciate if you, like seemingly everyone else in the last 24 hours, have got angry with me for some reason, but no one else is going to do the Top 25 report, and I can't do it without a second opinion. Serendipodous 05:26, 8 July 2013 (UTC)
- I have an in depth response to the popularity questions at Wikipedia_talk:Top25Report#OK.2C_open_forum_this_week. It was Independence day over the (long) weekend for us Americans, which might explain some of the delayed responses (at least on my behalf). I am not irritated with you in any way. I don't think you need to double/triple post the top-25 bit every week (I watch *all* those pages), but that is small potatoes. Perhaps "hypothetical" was not the best word, but I blanked your post to my talk page because it was (a) a touch dramatic, (b) provided few details or context, and (c) the same point had already been made elsewhere. As I allude to in my longer post, I think we need to start using analytics to find data-driven solutions to these top-25 questions, rather than putting so much burden on human effort. Thanks, West.andrew.g (talk) 14:04, 8 July 2013 (UTC)
- Thank you very much :-) I don't really know if I can be of any more help. I guess we'll find out on Sunday. Serendipodous 20:26, 10 July 2013 (UTC)
- Here is the mailing list link. Be bold! Explain our issues and see what they can provide. We would be especially interested in some type of aggregate or anonymized referrer data. West.andrew.g (talk) 20:48, 10 July 2013 (UTC)
- Well I'm on the list. I can't seem to find your email. Do you think you could email me, just to make sure that the other analytics guys don't get their thoughts posted here if they don't want them to? Serendipodous 11:00, 11 July 2013 (UTC)
- I have not yet posted. I was proposing if you could handle the human resources side (i.e., contacting them and any administrative hoops we may have to jump through to get data access), and that I would deal with the data analysis and reporting half. Thanks, West.andrew.g (talk) 13:21, 11 July 2013 (UTC)
- Well yeah but I assumed you'd want to know what they were saying. Serendipodous 14:21, 11 July 2013 (UTC)
- Well I'm on the list. I can't seem to find your email. Do you think you could email me, just to make sure that the other analytics guys don't get their thoughts posted here if they don't want them to? Serendipodous 11:00, 11 July 2013 (UTC)
- Here is the mailing list link. Be bold! Explain our issues and see what they can provide. We would be especially interested in some type of aggregate or anonymized referrer data. West.andrew.g (talk) 20:48, 10 July 2013 (UTC)
- Thank you very much :-) I don't really know if I can be of any more help. I guess we'll find out on Sunday. Serendipodous 20:26, 10 July 2013 (UTC)
- I have an in depth response to the popularity questions at Wikipedia_talk:Top25Report#OK.2C_open_forum_this_week. It was Independence day over the (long) weekend for us Americans, which might explain some of the delayed responses (at least on my behalf). I am not irritated with you in any way. I don't think you need to double/triple post the top-25 bit every week (I watch *all* those pages), but that is small potatoes. Perhaps "hypothetical" was not the best word, but I blanked your post to my talk page because it was (a) a touch dramatic, (b) provided few details or context, and (c) the same point had already been made elsewhere. As I allude to in my longer post, I think we need to start using analytics to find data-driven solutions to these top-25 questions, rather than putting so much burden on human effort. Thanks, West.andrew.g (talk) 14:04, 8 July 2013 (UTC)
Sorry, I misunderstood. My email addresses can easily be found at the bottom of my professional homepage. Thanks, West.andrew.g (talk) 14:27, 11 July 2013 (UTC)
- I don't think anyone in a position to do anything about this is really interested in doing anything about it. I think the smartest thing to do at this point is to just exclude redlinks. Sure we may lose that one in ten thousand that is actually important, but that's what WP:TOPRED is for. Serendipodous 09:27, 15 July 2013 (UTC)
It looks like the Wikistats page is down
It hasn't been updated since the 23rd. This is potentially ruinous, as it means we can't check to see if a spike follows a human pattern. Serendipodous 06:17, 27 July 2013 (UTC)
- Sorry for the latent response, but the issue now appears fixed. Good work on the Top25 (it gets a good amount of traffic!). Thanks, West.andrew.g (talk) 16:27, 29 July 2013 (UTC)
Topred
I was thinking about getting the weekly WP:Topred – and a potential variant – community traffic. To do this, I'll rehash an old (possibly bad) idea to make things newsworthy for the Signpost. Maybe you could publish one really long list? Maybe it would capture everything down to 10 page views per week? I remember you saying it could take a ridiculous amount of computational time to do something like this, so maybe this could be a once a year or a bi-annual exercise. Maybe an op-ed to the community to encourage the creation of articles (and redirects) that people want could be of interest to the community. Biosthmors (talk) 16:06, 13 August 2013 (UTC)
- I'd be willing to do a one-off lengthy list for the Signpost if you can secure the fact they are actually interested. I feel like that list has become a bit mucked up by the omnipresence of strange character encodings and spam; one has to do a little digging to find the ones that might actually be deserving of page creation or redirections. I think an op-ed would be interesting, but I don't have the sexy statistics and explanations for red links -- so it will be considerably non-technical. Regardless, a lot of computer cycles are spun over this data, so the more people who can make use of this output, the better. West.andrew.g (talk) 14:39, 16 August 2013 (UTC)
- Any particluar feelings, User:The ed17, if publicizing a long red redlinks list (longer than WP:Topred) could have op-ed value? Biosthmors (talk) 10:23, 17 August 2013 (UTC)
- To give background context, do we know an average % of what English Wikipedia page view requests are to red links? Biosthmors (talk) 16:06, 17 August 2013 (UTC)
- I would consider calculating such a statistic if we had an op-ed piece (and eventually I need to compile all these lessons, graphs, and statistics into an academic publication of some type). It is a non-trivial calculation. I assume I could download the entire "pages" table from WP's database and then I would need to cross-check every statistics entry against that. West.andrew.g (talk) 17:01, 17 August 2013 (UTC)
- Re Signpost, I think it'd be a fine op-ed, and I'd be happy to publish it. Ed [talk] [majestic titan] 05:08, 18 August 2013 (UTC)
- I would consider calculating such a statistic if we had an op-ed piece (and eventually I need to compile all these lessons, graphs, and statistics into an academic publication of some type). It is a non-trivial calculation. I assume I could download the entire "pages" table from WP's database and then I would need to cross-check every statistics entry against that. West.andrew.g (talk) 17:01, 17 August 2013 (UTC)
How was Wikimania?
Did you get any help from the analytics people? Serendipodous 08:38, 18 August 2013 (UTC)
- Hong Kong was hot and my stay was brief. As to more pertinent matters, I did make some inquiries from some folks. Concerns were expressed w.r.t. user privacy and the efficient aggregation of these statistics. I should probably follow up via mailing list when I find a free moment. Thanks, West.andrew.g (talk) 15:08, 19 August 2013 (UTC)
- No hotter than here in Thailand ;) Sorry I missed you there Andrew. Perhaps another time. Kudpung กุดผึ้ง (talk) 15:32, 19 August 2013 (UTC)
- Yes, the ability to recognize others is quite difficult given the pseudo-anonymous terms under which Wikipedia tends to operate. I suppose some people might know who I am due to lots of speaking and the real-life links. But for most editors, it would take a lot of badge reading to find those you might know! West.andrew.g (talk) 15:36, 19 August 2013 (UTC)
Interview request: Your interactions with new editors
Hey Doc. West, I'm contacting you about a study that I'm running with TheOriginalSoni exploring newcomer mentorship activities in Wikipedia. I'd like to ask you a few questions about your interactions with newcomers and to explore how a tool like WP:Snuggle might make mentoring work easier. The interview and demo session will take 30 minutes to an hour depending on how much time we spend discussing things. If you're interested, let me know.
- Study overview: meta:Research:Peer_mentorship_and_snuggle
- Consent form: meta:Research:Peer_mentorship_and_snuggle/Consent
Thanks for your consideration. --EpochFail (talk • contribs) 15:03, 31 August 2013 (UTC)
- Yes, I am willing to do this some (East Coast, USA) evening. Thanks, West.andrew.g (talk) 13:55, 10 September 2013 (UTC)
- ... and rather than posting to your talk page, I'll just email you directly to work on scheduling. West.andrew.g (talk) 13:56, 10 September 2013 (UTC)
- Great! Thanks for getting back to me. It turns out that I've gotten enough responses for the study, but I'd still like to walk you through the tool and get your feedback. Would you mind if I pinged you again in a couple weeks to set that up? --EpochFail (talk • contribs) 14:12, 10 September 2013 (UTC)
- Yes that would be fine, things are calming down a touch here. West.andrew.g (talk) 14:13, 10 September 2013 (UTC)
- Great! Thanks for getting back to me. It turns out that I've gotten enough responses for the study, but I'd still like to walk you through the tool and get your feedback. Would you mind if I pinged you again in a couple weeks to set that up? --EpochFail (talk • contribs) 14:12, 10 September 2013 (UTC)
Link to flow funding at m:Grants?
I created WP:WMF and then linked to m:Grants, which looks like it could use a link to flow funding? Biosthmors (talk) 10:23, 1 September 2013 (UTC)
:Or if these flow funds are only to support the English Wikipedia, perhaps listing it at WP:WMF is better. Biosthmors (talk) 10:23, 1 September 2013 (UTC)
Nevermind. I see that flow funding was a pilot so maybe it's not beneficial to add a link. Meanwhile, I'm curious which Wikipedia space pages get the most hits. Could you publish a weekly 1000, 2000, or 5000 list, perhaps? Best regards. Biosthmors (talk) 11:25, 5 September 2013 (UTC)
- Flow funding was indeed a pilot project that was intended to serve all languages and WMF projects. I think it would be a stretch to call this trial run successful, but in a forward-looking fashion I've recorded some of my thoughts and suggestions about how to improve the process [1]. If it will emerge for a second trail run remains to be seen.
- Regarding statistics on articles outside the main namespace, these are not something that I currently store/collect. I receive plenty of requests in my inbox to enable statistical tracking for one thing or another (this language! that language! file namespace! this category of articles across all languages! another project!). From my perspective, it is a very slippery slope. I am resource constrained such that it is impossible for me to simply handle *everything* and currently I am pursuing some subsets that might result in academic publications. I have heard murmurs, however, that the new Wikimedia Labs setup might enable and revive an older tool that does some broader page view processing. West.andrew.g (talk) 14:10, 10 September 2013 (UTC)
BTW please don't think I'm blaming you
in that rant about doing the Top25 alone. It's just that, given the number of viewers, and the number of people roasting me over it, it would be nice if some were there to help instead of rant. Serendipodous 15:38, 11 September 2013 (UTC)
- What form would you like such assistance to take? West.andrew.g (talk) 15:46, 11 September 2013 (UTC)
- Mainly, a second opinion about the validity of various articles I exclude, and maybe a check, rather than a shout, about when my snark goes over the top. Serendipodous 15:48, 11 September 2013 (UTC)
September 2013
Hello. There is currently a discussion at Wikipedia:Administrators' noticeboard/Incidents regarding an issue with which you may have been involved. Thank you. Tariqmudallal (talk) 01:00, 22 September 2013 (UTC)
- I did a CTRL+F on that page for both "STiki" and my user-name, as well as some casual browsing, and could not identify the complaint of interest? I'll just assume my participation isn't of much consequence unless I hear otherwise. Thanks, West.andrew.g (talk) 23:31, 22 September 2013 (UTC)
Dead links page?
I was looking at the list for most revised pages on Wikipedia and found you made 188,938 edits to a single page, West.andrew.g/Dead links ...but it looks like this project was discontinued in January 2013. I'm just curious what this page was for and how you could possibly carry out that many edits, even with the use of a bot.
Thanks for any information you can offer, West.andrew.g. Liz Read! Talk! 16:30, 30 September 2013 (UTC)
- As the page name implies, I was crawling external links to Wikipedia that were added in new revision content (or re-added, as is often the case when a vandal blanks a page and then someone restores the link(s)). I was recording all links that resolved to HTTP 404 or other errors. This was part of a larger study into link spam behaviors whereby I was obtaining and analyzing page content, and I thought this could be a useful service for the community, i.e., "via the 'what links here' functionality one might notice an article had a dead link." This never really panned out. Initially I was adding links one-by-one, later switched the bot into a batch mode, and eventually turned it off altogether. Yes, a self-written bot was making these changes (BAG approval isn't required to do this in one's own userspace). Thanks, West.andrew.g (talk) 16:35, 30 September 2013 (UTC)
- Thanks for the explanation, West.andrew.g...even with a bot, the study only ran a couple of years so it must have been constantly revising that page. Liz Read! Talk! 23:52, 30 September 2013 (UTC)
- It wouldn't be hard to work out the math, but yes, the updates came hard and fast. The larger takeaway here should be the extent of link rot on Wikipedia (and of course, the Internet at large; but casual users can do something about it on wikis). Remember also these were dead links found only in new edits (which include reversions of old content), so its not as if we were scanning the old dark cobwebs of WP where dead links are probably more common. I remember an older and informal figure that about ~20% of WP links are dead, so how to best remedy this remains an open question (and archive.org is obviously among the solution contenders). Thanks, West.andrew.g (talk) 00:07, 1 October 2013 (UTC)
- I agree that link rot is a concern. I occasionally glance at an article I worked on just a year or two ago, and I'm surprised at how many links are no longer working. I proposed a solution which would solve the whole problem, I sent it in to Google, but haven't heard back, so fear it got round-filed.--SPhilbrick(Talk) 13:03, 1 October 2013 (UTC)
- Geez, 20%? That's enormous. I usually tag dead links when I find them as I've heard they can still be used as a reference. But I think ultimately, they should all be removed. What's a shame is that news websites, which contain most of this content, don't redirect incoming traffic to the new location of these articles (most often now in archives). Liz Read! Talk! 13:18, 1 October 2013 (UTC)
- Liz (and anyone else) I think you'd be interested in User_talk:Mdennis_(WMF)#WebCite. Best. Biosthmors (talk) pls notify me (i.e. {{U}}) while signing a reply, thx 13:23, 1 October 2013 (UTC)
- It wouldn't be hard to work out the math, but yes, the updates came hard and fast. The larger takeaway here should be the extent of link rot on Wikipedia (and of course, the Internet at large; but casual users can do something about it on wikis). Remember also these were dead links found only in new edits (which include reversions of old content), so its not as if we were scanning the old dark cobwebs of WP where dead links are probably more common. I remember an older and informal figure that about ~20% of WP links are dead, so how to best remedy this remains an open question (and archive.org is obviously among the solution contenders). Thanks, West.andrew.g (talk) 00:07, 1 October 2013 (UTC)
- Thanks for the explanation, West.andrew.g...even with a bot, the study only ran a couple of years so it must have been constantly revising that page. Liz Read! Talk! 23:52, 30 September 2013 (UTC)
German Wikipedia traffic
Hello West.andrew.g! I'm a big fan of your work on Wikipedia traffic stats (well, who isn't? ;-). Maybe you remember that some time ago i wrote to you a little about german WP article traffic and TV events: the second screen effect. The german journalist that made the second screen/Wikipedia analysis i mentioned has since built a Wikipedia traffic trends website http://wikipedia.trending.eu/de/index.html and he also wrote a blogpost about your traffic analysis.
- Hi Atlasowa, and first off, thanks for your rather thorough and well-sourced message. I do recall the earlier discussion and I am glad to hear that someone is running with these ideas on deWP. I'll preface all subsequent discussion by stating that I don't currently collect statistics on deWP, mainly due to processing time and a limited infrastructure. This fact will make it difficult for me to *independently* perform any of your suggestions below, but this wouldn't prevent some type of data collaboration with interested parties.
BTW, it seems that your Signpost WP traffic report has inspired the Spiegel (Germany's biggest news magazine and online news website) to publish Wikipedia traffic trend reports almost every week since june 2013. See:
- Spiegel Online: Fiktion gegen Realität, 05. Oktober 2013
- Wikipedia-Statistik: Alles über die Merkel-Raute. Samstag, 21. September 2013
- Wikipedia-Statistik: Wissensdurst zu Syrien. Samstag, 31. August 2013
- Wikipedia-Statistik: Die Rückkehr der Monster-Schildkröte. Samstag, 17. August 2013
- Datenlese: Washington Post übernimmt die Wikipedia SPIEGEL ONLINE - Netzwelt - 10.08.2013
- Datenlese: Kate übertrumpft William. Montag, 29. Juli 2013
- Jesus, Hitler, Homöopathie: Darüber streitet die Wikipedia. Sonntag, 21. Juli 2013
- Datenlese: Radprofis, Religion und Reality-Trash SPIEGEL ONLINE - Netzwelt - 14.07.2013
- Datenlese: Lisicki stürmt auch die Wikipedia SPIEGEL ONLINE - Netzwelt - 06.07.2013
- Datenlese: Die Wikipedia-Magneten der Woche SPIEGEL ONLINE - Wissenschaft - 29.06.2013
- Populärste Aufrufe: So hat die Hitzewelle Wikipedia erwischt SPIEGEL ONLINE - Netzwelt - 22.06.2013
(Try google translate, that should work OK for german -> english)
- It's my understanding that Der Spiegel is a pretty authoritative German publication. I can't judge how well placed these articles are within that framework, but that is certainly a good quantity of use. I assume the author is using some combination of the "DE trending" and "grok.se" tools in order to do this kind of analysis?
Another story that might interest you is that in September 2012, german Wikipedia put a link to the traffic stats on every single Wikipedia page (including user namespace etc.). It's the link "Abrufstatistik" at the bottom, and it links to the corresponding page views at stat.grok.se. A little later, a SEO presentation appeared online with new tips how to successfully spam external links on deWP. Actually it's quite hard to get SEO links on deWP, Wikipedians are very vigilant about spam. And deWP has flagged revisions, so that the added link is only visible to readers if this edit was patrolled by an editor -> edit will likely be reverted, unless the link goes to content that seems legitimate. The SEO how-to therefor recommends to use the traffic stats to concentrate the effort on the most effective, popular target article. Work-Flow: identify suitable Wikipedia article with lots of traffic and your topic -> Create content (i.e. translate english WP article parts in german and embellish) and publish on your website -> introduce the information in Wikipedia -> Backlink from Wikipedia.
- I have written much about the usage of popular articles for spam purposes, Google "Link Spamming Wikipedia for Profit" and "Autonomous Prevention of Link Spam in Purely Collaborative Environments" in order to stumble upon some open-access copies of those academic writings. Short story: Without flagged revisions or some type of quarantine, human latency can easily be exploited to an attacker's benefit. Machine-reverts can work to some extent.
Anyway, I recently reread your Wikipedia:Wikipedia_Signpost/2013-02-04/Special_report. You wrote about the power law distribution of article traffic: "The top 25 most viewed pages represent 4% of all total views, and the top 5000 represent 19% of all views. Though the distribution has an extremely long tail, the top 5000 data provides an opportunity to locate popular but poorly written articles that need attention, as opposed to randomly selecting one of the 4.15 million remaining articles on the project." That's a great starting point for quality control. I guess the page view distribution is similar on german Wikipedia? Maybe with less power ;-) ? I wonder if you would be interested in making an analysis of deWP traffic pattern and a comparison of enWP and deWP? I think the TV/second screen effect could also be worth another look to compare deWP/enWP. I suspect the effect is also happening on enWP, only it is diluted very much by the more global audience. What do you think? I could help and translate (also for the german Kurier) if you think this would be something for the Signpost. --Atlasowa (talk) 23:25, 7 October 2013 (UTC)
- Again, I'd be interested to know how the "DE trending" guy is storing and calculating his data. Such distribution plots could take a few lines of code and a couple of minutes if everything has been processed into a nice indexed database. As far as comparison goes, that is something I'd be interested in collaborating on, but I am probably not in a position to take on myself. I am gaining some perspective on these types of matters though, as I am working with WP:MED to look at their articles across multiple languages (granted, this is a much more narrow set than an *entire* language).
- I've taken a new job recently and this has limited my cycles towards wiki* projects. Given that, I am trying to transition away from the low-level coding into a more advisory and data-sharing role so I can continue to best serve the projects. A "Labs^2" project (the link is escaping my hurried searches at the moment) comes to mind, trying to bring together research ideas, young students, and those experienced in consultation and data-sharing. West.andrew.g (talk) 03:55, 9 October 2013 (UTC)
How to get article names and article assessments to display side by side?
As I mention there in the last paragraph in italics, how do you get the WP:5000 to display quality assessments adjacent to the names? Thanks. Biosthmors (talk) pls notify me (i.e. {{U}}) while signing a reply, thx 22:27, 9 October 2013 (UTC)
- I am a bit confused about what you are asking. As per, Wikipedia:MED1500, it seems the WP:MED folks already know how to do this in some way (or someone is generating the report for you). In my case, there isn't a nice template that does this. When I generate the reports I either (a) make 5000 API queries for article-category memberships and then parse through the categories returned, or (b) iterate through the hugely paginated response for article members of the categories in question, write those to a set, and then check for article membership as needed. I'm happy to share the Java code that does this, but otherwise I have no magic bullet. It is imaginable that the Java code could be easily modified to take in a one-per line text file of article titles and then spit out some nicely formatted wikitext. West.andrew.g (talk) 19:03, 10 October 2013 (UTC)
Bounce rate stats
Someone over at the Traffic report talk page suggested that bounce rate stats could determine whether website article views are due to mistaken clicks or genuine interest, but I'm not sure what bounce rate stats are. Have you heard of them? Serendipodous 19:58, 12 October 2013 (UTC)
- See Bounce rate ;-) : "Bounce rate is a measure of the effectiveness of a website in encouraging visitors to continue with their visit. It is expressed as a percentage and represents the proportion of visits that end on the first page of the website that the visitor sees." and "As a rule of thumb, a 50 percent bounce rate is average. If you surpass 60 percent, you should be concerned. If you're in excess of 80 percent, you've got a major problem."
- I don't think WMF collects this kind of stats (i hope), but i may be wrong. (They do tracking cookies since last year and apparently plan to do tracking pixels, see new privacy policy). But there are (global) bounce rates for Wikipedia from third parties (which are not at all reliable):
- Wikipedia.org by Alexa states a bounce rate of 53.90%
- Wikipedia.org by similarweb states a bounce rate of 53.61% (en.wikipedia.org: 52.67%; de.wikipedia.org: 49.63%)
- presentation given by Janette Lehmann at TNETS Satellite, ECCS, Barcelona, September 2013 [2] [3]: we collected 13 months (September 2011 to September 2012) of browsing data from an anonymized sample of approximately 1.3M users. We identified 48 actions such as reading an article, editing, opening an account, donating, visiting a special page. We then built a weighted action network ... We calculated these metrics for the 13 months under consideration and plotted their variations over time. An increase in TotalNodeTraffic means that more users visited Wikipedia. An increase in TotalTrafficRecirculation means that more users performed at least two actions while on Wikipedia, our chosen indicator of high engagement in Wikipedia. We observe that TotalNodeTraffic increased first then became more or less stable. By contrast, TotalTrafficRecirculation mostly decreased, but we see a small peak in January 2011. Two important events happened in our 13-month period. During the donation campaign (November to December 2011) more users visited Wikipedia (higher TotalNodeTraffic value). We speculate that many users became interested in Wikipedia during the campaign. However, because TotalTrafficRecirculation actually decreased for the same period, although more users visited Wikipedia, they did not perform two (or more) actions while visiting Wikiepedia; they did not become more engaged with Wikipedia. However, during the SOPA/PIPA protest (January 2012), we see a peak in TotalNodeTraffic and TotalTrafficRecirculation. More users visited Wikipedia and many users became more engaged with Wikipedia; they also read articles, gathered information about the protest, donated money while visiting Wikipedia.
- HTH, take it with a lot of salt! --Atlasowa (talk) 13:04, 13 October 2013 (UTC)
- I think Atlasowa covered things nicely here. However, I'll state that the WMF almost certainly has the capability to calculate such a statistic. Whether they actually take the effort to aggregate the raw logs is a different matter (this could be done via referrer data, which involves a bit more collection, or one could infer IP sessions from simple logs). This is no different than what other sites have, and the WMF may very well have some privacy-aware storage and retention policies, but the capability is there. West.andrew.g (talk) 18:07, 14 October 2013 (UTC)
- Adding to this, site-wide bounce rate is a statistic that the above resources have pinned to pretty narrow range. What would be more interesting is looking at the click trajectories in the sessions of individual users. What pages are the hot entry points? And which ones tend to be arrived at through wiki browsing and inter-page links? What pages tend to be the terminus of Wikipedia sessions? I would think getting an anonymized sample of this *might* be feasible for research (though if there is not referrer data, one would also need to write a log processor to aggregate session data). Third-party services (i.e., Trend Micro) certainly do this (and share to some extent), but the Wikipedia intersection might be comparatively small. It would be nice to get a representative sample from WP/WMF itself. West.andrew.g (talk) 18:17, 14 October 2013 (UTC)
Just checking in
Update's a bit late this week, making sure everything's OK at your end. Serendipodous 10:52, 20 October 2013 (UTC)
- Done -- Just for record keeping, I'll report that the reports did generate automatically, a few hours later than normal. West.andrew.g (talk) 14:32, 21 October 2013 (UTC)
Thank you!
debugging | |
Thank you for the "popular redlinks" list and for fixing the "+" bug in it. —rybec 23:10, 4 November 2013 (UTC) |
User permissions
Hi Andrew. We are having a discussion on how to implent a local user permission system for a script that is used at WP:AfC. How do you do this with Stiki, and is it independent of MedWiki? Forgive my ignorance, but I am not a programmer. Regards, Kudpung กุดผึ้ง (talk) 05:50, 31 October 2013 (UTC)
- I'll assume we are talking about PHP scripting here? Its not my area of expertise, but I know that is how a lot of things get done around here. I use Java to query the API to get information about (1) user permissions, and (2) calculate edit count. I run an explicit and separate database from WP/MW to store usernames that have explicit permission outside of these checks. However, one could imagine having a heavily protected page (admins only?) which contains usernames of permissioned users, and the script could somehow obtain and parse that. West.andrew.g (talk) 14:02, 31 October 2013 (UTC)
- Andrew, the current AfC helper is a Javascript gadget; folks doing AfC reviews *tend* to use it (although it is also possible to 'manually' bypass the helper-app ... which is of some concern because of recent difficulties with WP:COI and WP:SPIP and in at least one case WP:SPA being the motivation for people signing up to 'help' with the AfC queue-backlog). The idea, as I understand it, of Kudpung's current RfC on this matter is to define a minimum set of secondary-criteria (you can think of it as a minimum-edit-count basically although it will prolly be more complex than that) which will permit a good-sized subset of known-good-egg wikipedians to automatically be added to a database-table ... or even better, as you mention, a regular wikipedia page that is admin-only-locked ... that prevents anybody *not* listed in the whitelist from utilizing the AfC gadget. There is also supposed to be a blacklist, which prevents UIDs && IPs banned-from-AfC-duty (e.g. those involved with abusing AfC power in the past, or those involved with WP:PUPPET scandals, or whatever) from accidentally being added to the whitelist.
- If somebody installs the gadget, their UID-or-IP-or-both is checked against the whitelist and blacklist, presumably using javascript calls to API.php -- what happens next depends on decisions that have yet to be hammered out, but basically if you are on the blacklist you cannot approve AfC submissions, and if you are on the whitelist you can. If you are on neither list, maybe your AfC-review-decision will be submitted with a 'needs further eyeballs' flag set, and then some whitelisted editor can verify correctness. If we store the whitelist-n-blacklist in a wikipedia page, that might make API.php-based lookups more complex... so maybe we should have a backend script, which periodically parses the lists, and stores the info into a SQL table, if that is easier for the javascript-gadget to do the lookups in?
- Note that the way in which the whitelists-and-blacklists are maintained/updated/modified/humanViewed, can be distinct from the way in which the whitelists-and-blacklists are used for read-only-auto-lookup from within the AfC-helper-gadget-javascript. Security concerns: ideally, there would be ways for spam-fighters to automagically detect if somebody was tampering with things in the AfC queue manually... or indeed, creating a new article (or dramatically rewriting an existing article) in mainspace, *without* going through the AfC process at all. Some bots like this exist, but not sure how they work, and not sure they can be integrated with the new AfC-secure-dingus we are discussing here. Also ideally, there may be determined folks that will seek to bypass the whitelist-n-blacklist infrastructure; because the gadget is javascript, run in the untrusted execution environment of the browser, it *will* be possible for a savvy adversary to get the AfC script, modify it to bypass the whitelist, and then submit their "pre-approved" AfC-candidate-reviews. There needs to be a post-submission-verification check, presumably written in PHP, that double-checks to make sure the UID-or-IP which submitted the stuff via the AfC helper-gadget is actually in fact approved. (One could argue that *only* the backend PHP script is necessary... but that would make the AfC process very unfriendly, because users would not get a security-error until the very end, after they had gone through all the review-work. Better to show problems up front, even though this means code-functionality-duplication, and associated complexity risks, methinks.)
- Wikipedia:AFCR, instructions for budding AfC reviewer-candidates (general)
- Wikipedia:AFCH, instructions for budding AfC reviewer-candidates (the helper-gadget)
- Wikipedia:WikiProject_Articles_for_creation/Helper_script, helpdocs for the gadget
- MediaWiki:Gadget-afchelper-beta.js, source code for the gadget
- Hope this helps clarify what Kudpung is asking you. I will also point User:Mabdul and User:Theopolisme here, they are heavily involved in the AfC gadget and can prolly help. 74.192.84.101 (talk) 14:38, 6 November 2013 (UTC)
Are you changing the release day?
I need to know because it will upend my schedule. Serendipodous 17:59, 27 November 2013 (UTC)
- (FYI for talk page stalkers: This is in reference to the changed format at WP:5000). No, this should be the exact same data that came out over the weekend, I am just testing matters relating to formatting. New report generation will continue as always. West.andrew.g (talk) 18:02, 27 November 2013 (UTC)
Red links per language
Hoi, I was told of your popular redlinks. I want to know a few things about the processing that you do. Is this something that can be done for every language? If so, could this become a tool that is available in the labs environment??
FYI this is the kind of information that really shows people what to concentrate on when they want to make a difference in the service we provide.
Thanks, GerardM (talk) 16:30, 28 November 2013 (UTC)
- Yes, this can be done for every language. It is an aggregation of the very large raw statistics data made available by the WMF. At current, the English language redlinks (and popularity statistics like WP:5000) are computed on a personal machine. I do this because I prefer to have a local copy available for efficient processing and research endeavors; and my primary focus is English. This machine would probably choke (storage + bandwidth) if it had to do statistics/redlinks for every language (its also the backend for WP:STiki). However, I am more than willing to share the source code that does this for English WP. Modifying it for other languages/projects would be trivial. This could be easily run on Labs to produce reports for all languages/projects. I am not overly familiar with the Labs infrastructure, but if someone on that side of things were interested, I am pretty sure we could get something working fairly quick. West.andrew.g (talk) 19:09, 28 November 2013 (UTC)
Improbable redirects at WP:AN
A discussion concerning the creation of improbable redirects, related to a page or pages you created, has started at Wikipedia:Administrators' noticeboard/Archive257#Mass creation of very improbable redirects. Fram (talk) 11:18, 29 November 2013 (UTC)
- Thank you, I have provided two lengthy responses there: (1) I show that
\x
encoding is a redlink issue (despite non-technical consensus going otherwise). As I have demonstrated several times previously, the stats.grok.se page view tool has issues and is not the ground-truth everyone makes it out to be. (2) The fact I do not support the mass creation of redirects in order to patch the errors of misconfigured external software. West.andrew.g (talk) 18:31, 29 November 2013 (UTC)- Thanks. I hope that it is clear that, even if I recommend the shutdown of the redlinks page as a possible solution, this is not a complaint about you or your work. The page is not the cause of the sometimes thoughtless creations, and even less of the bot (or whatever) pageviews, but not having the page may contribute in not having the unwanted redirects. We'll see how the discussion goes... Fram (talk) 09:42, 2 December 2013 (UTC)
Your popular pages list contains underscores?
It's interesting to look at your popular pages list, but, I'm wondering, why are there underscores in the list, it didn't use to be that way. Could this issue be fixed? It felt much easier to read without them.UsefulWikipedia (talk) 04:16, 5 December 2013 (UTC)
- Done - The next generated report (this weekend) will fix this issue. Thanks, West.andrew.g (talk) 16:46, 6 December 2013 (UTC)
Year in review
I'm thinking about doing a year end review for the Signpost. Is it possible to generate a list for all of 2013? Serendipodous 09:52, 2 December 2013 (UTC)
- Yes, I think this should be straightforward. Please ping me again on or just after the New Year date and I will get that generated for you. Thanks, West.andrew.g (talk) 15:17, 2 December 2013 (UTC)
- Needless to say, perhaps, but that report and commentary is likely to get wide coverage!--Milowent • hasspoken 21:49, 9 December 2013 (UTC)
Popular pages
Hey, Andrew,
I was wondering if you had an archive of the popular pages chart...I looked into your subpages listing and I didn't see anything but I thought I'd ask. Also, do you do an year-end type of list, for all of 2013? Or is this something that Wikipedia issues itself? I'm sure there is interest in it. Thanks for all of your work! Liz Read! Talk! 12:45, 9 December 2013 (UTC)
- I don't think an explicit archive is necessary, as the "history" function should serve that need quite well. The bot should be the only editor touching that page, once a week. I do expect an end of year summary to be posted in the Signpost, per the above request for it to be generated. Thanks, West.andrew.g (talk) 13:20, 9 December 2013 (UTC)
- Thanks for the answer, West.andrew.g. I didn't see the above request. I'm looking forward to it! Liz Read! Talk! 01:14, 10 December 2013 (UTC)
Popular pages
Hi West.andrew.g.
I'm an admin at the Hebrew Wikivoyage and I am very interested in creating a list of the most popular articles in the Hebrew Wikipedia (I am hoping such a list would help our small community better decide which articles we should expand ASAP based on their popularity in the Hebrew Wikipedia). I have understood that you are not interested in creating any such lists for other language editions of Wikipedia BUT that you are willing to share your processing code. I am very well interested in trying to run it (I am hoping it won't be too complicated). ויקיג'אנקי (talk) 06:30, 16 December 2013 (UTC)
- Greetings. Shoot me your email address either via Mediawiki or traditional email [4]. I just prepared a package for someone who wanted to do something similar with WP:TOPRED, which shares a code base (I don't think its well documented enough to throw it open for widespread public consumption just yet). It's written primarily in Java with bits of BASH shell script. It does make use of a MySQL server for persistent statistical storage, so it isn't quite plug-and-play. If you don't know a bit of coding, you might want to see if there is anyone in your community who might be available to help out.
- FWIW, its not that I am unwilling or "uninterested" in helping people out, its just that my storage isn't sufficient to help *everyone* out, and the talk page traffic for en.wp WP:5000, WP:TOPRED, and other projects is already keeping me plenty busy. Thanks, West.andrew.g (talk) 03:12, 17 December 2013 (UTC)
What sort of help?
Recently, on the STiki talk page you said, "Between STiki and ClueBotNG, we've got large parts of the problem space covered using some pretty intelligent machinery. That being said, both of these projects are going to appreciate any volunteer assistance they can receive." Apart from using STiki, I am interested to understand both processes better, and maybe contribute more. Do you have anything in mind? --Greenmaven (talk) 01:31, 20 December 2013 (UTC)
- STiki receives simple assistance from folks when they take care of the nightly milestones/barnstars report and do talk page stalking. If someone were to familiarize themselves with my (primarily Java) code base, they could make *significant* contributions via GitHub to handle feature requests and bug reports. I know CBNG has a "dataset review interface" where they seek individuals towards creating a representative corpus for training purposes. I also think WP:CVUA is also a good related project. We have lots of individuals whose special access to STiki must be rejected. It would be nice to get them mentors and experience that can fast-track their responsible use of tools like STiki that can efficiently make use of their labor. Thanks, West.andrew.g (talk) 19:48, 20 December 2013 (UTC)
- I thought that by using STiki one was adding to the discriminating ability of ClueBotNG? --Greenmaven (talk) 05:18, 22 December 2013 (UTC)
@Jack Greenmaven: -- It's a more complicated set of interdependencies. The autonomous work of Cluebot feeds into the reputations of the "metadata" algorithm. The cumulative body of human effort (regardless of source queue) is used to retrain that "metadata" algorithm/queue. I have provided this set of human classifications to the CBNG folks (as a one-time dump) so they could use it in a similar fashion as they see appropriate. To what extent this is done, if at all, I am unsure. I know those folks aren't fond of the fact the STiki-classified set is non-representative of its bot workload -- a fair argument -- and why they have sought to create a representative corpus offline. West.andrew.g (talk) 04:46, 24 December 2013 (UTC)
Two cool statistical tables for 2013
Greetings everyone. I've recently posted over User_talk:West.andrew.g/Popular_pages, but I realize some of my watchlisters might not follow both pages. After much processing has been brought to bear, I've aggregated all the page view statistics for 2013. I thought these would likely be of general interest, and I'd appreciate if others would re-post to relevant discussion pages and forums (on or off wiki; Reddit and some others picked up on my last effort in this vein).
- The 10k most popular pages of 2013 (be patient with load time!)
ARTICLE | VIEWS -------------------------------------- [[Main_Page]] | 3,895,581,597 [[Facebook]] | 30,608,777 [[Deaths_in_2013]] | 21,246,624 [[Breaking_Bad]] | 17,389,161 [[Google]] | 16,759,294 [[World_War_II]] | 16,676,636 [[Wiki]] | 16,285,560 [[YouTube]] | 15,938,076
ARTICLE | UTC DATE | VIEWS | REASON ---------------------------------------------------------------------- [[Jorge_Bergoglio]] | March 13, 2013 | 1,460,586 | Papal ascension [[Shakuntala_Devi]] | November 4, 2013 | 766,256 | Google Doodle [[Paul_Walker]] | December 1, 2013 | 752,770 | Death [[Grace_Hopper]] | December 9, 2013 | 621,694 | Google Doodle [[Nelson_Mandela]] | December 5, 2013 | 484,966 | Death [[Jodie_Foster]] | January 14, 2013 | 451,270 | Came out at Golden Globes [[Beyonc%C3%A9_Knowles]] | February 4, 2013 | 378,923 | Super bowl halftime [[Nicolaus_Copernicus]] | February 19, 2013 | 336,836 | Google Doodle [[Seth_MacFarlane]] | February 25, 2013 | 320,999 | Hosted the Oscars [[Daniel_Day-Lewis]] | February 25, 2013 | 318,839 | Oscars [[Society_of_Jesus]] | March 13, 2013 | 287,568 | Papal ascension [[Mindy_McCready]] | February 18, 2013 | 282,679 | Death [[Hermann_Rorschach]] | November 8, 2013 | 276,072 | Google Doodle [[Edith_Head]] | October 28, 2013 | 263,915 | Google Doodle [[Raymond_Loewy]] | November 5, 2013 | 258,301 | Google Doodle [[Margaret_Thatcher]] | April 8, 2013 | 252,906 | Death [[Pope_Francis]] | March 13, 2013 | 248,753 | Papal ascension [[Peter_Capaldi]] | August 4, 2013 | 244,667 | Announced as next Dr. Who
Thanks everyone. West.andrew.g (talk) 17:15, 13 January 2014 (UTC)
5000
Hi there, looks like #360 on your list should be linking to Cancún but the accented character has caused the link to fail. I have seen such boxes before with Czech diacritics, I don't know how to fix it but hoping you may! Thanks, C679 19:22, 3 February 2014 (UTC)
- @Cloudz679: There is no error here. You'll notice that there are many article titles in WP:5000 where accents are correctly handled (e.g., #31, Eugène Viollet-le-Duc). People are actually landing (trying to land?) at a page for Cancun that has incorrectly encoded the special character. This is quite common at WP:TOPRED. The real question here is what bot/hyperlink/etc. is the root cause of people landing at this (incorrect) page. WP:5000 does nothing except aggregate low-level statistics. West.andrew.g (talk) 19:30, 3 February 2014 (UTC)
- Wow, that is quite an issue. I also saw it at WP:CUPSUGGEST last month, strange stuff. Thanks, C679 19:46, 3 February 2014 (UTC)
WikiAudit can not use on Windows7
WikiAudit can not use on Windows7 9shi (talk) 07:10, 4 February 2014 (UTC)(on zh wiki 9shi)
- Ummm... I am going to need a bit more detail than that. Also, its not a "double-click" GUI application, it must be run from the command-line. Thanks, West.andrew.g (talk) 01:23, 5 February 2014 (UTC)
Font size in the new version of Stiki
It's great! All us sight-deprived people thank you. Coretheapple (talk) 17:14, 13 February 2014 (UTC)
Thanks
... for fixing the silly barnstar mistake on Flyer22's talk page. Further proof that I need new reading glasses. Widr (talk) 18:43, 17 February 2014 (UTC)
Washington Post
First paragraph. :-) I'd be very interested if you'd like to write another analysis of the trends. Ed [talk] [majestic titan] 23:33, 28 January 2014 (UTC)
- @The ed17: @Jmh649: -- Very exciting to see that it is still getting some attention. As much as I love the page views we (i.e., the Signpost) are getting now, I am also eager to convert this statistical work into a format that can be academically cited. User:Jmh649 and myself are currently collaborating on some inter-language statistical stuff for WP:MED. Our intention is to submit our work for journal publication. If that happens, perhaps we could write a summary of our findings as a special Signpost piece (to be released concurrent with the article); that format would also be more ripe for media consumption and Internet linking. West.andrew.g (talk) 01:12, 29 January 2014 (UTC)
- Will get working on it more soon. Doc James (talk · contribs · email) (if I write on your page reply on mine) 02:28, 29 January 2014 (UTC)
- (ping User:Jmh649 too) Sorry to resurrect this, but I remembered the conversation and that I forgot to reply. Whatever you two come out with, I'll be happy to run it! :-) Thanks to you both, Ed [talk] [majestic titan] 01:39, 24 February 2014 (UTC)
- Will get working on it more soon. Doc James (talk · contribs · email) (if I write on your page reply on mine) 02:28, 29 January 2014 (UTC)
User:Fraggle has crossed over 250000!
See at WP:STiki/milestones -Ugog Nizdast (talk) 07:27, 23 March 2014 (UTC)
- Specially handled -- Done -- West.andrew.g (talk) 15:16, 23 March 2014 (UTC)
Duplication of names on the Stiki leaderboard
Hello,
Thanks for the welcome message. I have actually used Stiki before but under old account names. I just noticed on the Stiki leaderboard that those two account names are listed separately: "Gold Standard" and "Athleek123". Could you put these two together (and three once the leaderboard updates my latest uses with my current username)?
Thanks,
TheCascadian 04:08, 25 April 2014 (UTC)
- Done -- @TheCascadian: -- Remapped for a total of 340 edits under the current account. Thanks, West.andrew.g (talk) 04:21, 25 April 2014 (UTC)
Redlinks for wiktionary
Hello! I think that tool WP:TOPRED might be particulary helpful for wiktionaries... Can you please include in your top the data from the wiktionaries? Your algorithm for extract redlinks are open source? Can i see it somewhere? Thank you. --Grenadine (talk) 20:27, 29 April 2014 (UTC)
- Unfortunately, for spatial reasons I do not calculate statistics outside the main namespace of English Wikipedia. English WP alone has produced several 100s of GBs of data that I have lying around on external drives. Thus, I cannot entertain the many requests I receive for statistical work on different projects/languages.
- I am willing to share my code. However, be forewarned that I collect statistical data for purposes outside the TOP5000 and TOPRED. If one just wants to produce these reports, my methodology of using nightly aggregation scripts, a backend SQL server, and infinite data retention is a bit overkill for the average user. My code is primarily written in Java, and one would need to be somewhat familiar with that language to make the needed adjustments. If all this sounds like English to you, and you still want the code, please contact me directly via email -- it will take some attention on my part to make sure I've blanked out all the server credentials and other hardcoded sensitive bits. Thanks, West.andrew.g (talk) 21:00, 29 April 2014 (UTC)
Request for comment
Hello there, a proposal regarding pre-adminship review has been raised at Village pump by Anna Frodesiak. Your comments here is very much appreciated. Many thanks. Jim Carter through MediaWiki message delivery (talk) 06:47, 28 May 2014 (UTC)
Top 5000 delayed?
Just checking. :-) Serendipodous 09:43, 1 June 2014 (UTC)
- @Serendipodous: -- Report should come in the next couple of hours. More detailed post mortem to follow. West.andrew.g (talk) 21:06, 2 June 2014 (UTC)
- The reports generated as stated above. My logging cannot find a great explanation for why this happened; no errors or anything. My one thought is that there was a network snag or the WMF was a little slow to post the data. When this failed on day #1, it would have been picked up the next day, but my script got a little confused because we hit a month boundary (May->June). My handling of month switches (critical because it changes the file path) was a little inelegant, and has been improved. Thanks, West.andrew.g (talk) 13:27, 4 June 2014 (UTC)
Weird topics in top 5000 list and Stats.Grok.Se
Hello Andrew, in the top 5000 articles list and Stats.grok.se, I'm seeing weird articles I don't excpect to see in that list. For example, Le Cordon Bleu College of Culinary Arts Atlanta, Alexandria, Virginia, foods and other topics aren't that popular, they've been placed incorrectly. Is the counter counting pageviews correctly? I searched this issue and it says it includes non-human views. In the Wikipedia article traffic statistics, I'm seeing some glitchy page names with nonsense characters. Why is this happening for several months, and can this be fixed? It didn't use to be that way. I noticed before 2013 this did not happen.
And also, does your list support articles with colons in their name, such as Call of Duty: Advanced Warfare? I'm sure that topic will be up there. A Great Catholic Person (talk) 06:41, 8 June 2014 (UTC)
- These topics have been covered extensively in the talk pages archives here, as well as at Wikipedia_talk:5000 and Wikipedia_talk:Top_25_Report. You should also read Wikipedia:Wikipedia_Signpost/2013-02-04/Special_report. I am quite confident we are aggregating the numbers accurately, to the extent that the raw data provided to me is collected accurately. We are aware of several bugs in Stats.grok.se and tend to consider our numbers slightly more accurate. Glitchy names are a result of poor encoding on the client side making the request. Nothing is broken. The archives are your friend.
- I will investigate the potential semicolon issue in greater depth. West.andrew.g (talk) 18:24, 8 June 2014 (UTC)
- It said something about DOS attacks, bots and stuff like that. I'd love to use the archives, I remember seeing December 2010 data, there were some articles that are unusual, but much less than today's. And there was a few funny things I noticed with 12/2010 data. Could whatever is causing that be stopped?A Great Catholic Person (talk) 20:40, 8 June 2014 (UTC)
- Sure... the Internet is a pretty dynamic place. We're doing the best we can do right now with the data provided. Presumably the WMF could go one level deeper by collecting referrer data and source IP addresses, but it is unlikely researchers will get their hands on that, as it would be considered PII. I've poked them to see if some aggregates might be available, but that didn't gain much traction. West.andrew.g (talk) 17:55, 9 June 2014 (UTC)
Alright, I understand that now. I'll check this weekend. By the way, can the stats.grok.se and stats-classic.grok.se (old version) of Wikipedia stats be reverted to December 2010 data, rather than the 2014 data? Visitors are going to wonder what are those weird articles and glitches are, for now until the DoS attacks get fixed. I don't mind how oudated it is. I can't ask Henrik because he is not responding. He once said it can't because it needs code changes, but I don't mind a code change one last time. I don't mind outdated data, and I forgot to check 2010 rankings of articles I wanted to check.
- DoS attacks are never going to be "fixed", neither are misconfigured content scrapers, or the unusual effects of the Google Doodle on page view spikes. Those of us who work with these statistics regularly are quite comfortable with this fact. I have no means on contacting Henrik about his tool. I do know that the raw data is available all the way back to 2007. West.andrew.g (talk) 17:33, 11 June 2014 (UTC)
And... is it possible to generate the list of 2013 popular pages, but with the colons? Also, could every report you've generated from October 2012 to now be recreated but with colons? I don't want to wait until 2015 for data without colons, and I want to see what some titles with colons ranked in the top 10,000 last year.
- Is it possible? Yes. Will I be doing it? No. I think many fail to realize how big the hourly page view files are. It takes an hour or two of compute time to handle a single day's statistics. If I had to reprocess a year of reports, it could easily take over 1000 hours of machine time. The value just isn't there for this kind of work. There are a few other "top lists" out there that may not be affected by this error. West.andrew.g (talk) 17:33, 11 June 2014 (UTC)
And, also, because technology is growing, can this start counting pageviews from mobile devices and other machines? A Great Catholic Person (talk) 02:58, 11 June 2014 (UTC)
- I am not privy to that data. I know some reports/charts along these lines have been posted to [5]. West.andrew.g (talk) 17:33, 11 June 2014 (UTC)
Alright, I read all that, but both Henrik and Killiondude (another one who has an FAQ page) are both down. I don't care about how outdated the stats-classic.grok.se is, I want the old December 2010 data back. Plus, Killiondude's FAQ has a link to October 2009 data for Michael Jackson. It won't work. Just one code change is good enough, the design should have been "phased out" already with 2010. The link comes up with 0 pageviews, plus I prefer the old design for viewing data from any time. The top 1000 list date I can take will be from December 2009 (it's fine!) to December 2011. I also forgot to check at least a lot of articles' rankings in the 2010 version, and I want to know them badly. I'm very curious because the classic has 2014 data. I don't need an updated one. The old one is okay. Should I wait until Henrik is back? Because it will take so long.A Great Catholic Person (talk) 22:23, 21 June 2014 (UTC)
- I can't speak to how other services operate, what data they have, and what interface they choose to use. Personally, I'm sorry to say I won't be producing such a report, for reasons discussed above. I don't think Henrik is too active (if at all), so I wouldn't bank on that opportunity. West.andrew.g (talk) 04:20, 22 June 2014 (UTC)
I'm not meaning generate those reports, I just want the stats classic version's top 1000 list to change from January 2014 to December 2010.A Great Catholic Person (talk) 16:59, 22 June 2014 (UTC)
- I have no knowledge or control over Henrik's tool. I cannot help you. WP:VPT may serve you better. West.andrew.g (talk) 17:56, 24 June 2014 (UTC)
I'm at Wikimania in London
Apologies for being latent in posting here (there are about 1.24 days left in the conference), but I am at Wikimania in London. User:Jmh649 gave a presentation about WP:MED and utilized some of the recent statistical work I've done for him (and an academic paper is in progress). Besides that, I've largely been hanging out in the research/analytical/"social machines" tracks. If your a user/supporter/fan of WP:STiki, WP:5000, or anything else I've done -- I'd love to meet you, so don't hesitate to reach out. West.andrew.g (talk) 15:27, 9 August 2014 (UTC)
After over a month, I have to contact you again!
I know all what you said above, but how about you try to tune your popular pages lists to not include the popular redlinks and non-human pageviews to your lists? Also, blackout articles (Lycos, Ddd, Alexandria, Virginia, etc and others being attacked) to your lists, especially for a 2014 top list. I understand, but they are okay to be on other lists. I like seeing what are the top articles of the week are, but now they are replaced with BS articles. If this continues, I can't trust your data anymore. A Great Catholic Person (talk) 01:00, 4 August 2014 (UTC)
- We are not re-hashing this. The data is correct, you just want to isolate some of the phenomena within it. West.andrew.g (talk) 15:30, 9 August 2014 (UTC)
Are you down?
I haven't seen the top 5000 or any other of your reports updating since August 14.A Great Catholic Person (talk) 16:15, 17 August 2014 (UTC)
- I'm quite alive: User_talk:West.andrew.g/Popular_pages#Update. Thanks for your concern. West.andrew.g (talk) 02:38, 23 August 2014 (UTC)
- Another Saturday evening it's still not updating.A Great Catholic Person (talk) 15:46, 24 August 2014 (UTC)
- These things are generally better covered at User_talk:West.andrew.g/Popular_pages. Thanks, West.andrew.g (talk) 19:54, 2 September 2014 (UTC)
- Another Saturday evening it's still not updating.A Great Catholic Person (talk) 15:46, 24 August 2014 (UTC)
PSA: On the Non-Reporting of Mobile Views
A significant statistical issue has come to my attention. Quite simply, the WMF does not record/report per-article mobile views, and thus they are unavailable for my aggregation....
The complete write-up is at User_talk:West.andrew.g/Popular_pages#STICKY:_On_the_Non-Reporting_of_Mobile_Views.
Please consolidate all discussion at that location. Thanks, West.andrew.g (talk) 18:42, 4 September 2014 (UTC)
You've got mail!
Message added 16:24, 9 September 2014 (UTC). It may take a few minutes from the time the email is sent for it to show up in your inbox. You can {{You've got mail}} or {{ygm}} template. at any time by removing the
Ed [talk] [majestic titan] 16:24, 9 September 2014 (UTC)
- Done -- Acknowledged and responding. West.andrew.g (talk) 02:59, 10 September 2014 (UTC)