Jump to content

User:Apolo234

From Wikipedia, the free encyclopedia

Hey there, it's Apolo234. Spanish student of Software Engineering.

Tools I am working on

[edit]

I am working right now in a tool to serve as a complement to "What links here", it works similarly to the MassViews toolforge app but using backlink count instead of user views.

There are already some tools to search how many links are there to a page, but I wanted to add the ability to search the backlink count of many articles at the same time, sort them in reverse order and solve the issue of not counting links that are merely transcluded from templates into the article, which tends to be problematic for some use cases.

The source code looks like this right now:

import requests
from ratelimit import limits, RateLimitException, sleep_and_retry
import json
listoffiles = """"""
myarray = listoffiles.split("\n")
myarrayprocessed = listoffiles.replace(" ", "_").split("\n")
result = {}
cache = {}
cache2 = {}
cache = open('cacheforallwikipediabacklinks', 'r').read().split("\n")
cache2 = open('cacheforallwikipediabacklinks2', 'r').read().split("\n")
counter = 0
@sleep_and_retry
@limits(calls=1, period=3)
def accesswikipedia(myarray,myarray2,mycache,mycache2,counter):
    for id, i in enumerate(myarray):
        if (myarrayprocessed[id] in mycache):
            print("Found " + i + " in cache!")
            print("")
            result[i] = int(mycache2[mycache.index(myarrayprocessed[id])])
            counter = counter+1
        else:
            api_url='https://en.wikipedia.org/w/api.php?action=query&format=json&srlimit=1&list=search&utf8=1&format=json&maxlag=10&srsearch=linksto:'+myarrayprocessed[id]+' insource:"[['+i+']]"&srinfo=totalhits&srprop='
            headers = {'Accept-Encoding':'gzip','User-Agent':'Final What Links To Tool/0.0 (https://en.wikipedia.org/wiki/User:Apolo234) Using Requests (Python)'}
            apiresult = requests.get(api_url, headers=headers).json()
            print("Result obtained "+str(counter+1)+"/"+str(len(myarray))+": " + str(apiresult))
            print("")
            result[i] = apiresult["query"]["searchinfo"]["totalhits"]
            open('cacheforallwikipediabacklinks','a').write(str(myarrayprocessed[id])+"\n")
            open('cacheforallwikipediabacklinks2','a').write(str(apiresult["query"]["searchinfo"]["totalhits"])+"\n")
            counter = counter + 1
accesswikipedia(myarray,myarrayprocessed,cache,cache2,counter)
result = dict(sorted(result.items(), key=lambda item: item[1], reverse=True))
for i in result.keys():
    print(str(i) + ": " + str(result[i]))

Instructions

[edit]

1. You will need to install the modules ratelimit and Requests, preferably in Python 3.10.x (Tested in Python 3.10.4).

2. Copy the source code and paste it in a new Python file, then create two files in the same folder as the Python file called 'cacheforallwikipediabacklinks' and 'cacheforallwikipediabacklinks2' (without extensions nor quotation marks).

3. To input multiple article titles, you have to modify the 'listoffiles' variable such as this:

listoffiles = """Pikachu
Raichu
Charmander"""

4. Start the Python script, it should start outputting the current article it is requesting, at a relaxed pace (it should not be executing multiple instructions in a batch at it is purposefully rate limited, except for the cases where it finds the title in the cache, so as not to overload Wikipedia servers). It will start appending the information it obtains into the 'cacheforallwikipediabacklinks' file you created and the 'cacheforallwikipediabacklinks2' file too.

5. When it finishes, it will return a reversely ordered list with all titles and the number of links to it from other articles that it found, it does not count links solely coming from a template.

Enjoy any uses you can give it!

Patches

[edit]

If you want the titles result of this script to look machine-readable replace

print(str(i) + ": " + str(result[i]))

with

print(str(i.replace(" ","_")) + ": " + str(result[i]))
[edit]

It is simple to use, copy the Wikitext of an article, put it on a text file, and use the following command:

 grep -oP '\[\K[^]]*'  <name of text file here> | cut -c 2- | sed 's/^\(.\)/\U\1/'| sed -e 's/|.*//'| sed -e 's/#.*//' | sed -e 's/Category:.*//' | sed -e 's/Ttp.*//'| sed -e 's/File:*//' 

It should return the title of every Wikilink in the article (it does not include template links, which is the main reason you would use this instead of some other more established tool, it also has problems with some special characters, such as returning 'C' instead of 'C++').

[edit]

First, obtain the list of links you want to analyze.

Second, go to the Special:Export tool on Wikipedia and paste the links there. It will generate a XML file which you save on your computer. Use the Wikilink extracting script excluding templates on the XML file and save its output in some other text file, we will use it for our last step.

Third, save the titles in a text file and use the following script to transform them into something that will be input in the next command:

sed -e :a -e '$!N; s/\n/\$|\^/; ta' <name of text file here> | sed 's/^/\^/' | sed 's/$/\$/' | sed -re 's/[(]/[\\(]/g' | sed -re 's/\[//g' | sed -re 's/\]//g' | sed -re 's/[)]/[\\)]/g' | sed -re 's/\[//g' | sed -re 's/\]//g' | sed 's/"/\\"/g' | sed 's/!/\\!/g' | sed 's/+/\\+/g' | sed 's/*/\\*/g'

(in case you wish to also get the links from Main, Main Article, Broader and Further use templates (not fully tested):

grep -oP '\[\K[^]]*'  <name of text file here> | cut -c 2- | sed 's/^\(.\)/\U\1/'| sed -e 's/|.*//'| sed -e 's/#.*//' | sed -e 's/Category:.*//' | sed -e 's/Ttp.*//'| sed -e 's/File:*//' && grep -oP '\{\{[Mm]ain article\|[^\}]*' <name of text file here> | sed 's/{{[Mm]ain article|//g' | sed 's/#.*//g' | sed -e 's/|.*//' | sed -e 's/^ //g' | sed -e 's/.*\= *//g' | sed 's/^\(.\)/\U\1/' && grep -oP '\{\{[Ss]ee also\|[^\}]*' <name of text file here> | sed 's/{{[Ss]ee also|//g' | sed 's/#.*//g' | sed -e 's/|/\n/g' | sed -e 's/^ //g' | sed -e 's/.*\= *//g' | sed 's/^\(.\)/\U\1/' && grep -oP '\{\{[Mm]ain\|[^\}]*' <name of text file here> | sed 's/{{[Mm]ain|//g' | sed 's/#.*//g' | sed -e 's/|/\n/g' | sed -e 's/^ //g' | sed -e 's/.*\= *//g' | sed 's/^\(.\)/\U\1/' && grep -oP '\{\{[Ff]urther\|[^\}]*' <name of text file here> | sed 's/{{[Ff]urther|//g' | sed 's/#.*//g' | sed -e 's/|/\n/g' | sed -e 's/^ //g' | sed -e 's/.*\= *//g' | sed 's/^\(.\)/\U\1/' && grep -oP '\{\{[Bb]roader\|[^\}]*' <name of text file here> | sed 's/{{[Bb]roader|//g' | sed 's/#.*//g' | sed -e 's/|/\n/g' | sed -e 's/^ //g' | sed -e 's/.*\= *//g' | sed 's/^\(.\)/\U\1/' && grep -oP '\{\{VT\|[^\}]*' <name of text file here> | sed 's/{{VT|//g' | sed 's/#.*//g' | sed -e 's/|/\n/g' | sed -e 's/^ //g' | sed -e 's/.*\= *//g' | sed 's/^\(.\)/\U\1/' && grep -oP '\{\{AP\|[^\}]*' <name of text file here> | sed 's/{{AP|//g' | sed 's/#.*//g' | sed -e 's/|/\n/g' | sed -e 's/^ //g' | sed -e 's/.*\= *//g' | sed 's/^\(.\)/\U\1/'

Copy the output it gives.

Fourth, use the following command to obtain the list of Wikilinks linked by other elements on the list along with the number of times they were found linked in the file we generated in the second step:

 grep -EIho "<list of titles generated in the second step>" <text file obtained in the second step> | sort | uniq -c | sort -n -r 

Fifth, it should output a list similar to this:

   104 Aaron
    84 Peter
    72 Robert
    72 Paul
    58 Thomas
    48 Steve

You are done!

Now, these scripts have a problem, they don't give links which link 0 times to any of the links, if you need this information, use the following:

First, take the text files you used in the second step, sort them using the command below and paste the results into some text file

sort <text file obtained in the second step> > <name of text file where you output the sorting> 

Then, run this command

grep -EIho <pattern> <text file obtained in the second step> | sort | uniq | diff - <name of text file where you output the sorting> | grep -EIho "> .*" | sed 's/>/0/g' | sed 's/^/      /g' | sort -n -r

You should get all links that link exactly 0 times to every link in the list now.

In case it gave an error in the last step, it is usually due to a special character (such as '!' or '?'); in that case, go back to the list of titles you had before transforming them and escape any special non-alphanumeric characters with '\', such as this 'Is this real\?', Repeat step 3 on and it should run smoothly now.