Jump to content

User:Visviva/Bash

From Wikipedia, the free encyclopedia

I'm fairly new to Bash, but if these scripts are of any use to you please feel free to use & adapt them.

If you think you can improve anything on this page, please share your ideas either here or on the Talk page.

Uncat.sh

[edit]

I find that this only processes about 300,000 lines per hour on my desktop machine. It would therefore take about 400 hours to process the entire text of Wikipedia.

#!/bin/bash

#This is a bash script for extracting files from a EN Wikipedia XML dump.
#This script takes one argument, the name of the file it will process.
#If you know a way to make this script faster, please share.

#Make a special pipe for the file

exec 3< $1

in=0
cat=0

#Start

while read <&3 line; do

#Scan for categories
    if [ "$in" -eq "1" ]
    then
        case $line in
            *[[Category:* | *[[category:* | *REDIRECT* | *redirect* | *disambig*
 | *dis}}* | *CC}}* | *Disambig* |  *Redirect* )
                in=0;;
        esac
   fi
#Scan for title -- also tells us if the last page is over
    title=""
    title=" $(echo $line | grep '<title>')"
    if [ "$title" != " " ]
    then
       oldtitle=$PAGE_TITLE
       title=$(echo $line | grep '<title>' | sed -e s'@<title>\(.*\)</title>@\1@
')
       export PAGE_TITLE=$title
       if [ "$in" -eq "1" ]
       then
               echo "*[[$oldtitle]]"
       fi
       in=1
       case $title in
           *deletion* | *Deletion* )
               in=0;;
       esac
    fi
done