Friday, July 31, 2009

Getting copies of your stuff off the internet using wget

So I think it's a bit iffy putting stuff out there on other peoples computers. Like, well like this is. So what do you do?? Well if it's a page or two -- like a blog with only a few entries you can just save the page. But what if it's a whole bunch of pages like say a wiki? What you need is wget

Wget



Wget is a program which you can use to download webpages. But it can read the pages it downloads and then download the pages to which that first page linked. Then it can download the pages that the pages that the first page linked to link to. If you see what I mean. The only drawback that it could be seen as having is that it doesn't have a GUI -- you have to type in all the commands. But once you get the hang of that it's usually easier and quicker than clicking on things and dragging on other things.

If your wiki has a password you can even log into it first using wget and then download the wiki.

Just for the laugh, here's the directions I did fro when I was looking after a wiki on pbwiki (now pbworks) showing how to download an entire (pretend example, called "*example*") wiki from pbworks:

Backup instructions:



These instructions will create a copy of pretty much the entire wiki which will run on your computer. It puts it in a folder with the title of the wiki - it will have hundreds of files in it, scroll down to, and click, "index.html" and you'll be using your local copy of the wiki.

wget is run from the command prompt (Start -- All Programs -- Accessories -- Command prompt), you type each command in as a single line and press enter and it will do its job without any further input from you. As it's doing its work the command prompt window will tell you what's happening -- this will not be very interesting to watch but if it fails you'll probably be able to work ou what's gone wrong. The paths in the commands below assume that you have installed wget with the defaults from the sourceforge page above. If you're using CygWin or another form of wget you probably know what to do.

The command to do the downloading will probably take a long time (as in an hour or so) -- this is deliberate so as not to annoy the computers hosting the wikis. (it mostly takes so long because I can't work out how to fine-tune it to only download the current version of each page and not every single version for all history)

When wiki does not require a password:

If the wiki doesn't require a password to read, for example http://*example*.pbwiki.com things are easier and it can be downloaded with a single command. This is working from the assumption that there are no uploaded documents. If there are then a Backup page should be created which has a copy of each file and the URL in the command altered appropriately.

"Program Files\GnuWin32\bin\wget.exe" --no-check-certificate -r --wait=3 --random-wait -e robots=off --reject=php --exclude-directories /session/,/user/ --convert-links --directory-prefix=c: http://*example*.pbworks.com

When wiki requires a password:

(prerequisite) to get this to work you'll need to set up a student account which only has reader permissions for the wiki called "backup" with the password "backup_password". I just did this to make a clear distinction between other logins and backup passwords and also so nothing accidentally happened by doing this with a permission level which is allowed to edit or delete things.

wget will need to be run twice --

1 -- To log in to the wiki as backup and save the appropriate cookies

2 -- To download all the files on this page as well as all the wiki pages which can be accessed through the links on the normal pages and the sidebar. wget is unable to follow the links within the "files and pages" tab on the top right, which is why they've been copies onto this backup page prior to backing up the wiki.

The first command, to get the appropriate cookies, is:

"Program Files\GnuWin32\bin\wget.exe" --no-check-certificate --save-cookies cookies.txt --keep-session-cookies --post-data="return=http://*example*.pbworks.com/FrontPage&u_email=backup&u_password=backup_password&u_remember=checked&wiki=*example*&redir=note&submit_submit=Log in" https://my.pbworks.com/

The command to download the wiki is as follows:

"Program Files\GnuWin32\bin\wget.exe" --no-check-certificate --convert-links --directory-prefix=c: --load-cookies cookies.txt -r --wait=3 --random-wait -e robots=off --reject=php --exclude-directories /session/,/user/ http://*example*.pbworks.com/Backup

Explanation of commands:



That will download pretty much the entire wiki (including all past revisions -- sorry, I couldn't work out how to stop it doing that). It will take a long time as it includes the instruction to wait for a few seconds between each file request so as not to annoy the nice people hosting our wiki by bashing their server.

That will log in to the wiki and save the session cookie on your computer in a file called "cookies.txt". The options mean as follows:

--no-check-certificate : don't check for the security certificates for the site. We know the site is okay and the certificates it offers cause trouble for wget

--save-cookies cookies.txt --keep-session-cookies : save the cookies in a file called "cookies.txt" and then save that file so it can be used again

--post-data="return=http://*example*.pbwiki.com/FrontPage&u_email=backup&u_password=backup_password&u_remember=checked&wiki=*example*&redir=note&submit_submit=Log in" https://my.pbwiki.com/

This sends the necessary information to the server to log in. The appropriate information was obtained using the Web Developer addon for firefox. The final https://my.pbwiki.com/ is the address necessary to complete the logging in.

--convert-links : changes all the links in downloaded files to make them refer to the other downloaded files. If a particular file isn't downloaded the link stays as it is.

--load cookies cookies.txt : loads the cookies file created previously

-r : downloads recursively i.e. follows links. wget won't go outside of *example*.pbwiki.com so you don't have to worry about it downloading the whole internet

--wait=3 : waits for 3 seconds between each request. This is considered polite behaviour when doing this type of thing, so as not to annoy the computer hosting the site. You can make the wait delay shorter if you're in a hurry, but it's probably not really that urgent?

--random-wait : makes the wait not be exactly 3 seconds.

-e robots=off : makes wget ignore the robots.txt file, otherwise it will only download a single page.

--reject=php,xml : don't download any php files - this avoids creating new pages while trying to download the old ones, which won't happen because the login doesn't have the permissions but just in case... it also avoids downloading the rss feed

--exclude-directories /session/,/user/ : exclude those two directories (mainly because going to session/logout logs you out, at which point the download will stop)

--directory-prefic=c: : This makes all the files download to a folder located at c:. This will just make it easier to transfer between computers -- the default would include all the path bits to wget

************



Those commands are mostly so complicated because of logging in -- the information you have to send to log in is a bit tricky to work out and then send. But remember that at the end of this you'll have a complete copy of the wiki on your local disk. You might have no need ever to look at it again, but at least you'll now have the option and it won't disappear if the company hosting the wiki should happen to.

Thursday, July 30, 2009

#2 Wikis, online appications and suchlike

I've had a look at a good load of the online word-processors but they're not really useful to me as I prefer to write using LaTeX and as a result don't really like WYSIWYG applications. Of the ones I have looked at I think Zoho is by far the best, (and you can actually add LaTeX equations to your documents -- but you can't do the whole article.)

Actually I've just looked around and it looks like MonkeyTex is actually an online Tex processor. Unfortunately there's no details attached as to who runs it or anything so I don't think I'll give it my email and there's no guest thingy so I can't look at it.

I used Zoho writer a bit while I was at college and it was useful for collaborative papers, well the one I did with one other person. But on the whole I think the risks of just leaving your stuff `out there' in the aether is troublesome... mind you I was never working with something that I didn't either not have another copy on another computer or USB thing or just wasn't that fussed if it disappeared.


Wikis


Obviously there's Wikipedia, which I'm trying to avoid as much as possible lately -- to the point of editing my default google search in firefox to remove anything from wikipedia. I find Wikipedia is really useful for anything technical and increasingly rubbish for anything after that. I find myself getting more and more dubious about it as time goes by -- the major turning point was when I started habitually checking the talk page on every page I looked at. In that absence of that strong explicit there is instead the sneaky authority which is the lowest common denominator of essentially whoever can stick out an argument for the longest. Jason Scott has a few good talks about it somewhere on http://ascii.textfiles.com -- actually in general his blog is one of the more interesting library-related things I've seen on the internet and it's not even about libraries but instead about recondite bits of computer history. I think the talks are in the `about' pages or something like that.

In terms of the general use of wikis in libraries. It's probably a good idea, as long as there is a single source of authority for the wiki. Without a single person or group of people who are clearly guiding the direction of the project it's likely to stagnate. With a clear goal it can be incredible useful. My two favourites are the library success wiki and the Digital Research Tools wiki. Both of them have a clear goal and provide a useful service.

Tuesday, July 28, 2009

# Wk 1: blogging

I don't really get the appropriate tone of voice for blogging. As the writing is just going off into the aether it's tricky to direct the speech towards an imagined audience. As I already know what I think it's tricky to write "for myself". Ah well... I'll just type away.

Thursday, July 23, 2009

Blogs I like to read.

I don't read a whole lot of individual specific blogs -- when I find a new one that might be interesting I tend to bookmark it with firefox's `live bookmarks' and then delete it pretty soon after if I don't look at it regularly.

Mostly I like to read already assembled blog collections or collaborative blogs like metafilter or slashdot. But I also like Yahoo Pipes as searching there for a subject usually turns up a collection of blogs which someone has already assembled. For example I was reading one or two GIS blogs until I found a Yahoo Pipe which combines a bunch of them in one convenient feed:

Geospatial Blog Headlines

And I also like these two which collect blogs covering the intersection of library things and technology

All thins technical for libraries

and

Libraries and technology

I find these pipes are much better way of keeping an eye on a load of different blogs without having to wade through the mass of posts. I'm sure at some stage I'll get around to building one of them pipe things myself but for the moment these work fine. I've also totally forgotten the details I used when I last signed up for a yahoo account.

First post

This is my first post to my blog, which is being done as part of my library's learning 2.0 course.