Categories
Technology

Now that’s Semantic Web(?)

Danny pointed out SemaView’s new calendar based product, Sherba, congratulating them on a …winning application of SemWeb technologies.

The company is using the iCal RDF Schema to create a windows-based application to manage and share event information through an interconnected calendaring system. My first reaction when I saw “window-based application” is to wince at the use of semantic web to what sounded like another Groove-like product that just happens to use RDF/XML for the data. Or does it?

According to the developer documentation, though the company’s application generates the RDF/XML data, it’s not hidden into the bowels of an application only accessible through archane, proprietary rituals or other perversions of openness. (And yes I’m including web services in this because to me, open means open — wide out there baby, just like this web page is. )

There are web services available, but more importantly to me, me being a person who believes that the semantic web is about data rather than applications, the product produces lovely RDF/XML files. Crawlable, open, plain view, accessible RDF/XML files.

Better, it gets better. Not only does the company produce the RDF/XML, it allows organizations that use the product to register their calendars in a global search directory called SherpaFind. Now you can search for events based on a set of parameters, view the calendar, download it, or best of all, directly access the RDF/XML for the calendar.

This is open. This is data within context, though Tim Berners-Lee hates that word . This is data that’s saying: excuse me little bots, sirs, kind sirs, but this data you’re slurping up isn’t just a mess of words waiting to be globally gulped and spit out in a bizarre search based on weights and links; it’s data that has some meaning to it. This data is calendaring data, and once you know that, you know that a lot.

Having said this, though, some of what I read leads me to think this isn’t as open as I thought at first glance. First, if I read this correctly, the Sherpa calendar information is centralized on the Sherpa servers. I’m assuming by this, again with just a first glance, that Semaview is providing the P2P cloud through which all of the clients interact in a manner extremely similiar to how Groove works. If this is true, I’ve said it before and will again — any hint of centralization within a distributed application is a point of weakness and vulnerability, the iron mountain hidden within the cloud.

Second, I can’t find the calendar RDF/XML out at the sites that use the product. There are no buttons at these sites that give me the RDF/XML directly. Additionally, trying variations of calendar.rdf isn’t returning anything either. Again, this is a fast preliminary read and I’ll correct my assumptions if I’m wrong — but is the only way to access the RDF/XML calendar information through SherpaFind? How do bots find this data?

Let’s compare Sherpa with that other popular use of RDF/XML: RSS. I generate an RSS 1.0 file that’s updated any time my weblog pages are updated. You can find it using multiple techniques, including searching for index.rdf files, following a link on my page or using RSS autodiscovery. You can find my site originally by me pinging a central server such as blo.gs. However, most of us find each other because we follow a link from another weblog. If we like what we read, we then subscribe to each other and use aggregators to keep up with updates. The golden gateway in this distributed application is through the links, rather than through an organization’s P2P cloud.

This is almost a pure P2P distributed application, enabled bya common vocabulary (RSS 1.0), serialized using a common syntax (RDF/XML), defined using a common data model, (RDF). Since it is dependent on the Internet and DNS, there’s an atom of iron in this cloud, but we can’t all be perfect. The only way to break this connection between the points is to take my site down (micro break), in which case there is no data anyway; or if we take the Internet down (macro break).

When you have a centralized cloud, like Groove’s, then you’re dependent on an organization to always and consistently provide this service. For Groove the product to work, Groove the company must continue to exist. If Groove no longer exists and the Groove cloud is no longer being maintained, hundreds, thousands, of connections to each other are lost.

The SemaView site mentions Sherpa Calendar in the context of Napster, as regards its functionality, except that calendaring information is shared rather than music. (We also have to assume the RIAA isn’t out to sue your butt if you use the application.) But Napster is based on the data being stored on the nodes — the end computers, not on the web. (Well, not directly on the wide open Web.) Is it, then, that the calendar data is stored on the individual PCs, only accessible through the Sherpa cloud? If this is so, then ingenous use of RDF/XML or not — this isn’t an application of the Sematic Web. This is just another application of web services.

(Though Tim B-L believes that the Semantic Web is based on functionality such as web services rather than data in context, I don’t agree. And many in the semantic web community wouldn’t, either. )

Without a closer look at how the product works, the documentation only tells me so much so my estimations of how this product functions overall is somewhat guesswork at this moment. When I have access to the product, I’ll do an update.

Page and comments are archived in the Wayback Machine

Categories
Technology

The Ten Basic Commands of Unix

Copy found at Wayback Machine Archive.

Once upon a time Unix used to be for geeks only — the platform of choice for godlike SysAdmins and obsessed hackers who muttered strange phrases and giggled over inside jokes, as they swigged gallon after gallon of Mountain Dew. Unix neophytes were faced with a blank screen and an uncompromising command line along with dire warnings about what not to do … or else. Extending the basic computer, adding in such esoteric devices as printers or modems, required recompilation of the kernel, ominous sounding words intimidating enough to send all but the most brave, or foolish, running for the safety of Windows.

Then a strange thing happened: Unix started to get friendlier. First, commercial versions of Linux such as Red Hat came along with easier installation instructions, integrated device support, and lovely graphical desktops (not to mention a host of fun and free games). Open source Unix developers started drinking microbrews and fancy cocktails instead of caffeine and realized that they had to make their software easier to install and well documented in addition to being powerful and freely available. Alternatives to powerhouse commercial applications, such as Openoffice’s challenge to Microsoft’s Office, minimized the cost of switching to desktop Unix platforms. Finally, that bastion of the Hide the Moving Parts Club, Apple, broke all tradition and built a lovely and sophisticated operating system, Mac OS X, on top of a Unix platform.

Today’s Unix: slicker, safer, smaller, better…but push aside the fancy graphics and built-in functionality and simple installation, and you’re still going to be faced, at one time or another, with a simple command line and dire warnings about what not to do. Before you contemplate drinking the Code Red kool-aid, take a deep breath, relax, and familiarize yourself with the Ten Basic Commands of Uinux.

First Command: List the Contents

You have a brand new Unix site to host your weblog. You’re given shell access, which means that you can actually log into the operating system directly, rather than access the site contents through a browser or via FTP. You’ll access the site through SSH, or Secure Shell, because you’ve been told that its more secure. To do so, you’ll install an SSH application recommended by your friends, or use one provided by your hosting service. Up to this point, you’re in familiar territory — start an application and provide your username and password. Simple.

However, once you log on to the operating system, you’re faced with a cryptic bit of writing on the left side of the screen, such as “host%” or some variation thereof, with the cursor located just to the right, waiting to reflect whatever you type. At this point, your mouse, which has been your friend and companion, sits idle, useless, because you’re now in the Unix command line interface, and you haven’t the foggiest what to do next.

Your direction at this point depends on what you hope to accomplish, but chances are, you’re going to be interested in knowing what’s installed in the space you’ve just been given. To do this, you use the Unix List directory contents command, ‘ls’ as it’s abbreviated, to list the contents of the current directory. You can issue the command by typing the letters ‘ls’ followed by pressing the Enter key:

host% ls

What results is a listing of all the files and directories located directly in your current location, which is likely to be the topmost directory of your space on the machine. Depending on the host and what you have installed, this listing could include a directory for all CGI applications, cgi-bin. If your site is web-enabled, it could also include web pages, such as an index.html or index.php file, depending on what you’re using for web pages. If you have a email box attached to your account, you might also see a directory labeled “mail”, or another labeled “mbox”.

This one simple command is highly useful, but there are parameters you can pass to the list command to see more detailed information. For instance, you can see the owner, permissions, and size of files by passing the -l parameter to the command:

host% ls -l

The results you’ll get back can vary slightly based on version of Unix, but the following from my forpoets directory is comparable to what you’ll see:

drwxr-xr-x 3 shelleyp shelleyp 4096 Jul 20 18:09 flavours
-rw-r–r– 1 shelleyp shelleyp 5255 Aug 16 16:28 forpoets.css
-rw-r–r– 1 shelleyp shelleyp 6064 Aug 10 15:14 index.php
-rw-r–r– 1 shelleyp shelleyp 1319 Aug 10 15:00 index.rdf
-rw-r–r– 1 shelleyp shelleyp 789 Aug 10 15:00 index.xml
drwxr-xr-x 10 shelleyp shelleyp 4096 Sep 25 16:21 internet
-rw-r–r– 1 shelleyp shelleyp 27638 Jul 23 00:06 jaggedrocksml.jpg
drwxr-xr-x 9 shelleyp shelleyp 4096 Sep 25 16:23 linux

In this output, the first set of parameters is the permissions for the files and directories, the owner and group associated with each is ‘shelleyp’, the size is listed after the group name, as well as the date, and so on. If the permission character begins with the character ‘d’, this means the object is another directory. Easy.

Of course, at this point you might be saying to yourself that I find Unix easy because I’m aware of what the commands are and what all the different parameters mean and do, as well as how to read the results. I’m a geek. I’ve visited the caffeine fountains and drunk deep; I’ve wondered the halls and muttered arcane curses and behold, there is light but not smoke from the tiny little boxes. But how can you, the creative master behind the sagas recorded on the web pages and the color captured in the images and the sounds recorded in the song files, learn these mystical secrets without having to apprentice yourself to the SysAdmin?

That leads us to the second command, whereby you, the seeker, find the Alexandrian Library embedded within the heart of most Unix installations.

Second Command: Seek Knowledge

Cryptic as Unix is, there is an amazing amount of documentation installed within the operating system, accessible if you use the right magic word. Originally, this word used to be man for manual pages; more recently the command has been replaced by info, though most Unix systems provide support for both.

Want to discover what all the parameters are for the list command? Type in the world man, followed by the command name:

host% man ls

What returns is a wealth of information such as more detailed information about the command itself, as well as a listing of optional parameters, and how each impacts on the behavior of the Unix command. Additionally, documentation for some commands may actually contain examples of how to use the command.

Nice, but what if you don’t know what a command is in the first place? After all, Unix is a rich environment; we can assume that one does more than just list directory contents.

To provide a more introductory approach to Unix, the info command, and the associated Info documents for the Unix system provide detailed information about specific commands, and can be used in a manner similar to man:

host% info ls

What follows is less cryptic information about the command, written more in the nature of true user documentation rather than arising from the ‘less is more’ school of annotation. Still, you have to know about the command first to use the system. Or do you?

If you type ‘info’ without a command, you’ll be introduced into the Info system top level node, which provides a listing of commands and utilities and a brief description of each. Pressing the space bar allows you to scroll through this list until you find a utility or built-in Unix command that seems to provide what you need. At this point, you can usually type ‘m’ to enter menu item mode, and then type the command name to get more detailed information. For instance, if I’m looking for a way to list directory contents, scrolling through Info on my server the first command that seems to match what I want is ‘dir’ not ‘ls’. By typing ‘m’ while still in Info, and then ‘dir’, I find out that ‘dir’ is a shortcut, an alias for a specific use of ‘ls’ that provides certain parameters by default:

`dir’ (also installed as `d’) is equivalent to `ls -C -b’; that is,
by default files are listed in columns, sorted vertically, and special
characters are represented by backslash escape sequences.

Suddenly, Unix doesn’t seem as cryptic or as mysterious as you once originally thought. Still, it helps to know some basic commands before diving into it headfirst, and we’ll continue with our basic Commands of Unix by exploring how to traverse directories, next.

Third Command: Move About

Unix systems, as with most operating systems including Windows, are based on a hierarchy of directories following from some topmost directory basically represented by an empty slash ‘/’. However, unlike a Window-like environment where you click the directory name to open it and continue your exploration, in a command line environment you have to traverse the directories via command. The command you use is the Unix ‘Change directory’ command, or ‘cd’.

For instance, if you have a directory called cgi-bin located in your current directory, you can change to this directory by using the following:

host% cd cgi-bin

Typing the ‘ls’ command displays the contents of the cgi-bin directory, if any.

To return to the directory you started from you can use the ‘..’ value, which tells the cd command to move up one directory:

host% cd..

You can chain your movement requests to move up several directories with one command by using the slash character between the ‘..’ values. The following moves up two levels in the directory hierarchy:

host% cd ../..

Additionally, you can move down many levels by typing the names of directories you want to traverse, again separated by the slash:

host% cd shelleyp/forpoets/cgi-bin

Of course, you have to be directly in the directory path of a target directory to be able to use these shortcuts; and you have to know where you’re at relative to the target directory. However, what if you want to access a directory directly without messing with relative locations? Let’s say you’re in the full directory path of ‘home/username/forpoets/cgi-bin’ (assuming your home environment is /home/username) and you want to move to ‘home/username/web/weblog/logs’? The key to directly accessing a directory no matter where you are is to specify the complete directory path, including the beginning slash:

host% cd /home/shelleyp/forpoets/cgi-bin

Once you’ve discovered the power of directory traversal, you’ll go crazy, winging your way among directories, yours and others, exploring your environment, and generally snooping about. At some point, you’ll get lost and wonder where you are. You’re at X. Now, what is X.

Fourth Command: Find yourself

In Unix, to paraphrase Buckaroo Bonzai, no matter where you go, there you are. To find your location within the Unix filesystem of your machine, just type in the Unix Print Working Directory command, ‘pwd’:

host% pwd

Your current directory location will be revealed, and then you can continue your search for truth, and that damn graphic you need for your new page but you placed somewhere and can’t remember where now.

Of course, to traverse to a directory in order to place a graphic in it, the location of which you’ll then promptly forget, you have to create the directory, first.

Fifth Command: Grow your Space

Directories are wondrous things, a way of managing your resources in such a way that you can easily find one JPEG file without having to search through 1000 different items. With this simple hierarchical, labeled system, you can put your images in a directory labeled ‘image’, or put your weblog pages in a directory labeled ‘weblog’, and add an ‘archives’ directory underneath that for archive pages.

You can go mad, insane, with the impulse to organize — organizing your pages by topic, and then by month, and then by date, and then…well, the limits of your creativity will most likely be exhausted before the system’s ability to support your passionate embrace of your own self geekness.

Making a new directory is quite simple using the Make Directory command, ‘mkdir’. At the command line, you specify the command followed by the name of the directory:

host% mkdir image

When next you list the contents of the current directory, you’ll now see the new directory, ready for you to traverse and fill with all your bits of wisdom and art. Of course, there is a caveat. This is Unix — there is always a caveat.

Before you can create a directory or even move a file to an existing directory you have to either own the directory, and/or have permissions to write to the directory. It wouldn’t be feasible, in fact it would be downright rude, if you could create a directory in someone else’s space, or worse, in the core operating system directories.

We’re assuming for the nonce that you’re the owner of your domain, as far as your eye can see (as allowed by the operating system) and that you can create things as needed. But what if you want to magnanimously change the permissions of files or directories to allow others to run applications, access pages, or create their own directories?

Sixth Command: Grant Devine Rights

Earlier when playing around with the ‘ls’ command, we looked at more detailed output from the command that showed a set of permissions for the directory contents. The output looked similar to:

-rw-r–r– 1 shelleyp shelleyp 789 Aug 10 15:00 index.xml
drwxr-xr-x 10 shelleyp shelleyp 4096 Sep 25 16:21 internet

In the leftmost portion of the output, following the first character, which specifies whether the object is a directory or not, the remaining values specify the permissions for each object listed by owner of the object (the first set of triple characters), the group the owner belongs to (the second set of triples), and basically the world. Each triple permission states whether the person accessing the object has read access, write access, or can execute (run) the object — or all three.

In the first line, I as owner had read and write access to the file, but not execute because the file was not an executable. Any member of the group I belong to (the same name as my user name in this example, though on most systems, this is usually a different name), would have read access to the file, only. The same applies to the world, not surprising since this is a web accessible XML file. For the second line, the primary difference is that all three entities — myself, group, and the world — have executable permission for object, in this case a directory.

What if you want to change this, though? In particular, for weblog use, you’ll most likely need to change permissions for directories to allow weblogging tools to work properly. To change permissions for a file or a directory, you’ll use the Change Mode command, ‘chmod’.

There are actually two ways you can use the chmod command. One uses an octal value to specify the permission for owner, group, and world. For instance, to change a directory to all all permissions for the owner, but only execution permission for a group and the world, you would use:

host% chmod 755 somefile

The first value sets the permissions for the owner. In this case, the value of ‘7’ states that the owner has read, write, and execute permission for the object, somefile

-rwxr-xr-x 1 shelleyp shelleyp 122 Sep 27 17:48 somefile

If I wanted to grant read and write permission, but not execute, to owner, group, and world, I would use ‘chmod 666 somefile’. To grant all permissions to owner, read and write to group, and read only to world, I would use ‘chmod 764 somefile’.

To recap the numbers used in these examples:

4 – read only
5 – read and execute only
6 – read and write only
7 – read, write, and execute

The first number is for the owner, the second for the group, the final for the world.

An approach that’s a bit more explicit and a little less mystical than working with octal values, is to use a version of chmod that associates permission with a specific group or member, without having to provide permissions for all three entities. In this case, the use of the plus sign (‘+’) sets a permission, the use of the subtraction sign (‘-‘) removes it. The groups are identified by ‘u’ for user (owner), ‘g’ for group, and ‘o’ for others. To apply a permission to all three, use ‘a’, which is what’s assumed when no entity is specified.

This sounds suspiciously similar to that simple to put together table you bought at the cheap furniture place, but all’s clear when you see an example. To change a file’s permission to read, write, and execute for an owner, read and execute for group, and execute for the world, use the following:

chmod u+rwx,g+rx,o+x somefile

In this example, the owner’s permissions are set first, followed by the permissions for the group and then ‘others’, or the rest of the world.

To remove permission, such as removing write capability for owner, use the following:

host% chmod u-w somefile

Though a bit more complex and less abbreviated than using the octal values, the latter method for chmod is actually more precise and controlled and should be the method you use generally.

(Of course, there’s a lot more to permissions and chmod than explained in this essay, but we’ll leave this for a future Linux for Poets writing.)

Once you’ve created your lovely new directory, and made sure the permissions are set accordingly, the next thing you’ll want to do is fill it up.

Seventh Command: Be fruitful, copy

One way you’ll add content to your directories is to create new files, or to FTP files from another server. However, if you’re in the midst of reorganizing your directories, you’ll most likely be copying files from an existing directory to a new one. The command to copy files is, as you’ve probably guessed by now, Copy, or ‘cp’.

To copy a file from a current directory to another, use the following:

host% cp somefile /home/shelleyp/forpoets

With this the source file, somefile, is copied to the new destination, in this case the directory at /home/shelleyp/forpoets. Instead of copying the file to another location, you can copy it in the same directory, but use a different name:

host% cp somefile newfile

Now you have two files where before there was one, both with identical content.

You can copy directories as well as files by using optional parameters such as -a, -r, or -R. For the most part, and for most uses, you’ll use -R when you copy a directory. The -R option instructs the operating system to recursively enter the directory, and each directory in that directory and so on copying contents, and to preserve the nature of certain special files such as symbolic links and device files (though for the most part you shouldn’t have these types of files in your space unless you’ve come over to the geek side of the force):

host% cp -R olddir newdir

The -a option instructs the operating system to copy the files and directories as near as possible to the state of the existing objects, and the -r option is recursive but can fail and hang with special files.

(Before using any of the optional flags with copy, it’s a good idea to use the previously mentioned ‘info’ command to see exactly what each flag does, and does not do.)

When you’re reorganizing your site, copying is a safe approach to take but eventually you might want to commit to your new structure and that’s when you make your move. Literally.

Eighth Command: Be Conservative, Commit

Instead of copying files or directories, you can move them using the Unix Move command, abbreviated as ‘mv’.

To move a file to a new location, use the command as follows:

host% mv filename /home/shelleyp/forpoets

Just as with copy, the first parameter in this example is the source object, the second the new destination or new object name — you can rename a file or directory by using ‘mv’ command with a new name rather than a destination. You can also move a directory but unlike ‘cp’, you don’t have to specify a an optional parameter, or flag, to instruct the command to move all the contents:

host% mv olddir newdirlocation

Up to this point, you’ve created, and you’ve copied, and you’ve moved and over time you’re going to find your space becoming cluttered, like Aunt Minnie’s old Victorian house filled with dusty lace doilies and oddities like Avon bottles, forming canyons of brightly colored glass for the 20, or so, cats wondering about.

It’s then that you realize: somethings got to go.

Ninth Command: Behold, the Destroyer

There is a rite of passage for those who seek to enter geekhood. It’s not being able to sit at a keyboard and counter the efforts of someone trying to crack your system; it’s not being able to create a new user or manage multiple devices. The rite of passage for geek candidates is the following:

host% rm *

Most geeks, at one time or another, have unintentionally typed this simple, innocuous phrase in a location that will cause them some discomfort. It’s through this experience that the geek receives a demonstration of the greatest risk to most Unix systems…ourselves.

The simple ‘rm’, is the Unix Remove command and is used to remove a file or directory from the filesystem. It’s essential to keep a directory free of no longer wanted files or directories, and without it, eventually you’ll use up all your space and not be able to add new and more exciting material. However, it is also the command that most people use incorrectly at some point, much to their consternation.

To remove a specific file, type ‘rm’ with the filename following:

host% rm filename

To remove an entire directory, use the following, the -r flag signaling to recursively traverse the directories removing the contents in each:

host% rm -r directoryname

When removing an entire directory, you’ll be prompted for each item to remove, and this prompt can be suppressed using the -f option, as in:

host% rm -rf directoryname

So far, the use of remove is fairly innocuous, as long as you’re sure you want to remove the file or directory contents. It’s when remove is combined with Unix wildcards that warning signs of ‘Ware, there be dragons here should be entering your thoughts.

For instance, to remove all JPEG files from a directory, instead of removing each individually, you can use a wildcard:

host% rm *.jpg

This command will remove any file in a directory that has a .jpg extension. Any file. Simple enough, and as long as that’s your intent, no harm.

However, it’s a known axiom that people work on their web sites in the dead of night, when they’re exhausted or have had one too many microbrews. Our minds are befuddled and confused and tired and not very alert. We’re impatient and want to just finish so we can go to bed. So we enter the following to remove all JPEG files from a directory:

host% rm * .jpg

A simple little space, the result of a slight twitch of the thumb, and not seen because we’re tired — but the result is every file in that directory is removed, not just the JPEG files. And the only way to recover is to access your backups, or seek the nearest Unix geek and ask them to please, pretty please, help you recover files you accidentally removed.

And they’ll look at you with a knowing eye and say, “You used rm with a wildcard, didn’t you?”

Which leads us to our last Command, and the most important…

Tenth Command: Do Nothing

You can’t hurt anything if you don’t touch it. If you’re unsure of what a command will do, read more about it first, don’t type it and hope for the best. If you’re tired and you’re removing files, wait until you’re more rested. If something isn’t broken, don’t fix it. If your site is running beautifully, don’t tweak it. If you’re trying something new, back your files up first.

Unless you’re a SysAdmin and need to maintain a system, in which case you don’t need this advice anyway, you can’t hurt yourself in Unix unless you do something, so if all else fails, Do Nothing.

The easiest mistake to recover from in Unix is the one that’s not made.

Categories
Connecting Internet

Fool you fool you

I never thought I would get to the point of welcoming emails offering to enlarge either my penis or breasts, to set me up with a single in my area, to show me girls with big boobies, or my friends from Nigeria with wonderous opportunities.

Email after email, alternating:

Thanks!
Details
My Details
Your Details
Your Application
Wicked Screensaver
That Movie
Thank You!

You access your account and you see that you have 10,50, 125, 600 emails waiting and you think of the notes from friends that might be included, or perhaps an interesting comment or two on your writing, but no, all you get is email after email with:

Thanks!
Details
My Details
Your Details
Your Application
Wicked Screensaver
That Movie
Thank You!

You hunt carefully among the dross but no glimmer of gold; or if it’s there, you can’t see it because your mind numbs from email after email with:

Thanks!
Details
My Details
Your Details
Your Application
Wicked Screensaver
That Movie
Thank You!

You hope, but the messages chant out “Fool you!” “Fool you!” At the end of the day, oddly, you feel more lonely than if you hadn’t received any email at all.

Categories
Web

Putting Hotlinks on Ice

Recovered from the Wayback Machine.

Hotlinks — what a perfect word for the practice of directly linking to a photograph or other high bandwidth item on someone else’s server. Hot with its implication of hot goods and thieves passing in the cybernight. The proper term is “direct linking”, and while more technically accurate, the latter term lacks panache. Hotlinking is a particularly warm subject for me because of my extensive use of photography with my writing.

I’m not really sure what led to me start posting photographs with my essays and other writing. Probably the same impulse that leads me to mix poetry with technology, a combination leading Don Parks to write They are rather verbose and poetic… of my Permalinks for Poets essays. Well, get comfortable with your favorite drink, because we’re about to embark on another poetic, verbose, adventure into the mysteries of technology. Most fortunate for you, this one’s a murder mystery because we’re going to put hotlinks on ice.

This is a photograph of me

It was taken some time ago.
At first it seems to be
a smeared
print: blurred lines and grey flecks
blended with the paper;

then, as you scan
it, you see in the left-hand corner
a thing that is like a branch: part of a tree
(balsam or spruce) emerging
and, to the right, halfway up
what ought to be a gentle
slope, a small frame house.

In the background there is a lake,
and beyond that, some low hills.

(The photograph was taken
the day after I drowned.

I am in the lake, in the center
of the picture, just under the surface.

It is difficult to say where
precisely, or to say
how large or small I am:
the effect of water
on light is a distortion

but if you look long enough,
eventually
you will be able to see me.)

Margaret Atwood

 

Hotlinking is the practice of adding a photograph or other multimedia directly in a web page, but linked to the resource on someone else’s server. The bandwidth bandit gets the benefit of the photograph, but the owner of the photograph or has has to pay for the bandwidth. If enough photographs or movies or songs are hotlinked, the bandwidth use adds up.

Recently I noticed that several photographs from FOAF, Flocking, and the Semantics of Starlings were being accessed from various other weblogs, including Adam Curry’s weblog. The reason this was happening is that some folks copied part of the essay, including the links to the photographs. The photograph accesses started appearing from one weblog, then another, then another.

The problem was then compounded when each of these sites published RSS that included all their content rather than excerpts — including these same direct links to the photographs. In fact, it was through RSS that photographs appeared in Adam Curry’s online aggregator — along with several very interesting pornography photos.

I’ve had photographs hotlinked in the past and haven’t taken any steps to prevent it because the bandwidth use wasn’t excessive. In addition, some people who are weblogging within a hosted environment don’t have a physical location for photographs, and I’ve hesitated about ‘cutting them off’. Besides, I was flattered when people posted my photographs, being a pushover when it comes to my pics.

However, with this last incident, I knew that not only was my bandwidth being consumed from external links, those who share space and other resources on the weblogging co-op I’m a part of are also losing bandwidth through our shared line. Time to close the door on the links.

To restrict access to images, I’ll need to add some conditions to my existing .htaccess file. If you’ve not worked with .htaccess before, it’s a text file located in your directory that provides special instructions to the web server for files in your directories. In this particular case, the restrictions I’ll add will be dependent on a special module, mod_rewrite, being compiled into your server’s installation of Apache. You’ll need to check with your ISP to see if you have it installed.

(If you have IIS, you’ll use ISAPI filters, instead. See the IIS documentation for specifics.)

Restrictions for image access are made to the top-level .htaccess file shared by all my sites. By putting the restrictions into the top-level file, they’ll be applied to all sub-directories unless specifically overridden.

Three mod_rewrite instructions are used within the .htaccess file:

RewriteEngine On — turns on the rewrite engine
RewriteCond — specifies a condition determining if a rewrite rule is implemented
RewriteRule — the rewrite rule

When the web server accesses the .htaccess file and sees these directives, three things happen: the rewrite engine is turned on, the rewrite conditions are used against the incoming request to see if a match is found, and the rewrite rule is applied.

The rewrite conditions and rules make use of regular expressions to determine if an incoming request matches a specific pattern. I don’t want to get into regular expressions in this essay, but know that regular expressions are basically pattern matching, using special characters to form part of the pattern. The examples later make use of the following regular expression characters, each listed with its specific behavior:

! used to specify non-matching patterns
^ start of line anchor
$ end of line anchor
. match any single character
? zero or one of preceding text
* 0 or N of the proceding text, where N is greater than zero
\char Escape character — treat char as text, not special character
(chars) grouping of text

There are other characters, but these are the only ones I’m using — the mod_rewrite Apache documentation describes the entire set.

Within .htaccess I add a line to turn on the rewrite engine, and add my first condition — match a HTTP request from any domain that is not part of the burningbird.net domain:

RewriteEngine On
RewriteCond %{HTTP_REFERER} !^http://(.*\.)?burningbird.net/.*$ [NC]

The condition checks the HTTP referrer (HTTP_REFERER) to see if it matches the pattern, in this case anything that is not from the burningbird.net. This includes domains other than paths.burningbird.net, rdf.burningbird.net, www.burningbird.net, and burningbird.net directly. The qualifier at the end of the line, [NC], tells the rewrite engine to disregard case.

I’m looking for domains other than my own because I want to apply the rules to the external domains — let my own pass through unchecked. Since I have more than one domain, though, I need to add a line for each domain and modify the file accordingly:

RewriteEngine on
RewriteCond %{HTTP_REFERER} !^http://(.*\.)?burningbird.net/.*$ [NC]
RewriteCond %{HTTP_REFERER} !^http://(.*\.)?forpoets.org/.*$ [NC]
RewriteCond %{HTTP_REFERER} !^http://(www\.)?yasd.com/.*$ [NC]
RewriteCond %{HTTP_REFERER} !^http://(www\.)?dynamicearth.com/.*$ [NC]

Once all of conditions are added to .htaccess, when the web server accesses a file within my directories, the conditions are combined, adding up to a pattern match for any domain other than a variation on burningbird.net, forpoets.org, yasd.com, and dynamicearth.com.

One last pattern domain needs to be allowed through, unchecked — I need to allow access to the images when the referrer has been stripped, such as local access or access through a proxy. To do this, I add a line with no domain or pattern — a blank referrer. The file then becomes:

RewriteEngine on
RewriteCond %{HTTP_REFERER} !^$
RewriteCond %{HTTP_REFERER} !^http://(.*\.)?burningbird.net/.*$ [NC]
RewriteCond %{HTTP_REFERER} !^http://(.*\.)?forpoets.org/.*$ [NC]
RewriteCond %{HTTP_REFERER} !^http://(www\.)?yasd.com/.*$ [NC]
RewriteCond %{HTTP_REFERER} !^http://(www\.)?dynamicearth.com/.*$ [NC]

Once I have the rewrite conditions set, time for the rule. This is where all of this can get interesting, depending on how clever you are, or how devious.

In my .htaccess file, when a referrer from a domain other than one of my own accesses one of my photos, I forbid the request. The rule I use is:

RewriteRule \.(gif|jpg|png)$ – [F]

What this rule says is that any request to a JPG, GIF, or PNG file, coming from a domain that doesn’t match the conditions set earlier, is rewritten to the ‘-‘ character. In addition, the [F] qualifier at the end of the line tells the browser that they are forbidden to fetch this particular file.

Depending on the browser accessing the web page that contains the hotlinked photo, rather than the image, the page will either show a missing image symbol, or the name of the image file will be printed out.

Now, my approach just prohibits others from hotlinking to my images. Other people will redirect the image request to another image — perhaps one saying something along the lines of “Excuse me, but you’ve borrowed my bandwidth, and I want it back.” In actuality, people can be particularly clever, and downright mean, with the image redirection.

If this is the approach you want, then you would use a line similar to:

RewriteRule \.(gif|jpg|png)$ http://burningbird.net/baddoodoo.jpg [R,L]

In this case, the image request is redirected to another image, baddoodoo.jpg, and a redirect status is returned (the ‘R’). The ‘L’ qualifier states that this is the last rewrite rule to apply, to prevent an infinite lookup from occurring (accessing that redirected image, triggering the rule, that accesses that image, that triggers…you get the idea). Don’t forget to terminate the rule with the ‘L’ qualifier or you’ll quickly see how your web server deals with runaway processes.

(Does anyone smell smoke?)

It’s up to you if you want to forbid the image access, or redirect to another file — note, though, that you shouldn’t assume that people who are hotlinking are doing so maliciously. Most do so because they don’t know there’s any wrong with it. Including most webloggers.

My complete code is:

RewriteEngine on
RewriteCond %{HTTP_REFERER} !^$
RewriteCond %{HTTP_REFERER} !^http://(.*\.)?burningbird.net/.*$ [NC]
RewriteCond %{HTTP_REFERER} !^http://(.*\.)?forpoets.org/.*$ [NC]
RewriteCond %{HTTP_REFERER} !^http://(www\.)?yasd.com/.*$ [NC]
RewriteCond %{HTTP_REFERER} !^http://(www\.)?dynamicearth.com/.*$ [NC]

RewriteRule \.(gif|jpg|png)$ – [F]

 

update

Some browsers strip the trailing slash from a request, and can cause access problems, as noted in comments in Burningbird. I’ve modified the .htaccess file to the following to allow for this:

RewriteEngine on
RewriteCond %{HTTP_REFERER} !^$
RewriteCond %{HTTP_REFERER} !^http://(.*\.)?burningbird.net(/.*)?$ [NC]
RewriteCond %{HTTP_REFERER} !^http://(.*\.)?forpoets.org(/.*)?$ [NC]
RewriteCond %{HTTP_REFERER} !^http://(www\.)?yasd.com(/.*)?$ [NC]
RewriteCond %{HTTP_REFERER} !^http://(www\.)?dynamicearth.com(/.*)?$ [NC]
RewriteRule \.(gif|jpg|png)$ – [F]

I tested to ensure this worked using the curl utility, which allows me to access a file and pass in a referrer:

curl -ehttp://burningbird.net http://weblog.burningbird.net/mm/blank.gif

The .htaccess file should now work with browsers that strip the trailing slash

With the rule now in place, if someone tried to link directly to the following photograph, all they’ll get is a broken link.

boats5.jpg

You can see the effect of me putting hotlinks on ice by accessing this site with the domain mirrorself.com. This is one of my domains, but I haven’t added it to the .htaccess file — yet. If you access the main burningbird weblog page through mirrorself.com, using http://www.mirrorself.com/weblog/, you’ll be able to see the broken link results. Look quickly, though, I’ll be adding mirrorself.com in the next week.

When I mentioned writing this essay, Steve Himmer made a comment that the rules he added to .htaccess didn’t stop Googlebots and other bots from accessing images. To restrict access of images from webbots such as googlebot, you’ll want to use another file, robots.txt.

My photos are usually placed in a sub-directory called /photos, directly under my main site. To prevent well behaving webbots such as Google from accessing any photo, add the following to your robots.txt file, located in the top-level directory:

User-agent: *
Disallow: /photos/

This will work with Googlebot, which is a well behaved bot; it will also work with other well behaved bots. However, if you’re getting misbehaving ones, then the next step is banning access from specific IPs — but that’s an essay for another day, because it’s past my bedtime, and …to sleep, perchance to dream.

bwboats.jpg

Categories
Web

Weblogging for poets-series published

Recovered from the Wayback Machine.

I just published the final three parts of the Weblog Link Series, on permalinking and archives:

Part 1 – The Impermanence of Permalinks

Part 2 – Re-weaving the Broken Web

Part 3 – Architectural Changes for Friendly Permalinking

Part 4 – Sweeping out the webs