Utah Valley University Herbarium
Our Herbarium:   History   Staff   Location   Research   Documentation Online Specimen Catalog  

Intro

How to Build Your Own Virtual Herbarium

This is a description of how we built our virtual herbarium. I’ll try to describe why we did things, and how we came to certain decisions, and then describe the code we wrote that drives the virtual herbarium.

Static and Dynamic Pages

Websites consist of static and dynamic pages. Static pages look the same to every visitor. They are simply HTML-formatted documents visitors download and view in their web browser (the browser handles the download and display). Once the page is created it is static until it is again manually edited.

Dynamic pages are very different. Each dynamic page is generated on the web server when it is requested. This is necessary because dynamic pages allow each user to select different things to view. For example, a virtual herbarium might have a keyword search form, where a visitor can enter, say, the common name of a plant hoping that you have a picture of that plant. Since you can't anticipate what search term the visitor will enter, you can't pre-create that page (even if you wanted to have a static page created for every search term that might be used you'd never have enough time to create them all; if you want to do it that way stop reading now and get to work).

Dynamic pages are created by a server-side script. These are simple programs built into the web page (along with the HTML formatting) that run on the server. They will take input from the visitor (in our example, the keyword search term) and search a database that contains your specimen records. When the proper records are returned the script formats the data in HTML and returns it to the visitor. Writing the script is not easy, as you'll need to know how your database information is organized, how the logic of the searches, and how to return the information to the visitor that both looks good and is functional.

Here we describe the options and choices we made in setting up a database-driven virtual herbarium website in hopes that other herbaria directors can benefit from our experience.

Planning

In the planning stage you'll need to consider several things, and hope you get it right, because it's difficult to change partway through the process:

  1. How do you want visitors to interact with the website?
  2. How will you store your specimen information for easiest entry and return?
  3. What scripting language will you use to make the initial programming job easier and make the scripting understandable to whomever replaces your programmer?
  4. How will your website look?
  5. Once the data and images are on a hard drive, how will you keep the data safe from damage or loss?

For those of you just starting to examine the possibility of your own virtual herbarium, consider the options:

  1. Buy a turn-key virtual herbarium system. There are several companies that offer herbarium management/virtual herbarium software packages. Expensive, but they include programs to convert your data over to the new server and get your virtual herbarium (sans images) on the internet very quickly. They generally have good support, too. Nice if you’re well funded. KE EMu, BG-BASE, BIOTA.
  2. Get a free management/virtual herbarium system. Index kentuckiensis has a very powerful management package, and they include a virtual herbarium package. More difficult to install, a lot fewer tools and no handholding. You’ll need a web nerd to help get things running. Another option is the UCDavis management system. They have for download a fairly rudimentary but very functional virtual herbarium. No option for images, thought that can be added without too much trouble. BRAHMS and SPECIFY are two very good packages. Brahms has a single online server for all users of the package, so adopters needn't put their own server online. Index kentuckiensis, UC Davis Herbarium Management System, BRAHMS, SPECIFY, NYBG VH.
  3. Get the online data presentation package with Oracle. Oracle has an add-on package for presenting data on the internet. You’d need to modify the code created by the web tools to get it to look different than the built-in templates, but it handles getting the data from the database and to the client automatically.
  4. Do it yourself. This was the option we chose. We had a web programmer interested in dynamic website, plants of Utah, digital imagery, and geospatial information systems (GIS) programming. I’d had a lot of experience putting up database-driven sites, even a database-driven image website. So I undertook the project. Most of the code was written in about 70 hours of planning, programming and meeting time. We went with this option because we had the idea that we could put up a virtual herbarium that did many things not available in any other package: rulers on images, the way users navigate around the site, presenting and interpreting GIS data, and especially moving around the sheet images. Have a look around the UVU herbarium. We think we worked it all in rather nicely, and hope you agree.

Let's assume now that you are going to create your own website. Those who will buy a turnkey solution need to speak with the vendors and make comparative lists of pricing, functionality (both in the herbarium and online), ongoing costs, ease of branding and customization, etc., and compare them with the free solutions linked above. Other examples and a more complete list (some no longer in development) are listed here.

Logic

You need to know just what you want your visitors to first see and do when they enter your website. Do they see a keyword search box, a list of families or genera, or do they see a complex search form listing all searchable fields and how they are associated with each other? We've seen sites designed all these ways. Some have a few example search links to familiarize visitors with what they can do with searches. Some sites seem designed by database programmers, and require a thorough understanding of how to combine AND and OR searches to get just the data needed. Some sites are designed to provide a very specific fit to the data on hand, others are programmed generically to allow additional information to be added and searched later.

Once the initial search is entered, what do you want your visitor to see? All the information you have on each specimen, or just a summary with links to the details? Do you want them to see a thumbnail of any images you have, and if you have multiple images for a specimen, which one do you want them to see? When visitors move to a details or image page, do you want to have links there to get back to the original search, or should they re-enter the search term to view the list again?

How much of your data do you want to give your visitor? All you have, in the best, research quality form, or do you want your visitors to just see what you have, then contact you for research-quality information?

There is perhaps no better way to get this information than by looking at online herbaria and making notes on what you liked and didn't like. Your web programmer needs these notes to do the job right. If you don't have these notes, and if your programmer isn't a botanist, you'll likely get a website that appeals primarily to web programmers, not your target audience.

Database

The most important decision to make is how to get your sheet label data over to the web server. In our case, we were already using the University of California Davis Herbarium Management code (http://herbarium.ucdavis.edu/database.html), so we stuck with it. This system was adopted by the UVU Herbarium staff some years back, and has all the label data (some 11,000 records) already in the database. It uses a Microsoft Access front-end/backend configuration, with the front-end .mdb file holding the (static) data entry and printing forms, and the backend .mdb files holding all the (dynamic) label and customer data. Having all label data in a single .mdb file suited us as we could transfer the current version of the data to the server when there have been significant changes.

If your data is already in a database, good. That's the first most important step. If your database is ready for the web, then your herbarium staff can continue using it as they are, and you can go directly to getting your website built. But many herbarium databases are not in the way they are set up ready for web use, and must be moved to a database server. Spreadsheets, or herbarium management programs that store data in either a text file or in a proprietary database must be moved to a server or a file-based database like Access or FoxPro.

Entering information into a database is a time consuming, error-prone process. Most herbarium management applications have a data table with label information. As the data is entered, some of the fields will come from look-up tables (family, genera, species, state, county, etc.) which will have the correct spelling and formatting. Other data is unique to the sheet (collector, location, descriptions, etc.) and must be typed in by hand, meaning that some of it will be corrupted on entry. Few herbaria have the budget to afford double-entry and cross-checking of the label data to ensure accuracy. But careful planning of the way data is entered can make the process of catching errors a little easier. For example, we can do a search on county and plot the geographic locations on a map. It's very easy to see accessions that are located outside the county and check those to see what was wrong (Index kentuckiensis can do this also). A very good programmer can use the coordinates of the county lines to do the same check. Alphabetized lists of all unique data in a particular field (for example, the names of collectors), make it easier to spot spelling and alternative name errors.

The real purpose of good planning of the entire process, from database entry to website maintenance, is to make data entry and handling as easy and painless for your staff as possible. Tired people make mistakes, and botanists have a lot more fun working with plants than with a computer. Data entry, sheet scanning, and website maintenance must be made as straightforward, obvious, and responsive to your staff as possible. Otherwise it will be done slowly, and with error. You staff must be involved in the process of planning how all tasks should be done. Don't let your programmer make these decisions. He won't be doing the work, and he probably doesn't know what it is you do, exactly. But make sure your programmer is versatile because maybe you aren't sure either, and he'll occasionally have to throw away a weeks worth of great programming because your staff discovers a better way of doing it, or you changed your mind. And just so you know, it's okay to throw out programming if it doesn't work; you don't publish a first draft, neither do good programmers.

Images

Most virtual herbarium packages have a problem with image presentation. It’s a natural result of not having images in a herbarium, but sheet-mounted dried specimens; herbarium management software just doesn't need images to keep track of things. Clearly the presentation of a sheet on a website requires a high-quality image, preferably one of sufficient resolution that it can be examined as closely as you might do with the real sheet. It can be done in one of two ways:

  1. One very high-resolution scan of the entire sheet (giving an image of 20 million pixels [MPix] or higher) of sufficient quality that when viewed in its entirety a great deal of detail can be found.
  2. One image of the entire sheet, with additional selected close-up shots of specific parts of the sheet.

Method A requires a huge file (a minimum 2 MByte file that consumes 60 MB of computer memory when decompressed). Method B requires some logic in the script to send the proper image to the client, and the fervent hope the client will not want to see detail in any part of the image you haven’t photographed.

Method A is much preferred, but has the significant drawback of sending too big a file to the client, and lacking any client-side image manipulation software common to all clients. One operating system or browser might be able to scroll around a very large image, another might not. And a 2MB image file is just too big for modem users.

The solution to using method A is to send only part of the image to the client at a time. Server-side image serving is a major preoccupation of the GIS industry, where 20 to 200 MB files are common. The de facto standard for these files is JPEG2000. This is an image compression format much like the very common JPEG used in all website images, but with a few enhancements. JPEG2000 has a higher compression ratio than JPG, and it has the ability to decompress only a portion of the image. Unfortunately the open/free software community has not grasped the JPEG2000 format yet, and the only programs for handling them are commercial. MrSID is the most popular of these programs, and it comes in two parts: the server, which sends a requested part of a large image to the client, and the client-side viewer to scroll and zoom the image sent. They work together, so when you scroll the image the newly-revealed part of the image is requested from the server automatically and appears in a few moments. The other JPEG2000 programs work similarly.

We chose not to use this method of image presentation for two reasons: cost, and quality. MrSID charges for every image compressed, and if you compress enough GBytes of files you need to buy more GBytes from the makers. We didn’t look into the licensing of any similar programs. The quality of the delivered images was lower than we wanted for primarily one reason: when a digital image is resized it needs to be sharpened. I’ve learned this from long experience creating landscape images for my photo website. The JPEG2000 programs just don’t do any resizing, presumably because they were developed to deliver large maps and aerial photography for further treatment by the client. We chose to make our own server-side image delivery system. The whole thing is the bit of script you’ve seen above. We have the viewer.asp script calculate the crop window and image size, then deliver that information to the sheet.asp script which loads the large image file into the server memory, crops and resizes the image, then sharpens it for delivery to the client browser. We can adjust the sharpening parameter to suit our taste, and deliver exactly the image we want. And with a reasonably fast server (the standard is a dual processor XEON server running at around 3 GHz) a 35MByte image can be delivered in 4 seconds. Not an unreasonable time, and about double the delivery times for a JPEG2000 image.

Image sharpening is a tricky thing. You must never sharpen an image twice. Sharpening an image means to enhance the contrast around edges by lightening the light side and darkening the dark side. When over done, or done twice in a row, sharpening will leave a bright halo around every dark object. The effect is not visible in low-contrast images (such as one sees on most herbarium sheets), but very visible when text is seen. You can see this in all our sheet images when you zoom in on the label. We’ve chosen to sharpen the specimen to what we feel is an optimal level, a level that routinely over sharpens the text.

Our method of image processing does have one major drawback: it lacks scalability. If only one client is visiting the site at a time there is no problem; that one client has all the server hardware resources available to process the image. But what happens if 5 clients are visiting and they all request image pages at the same time? Some will have to wait until the CPU if free. We programmed our image code in a way what will not scale beyond the two CPU's available in the server. If we need to host more traffic we'll need more CPU's. Most of the CPU time is spent decompressing the entire image into memory. This is a poor situation to be in. The JPEG2000 application scale better. Decompressing only the needed portion of the image into memory requires fewer CPU and memory resources, so multiple clients should be able to click simultaneously without degrading the server response time. We hope that the delay isn't too noticeable to our visitors, and when it is we hope they'll find the better image quality is worth the wait.

Scanning

We chose to do large, 42-48 MPixel scans, which result in 0.5 to 3 MB .jpg files, and process the images into presentable magnifications and dimensions on the server.

Making a 44 Mpixel image is no easy trick. There are three methods available:

  1. Flatbed scanning

  2. Large megapixel SLR camera or medium-format digital backs

  3. Large-format scanning back

Flatbed scanners, in our tests, produce very good-looking images. Most scanners have optical systems that compensate for about 0.5 inches of subject depth, have very flat and white illuminating lamps, and can easily produce 280 Mpixel images. But they do have the significant drawback of specimen handling damage. The sheet holding the specimen must be placed against the flatbed scanning surface. Damage is easily confirmed by measuring the amount to specimen detritus left behind on the scan bed. Some attempts have been made to invert the scanner over the sheet, but have, in our reading of anecdotal reports, been largely unsuccessful in eliminating specimen damage when the scanner is lowered over the sheet. One interesting solution may be in the offing with thin, see-through scanners that are meant to be set atop the sample, However, current scanners (at 8.5 x 11 inches) are too small for full-sheet scans.

There are no SLR cameras (up to 1.5 inch sensor) or medium format backs (up to 2.5 inch sensor) capable of the resolution we wanted. The best we could find were several 22 Mpixel cameras, but even those remarkable cameras introduced aliasing problems we didn't care to solve. Aliasing is the appearance of artifacts when the red, green and blue sensors for a particular pixel are located in different positions on the sensor. When a contrasting line crosses the pixel area, it may be that the blue sensor is on the white side of the line while red and green are on the black side, so the pixel will appear blue instead of grey. Camera manufacturers solve this difficulty by adding an anti-aliasing filter to the surface of the sensor, which has the effect of blurring the image slightly. This blurring is then corrected in-camera with a sharpening step. To get a 44 Mpixel image we'd need to upsize the image, which would result in the sharpening artifacts becoming more visible. These cameras do have the advantage of speed. Since the image is recorded at once, instead of being scanned in, they are much faster in use than either a flatbed or focal-plane scanner.

We chose to use a large-format scanning back. This is used in a camera meant to accommodate 4x5 inch film. We purchased a used monorail bellows camera  to hold the Better Light Model 6000E-HS 48Mpixel scanning back, and an enlarger lens (we didn't use a photographic lens as we had no need of a shutter, and enlarger lenses, like so called "process lenses", have flat focal planes). This back has a 3.5 inch scanning width that traverses 4 inches of the camera focal plane in about 90 seconds. Since the R, G, and B pixel sensors directly follow each other there are no aliasing problems. The scanning back is controlled by a program on an associated computer, but the files themselves are stored in 16-bit-per-pixel format (and are thus 250 MBytes each) on the interface/hard disk that controls the back itself.

 

 

 

 

 

 

 

The camera is mounted on an enlarger stand above the top opening of a Coloreal eBox light box. This is a metal box with 5000 K full-spectrum fluorescent lights standing vertically in each corner behind diffusers to provide very flat and even lighting on the sheet lying on the bottom of the box. We had to enlarge the top opening a bit to accommodate the 135 mm lens angle of view. At f/16 our calculations show a depth-of-filed of about one inch, enough to accommodate most mounted specimens. Very rarely do we need to bring the focal plane up to keep thicker samples in focus.

 

 

 

 

 

 

 

 

 

 

specimen image: 09490s2.jpgThe scanner is calibrated in a rudimentary fashion by keeping the RBG values of the B, M, and A zones of a Kodak Q13 grey scale within 2 units of each other. We use the default color profile for our scanner and adjust the overall R, G, and B channels to match grey only. By our eyes the colors of the reference sheet match the colors displayed on a Pantone Spyder-calibrated monitor. The reference sheet we use has the Kodak Q13 targets and many focus targets. Before a series of scans the reference sheet is scanned and the color calibration is checked, the corner and central focus is established (checking the four corners focus is also a check that the back is completely parallel to the sheet). The Better Light scanner control program includes handy functions to check focus and color calibration. We have seen the exact color of the fluorescent bulbs (and plastic diffusion covers) change slightly over time, so we recheck color calibration every session.

 

Sheets are fitted into an "L" frame taped to the bottom of the light box for scanning. Rulers are placed in an empty area, and the scan is made. One thing we do need to watch carefully is that the width of the scan exactly matches the width of the sheet. We use the sheet width and image width to calibrate the displayed rulers in viewer.asp. After a days scanning all the images are processed together. Image processing is as follows:

  1. Scan all images using our accession number as the file name. This is the only data-entry step of the image-acquisition-to-website process.

  2. Move all images from the scanner interface/hard drive to the computer and delete them from the interface.

  3. Rotate all images to the proper orientation and save them as mildly-compressed .jpg files. Note that we do not sharpen or color correct the image during this stage of processing; color correction is done during the scan, and sharpening is done only when the image is being sent to a visitor.

  4. Move all .jpg files to a temporary directory on the web server.

  5. Log into the web server and process the images in admin.asp. This will add the image to the images table (using the filename/accession number to keep images associated with the proper specimen) of the herbarium.asp database and create the thumbnail and reference images in the proper subfolder.

  6. Archive the .jpg files on the scanner computer. We don't archive the original 16-bit-per-channel scans nor the 8-bit .TIFF files generated from them.

A document describing our procedures can be found on the downloads tab.

GIS

Most virtual herbaria are attempting to incorporate geographic information into their websites. Geographic information has many forms (streets and rivers, political boundaries, underground utilities locations, etc.), but we are most interested in climate data. This is data collected by the US Government (the National Climate Data Center, a branch of NOAA) and provided as shapefiles for a very low price to the public.

Geospacial Information Systems (GIS) comprise a data structure (in the form of a shapefile) and software to manipulate and display the data. Shapefiles were defined by the ESRI company, come in many varieties (point, line, polygon, etc), but climate data use the polygon format almost exclusively. The shapefiles are the form of ranges of temperature, precipitation, dates of first freeze or last thaw, wind, sunlight, etc, organized into a shapes, the perimeters of which are described as entries in the shapefile. Here is an example of the data and maps that are found in a shapefile:

Range n (%) Relative
<32F (0.0%)
32-40F (0.0%)
40-45F (0.0%)
45-50F 1 (7.1%)
50-55F 2 (14.3%)
55-60F 2 (14.3%)
60-65F 7 (50.0%)


Each colored zone represents the annual average precipitation collected over a period of 30 years by the NCDC. The maps generated from GIS data are typically presented in an interactive window on the website, where users can zoom in and pan around the map. This method of presentation has the immediate advantage of making the site more interesting to the tech-savvy user, but has the drawback of requiring installation of viewing software on the client computer, and presenting maps at a somewhat slow pace. The temperature or precipitation values for a point on the map can be read by clicking that point on the map after selecting an 'info' button.

We have chosen a very different approach. Since we are working with collections of species, genera and families, to get useful information from the GIS data we must collect and average the GIS values for many locations. Using the typical GIS web display method, a user would take a great deal of time getting every value off the map. The reason this is so is the unfortunate fact that shapefiles are built for one-way data usage. Shapefiles are very fast when used to draw a map from many polygons. But they are notoriously slow at finding the value of a polygon from a single location. The only way to use the file is to go through every polygon and test to see if the point is in that particular polygon. On a fast computer this typically takes 3 to 9 seconds. Multiply that by the number of specimen in a search and you could be waiting a minute or so for the average and range.

So we looked for a rapid method of returning this information. We found that using the maps themselves is the fastest way of determining the values from a dataset. The method is simple: make a map from the GIS data of sufficient resolution that not too many errors are generated when detailed shapes are presented in a somewhat compressed map. We don't think the errors are any worse than the errors introduced by NCDC in interpolating data over areas not directly measured then placing these values into ranges and representing those areas as (relatively) low vertex polygons. To generate our data we simply open the image, then measure the color of the pixel representing the location of the specimen in question. We can examine hundreds of locations is a second, making these maps fast enough for web delivery without undue tax on the server hardware.

We have a second approach to the problem. We are currently extracting value information from the shapefiles into a separate database. We will in the future use this data to generate monthly temperature and precipitation profiles for every specimen for which we have a location.

The Future

GIS analysis can be a very informative process. When supplied with data of sufficiently high quality, a good GIS analyst can produce answers to questions like

This isn't a comprehensive list by any means, but wouldn't you like to know the answers to some of these questions?

For GIS analysis like this to be truly useful, we need to create websites that collect data from many sources and "expose" the data in a way that lets even the lay user see what the answer might be. Here is where we think the future might go:

All natural collections, plants, insects, animal, need to get their data into a database, and to georeference (determine the latitude and longitude, for example) of all collected items. This can be a daunting task for collections numbering in the millions. Small collections, like ours, can be leaders in this area, because our problem is so much more tractable. It is possible to have all small collections (25,000 or fewer) completely georeferenced and in a database by 2010. Having the collections imaged is not so important for the sort of analysis described in this section. There is a movement among herbaria managers to have this done by 2020 for even the largest herbarium.

All databased collections must combine the data into a single network. Currently the DIGiR protocol is emerging as the front runner in the US, but other data-sharing protocols exist and are favored on other continents. The DigiR protocol is a method of XML data sharing using definitions common to the sort of data being shared. For herbaria the definition is the DarwinCore v1.5. The DIGiR protocol seems to be most actively developed currently by the GBIF organization. In our estimation, the closest thing we now have as a data-sharing standard for herbaria is the free GBIF package, which can 'put online' a database in an almost turnkey installation. But other protocols may prove to be more efficient in the future, especially for GIS analysis.

There has been almost no development of GIS analysis of population data. The UVU Herbarium has more online GIS analysis than any other we've found, and even our work is rudimentary. The forefront of GIS analysis for natural collections is found in the DIVA-GIS application, but it has no on-line component. DIVA-GIS allows the user to easily make biodiversity maps from any amount of data, to make climate niche predictions (we use the DIVA-GIS program to make our online predictions), to find climate outliers in a population as a means of identifying errors in the data. As yet, however, there are no applications developed for the natural scientists for doing GIS analysis on multiple species. The current method is to use a program like ArcGIS to formulate and program the analysis yourself. Unfortunately, the learning curve to do this is quite steep, and the time required makes even one analysis a difficult project, and impossible to do online. During the last years of our web development funding (2005-2007), our web programmer, Dr. Bruce Wilson, spent most of his effort developing tools for online GIS analysis.

While a small herbarium can share its data, and even provide an online herbarium site without too much difficulty, until there is an easy-to-install GIS analysis web package it is rather difficult to provide online GIS analysis. The development of national and regional websites will bring resources enough to make GIS analysis possible. Currently there are 16 regions in the US (southeast, southwest, long-term ecological, northwest, Texas & Oklahoma, California, great plains, north central, rockies, and the intermountain west) who either have put up regional websites, or who plan to. We think a likely outcome for these sites is to, with the help of GBIF, use data shared over DIGiR to do advanced GIS analysis to expose the data to the sort of questions mentioned above, especially those questions involving multiple species. For example, a website user might be able to 'save' search results on the web server, then compare the results of two different searches in several ways (displaying coincidence maps, calculating overlap figures, perhaps even looking at distributions over time, and determining how much bias exists in the data for the calculation of uncertainties). The availability of as much data as exists is necessary to make the analysis meaningful, and data sharing is as essential to this project as the GIS tools used for the analysis.

This may not be the actual future, but it's probably ours.

Site Layout

Here is a map of how visitors move around the site, starting with the Online Specimen Catalog:

Let's run through the pages we need to run the virtual herbarium, and what functionality each page must include:

  1. default.asp
    This is the home page, which presents both the search forms: simple and advanced. The simple search form has a single box that is searched against geographic names, family and scientific names, collector names, etc. The advanced search form also has coordinate searching, elevation, locality and ecology searches, etc. The last part of the default page is a list of all families in the database. If a visitor clicks on the name of a family then a secondary list of the genera of that family is presented. AdvSearch.asp is a simple HTML form with all the combined search terms shown.
  1. search.asp
    Receives the information from the search forms and presents a list of all accessions that match. The list is paginated, and can be sorted on a few fields. The columns presented can be altered. At the bottom of the search page is a selectable map of Utah with the locations of all accessions with coordinate shown. The information shown on this map can be selected by the visitor; Utah towns and major roads, or For each accession line of the table, if a sheet image is available a thumbnail of that image is shown. Clicking it takes the visitor to viewer.asp. Clicking a "full details" link takes the visitor to details.asp. This page also returns XML data in two forms, one resembling the UCDavis Herbarium Labels table format, the other in DARWIN Core v1.3 format.
     
  2. details.asp
    Lists all herbarium label information present in the database, as well as images, a location map, nearby accessions and names land features, and links to other online resources, all presented in a tabbed format. Linked thumbnail images of that accession, as well as any images of other accessions of the same species. Next comes the map of Utah showing the accession location, and a link to the Utah State University plant distribution in Utah for that species and a series of links to other herbaria where images and information might be found. There is then presented a drop-down box listing all accessions collected nearby the current accession, and a short list of all named geographic and geological features found in the USGS GNIS database.
     
  3. viewer.asp
    This is for viewing the full sheet image, or other images that may be available. There are three parts of this page: the main image window (selectable size and zoom level), the pan window that shows the entire sheet and an indication of what part of the sheet is seen in the main window and a set of level controls to assist the visitor in zooming. The main image also has horizontal and vertical centimeter rulers scaled to the current zoom level. Both the main image and pan window are clickable. Clicking the main window centers the image on the click at a zoom level one closer than the zoom level currently indicated. Clicking the pan window centers the main window image at that point at the zoom level indicated.

  4. stats.asp
    This is a page presenting the combined GIS (Geographic Information Systems). GIS information (rainfall, ecological communities, altitude, etc.) is shown with a short summary of how many accessions are associated with each region shown on the GIS map, all in a tabbed page format. We've taken the unusual route of using the image representing the GIS layer, not the layer data itself, to determine the GIS layer attribute (for example, inches of precipitation or date of last thaw) by simply reading the pixel color at the place the specimen was collected. This method very likely induces error of several types:
    1. mis-determined coordinates of the pixel representing the specimen (due to poor conversion of lat/log to x/y coordinates)
    2. poorly-rendered attribute areas in the relatively small image
    3. the way we handled pixels that fell on a boundary line (we searched nearby pixels looking for the first non-border pixel color)
    4. errors in the projection used to create the Utah map (we assume the Mercatur projecton is linear, but it may not be at the pixel level).

    We have not examined the cumulative error on the statistical distributions, but can argue that the cumulative errors of "binned" attributes (for example, precipitation given in some rather broad range categories rather than measured or estimated values, or the method used to determine predominant plant communities) and errors in the estimation of layer attributes are probably larger.
     

  5. admin.asp (not shown on the diagram above because it stands outside normal usage)
    This page handles administrative functions, things only a herbarium or website administrator would do. Primarily, they are
    1. adding images to the website  (and the creation of the thumbnail- and reference-sized images)
    2. deleting images from the images table and directory
    3. deleting comments from the comments table
    4. turning on the private function, which lets us (and only us) view accessions marked as "private", which are accessions used for internal study and pre-publication study, or images of accessions we borrowed from other herbaria.
    Admin.asp provides editing only of those tables in herbarium.mdb, which resides only and always on the server. Adding or editing of the accessions, like checking the private or endangered boxes, can only be done to the labelsdata.mdb file itself, via the UC Davis management forms, which it then FTP'd to the server. To get into admin.asp you must have a server username and password (using Windows integrated security login).

Auxiliary pages serve image content. Note that while these are .asp pages, which typically return HTML-formatted text (using the txt/html header), they can also return any other content needed. Here is a list of the auxiliary pages and what they return:

  1. bar.asp
    This page returns a 1 x 1 pixel .gif image of the color specified in the query string, which is itself the color of a particular pixel on the GIS map image. The 1x1 gif image is used in the percentage bars in the tables on the stats.asp page. Called only by stats.asp.
     
  2. sheet.asp
    This page returns a .jpg image of the original full-sheet .jpg image cropped, resized, and selectively sharpened. Sharpening increases with the magnification factor. Called only by viewer.asp.
     
  3. map.asp
    Returns a .gif image of a map of Utah, be it the state map or a GIS layer map, sometimes with dots representing collections. The collections data (only) is sent to the page as an n x 7 array stored as a server variable. Called by search.asp, stats.asp, details.asp.
     
  4. reference.asp
    This returns a .png image of the reference image with green lines drawn to represent the view seen in the main image window. Called only by viewer.asp.
     
  5. ruler.asp
    Creates a ruler based on a magnification value, length, and orientation. The script has the ability to change the colors but we haven't used that (yet). Called only by viewer.asp.
     
  6. google.asp (not shown on the diagram above as it interacts with Google Earth, not the browser)
    Creates and sends .kml data (keyhole markup language, a form of XML used by Google Earth) representing sets of accessions returned in a search. Used in all "download to Google Earth" links in search.asp, details.asp and stats.asp. It is also called directly by Google Earth clients when they check the "view all accessions" box in Google Earth. This a Google Earth function in which the bounding box of the viewed portion of the Earth is sent to google.asp, which then returns all .kml-formatted accessions inside that box. I am told NASA World Wind can do the same thing, but we haven't added that functionality yet.

Downloads

Download our .ASP code here (73 KB)

Download our Access database files Full (104 MB) or All but GNIS (1.8 MB)

Our scanning and processing procedures document, v1.2, usernames and passwords removed (80 KB .zip file).

These files are released under the GNU public license. They may be used without royalty or remuneration. If any part of these files, including modifications of the code that you make, are included in any other distributed form, that form must be distributed under this same license and the source code must be made freely available. We are not responsible in any way for the manner in which you use the files or descriptions we gave you, nor for the damage you manage to do with them.

 

Credits


This website was developed by the Utah Valley University Herbarium
under the direction of Dr. Jim Harris

by

Dr. Bruce Wilson (UVU Department of Chemistry) - database programming & server-side scripting
Vadra Rowley (former College of Science and Health web programmer) - HTML formatting & look and feel
with much useful feedback from
Donna Barnes (former Herbarium Curator) & Dr. Renee Van Buren
 

The development of the UVU Virtual Herbarium
is supported by a grant (DBI-0447301) from the National Science Foundation.


We thank the following for contributions of code to our project:

Mike Shaffer - lat/long to distance/bearing code
Jim Deutch of Syracuse, New York - arctan code
Carlos Baptiste and others - XML-generation code
4umi.com - tabbed-display code
Google for their free and easy-to-use Google Earth viewer
Marty Wefald for his township, range, section to lat/long conversion program
Ross Pickard for the ArcViewShapeFileDLL
and
Pierre E. Gougelet - for the utterly handy GFLAx image component
and
DIVA-GIS.org for their avid.exe Ecological Niche Predction software
and especially
The UC Davis Herbarium - for making its Management Software freely available.


Last edited 20 May 2006.

Beginning programmers info

Database

Let’s talk about the database server options. I’ve mentioned one, Access .mdb file-based data. Let’s complete the list, from most rudimentary to the most sophisticated:

  1. Text file (also known as a flat-file database).
    Flat-file database-driven websites were popular in the early days of the Internet when Perl was the predominant server-side scripting language. Applications built on flat-file DB’s are slow, have very limited record-updating capabilities. And they cannot do the one thing that is absolutely essential in database usage: relate one table to another one. More on that later.
     
  2. MS Access
    Actually I quite like programming a website using an Access database. There are distinct advantages to the non-professional programmer to using some of the hand-holding capabilities of MS Access, most particularly building preformed SQL searches and testing your queries to see that they get the correct data formatted the way you need. Programmatically they are really quite sophisticated, lacking only a few of the features (most of which are not needed in casual database programming) present in the fully-functional DB’s like MS SQL and Oracle. The main drawback, as I’ve mentioned above, is the limited number of connections you can have to the database before queries start coming back slow. And the additional step of updating the data by uploading the current .mdb file to the server. This isn’t too big a problem unless you anticipate a growth spurt in your accessions. Another limitation that shouldn’t affect anyone reading this is the .mdb file can never be larger than 2 GBytes. If you keep only data in the database (no images stored in digital form) you’ll have a hard time getting your database larger than a couple dozen megabytes.
     
  3. Microsoft SQL Server 2000 Desktop Engine (MSDE 2000) or  Microsoft SQL Express
    This is the lite version of Microsoft’s SQL Server. Database servers have one nice advantage over file-based data: both you and the server are using the same data. You can access the data from your desktop/lab computer, and the server can grab that same data, so no special updates are needed.  MSDE, like Access, has a connection limitation (5 max.) and data size limit (2 GBytes), so should serve many herbaria very well. It’s free, and has most of the power of the full MS SQL (more than Access, less than MS SQL). One big problem with MSDE is that it comes with no database management tools. To connect to the data you need a user account, and they are hard to set up or modify without the DB management tools that come with MS SQL. Moving data to the server can be tricky, and the first-time user should find some DB nerd to help get it right. Up scaling from Access to MSDE is particularly headache free.
     
  4. Open-source Databases (MySQL/PostgreSQL/SQL Anywhere/etc.)
    MySQL (pronounced “ess que ell” rather than the standard “sequel”) is a free and fast database server. Not as powerful as MSDE, it will do most of the things you’ll need to server labels data. The thing it doesn’t do that you’ll need to work around is stored procedures (Access calls them “queries”, but they will become available in the slower version 5). Handy during programming if you have complicated query commands you don’t want to keep copying. PostgreSQL is another free database server in this category (more functional than MySQL, but slower).
     
  5. MS SQL
    One of the big DB’s. Expensive, but fairly fast and very robust. Lots of CPU is needed to run it, but you get good DB tools.
     
  6. Oracle
    As functional as MS SQL, and about as expensive.
     
  7. Others
    There are many other file-based and server-based database systems, many of them free, that could serve data about as well as these can

If you have a database, you can continue using it as is and adapt your virtual herbarium to use it, or you might consider upgrading your current database to a server-based system. Access can handle upgrading  (“upsizing” as Microsoft calls it) very nicely, especially to MSDE, MS SQL Express, or MS SQL.

Make sure the database you have chosen is compatible with the scripting language you want to use. Most scripting languages or database servers will come with “database connectors”, small programs that translate commands from the scripting language code to the binary format used by the database. Check that you don’t get stuck with management software, scripting languages, and database systems that won’t talk to each other.

By way of explanation, we’re using MS Access frontend/backend database structure. We need to store some information that is used by the server only (image data, GIS information, and accession comments) that must be kept on the server and not overwritten when the labelsdata.mdb table is uploaded. So we have several database files. Herbarium.mdb is our main database. It holds the images and comments tables. It also links to the lablesdata.mdb and GNIS.mdb tables. We had to do front-end/backend rather than separate database files so we could combine the results from tables of data in each file at the same time. Readers with MS Access can download the database tables for closer examination here.

Our database is organized as follows:

herbarium.mdb:

Tables:

comments

images

maps

StaticTally

UtahCounties

Linked Tables:

labelsdata.mdb:

Family

Genera

Herbarium Labels

tblUTM

GNIS.mdb:

GNIS

Queries (stored procedures):

AllSearch

AllSearchP

CommentsCount

ImagesCount

Counties

ElevationStats

Families

Genus

Geo

Tally (and many associated t* queries for generating counts)

The herbarium.mdb file resides permanently on the server. It has the images and comments tables, both of which change on a pretty regular basis. The images table contains one record for every image added to the database. The comments table contains one record for each comment made. Herbarium.mdb is linked to tables in labelsdata.mdb, the primary data file in the UC Davis Management System. the table Herbarium Labels has one record for each accession. Family and Genera are tables of al families and genera represented by the collection (plus some not represented). The tblUTM table hold UTM coordinates of some accessions. The labelsdata.mdb file is uploaded to the server from the primary herbarium record-keeping computer whenever there are significant additions or changes.

The GNIS.mdb table is a version of the GNIS database downloadable from the USGS for the United States and territories.

SQL

Database systems use Structures Query Language (SQL, sometimes pronounced "sequel")  to communicate with the database. All database system (except flat-file DB’s) use it, and except for some small differences in formatting, use identical commands. SQL statements in MS Access look something like these:

SELECT *
FROM AllSearch
WHERE accession = '00284';

In this command we are asking the database for all fields form the table (or, in this case, stored procedure already in the database) with the accession field matching the text value “0284”. Or you can get SQL commands that have complicated search terms:

SELECT *
FROM AllSearchP
WHERE (((((Elevation) BETWEEN 5000 AND 6000) AND ((ElevationUnits)='ft.')) OR (((Elevation) BETWEEN 1524 AND 1829) AND ((ElevationUnits)='m.'))) AND ((GeoTertiaryDivision) LIKE '%utah%') AND ((Collector) LIKE '%wilson%'));

Note that in Access all text is entered between single quotes, and percent signs are used as text wildcards. Other DB systems will have slightly different formats. And finally, very complex queries can be constructed that combine the data in several tables is very specific ways. This is the SQL query stored in our Access database as “AllSearchP” (AllSearch is the same except we add the WHERE Private=FALSE condition):

SELECT L.*, G.Genus, F.Family, C.Comments, I.Images, U.Datum, U.UTMZone, U.easting, U.northing
FROM tblUTM AS U
RIGHT JOIN (ImagesCount AS I
  RIGHT JOIN (CommentsCount AS C
    RIGHT JOIN (Genera AS G
      RIGHT JOIN (Family AS F
       RIGHT JOIN [Herbarium Labels] AS L
       ON F.ID = L.FamilySource)
      ON G.GenusNumber = L.GeneraSource)
    ON C.ID = L.ID)
  ON I.ID = L.ID)
ON U.Label_ID = L.ID;

Here the ImagesCount and CommentsCount are also stored queries, and the joins are how we relate tables to each other. Note that these queries can be written in code instead of having them stored in the database. Here are the two stored queries mentioned above:

SELECT [Herbarium Labels].ID, Count(comments.ID) AS Comments
FROM comments
INNER JOIN [Herbarium Labels] ON comments.accession = [Herbarium Labels].Accession
GROUP BY [Herbarium Labels].ID;

SELECT [Herbarium Labels].ID, Count(image.ID) AS Images
FROM [Herbarium Labels]
INNER JOIN [image] ON [Herbarium Labels].Accession = image.accession
GROUP BY [Herbarium Labels].ID;

The SQL terms RIGHT JOIN, LEFT JOIN, INNER JOIN, GROUP BY, COUNT(), etc, all have very specific meanings, meanings that are difficult to learn and nearly impossible to master. Fortunately, if you have your data in Access (or some other query building program), you can define the relationships between the data tables, build the query using a (fairly-easy-to-understand) graphical method, and test the query; then Access will create the actual SQL statement. Remember what I said about Access having advantages?

The reader should consult the herbarium.mdb file and the search.asp & details.asp files for more examples of the SQL queries we use.

Scripting

A server-side script is a web page, written in a combination of HTML (which is a formatting language) and a scripting language (which runs like a program every time a page is requested).

Here is a small example of a server-side script written in VBScript, Microsoft’s scripting version of the Visual BASIC computer language:

Viewer.asp:

<%@ EnableSessionState = FALSE %>
<!--#include file="config.asp"-->
<%
set GflAx = Server.CreateObject("GflAx.GflAx")
GflAx.LoadBitmap request.querystring("f")
x = request.querystring("x") 'OriginX
y = request.querystring("y")
w = clng(request.querystring("w")) 'WindowWidth
h = clng(Request.querystring("h"))
dh = request.querystring("dh") 'DisplayHeight
dw = request.querystring("dw")
if (w<GflAx.width) OR (h<GflAx.height) then GflAx.Crop x, y, w, h
GflAx.Resize dw, dh
GflAx.Sharpen 40
GflAx.SaveFormat = 1 'JPEG
GflAx.SaveJPEGProgressive = TRUE
GflAx.SaveJPEGQuality= 75
Response.ContentType = "image/jpeg"
Response.BinaryWrite GflAx.SendBinary
set GflAx = nothing
%>

Note: the highlight colors of the scripts are the default .ASP colors in SourceEdit, http://www.brixoft.net/default.asp, the best script editor available, and it too, is free. This script uses the GFLAx image component.

Several things to point out in this script: everything that appears between the script delimiters, “<%” and “%>”, is not returned directly to the client; instead the commands are “run” on the server, and change the information returned to the client. This is the code to modify the image of the sheet and return it as an image. It is not sending any text to the client. So instead of the browser asking for an image called “sheet.jpg” it asks for

“sheet.asp?f=sheet234.jpg&x=234&y=435&w=1000&h=1400&dw=600&dh=800”.

 This request includes a “query string”, variables that are passed to the script which define the crop parameters and resize dimensions of the image file called “sheet234.jpg”. The script executes on the server to crop and resize the image, then sharpens the image before returning it to the client to look exactly like it were an image file. The client doesn’t know it’s a modified file being returned. Probably doesn’t care either. But the client sees nicely customized images. This script is used to display the sheet image at various magnifications (f is the full image name, x & y tell where to start the crop box, w & h tell the dimensions of the box, and dw & dh tell the final size of the image).

There are several predominant scripting languages. You’ll need to choose one and stick with it, as translating one language to another is rather difficult.

  1. VBScript/Jscript
    These run on Microsoft servers. VBScript is based on Visual Basic, and JScript is Microsoft’s version of JavaScript (Java is a very portable [runs within many operating systems] language related to an older language called C). VBScript is known for readability (it’s the script I’ll use in all my examples), Jscript is known for its power and brevity of code. Microsoft server includes by default a few “objects”, subprograms you can use to do more advanced functionality. You can see an example of GflAx above, being used to handle the image data. GflAx is an external object I downloaded for free and installed on the server. “Request” and “response” are also server objects, used to get data from the client and return data back again. Both these languages communicate nicely with Access, MSDE, and MS SQL.
     
  2. ASP.NET
    The newest language from Microsoft, has a lot of variety in how you write the code (can use C#, Visual Basic, and Jscript), and has some nifty compilation features that speed content delivery and shorten code-writing time.
     
  3. PHP
    PHP runs on linux and Windows servers, is installed by default in linux, and is very powerful. It includes a vast library of objects the programmer can use to great advantage. Most server-side scripting found on the web at large is done using PHP. Works very well with MySQL.
     
  4. Perl
    Originally a text processing language only, has grown up quite a bit. Still not as powerful as any language mentioned above, but easy to use.
     
  5. Others
    There are several other scripting languages used, the most powerful (and portable) being good old JavaScript, useful if you don’t know what sort of server you’ll have, or if you’ll be moving servers in the future. ColdFusion, Python, Ruby, TCL2002, and many others can be used, but have less support from the open- and free-software makers.

We chose Active Server Pages as our scripting language, mostly because that's the one I knew best. Please download and examine our annotated code for a better idea of just how we do our server-side .ASP scripting.

The Do It Yourself option brings with it many decisions that must be made before you start writing code. The primary decision is how to keep track of where the client is in the website. This is called "state". State refers to what the server knows about particular clients. For example, you visit a well-designed site like Amazon.com and if you have ever logged in in the past, you will be greeted by a page that says "Welcome Joe. Here are some book recommendations for you." How did the server know it was you visiting and not someone else? How does the website know what books you've looked at?

State

State, the keeping track of what a particular visitor has done during a visit, means that the script is aware of the search terms used, what settings the visitor has made, and helps in building navigation links. This is done by storing information as variables in some place that persists between clicks. It can be done on of four ways:

  1. Cookies
  2. Server-maintained state variables
  3. The querystring
  4. Form variables

A. In the early days of the Internet state did not exist. All pages were static files, and all visitors saw exactly the same information. When web designers wanted to customize a page, they needed a method of identifying which client was which. At first they used the client IP address (the number that identifies your particular computer to the internet, as in 128.025.186.205) to keep track, but this method was confounded by corporate firewalls and proxies, which make all clients in the corporation appear to have the same IP address. So they invented "cookies". Cookies are small text files placed on the client hard drive which contain information generated by the web server. For example, the Amazon.com cookie looks like this:

ubid-main=430-8538451-385471

    x-main=MDh0sbFgeWALIt0Vg?K3kYMDMm
session-id=102-9498384-4474568

The Amazon.com website generates these keys and values and stores them in their database. When you visit Amazon.com again, your web browser automatically sends the Amazon.com cookie back to the website with every click. The script on the website reads the cookie information, looks up the particulars in the database, and instantly knows who you are, what you like, what books you've looked at, etc. This method could be used to store variables needed by the server to keep track of which pages have been seen, or what the last search terms were, and it could do so without the visitor knowing that it was being done, were it not for two things: some visitors will have cookies turned off, and the back button messes everything up.

B. Scripting engines (VBScript, PHP, etc.) automatically keep track of state. Most server-side-scripted sites hosted on a Windows machine will generate a cookie with data looks something like:

ASPSESSIONIDQCCDSATA=DEGFAMCCFJELIFNEMEOFHGCO

It's a random string of characters that identifies you uniquely. Every time you click a link on that site, that information is sent to the server. The server can then (doesn't have to, depending on what the programmer wanted) use state to keep track of a lot of internal variables. For example, when the advanced search is used there might be a lot of very specific search information entered (name, altitude, lat/long, collector, etc.) to generate the search results. How does the server know what the search term were when you click the link to go to page 2? The programmer can create a bunch of server variables, variables that exist only in the server memory, and associated only with the client that returns the matching session string (above), to hold all that information, so when you click the link to "page 2" the script can find all the search information you entered.

Server-maintained state has one big disadvantage: timeout. Since it is not reasonable to keep track of every visitor's stated variables over the course of weeks or months, state will timeout on a server and delete from memory all the variables it had stored. Most servers use a 20 minute default timeout. If you wait more than 20 minutes between clicks, the server won't remember where you've been or what you've entered unless it has saved your data in a file or database, and no one does that except Amazon.com and a few site like it. Wait 20 minutes, click on a link, and you'll either get an error page or be sent back to the home page to start over. This is a major disadvantage to a virtual herbarium where clients can reasonably be expected to spend quite some time studying a particular image or retrieving a specimen from the lab for comparison. It also means that URLs cannot be saved that will retrieve the same information from the database.

C. Another method is to pass all the variables you need in the query string of each URL (web link). Here is an example URL from a search using the combination-of-terms form:

http://research.uvsc.edu/herbarium/virtual/search.asp?p=2&s=form&n=&
o=Family,G.Genus,SpecificEpithet&l=(((((Elevation) BETWEEN 6000 AND 10000) AND ((ElevationUnits)='ft.')) OR (((Elevation) BETWEEN 1829 AND  3048) AND ((ElevationUnits)='m.'))) AND ((GeoTertiaryDivision) LIKE '~utah~') AND ((MoreCollectors) LIKE '~harris~'))&t=&cc=y&ce=&cf=y&cg=y&cl=&m=utahsmall

Here we are sending quite a bit of information: p=page number, s=type of search made, o=the ORDER BY clause used in the SQL query, l=the WHERE clause of the SQL query, all the cx variables indicate which columns of data to display on the search results table, m=which map to display. There are also variables for the page title (t) , shortcuts to particular information in the database (n) that aren't being used. The advantage of using this method of passing state: no personalization is available (like client names), so no personal information is exposed in the URL, and a client can click on this any time and get the same information back. The disadvantage is obvious: you can't type that in. You'd need to email the link to your friend or yourself to use it on a different computer. But you can save the URL to a particular search result as a favorite link.

D. The last method of maintaining state: use hidden variables in a form. This requires that all links be part of a web form (not that big a deal), but sacrifices the ability to save the URL or send it to as friend.

In the end we chose to use primarily the query string to maintain state. We thought the ability to save URL's as favorites, and to come back at any time and get the same page suited the needs of our clients better than their need for clean URLs. We do, however, use the form to return advanced search requests to the server, and use several short-lived server variable (an array) to keep track of accession positions on the maps.
 


 
 
Printing this page will render all tabbed text on a single, properly-formatted page.