ht://Check 1.2.4 README
Website: http://htcheck.sourceforge.net/
Copyright (c) 1999-2006 Comune di Prato - Prato - Italy
Some Portions Copyright (c) 1995-2003 The ht://Dig Group <www.htdig.org>
Author: Gabriele Bartolini - Prato - Italy <angusgb@users.sourceforge.net>
$Id: README,v 1.20 2006/07/04 12:08:46 angusgb Exp $

ht://Check is distributed under the GNU General Public License (GPL).
See the COPYING file for license information.

===========================================================================

ht://Check is more than a link checker.  It is a console application
written for Linux systems in C++ and derived from ht://Dig.

It can retrieve information through HTTP/1.1 and store the information
in a MySQL database, and it is particularly suitable for small
Internet domains or Intranet.

Its purpose is to help a webmaster manage one or more
related sites: after a "crawl", ht://Check gives back very useful
summaries and reports, including broken links, anchors not found,
content-types and HTTP status codes summaries, etc.

From version 1.2.3, ht://Check also performs accessibility checks
in accordance with the principles of the University of Toronto's
Open Accessibility Checks (OAC) project, allowing users to discover
site-wide barriers like images without proper alternatives,
missing titles, etc.

ht://Check can also be used for Web structure analysis,
as it stores information regarding links between HTML documents.

===========================================================================

ht://Check - FEATURES
=====================

ht://Check is made up of two logical parts: a "spider" which starts checking URLs
from a specific one or from a list of them; and an "analyser" which takes the
results of the first part and shows summaries (this part can be done via console
or by using the PHP interface through a web server).

The "Spider" or "Crawler"
-------------------------

- HTTP/1.1 compliant with persistent connections and cookies support (pre-loading too)
- HTTP Basic authentication supported
- HTTP Proxy support (with basic authentication too)
- Crawl customisable through many configuration attributes which let the user
limit the digging on URLs pattern matchings and distance ("hops") from the first URL.
- MySQL databases directly created by the spider
- MySQL connections through user or general option files as defined by the
database system (/etc/my.cnf or ~/.my.cnf)
- Accessibility checks performed on HTML documents

No support for Javascript and other protocols like HTTPS, FTP, NNTP and local files.

The "Analyser"
--------------

Just a preface: as long as all of the data after a crawl are all stored into a
MySQL database, it is pretty easy to get your desired info by querying the
database. The spider, anyway, is included into the 'htcheck' application, which
at the end shows by itself a small text report. In a second time you can always
retrieve info from that database by building your own interface (PHP, Perl for
instance) or by just using the default one written in PHP.

'htcheck' (the console appllication) gives you a summary of broken links, broken
anchors, servers seen, content-types encountered.

The PHP interface lets you perform:
- Queries regarding URLs, by choosing many discrimineting criterias such as
pattern matching, status code, content-type, size.
- Queries regarding links, with pattern matching on both source and destination
URLs (also with regular expressions), the results (broken, ok, anchor not found,
redirected) and their type (normal, direct, redirected).
- Info regarding specific URLs (outgoing and ingoing links, last modify
datetime, etc ...
- Info regarding specific links (broken or ok) and the HTML instruction that
issued it
- Statistics on documents retrieved



ht://Check - DATABASE TABLES EXPLANATION
========================================

1) Link
-------

This table contains info on all the links that ht://Check found during the
crawl. Each link is identified by 4 fields, which make up the primary key:
the source URL, the destination URL, the tag position in the document and
the attribute position in the tag definition.

For example, let's suppose our first URL visited is http://www.foo.com/ and
it is as shown below (so simple, I hope you never write it this way):
<HTML>
<A href="http://htcheck.source.net/">
</HTML>

So, http://www.foo.com/ is identified as URL number 1 (IDUrl=1) whereas
http://htcheck.sourceforge.net/ has IDUrl=2.

The A tag has position number 2 in the document, and the attribute which
creates a link is the 'href' with position 1 in the tag.

So our link record's primary key is:
IDUrlSrc=1, IDUrlDest=2, TagPosition=2, AttrPosition=1.

Sometimes, when referencing a URL we use the so called anchors, by specifying
them after a '#' in the URL field. If that's been set, the anchor field of the
table contains that value.

The most interesting fields of the table are LinkType and LinkResult, which are
both enumeration fields. The LinkResult field is set only at the end of the
crawl, after all the URLs have been retrieved.

LinkType field can contain records with these cases:
- 'Normal' (a normal link, like the 'href' ones: this means you have to click
  before accessing it);
- 'Direct' (a direct link is a link that is downloaded, usually, automatically
  by agents, like for example images called by the <IMG src> HTML tag);
- 'Redirection': this is a special case, it's an unusual link, because it's
  an instance of the HTTP redirection, performed by the server (3xx status
  codes). So this kind of records don't have a TagPosition and an AttrPosition
  properly set (obviously, there's no HTML statement issuing this link).
 
LinkResult field can contain records with these cases:
- 'NotChecked': this is the default case, and it's issued when every Link record
  is created; only at the end of the crawl loop, this field can be set properly;
- 'NotRetrieved': the destination URL of the link has not been retrieved;
- 'OK': the link works perfectly. And the anchor, if present, works fine (only
  if the document has been retrieved, and not only checked wether it exists
  or not);
- 'Broken': the link is broken. The destination URL has not been found;
- 'Redirected': the destination URL has been redirected by the HTTP server;
- 'AnchorNotFound': the destination URL has been found and parsed, but
  the link anchor doesn't exist in it.
- 'NotAuthorized': you must have rights to access this URL, that is to say
  a valid user and a password for authentication (see 'authorization'
  attribute).
- 'EMail': that's an e-mail address reference.



The tables as of the 'mysqldump' program
========================================

Here follows the structure of the tables of the a typical ht://Check database,
as created by the <i>mysqldump<i> program. Please refer to the MySQL
documentation for more and further information. And if you find some useful
advice and suggestions to give me regarding the database (and of course
everything else) please come up tome with an e-mail! :-)

--
-- Table structure for table `Accessibility`
--

CREATE TABLE Accessibility (
  IDCheck mediumint(8) unsigned NOT NULL default '0',
  IDUrl mediumint(8) unsigned NOT NULL default '0',
  TagPosition smallint(5) unsigned default '0',
  AttrPosition tinyint(3) unsigned default '0',
  Code tinyint(3) unsigned default '0',
  PRIMARY KEY  (IDCheck),
  KEY IDUrl (IDUrl,TagPosition,AttrPosition),
  KEY Code (Code,IDUrl,TagPosition)
);

--
-- Table structure for table `Cookies`
--

CREATE TABLE Cookies (
  IDCookie mediumint(8) unsigned NOT NULL default '0',
  Name varchar(255) NOT NULL default '',
  Value text NOT NULL,
  Path varchar(255) NOT NULL default '',
  Domain varchar(255) NOT NULL default '',
  MaxAge mediumint(9) NOT NULL default '-1',
  Version tinyint(4) NOT NULL default '0',
  SrcUrl varchar(255) NOT NULL default '',
  Expires datetime NOT NULL default '0000-00-00 00:00:00',
  Secure tinyint(4) NOT NULL default '0',
  DomainValid tinyint(4) NOT NULL default '0',
  PRIMARY KEY  (IDCookie)
);

--
-- Table structure for table `HtmlAttribute`
--

CREATE TABLE HtmlAttribute (
  IDUrl mediumint(8) unsigned NOT NULL default '0',
  TagPosition smallint(5) unsigned NOT NULL default '0',
  AttrPosition tinyint(3) unsigned NOT NULL default '0',
  Attribute varchar(32) NOT NULL default '',
  Content varchar(255) NOT NULL default '',
  PRIMARY KEY  (IDUrl,TagPosition,AttrPosition),
  KEY Idx_Attribute (Attribute(8)),
  KEY Idx_Content (Content(8))
);

--
-- Table structure for table `HtmlStatement`
--

CREATE TABLE HtmlStatement (
  IDUrl mediumint(8) unsigned NOT NULL default '0',
  TagPosition smallint(5) unsigned NOT NULL default '0',
  Row mediumint(8) unsigned NOT NULL default '0',
  Tag varchar(32) NOT NULL default '',
  Statement varchar(255) default NULL,
  LinkTagPosition smallint(5) unsigned default NULL,
  LinkDescription varchar(255) default NULL,
  PRIMARY KEY  (IDUrl,TagPosition),
  KEY Idx_Tag (Tag(4)),
  KEY Idx_Statement (Tag(8))
);

--
-- Table structure for table `Link`
--

CREATE TABLE Link (
  IDUrlSrc mediumint(8) unsigned NOT NULL default '0',
  IDUrlDest mediumint(8) unsigned NOT NULL default '0',
  TagPosition smallint(5) unsigned NOT NULL default '0',
  AttrPosition tinyint(3) unsigned NOT NULL default '0',
  Anchor varchar(255) binary NOT NULL default '',
  LinkType enum('Normal','Direct','Redirection') NOT NULL default 'Normal',
  LinkResult enum('NotChecked','NotRetrieved','OK','Broken','AnchorNotFound','Redirected','NotAuthorized','EMail','Javascript','BadEncoded') NOT NULL default 'NotChecked',
  LinkDomain enum('SameServer','Internal','External') default NULL,
  PRIMARY KEY  (IDUrlSrc,IDUrlDest,TagPosition,AttrPosition),
  KEY Idx_IDUrlDest (IDUrlDest),
  KEY Idx_Anchor (Anchor(8)),
  KEY Idx_LinkType (LinkType),
  KEY Idx_LinkResult (LinkResult)
);

--
-- Table structure for table `Schedule`
--

CREATE TABLE Schedule (
  IDUrl mediumint(8) unsigned NOT NULL default '0',
  IDServer smallint(5) unsigned NOT NULL default '0',
  Url varchar(255) binary NOT NULL default '',
  Status enum('ToBeRetrieved','Retrieved','CheckIfExists','Checked','BadQueryString','BadExtension','MaxHopCount','FileProtocol','EMail','Javascript','NotValidService','Malformed') NOT NULL default 'ToBeRetrieved',
  Domain enum('Internal','External') default NULL,
  CreationTime datetime NOT NULL default '0000-00-00 00:00:00',
  IDReferer mediumint(8) unsigned NOT NULL default '0',
  HopCount tinyint(3) unsigned NOT NULL default '0',
  PRIMARY KEY  (IDUrl),
  KEY Idx_IDServer (IDServer),
  KEY Idx_Url (Url(64)),
  KEY Idx_Status (Status)
);

--
-- Table structure for table `Server`
--

CREATE TABLE Server (
  IDServer smallint(5) unsigned NOT NULL default '0',
  Server varchar(255) NOT NULL default '',
  IPAddress varchar(15) default NULL,
  Port smallint(5) unsigned NOT NULL default '0',
  HttpServer varchar(255) NOT NULL default '',
  HttpVersion varchar(255) NOT NULL default '',
  PersistentConnection tinyint(1) unsigned NOT NULL default '0',
  Requests smallint(5) unsigned NOT NULL default '0',
  PRIMARY KEY  (IDServer),
  KEY Idx_Server (Server(24)),
  KEY Idx_Requests (Requests)
);

--
-- Table structure for table `Url`
--

CREATE TABLE Url (
  IDUrl mediumint(8) unsigned NOT NULL default '0',
  IDServer smallint(5) unsigned NOT NULL default '0',
  Url varchar(255) binary NOT NULL default '',
  ContentType varchar(32) NOT NULL default '',
  ConnStatus enum('OK','NoHeader','NoHost','NoPort','NoConnection','ConnectionDown','ServiceNotValid','OtherError','ServerError') NOT NULL default 'OK',
  ContentLanguage varchar(16) NOT NULL default '',
  TransferEncoding varchar(32) NOT NULL default '',
  LastModified datetime NOT NULL default '0000-00-00 00:00:00',
  LastAccess datetime NOT NULL default '0000-00-00 00:00:00',
  Size int(11) NOT NULL default '0',
  StatusCode smallint(6) NOT NULL default '0',
  ReasonPhrase varchar(32) NOT NULL default '',
  Location varchar(255) binary NOT NULL default '',
  Title varchar(255) NOT NULL default '',
  Contents mediumtext,
  DocType enum('not-public','not-html','xhtml-11','xhtml-10','xhtml-10-transitional','xhtml-10-frameset','html-401','html-401-transitional','html-401-frameset','html-40','html-40-transitional','html-40-frameset','html-32','html-20','html-20-level2','html-20-level1','html-20-strict','html-20-strict-level1','html-iso-iec-15445-2000','unknown') default NULL,
  Charset enum('windows-1258','iso-8859-1','iso-8859-2','iso-8859-3','iso-8859-4','iso-8859-5','iso-8859-6','iso-8859-7','iso-8859-8','iso-8859-9','utf-8','koi8-r','koi8-u','iso-8859-10','iso-8859-13','iso-8859-14','iso-8859-15','windows-1250','windows-1251','windows-1252','windows-1253','windows-1254','windows-874','windows-1255','windows-1256','windows-1257','unknown') default NULL,
  Description varchar(255) default NULL,
  Keywords varchar(255) default NULL,
  SizeAdd int(11) NOT NULL default '0',
  PRIMARY KEY  (IDUrl),
  KEY Idx_IDServer (IDServer),
  KEY Idx_Url (Url(64)),
  KEY Idx_ContentType (ContentType(16)),
  KEY Idx_StatusCode (StatusCode),
  KEY Idx_Charset (Charset)
);

--
-- Table structure for table `htCheck`
--

CREATE TABLE htCheck (
  StartTime datetime NOT NULL default '0000-00-00 00:00:00',
  EndTime datetime NOT NULL default '0000-00-00 00:00:00',
  ScheduledUrls mediumint(8) unsigned NOT NULL default '0',
  TotUrls mediumint(8) unsigned NOT NULL default '0',
  RetrievedUrls mediumint(8) unsigned NOT NULL default '0',
  TCPConnections mediumint(8) unsigned NOT NULL default '0',
  ServerChanges mediumint(8) unsigned NOT NULL default '0',
  HTTPRequests mediumint(8) unsigned NOT NULL default '0',
  HTTPSeconds mediumint(8) unsigned NOT NULL default '0',
  HTTPBytes bigint(20) unsigned NOT NULL default '0',
  AccessibilityChecks tinyint(3) unsigned NOT NULL default '1',
  User varchar(255) NOT NULL default '',
  PRIMARY KEY  (StartTime,EndTime)
);

