$TRANSMACHINA$'s web mining tool is able to "crawl" collections of web site collecting linguistic data. Much like Google collect data for searching, NaturalCrawler is used to collect information about word and phrase usage, terminology and context. Data collected from the NaturalCrawler is used to complement $TRANSMACHINA$'s $SPRAWK$ service and can also be used to perform "language audits" on corporate web sites â reporting on consistency of communication to your online audience.
A web "crawler" or "spider" that can explore a given web site or url, extract text (and sentences) from the site into a database table.
The text will later be used for:
The spider is be able to track its progress:
Furthermore, filters should be able to be defined on a per-url basis to remove unwanted data. For example, on the www.dn.se sight, a filter should be pre-applied that removes the page header and menus prior to sentence extraction so that the words "news", "sports", "weather" etc. aren't included in frequency counts for every single page.
The spider is designed to be run in a server environment since it will require significant CPU and bandwidth. The spider is a set of Java libraries that can be called from other Java programs over via dynamic web pages such as JSP/servlets.
When each page is visited, the following details are recorded:
The interface allows: