Sprawk Crawler

transmachina's web mining tool is able to "crawl" collections of web site collecting linguistic data. Much like Google collect data for searching, NaturalCrawler is used to collect information about word and phrase usage, terminology and context. Data collected from the NaturalCrawler is used to complement transmachina's sprawk service and can also be used to perform "language audits" on corporate web sites â' reporting on consistency of communication to your online audience.

Summary

A web "crawler" or "spider" that can explore a given web site or url, extract text (and sentences) from the site into a database table.

The text will later be used for:

  • calculating word frequencies
  • determining commons phrases and word combinations
  • determining words that co-occur
  • context processing and detection

The spider is be able to track its progress:

  • know which pages it has already processed
  • know which pages are in its "todo-list"
so that it can be started and stopped without losing information and can avoid double-processing (where the same page is processed more than once, thus skewing the frequency data)

Furthermore, filters should be able to be defined on a per-url basis to remove unwanted data. For example, on the www.dn.se sight, a filter should be pre-applied that removes the page header and menus prior to sentence extraction so that the words "news", "sports", "weather" etc. aren't included in frequency counts for every single page.

The spider is designed to be run in a server environment since it will require significant CPU and bandwidth. The spider is a set of Java libraries that can be called from other Java programs over via dynamic web pages such as JSP/servlets.

When each page is visited, the following details are recorded:

  • the url
  • the date/time visited
  • the title of the page
  • the language of the blank (blank if unknown)
  • the paragraphs/sentences extracted from the page

Online Interface

The interface allows:

  • the creation of new crawls
  • viewing the progress of current crawls
  • viewing and searching list of current pages searched
  • viewing and searching list of pending pages
  • viewing sentences extracted from certain pages (linked from lists above)