The following are edited questions from a VOSON user, submitted via email:
-Is there a limit on the number of seed sites that we can use?
The number of seed sites does not determine a limit for the crawl itself. This is because there is a fair amount of variability depending upon the seed sites selected by the user and the design of the crawl (depth of crawl, max. number of inlinks, outlinks, etc.). The actual accounting measure for “crawl size" is VAU (Voson Activity Units) which is an estimate comprising system resource use and crawl parameters.
As a reference, it is not recommendable to use a standard account to crawl more than 500 seeds (this is an approximation).
-What are 'inbound' links? Are these other sites that link to the seed? Do they only include inbound links from other sites in the seed?
You, inbound links are hyperlinks that point to the seed. These may include connections between seeds or external URLs linking to a given seed depending on how you set the webmining parameters and what the crawler finds. You also have the option to crawl internal inbound links i.e. in-links to all internal pages of a given seed.
-Text analysis features
Text analysis features include:
- Crosstabs->text tool provides frequency analysis of words, meta keywords, word-pairs (co-located words) [if available].
- Maps->concept tool provides maps showing how concepts/words are clustering (using multdimensional scaling)
Text data collection
- by default meta keywords and title are collected (for seed sites), and are available in database fields.
- by default words are extracted from the pages that are crawled, and are available in a field.
- if selected in the crawler options (but requires more VAU): co-located words will be extracted using Link:Parser software. This will only work for English.
You may also consider to use VOSON as a tool for text data collection, and then download the dataset and analyse the text data using tools such as R.