We Use Cookies

We use cookies to ensure that we give you the best experience on our website. If you continue to use this site we will assume that you are happy with this.

See our cookie policy.

Automation Action: Web Spider

Crawls a web site and returns a list of all URLs found.

Built-In Action

Crawls (spider's) a URL and returns a list of all URLs found. The list can either be returned as a text with one URL per line or as CSV or Json containing each URL, Title, Description and Keywords.

The Web Spider Action only crawls the specified URL. It does not crawl outbound links.

Specify the URL to spider.

Specify any Avoid Patterns (separated by semi colons). Adds wildcard patterns to prevent spidering matching URLs. For example, if "*/assets/*" is added, then any URL containing "/assets/" is not spidered. The "*" character matches zero or more of any character.

Set the Maximum URLs that you want to spider for the site.

Enable the Chop Querystrings to remove the ?query portion from any URLs. This can be done to avoid auto-generated content.

The Web Spider Action will check any robots.txt file. It will not download pages denied by robots.txt

The Return As option can be set to:

URLs one per line

For example:


https://www.testsite.com/
https://www.testsite.com/page2.htm              

CSV Containing URL, Title, Description, Keywords

For example:


URL,Title,Description,Keywords
https://www.testsite.com/,Title1,Test Description 1,"keyword1,keyword2"
https://www.testsite.com/page2.htm,Title 2,Test Description 2,"keyword1,keyword2"              

JSON Array Containing, URL, Title, Description, Keywords

For example:


[
  {
    "URL": "https://www.testsite.com/",
    "Title": "Title 1",
    "Description": "Test Description 1",
    "Keywords": "keyword1,keyword2"
  },
  {
    "URL": "https://www.testsite.com/page2",
    "Title": "Title 2",
    "Description": "Test Description 2",
    "Keywords": "keyword1,keyword2"
  }
]              

Select the variable to receive the results from the Assign To list.

You can also assign a list of outbound links found across all URLs spidered. Select the variable to receive outbound links from the Assign Outbound Links to list. Outbound links are returned as a text string with one link per line.

This Action is useful when you need to load content for an entire site - for example: If loading a site to add to a Knowledge Store. You could first spider a site and then use the For..Each.. Line In action to loop through the site adding each page content to a Knowledge Store Collection, using the page title as the article titles. For example:


// add site to knowledge store
URL =
URLS =
Title =
Content =
URLS = Web Spider URL https://www.mysite.com Avoid *.js;*/assets/*
For Each Line In %URLS% [Assign To URL]
   Content = HTTP Get From %URL% Convert To Markdown [Assign Title To: Title]
   If %Title% Is Not Blank Then
      Embedded Knowledge Store MyKnowledgeBase Update Title = %Title% %Content%
   End If
Next Loop              
Note: This action may take several minutes for large sites.