Website

How to Complete the Website Scraping Form?

Looking at this Website loader form, here's what you need to do to complete it:

Step-by-Step Instructions:

1. Name (Required)

  • Enter a name for your data source in the text box

  • This helps you identify your website content later

  • Example: "Company Blog," "Product Documentation," or "News Articles"

  • The placeholder shows "My website" as an example

  • ⚠️ This field is required (red warning shown)

2. Website URL (Required)

  • Enter the main URL of the website you want to scrape

  • Example: https://example.com

  • Make sure to include https:// or http://

  • The system will extract content from multiple pages of this website

  • ⚠️ This field is required (red warning shown)

3. URLs to Exclude (Optional)

  • List any specific URLs you want to skip during scraping

  • Format: One URL per line, or separate multiple URLs with commas

  • Examples of URLs to exclude:

    • https://example.com/login

    • https://example.com/admin

    • https://example.com/contact

  • Use this to avoid scraping pages like login forms, admin panels, or irrelevant sections

4. Chunk Size

  • Set how many tokens or characters should be in each chunk

  • Default value is 1024 (recommended for most cases)

  • You can change this number if needed

  • This controls how the website content is divided for processing

5. Metadata (Optional)

  • Add extra information about the website content

  • This helps the AI better understand the scraped content

  • Example: "Technical documentation from 2024" or "Marketing blog posts"

  • You can leave this empty if you don't have specific metadata

6. Text Splitter (Optional)

  • Default is set to "markdown"

  • This controls how the website text will be divided for processing

  • Usually, you don't need to change this setting

7. Select Embedding Model

  • Default "OpenAI - Text Embedding 3 Small" is already selected

  • You can change this by clicking the dropdown if needed

  • This model processes and understands your website content

8. Review Cost Warning

  • ⚠️ Important: Check the yellow warning box

  • Data import costs 4.580 tokens per page

  • This is higher than other loaders because it processes multiple web pages

  • Consider the size of the website you're scraping

9. Final Step

  • Click the "Save" button at the bottom

  • The system will crawl the website and extract content from multiple pages

  • The text content will be ready for AI to work with

Key Features:

  • Multi-Page Scraping: Extracts content from multiple pages of a website

  • Smart Filtering: Can exclude specific URLs you don't want

  • Clean Text Extraction: Removes HTML formatting and keeps readable content

What Gets Scraped:

  • Page titles and headings

  • Main text content

  • Article content

  • Blog posts

  • Product descriptions

  • Any readable text on web pages

Perfect For:

  • Company websites and blogs

  • Documentation sites

  • News websites

  • Product catalogs

  • Knowledge bases

  • Educational content

  • Public information websites

Tips:

  • Start with smaller websites to test

  • Use URL exclusions to avoid unnecessary pages

  • Be mindful of the cost per page

  • Respect website terms of service

  • Some websites may block automated scraping

Simple Summary:

  1. Give your website scraping project a name

  2. Enter the main website URL

  3. Add URLs to exclude (optional, but recommended)

  4. Adjust settings if needed (or keep defaults)

  5. Add metadata about the content (optional)

  6. Click Save

  7. Done!

The system will crawl through the website, extract text content from multiple pages, and convert it into searchable text that AI can understand and work with for analysis, questions, and insights!

Last updated