Website
How to Complete the Website Scraping Form?
Looking at this Website loader form, here's what you need to do to complete it:

Step-by-Step Instructions:
1. Name (Required)
Enter a name for your data source in the text box
This helps you identify your website content later
Example: "Company Blog," "Product Documentation," or "News Articles"
The placeholder shows "My website" as an example
⚠️ This field is required (red warning shown)
2. Website URL (Required)
Enter the main URL of the website you want to scrape
Example:
https://example.com
Make sure to include
https://
orhttp://
The system will extract content from multiple pages of this website
⚠️ This field is required (red warning shown)
3. URLs to Exclude (Optional)
List any specific URLs you want to skip during scraping
Format: One URL per line, or separate multiple URLs with commas
Examples of URLs to exclude:
https://example.com/login
https://example.com/admin
https://example.com/contact
Use this to avoid scraping pages like login forms, admin panels, or irrelevant sections
4. Chunk Size
Set how many tokens or characters should be in each chunk
Default value is 1024 (recommended for most cases)
You can change this number if needed
This controls how the website content is divided for processing
5. Metadata (Optional)
Add extra information about the website content
This helps the AI better understand the scraped content
Example: "Technical documentation from 2024" or "Marketing blog posts"
You can leave this empty if you don't have specific metadata
6. Text Splitter (Optional)
Default is set to "markdown"
This controls how the website text will be divided for processing
Usually, you don't need to change this setting
7. Select Embedding Model
Default "OpenAI - Text Embedding 3 Small" is already selected
You can change this by clicking the dropdown if needed
This model processes and understands your website content
8. Review Cost Warning
⚠️ Important: Check the yellow warning box
Data import costs 4.580 tokens per page
This is higher than other loaders because it processes multiple web pages
Consider the size of the website you're scraping
9. Final Step
Click the "Save" button at the bottom
The system will crawl the website and extract content from multiple pages
The text content will be ready for AI to work with
Key Features:
Multi-Page Scraping: Extracts content from multiple pages of a website
Smart Filtering: Can exclude specific URLs you don't want
Clean Text Extraction: Removes HTML formatting and keeps readable content
What Gets Scraped:
Page titles and headings
Main text content
Article content
Blog posts
Product descriptions
Any readable text on web pages
Perfect For:
Company websites and blogs
Documentation sites
News websites
Product catalogs
Knowledge bases
Educational content
Public information websites
Tips:
Start with smaller websites to test
Use URL exclusions to avoid unnecessary pages
Be mindful of the cost per page
Respect website terms of service
Some websites may block automated scraping
Simple Summary:
Give your website scraping project a name
Enter the main website URL
Add URLs to exclude (optional, but recommended)
Adjust settings if needed (or keep defaults)
Add metadata about the content (optional)
Click Save
Done!
The system will crawl through the website, extract text content from multiple pages, and convert it into searchable text that AI can understand and work with for analysis, questions, and insights!
Last updated