Ctrlk

Webpage

How to Complete the Webpage Scraping Form?

Looking at this Webpage loader form, here's what you need to do to complete it:

Step-by-Step Instructions:

1. Name (Required)

Enter a name for your data source in the text box
This helps you identify your webpage content later
Example: "Product Pages," "Article Collection," or "Documentation Pages"
The placeholder shows "Give a name" as an example

2. URLs (Required)

Enter one or more URLs separated by commas
Example: https://example.com, https://another-example.com
You can add multiple specific webpage URLs
⚠️ This field is required (red warning shown)

3. Fetch Content from Page Sub-URLs?

Toggle switch: Turn ON or OFF
When ON: Also scrapes linked pages found on the main pages
When OFF: Only scrapes the exact URLs you provided
Useful for getting related content automatically

4. Enable Pagination

Toggle switch: Turn ON or OFF
When ON: Follows pagination links (page 1, 2, 3, etc.)
When OFF: Only scrapes the specific pages you listed
Note: Only works with pagination systems where page numbers are in the URL as query parameters
Example: https://example.com/articles?page=1

5. Page Key (For Pagination)

The parameter name used for pagination
Common examples: page, p, pagenum
Example: If the URL is https://site.com/articles?page=2, enter page
Only needed if pagination is enabled

6. Start Pagination Page

The initial page number to begin fetching pages
Usually starts at 1 or 0
Enter the first page number you want to scrape

7. End Pagination Page

The last page number to fetch or process
Enter the final page number you want to scrape
Be careful not to set this too high to avoid excessive costs

8. Chunk Size

Set how many tokens or characters should be in each chunk
Default value is 1024 (recommended for most cases)
This controls how the webpage content is divided for processing

9. Metadata (Optional)

Add extra information about the webpages
This helps the AI better understand the scraped content
Example: "Product documentation from Q4 2024"
You can leave this empty if you don't have specific metadata

10. Text Splitter (Optional)

Default is set to "markdown"
This controls how the webpage text will be divided for processing
Usually, you don't need to change this setting

11. Select Embedding Model

Default "OpenAI - Text Embedding 3 Small" is already selected
You can change this by clicking the dropdown if needed
This model processes and understands your webpage content

12. Review Cost Warning

⚠️ Important: Check the yellow warning box
Data import costs 0.540 tokens per page
Consider the number of pages you're planning to scrape

13. Final Step

Click the "Save" button at the bottom
The system will scrape the specified webpages
The content will be ready for AI to work with

Key Features:

Multiple URLs: Can scrape several specific webpages at once
Sub-URL Fetching: Option to follow links found on the main pages
Pagination Support: Can automatically follow paginated content
Flexible Configuration: Control exactly what gets scraped

Pagination Example:

If you want to scrape a blog with pagination:

URLs: https://blog.example.com/articles
Enable pagination: ON
Page key: page
Start page: 1
End page: 5

This will scrape: https://blog.example.com/articles?page=1, ?page=2, etc.

Simple Summary:

Give your webpage collection a name
Enter the URLs you want to scrape (separated by commas)
Choose if you want to fetch sub-URLs (optional)
Configure pagination if needed (optional)
Adjust other settings if needed (or keep defaults)
Add metadata (optional)
Click Save
Done!

The system will scrape the specified webpages and convert the content into searchable text that AI can understand and work with!

PreviousWebsite NextSitemap

Last updated 2 months ago