Web-Page Screenshot Automatic Segmentation Tool

Background

When it’s necessary to share or analyze web content, long screenshots are an extremely practical form, as they can completely display a page. However, when dealing with these long screenshots, how to maintain their integrity and readability while making them convenient for subsequent operations has always been a challenge. For example, currently (early 2024), mainstream AI image models still cannot process very large, complex pictures. If a long screenshot is forced into a model, it will lead to poor performance of the model output (many details cannot be recognized). To solve this problem, I have developed a tool based on OpenCV, aimed at simplifying the process of handling long screenshots while preserving their content’s integrity and readability.

This project is open-sourced on my Github: https://github.com/Tim-Saijun/Web-page-Screenshot-Segmentation

Different from many existing tools or methods, Web-page Screenshot Segmentation uses OpenCV to automatically identify and follow the natural separation lines of web content, automatically finding the most suitable segmentation points. This means that titles, paragraphs, or charts can be neatly preserved in the segmented images without content breaks or omissions.

Using the Web-page Screenshot Segmentation is very simple, only requiring a long screenshot, and the tool will automatically analyze the content of the image and intelligently decide the segmentation points. Ultimately, you will obtain a series of complete and well-structured images, convenient for sharing and further processing.

Introduction

This project aims to segment a webpage’s long screenshot into several parts based on the height of the text. The main idea is to find areas of low variation in the image, and then to locate division lines within those low variation areas.

image-20240229161346869

The output is small and complete images of the webpage, which can be used to generate webpages using Screen-to-code or for training models. More results can be found in the images directory.

Getting Started

Installation

1
 pip install Web-page-Screenshot-Segmentation

Usage in Command Line

To get the heights of the segmentation lines for an image

1
python -m Web_page_Screenshot_Segmentation.master -f "path/to/img"

The output should be a list: [6, 868, 1912, 2672, 3568, 4444, 5124, 6036, 7698]. These are the heights of the image division lines. If you want to display these lines in the picture, you can add the -s True parameter:

1
python -m Web_page_Screenshot_Segmentation.master -f "path/to/img" -s True

Drawing Division Lines in the Image

1
python -m Web_page_Screenshot_Segmentation.drawer --image_file path/to/image.jpg --hl [100,200] --color (0,255,0)

Segmenting the Image

1
python -m Web_page_Screenshot_Segmentation.spliter --f path/to/image.jpg -ht "[233,456]"

You will obtain the segmented images, stored in the returned path of the command.

For more usage explanations, refer to the help:

1
2
python master.py --help
python spliter.py --help

Usage From Source Code

split_heights function

The split_heights function is used to segment the image into several parts based on various thresholds. It accepts the following parameters:

  • file_path: The path to the image file.
  • split: A boolean indicating whether to segment the image.
  • height_threshold: The height threshold for low-variation areas.
  • variation_threshold: The variation threshold for low-variation areas.
  • color_threshold: The threshold for color differences.
  • color_variation_threshold: The threshold for changes in color differences.
  • merge_threshold: The threshold for the minimum distance between two lines.

If split is False, the function returns a list of division line heights; if split is True, it returns the path to the divided images.

Example Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
import Web_page_Screenshot_Segmentation
from Web_page_Screenshot_Segmentation.master import split_heights

# Splitting the image at 'path/to/image.jpg' into several parts
split_image_path = split_heights(
    file_path='path/to/image.jpg',
    split=True,
    height_threshold=102,
    variation_threshold=0.5,
    color_threshold=100,
    color_variation_threshold=15,
    merge_threshold=350
)

print(f"The split image is saved at {split_image_path}")

In this example, the image at ‘path/to/image.jpg’ is split into several parts based on the provided thresholds. The split images are saved in the path returned by the function.

draw_line_from_file Function

The draw_line_from_file function is used to draw lines at specified heights on an image. It accepts the following parameters:

  • image_file: The path to the image file.
  • heights: A list of heights at which to draw the lines.
  • color: The color of the lines. The default color is red (0, 0, 255).

This function reads an image from the provided file path, draws lines at the specified heights, and then saves the modified image to a new file. The new file is saved in the result directory, with the same name as the original file, but with ‘result’ added before the file extension.

If the function encounters an error while reading the image file (for instance, if the file path includes ‘.’ or Chinese characters), it will throw an exception.

Sample Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
import Web_page_Screenshot_Segmentation
from Web_page_Screenshot_Segmentation.spliter import draw_line_from_file

# Drawing lines at heights 100 and 200 on the image at 'path/to/image.jpg'
result_image_path = draw_line_from_file(
    image_file='path/to/image.jpg',
    heights=[100, 200],
    color=(0, 255, 0)  # Drawing the lines in green
)

print(f"The modified image is saved at {result_image_path}")

In this example, the image at ‘path/to/image.jpg’ is modified by drawing green lines at heights 100 and 200. The modified image is saved in the path returned by the function.

Buy me a coffee~
Tim AlipayAlipay
Tim PayPalPayPal
Tim WeChat PayWeChat Pay
0%