Automatic Segmentation Tool for Long Webpage Screenshots

1 Background

When needing to share or analyze web content, long screenshots are a very practical form as they can fully display the page. However, processing these long screenshots while maintaining their information integrity and readability, and facilitating subsequent operations, has always been a challenge. For example, as of early 2024, mainstream AI image models on the market still cannot handle very large and complex images. If a long screenshot is forcibly input into the model, it will result in degraded performance (many details cannot be recognized). To solve this problem, I developed a tool based on OpenCV, aimed at simplifying the processing of long screenshots while maintaining their content integrity and readability.

This project is open source on my Github: https://github.com/Ryaang/Web-page-Screenshot-Segmentation

Unlike many existing tools or methods, Web-page-Screenshot-Segmentation uses OpenCV to automatically identify and follow the natural dividing lines of web content, automatically finding the most suitable segmentation points. This means that whether it is titles, paragraphs, or charts, they can be neatly retained in the segmented images without content breakage or omission.

Using Web-page-Screenshot-Segmentation is very simple. You only need to prepare a long screenshot, and the tool will automatically analyze the image content and intelligently decide the segmentation points. The result will be a series of complete and well-structured images, convenient for sharing and further processing.

2 Introduction

This project is used to segment long screenshots of web pages into several parts based on the height of the text. The main idea is to find the low-variation areas of the image and then find the segmentation lines in these areas.

image-20240229161346869

The output is small but complete images of the webpage, which can be used to generate webpages or train models using Screen-to-code. More results can be found in the images directory.

3 Getting Started

3.1 Installation

1
 pip install Web-page-Screenshot-Segmentation

4 Using in Command Line

Get the height of the segmentation lines of the image

1
python -m Web_page_Screenshot_Segmentation.master -f "path/to/img"

The output should be a list: [6, 868, 1912, 2672, 3568, 4444, 5124, 6036, 7698]. It is a list of the heights of the image segmentation lines. If you want to display this segmentation line in the image, you can add the -s True parameter:

1
python -m Web_page_Screenshot_Segmentation.master -f "path/to/img" -s True

4.1 Draw Segmentation Lines in the Image

1
python -m Web_page_Screenshot_Segmentation.drawer --image_file path/to/image.jpg --hl [100,200] --color (0,255,0)

4.2 Split Image

1
python -m Web_page_Screenshot_Segmentation.spliter --f path/to/image.jpg -ht "[233,456]"

You will get the segmented images saved in the path returned by the command.

For more usage explanations, please refer to the help:

1
2
python master.py --help
python spliter.py --help

5 Using from Source Code

5.1 split_heights Function

The split_heights function is used to segment the image into several parts based on various thresholds. It accepts the following parameters:

  • file_path: The path of the image file.
  • split: A boolean indicating whether to split the image.
  • height_threshold: The height threshold of the low-variation area.
  • variation_threshold: The variation threshold of the low-variation area.
  • color_threshold: The color difference threshold.
  • color_variation_threshold: The color difference variation threshold.
  • merge_threshold: The minimum distance threshold between two lines.

If split is False, the function returns a list of the heights of the segmentation lines; if split is True, it returns the path of the segmented images.

5.1.1 Example Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
import Web_page_Screenshot_Segmentation
from Web_page_Screenshot_Segmentation.master import split_heights

# Split the image at 'path/to/image.jpg' into several parts
split_image_path = split_heights(
    file_path='path/to/image.jpg',
    split=True,
    height_threshold=102,
    variation_threshold=0.5,
    color_threshold=100,
    color_variation_threshold=15,
    merge_threshold=350
)

print(f"The segmented images are saved in {split_image_path}")

In this example, the image at ‘path/to/image.jpg’ is segmented into several parts based on the provided thresholds. The segmented images are saved in the path returned by the function.

5.2 draw_line_from_file Function

The draw_line_from_file function is used to draw lines on the image at specified heights. It accepts the following parameters:

  • image_file: The path of the image file.
  • heights: A list of heights at which to draw lines.
  • color: The color of the lines. The default color is red (0, 0, 255).

The function reads the image from the provided file path, draws lines at the specified heights, and then saves the modified image to a new file. The new file is saved in the result directory, with the same name as the original file but with ‘result’ added before the file extension.

If the function encounters an error reading the image file (e.g., if the file path contains ‘.’ or Chinese characters), it will throw an exception.

5.2.1 Example Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
import Web_page_Screenshot_Segmentation
from Web_page_Screenshot_Segmentation.spliter import draw_line_from_file

# Draw lines at heights 100 and 200 on the image at 'path/to/image.jpg'
result_image_path = draw_line_from_file(
    image_file='path/to/image.jpg',
    heights=[100, 200],
    color=(0, 255, 0)  # Draw lines in green
)

print(f"The modified image is saved in {result_image_path}")

In this example, the image at ‘path/to/image.jpg’ is modified to draw green lines at heights 100 and 200. The modified image is saved in the path returned by the function.

Buy me a coffee~
Tim AlipayAlipay
Tim PayPalPayPal
Tim WeChat PayWeChat Pay
0%