Crawling Some Images From Websites

(Chinese Doc:https://deepghs.github.io/waifuc/main/tutorials-CN/crawl_images/index.html )

How to Crawl Data From Websites?

In fact, waifuc can crawl from many websites, not just Danbooru. But before we get started, allow me to formally introduce my another waifu, Amiya, a cute bunny girl. (You may ask why I have so many waifus. Well, as you know, anime lovers have an infinite number of waifus; they are all my honey and angels 😍)

../../_images/amiya1.png

Zerochan

Zerochan is a website with many high-quality images. We can crawl from it in a straightforward way. Considering we only need 50 images due to the large quantity, the following code achieves this:

 1from waifuc.export import SaveExporter
 2from waifuc.source import ZerochanSource
 3
 4if __name__ == '__main__':
 5    s = ZerochanSource('Amiya')
 6    # the 50 means only need first 50 images
 7    # if you need to get all images from zerochan,
 8    # just replace it with 's.export('
 9    s[:50].export(
10        SaveExporter('/data/amiya_zerochan')
11    )

Please note that we use SaveExporter here instead of the previous TextureInversionExporter. Its function will be explained in the following sections. The data crawled here will be stored locally in the /data/amiya_zerochan directory, as shown below:

../../_images/zerochan_simple_501.png

However, we’ve noticed an issue—Zerochan has many member-only images, requiring login to access. To address this, we can use our username and password for authentication to obtain more and higher-quality images:

 1from waifuc.export import SaveExporter
 2from waifuc.source import ZerochanSource
 3
 4if __name__ == '__main__':
 5    s = ZerochanSource(
 6        'Amiya',
 7        username='your_username',
 8        password='your_password',
 9    )
10    s[:50].export(
11        SaveExporter('/data/amiya_zerochan')
12    )

Indeed, we successfully obtained many member-only images, as shown below:

../../_images/zerochan_login_501.png

However, many of these images have relatively low resolutions, with few exceeding 1000 pixels in either dimension. This is because Zerochan defaults to using the large size to speed up downloads. If you need larger images, you can modify the size selection like this:

 1from waifuc.export import SaveExporter
 2from waifuc.source import ZerochanSource
 3
 4if __name__ == '__main__':
 5    s = ZerochanSource(
 6        'Amiya',
 7        username='your_username',
 8        password='your_password',
 9        select='full',
10    )
11    s[:50].export(
12        SaveExporter('/data/amiya_zerochan')
13    )

After crawling, all the images will be in full size.

However, there’s still an issue—many high-quality images are official promotional art, and since Amiya is a main character, she often appears in group art. We actually need images that only feature her. No problem, just set the search mode to strict:

 1from waifuc.export import SaveExporter
 2from waifuc.source import ZerochanSource
 3
 4if __name__ == '__main__':
 5    s = ZerochanSource(
 6        'Amiya',
 7        username='your_username',
 8        password='your_password',
 9        select='full',
10        strict=True,
11    )
12    s[:50].export(
13        SaveExporter('/data/amiya_zerochan')
14    )

Now we have high-quality images of Amiya alone from Zerochan, as shown below:

../../_images/zerochan_login_50_full_strict1.png

Danbooru

Clearly, Danbooru can also be crawled easily:

1from waifuc.export import SaveExporter
2from waifuc.source import DanbooruSource
3
4if __name__ == '__main__':
5    s = DanbooruSource(['amiya_(arknights)'])
6    s[:50].export(
7        SaveExporter('/data/amiya_danbooru')
8    )

Moreover, on Danbooru and many similar sites, you can directly collect solo images by adding the solo tag, like this:

1from waifuc.export import SaveExporter
2from waifuc.source import DanbooruSource
3
4if __name__ == '__main__':
5    s = DanbooruSource(['amiya_(arknights)', 'solo'])
6    s[:50].export(
7        SaveExporter('/data/amiya_solo_danbooru')
8    )

Pixiv

waifuc also supports crawling for Pixiv, including keyword searches, artist-specific crawls, and crawls based on rankings.

We can use PixivSearchSource to crawl images based on keywords, as shown below:

 1from waifuc.export import SaveExporter
 2from waifuc.source import PixivSearchSource
 3
 4if __name__ == '__main__':
 5    s = PixivSearchSource(
 6
 7        'アークナイツ (amiya OR アーミヤ OR 阿米娅)',
 8        refresh_token='your_pixiv_refresh_token',
 9    )
10    s[:50].export(
11        SaveExporter('/data/amiya_pixiv')
12    )

We can also use PixivUserSource to crawl images from a specific artist, as shown below:

 1from waifuc.export import SaveExporter
 2from waifuc.source import PixivUserSource
 3
 4if __name__ == '__main__':
 5    s = PixivUserSource(
 6        2864095,  # pixiv user 2864095
 7        refresh_token='your_pixiv_refresh_token',
 8    )
 9    s[:50].export(
10        SaveExporter('/data/pixiv_user_misaka_12003')
11    )

We can use PixivRankingSource to crawl images from the ranking, as shown below:

 1from waifuc.export import SaveExporter
 2from waifuc.source import PixivRankingSource
 3
 4if __name__ == '__main__':
 5    s = PixivRankingSource(
 6        mode='day',  # daily ranklist
 7        refresh_token='your_pixiv_refresh_token',
 8    )
 9    s[:50].export(
10        SaveExporter('/data/pixiv_daily_best')
11    )

Anime-Pictures

Anime-Pictures is a site with fewer images, but generally high quality. waifuc also supports crawling from it, as shown below:

1from waifuc.export import SaveExporter
2from waifuc.source import AnimePicturesSource
3
4if __name__ == '__main__':
5    s = AnimePicturesSource(['amiya (arknights)'])
6    s[:50].export(
7        SaveExporter('/data/amiya_animepictures')
8    )

Sankaku

Sankaku is a site with a large number of images of various types, and waifuc also supports it, as shown below:

 1from waifuc.export import SaveExporter
 2from waifuc.source import SankakuSource
 3
 4if __name__ == '__main__':
 5    s = SankakuSource(
 6        ['amiya_(arknights)'],
 7        username='your_username',
 8        password='your_password',
 9    )
10    s[:50].export(
11        SaveExporter('/data/amiya_sankaku')
12    )

Gelbooru

waifuc also supports crawling from Gelbooru, as shown below:

1from waifuc.export import SaveExporter
2from waifuc.source import GelbooruSource
3
4if __name__ == '__main__':
5    s = GelbooruSource(['amiya_(arknights)'])
6    s[:50].export(
7        SaveExporter('/data/amiya_gelbooru')
8    )

Duitang

In response to a request from a mysterious user on civitai, waifuc has added support for Duitang. Duitang is a Chinese website that contains many high-quality anime images. The crawling code is as follows:

1from waifuc.export import SaveExporter
2from waifuc.source import DuitangSource
3
4if __name__ == '__main__':
5    s = DuitangSource('阿米娅')
6    s[:50].export(
7        SaveExporter('/data/amiya_duitang')
8    )

Other Supported Sites

In addition to the above-mentioned websites, we also support a large number of other image websites. All supported websites are listed below:

  1. ATFBooruSource (Website: https://booru.allthefallen.moe)

  2. AnimePicturesSource (Website: https://anime-pictures.net)

  3. DanbooruSource (Website: https://danbooru.donmai.us)

  4. DerpibooruSource (Website: https://derpibooru.org)

  5. DuitangSource (Website: https://www.duitang.com)

  6. E621Source (Website: https://e621.net)

  7. E926Source (Website: https://e926.net)

  8. FurbooruSource (Website: https://furbooru.com)

  9. GelbooruSource (Website: https://gelbooru.com)

  10. Huashi6Source (Website: https://www.huashi6.com)

  11. HypnoHubSource (Website: https://hypnohub.net)

  12. KonachanNetSource (Website: https://konachan.net)

  13. KonachanSource (Website: https://konachan.com)

  14. LolibooruSource (Website: https://lolibooru.moe)

  15. PahealSource (Website: https://rule34.paheal.net)

  16. PixivRankingSource (Website: https://pixiv.net)

  17. PixivSearchSource (Website: https://pixiv.net)

  18. PixivUserSource (Website: https://pixiv.net)

  19. Rule34Source (Website: https://rule34.xxx)

  20. SafebooruOrgSource (Website: https://safebooru.org)

  21. SafebooruSource (Website: https://safebooru.donmai.us)

  22. SankakuSource (Website: https://chan.sankakucomplex.com)

  23. TBIBSource (Website: https://tbib.org)

  24. WallHavenSource (Website: https://wallhaven.cc)

  25. XbooruSource (Website: https://xbooru.com)

  26. YandeSource (Website: https://yande.re)

  27. ZerochanSource (Website: https://www.zerochan.net)

For more information and details on using these data sources, refer to the official waifuc source code.

Crawling Data from Multiple Websites

In reality, there are often cases where we want to retrieve image data from multiple websites. For example, we might need 30 images from Danbooru and another 30 from Zerochan.

To address this situation, waifuc provides concatenation and union operations for data sources. In simple terms, you can integrate multiple data sources using concatenation (+) and union (|). For example, to fulfill the mentioned requirement, we can concatenate the data sources of Danbooru and Zerochan, creating a new data source, as shown below:

 1from waifuc.export import SaveExporter
 2from waifuc.source import DanbooruSource, ZerochanSource
 3
 4if __name__ == '__main__':
 5    # First 30 images from Danbooru
 6    s_db = DanbooruSource(
 7        ['amiya_(arknights)', 'solo'],
 8        min_size=10000,
 9    )[:30]
10
11    # First 30 images from Zerochan
12    s_zerochan = ZerochanSource(
13        'Amiya',
14        username='your_username',
15        password='your_password',
16        select='full',
17        strict=True,
18    )[:30]
19
20    # Concatenate these 2 data sources
21    s = s_db + s_zerochan
22    s.export(
23        SaveExporter('/data/amiya_2datasources')
24    )

The code above first crawls 30 images from Danbooru and then another 30 from Zerochan. Consequently, we get a dataset like this:

../../_images/source_concat1.png

Moreover, in some cases, we might not know in advance the number of images each data source contains. Instead, we may want to collect images from different sources as much as possible, with a specific total quantity in mind. In such cases, we can use the union operation, as shown below:

 1from waifuc.export import SaveExporter
 2from waifuc.source import DanbooruSource, ZerochanSource
 3
 4if __name__ == '__main__':
 5    # Images from Danbooru
 6    s_db = DanbooruSource(
 7        ['amiya_(arknights)', 'solo'],
 8        min_size=10000,
 9    )
10
11    # Images from Zerochan
12    s_zerochan = ZerochanSource(
13        'Amiya',
14        username='your_username',
15        password='your_password',
16        select='full',
17        strict=True,
18    )
19
20    # We need 60 images from these 2 sites
21    s = (s_db | s_zerochan)[:60]
22    s.export(
23        SaveExporter('/data/amiya_zerochan')
24    )

In this example, it randomly crawls one image at a time from either of the two websites until it collects 60 images. Thus, the final dataset is not fixed, and the following dataset is just an example:

../../_images/source_union1.png

In fact, all waifuc data sources support such concatenation and union operations. You can even perform complex nested operations to construct a sophisticated data source:

 1from waifuc.export import SaveExporter
 2from waifuc.source import DanbooruSource, ZerochanSource, PixivSearchSource
 3
 4if __name__ == '__main__':
 5    # Images from Danbooru
 6    s_db = DanbooruSource(
 7        ['amiya_(arknights)', 'solo'],
 8        min_size=10000,
 9    )
10
11    # Images from Zerochan
12    s_zerochan = ZerochanSource(
13        'Amiya',
14        username='your_username',
15        password='your_password',
16        select='full',
17        strict=True,
18    )
19
20    # Images from Pixiv
21    s_pixiv = PixivSearchSource(
22        'アークナイツ (amiya OR アーミヤ OR 阿米娅)',
23        refresh_token='your_pixiv_refresh_token',
24    )
25
26    # We need 60 images from these 2 sites
27    s = s_zerochan[:50] + (s_db | s_pixiv)[:50]
28    s.export(
29        SaveExporter('/data/amiya_zerochan')
30    )

Here, a complex data source s = s_zerochan[:50] + (s_db | s_pixiv)[:50] is created, which effectively means:

  1. First, crawl 50 images from Zerochan.

  2. Then, randomly crawl a total of 50 images from Danbooru and Pixiv.

This results in obtaining a maximum of 100 images.

Moreover, concatenation and union operations can be performed after the attach syntax, meaning that you can preprocess the data source and then concatenate or union it. For example, in the following example:

 1from waifuc.action import BackgroundRemovalAction, FileExtAction
 2from waifuc.export import SaveExporter
 3from waifuc.source import DanbooruSource, ZerochanSource
 4
 5if __name__ == '__main__':
 6    # Images from Danbooru
 7    s_db = DanbooruSource(
 8        ['amiya_(arknights)', 'solo'],
 9        min_size=10000,
10    )
11
12    # Images from Zerochan
13    s_zerochan = ZerochanSource(
14        'Amiya',
15        username='your_username',
16        password='your_password',
17        select='full',
18        strict=True,
19    )
20    # Remove background for Zerochan images
21    s_zerochan = s_zerochan.attach(
22        BackgroundRemovalAction()
23    )
24
25    # We need 60 images from these 2 sites
26    s = (s_zerochan | s_db)[:60]
27    s.attach(
28        FileExtAction('.png'),  # Use PNG format to save
29    ).export(
30        SaveExporter('/data/amiya_zerochan')
31    )

The above code crawls images from both Zerochan and Danbooru, removes the background for images from Zerochan, and saves a total of 60 images. The results you get might look similar to the following, where images from Zerochan have their backgrounds removed:

../../_images/source_complex_attach1.png

Concatenation and union are essential features of waifuc data sources, and using them wisely makes the configuration of data sources very flexible and versatile.

Why Are There So Many JSON Files?

If you’ve read up to this point, you may have noticed something — all the datasets saved using SaveExporter have a corresponding JSON file with the same name as each image. You must be curious about the purpose of these JSON files, and this section will provide an explanation.

Firstly, let’s open the file .danbooru_6814120_meta.json and take a look at its content. The JSON data looks something like this:

A Sample Meta-Information JSON
  1{
  2  "danbooru": {
  3    "id": 6814120,
  4    "created_at": "2023-10-26T08:20:14.250-04:00",
  5    "uploader_id": 499293,
  6    "score": 16,
  7    "source": "https://i.pximg.net/img-original/img/2023/10/26/20/46/42/112868915_p0.jpg",
  8    "md5": "0ce585ccbbe7c79f0970466ef7e464ee",
  9    "last_comment_bumped_at": null,
 10    "rating": "s",
 11    "image_width": 4586,
 12    "image_height": 3758,
 13    "tag_string": "1girl absurdres alternate_costume amiya_(arknights) animal_ears arknights backpack bag black_bag black_footwear black_jacket black_shorts blue_eyes brown_hair commentary_request flag ganet_p highres holding holding_flag holding_megaphone jacket jewelry long_hair long_sleeves looking_at_viewer mask mask_pull megaphone midriff mouth_mask multiple_rings open_clothes open_jacket outdoors rabbit_ears ring shadow shirt shorts sleeveless sleeveless_shirt socks solo tied_shirt very_long_hair white_socks",
 14    "fav_count": 13,
 15    "file_ext": "jpg",
 16    "last_noted_at": null,
 17    "parent_id": null,
 18    "has_children": false,
 19    "approver_id": null,
 20    "tag_count_general": 41,
 21    "tag_count_artist": 1,
 22    "tag_count_character": 1,
 23    "tag_count_copyright": 1,
 24    "file_size": 4320312,
 25    "up_score": 16,
 26    "down_score": 0,
 27    "is_pending": false,
 28    "is_flagged": false,
 29    "is_deleted": false,
 30    "tag_count": 47,
 31    "updated_at": "2023-10-26T08:30:55.486-04:00",
 32    "is_banned": false,
 33    "pixiv_id": 112868915,
 34    "last_commented_at": null,
 35    "has_active_children": false,
 36    "bit_flags": 0,
 37    "tag_count_meta": 3,
 38    "has_large": true,
 39    "has_visible_children": false,
 40    "media_asset": {
 41      "id": 15651277,
 42      "created_at": "2023-10-26T08:19:17.692-04:00",
 43      "updated_at": "2023-10-26T08:19:23.071-04:00",
 44      "md5": "0ce585ccbbe7c79f0970466ef7e464ee",
 45      "file_ext": "jpg",
 46      "file_size": 4320312,
 47      "image_width": 4586,
 48      "image_height": 3758,
 49      "duration": null,
 50      "status": "active",
 51      "file_key": "vMcpfhxKG",
 52      "is_public": true,
 53      "pixel_hash": "1e09de6748794e2fb1138a4a30ec905a",
 54      "variants": [
 55        {
 56          "type": "180x180",
 57          "url": "https://cdn.donmai.us/180x180/0c/e5/0ce585ccbbe7c79f0970466ef7e464ee.jpg",
 58          "width": 180,
 59          "height": 148,
 60          "file_ext": "jpg"
 61        },
 62        {
 63          "type": "360x360",
 64          "url": "https://cdn.donmai.us/360x360/0c/e5/0ce585ccbbe7c79f0970466ef7e464ee.jpg",
 65          "width": 360,
 66          "height": 295,
 67          "file_ext": "jpg"
 68        },
 69        {
 70          "type": "720x720",
 71          "url": "https://cdn.donmai.us/720x720/0c/e5/0ce585ccbbe7c79f0970466ef7e464ee.webp",
 72          "width": 720,
 73          "height": 590,
 74          "file_ext": "webp"
 75        },
 76        {
 77          "type": "sample",
 78          "url": "https://cdn.donmai.us/sample/0c/e5/sample-0ce585ccbbe7c79f0970466ef7e464ee.jpg",
 79          "width": 850,
 80          "height": 697,
 81          "file_ext": "jpg"
 82        },
 83        {
 84          "type": "original",
 85          "url": "https://cdn.donmai.us/original/0c/e5/0ce585ccbbe7c79f0970466ef7e464ee.jpg",
 86          "width": 4586,
 87          "height": 3758,
 88          "file_ext": "jpg"
 89        }
 90      ]
 91    },
 92    "tag_string_general": "1girl alternate_costume animal_ears backpack bag black_bag black_footwear black_jacket black_shorts blue_eyes brown_hair flag holding holding_flag holding_megaphone jacket jewelry long_hair long_sleeves looking_at_viewer mask mask_pull megaphone midriff mouth_mask multiple_rings open_clothes open_jacket outdoors rabbit_ears ring shadow shirt shorts sleeveless sleeveless_shirt socks solo tied_shirt very_long_hair white_socks",
 93    "tag_string_character": "amiya_(arknights)",
 94    "tag_string_copyright": "arknights",
 95    "tag_string_artist": "ganet_p",
 96    "tag_string_meta": "absurdres commentary_request highres",
 97    "file_url": "https://cdn.donmai.us/original/0c/e5/0ce585ccbbe7c79f0970466ef7e464ee.jpg",
 98    "large_file_url": "https://cdn.donmai.us/sample/0c/e5/sample-0ce585ccbbe7c79f0970466ef7e464ee.jpg",
 99    "preview_file_url": "https://cdn.donmai.us/180x180/0c/e5/0ce585ccbbe7c79f0970466ef7e464ee.jpg"
100  },
101  "group_id": "danbooru_6814120",
102  "filename": "danbooru_6814120.png",
103  "tags": {
104    "1girl": 1,
105    "absurdres": 1,
106    "alternate_costume": 1,
107    "amiya_(arknights)": 1,
108    "animal_ears": 1,
109    "arknights": 1,
110    "backpack": 1,
111    "bag": 1,
112    "black_bag": 1,
113    "black_footwear": 1,
114    "black_jacket": 1,
115    "black_shorts": 1,
116    "blue_eyes": 1,
117    "brown_hair": 1,
118    "commentary_request": 1,
119    "flag": 1,
120    "ganet_p": 1,
121    "highres": 1,
122    "holding": 1,
123    "holding_flag": 1,
124    "holding_megaphone": 1,
125    "jacket": 1,
126    "jewelry": 1,
127    "long_hair": 1,
128    "long_sleeves": 1,
129    "looking_at_viewer": 1,
130    "mask": 1,
131    "mask_pull": 1,
132    "megaphone": 1,
133    "midriff": 1,
134    "mouth_mask": 1,
135    "multiple_rings": 1,
136    "open_clothes": 1,
137    "open_jacket": 1,
138    "outdoors": 1,
139    "rabbit_ears": 1,
140    "ring": 1,
141    "shadow": 1,
142    "shirt": 1,
143    "shorts": 1,
144    "sleeveless": 1,
145    "sleeveless_shirt": 1,
146    "socks": 1,
147    "solo": 1,
148    "tied_shirt": 1,
149    "very_long_hair": 1,
150    "white_socks": 1
151  },
152  "url": "https://cdn.donmai.us/original/0c/e5/0ce585ccbbe7c79f0970466ef7e464ee.jpg"
153}

In essence, this is a data file used to store metadata about the images. In this metadata file, it contains the following information:

  1. Information about the image from the Danbooru website, including tags, dimensions, ID, upload time, etc.

  2. URL information about the image, i.e., where the image was downloaded from.

  3. Naming information for the image, i.e., what filename the image will be saved as.

  4. Tag information for the image, i.e., what tags will be included when generating a training dataset.

These pieces of information play their respective roles in various processing steps. For example, tag information requires crawling from the website or using a tagger for labeling, and this information will be written during the generation of the LoRA training dataset. Therefore, in waifuc, they are maintained and can be saved using SaveExporter.

To make the most of this information, you can reload the dataset saved to your local machine using `LocalSource`. You can achieve this using the following code:

1from waifuc.export import TextualInversionExporter
2from waifuc.source import LocalSource
3
4if __name__ == '__main__':
5    # Images from your disk
6    s = LocalSource('/data/amiya_zerochan')
7    s.export(
8        TextualInversionExporter('/data/amiya_zerochan_save')
9    )

The code above reloads the dataset containing metadata that was previously saved and re-saves it in the format of the LoRA training dataset. The resulting files will look like the following:

../../_images/local_to_dataset1.png

Of course, it’s worth noting that LocalSource can be used not only for paths containing metadata files but also for paths containing only images. However, it will not include the initial metadata in the pipeline, meaning that information such as tags will need to be regenerated.

Additionally, like other data sources, `LocalSource` also supports concatenation and union operations. Leveraging this feature, you can use data from both the internet and local sources to build a dataset.

When you are sure that you only need images and don’t need any metadata for the next loading, you can use the no_meta parameter of SaveExporter to achieve this:

 1from waifuc.export import SaveExporter
 2from waifuc.source import DanbooruSource
 3
 4if __name__ == '__main__':
 5    s = DanbooruSource(
 6        ['amiya_(arknights)', 'solo'],
 7        min_size=10000,
 8    )[:50]
 9    s.export(
10        SaveExporter('/data/amiya_danbooru_nometa', no_meta=True)
11    )

This code will not save any metadata files, and you will precisely get 50 images, as shown below:

../../_images/save_no_meta1.png

I Don’t Want to Save Images to Disk, How Can I Use Them Directly?

In another scenario, you might not want to save image files to the hard disk but rather use them directly in memory to save time. waifuc also supports this usage, as shown below:

1from waifuc.source import DanbooruSource
2
3if __name__ == '__main__':
4    s = DanbooruSource(
5        ['amiya_(arknights)', 'solo'],
6        min_size=10000,
7    )[:50]
8    for item in s:
9        print(item)

Yes, data sources (including those that have used the attach method) can be iterated over. The type of each item is defined as follows:

1from dataclasses import dataclass
2
3from PIL import Image
4
5
6@dataclass
7class ImageItem:
8    image: Image.Image
9    meta: dict

As you can see, the structure of each item is straightforward, containing an image object of type PIL.Image and a meta item for storing metadata.

Once you have the item, you can customize the operations you need using its image object and metadata.