Crawling Some Images From Websites

(Chinese Doc：https://deepghs.github.io/waifuc/main/tutorials-CN/crawl_images/index.html )

How to Crawl Data From Websites?

In fact, waifuc can crawl from many websites, not just Danbooru. But before we get started, allow me to formally introduce my another waifu, Amiya, a cute bunny girl. (You may ask why I have so many waifus. Well, as you know, anime lovers have an infinite number of waifus; they are all my honey and angels 😍)

Zerochan

Zerochan is a website with many high-quality images. We can crawl from it in a straightforward way. Considering we only need 50 images due to the large quantity, the following code achieves this:

from waifuc.export import SaveExporter
from waifuc.source import ZerochanSource

if __name__ == '__main__':
    s = ZerochanSource('Amiya')
    # the 50 means only need first 50 images
    # if you need to get all images from zerochan,
    # just replace it with 's.export('
    s[:50].export(
        SaveExporter('/data/amiya_zerochan')
    )

Please note that we use SaveExporter here instead of the previous TextureInversionExporter. Its function will be explained in the following sections. The data crawled here will be stored locally in the /data/amiya_zerochan directory, as shown below:

However, we’ve noticed an issue—Zerochan has many member-only images, requiring login to access. To address this, we can use our username and password for authentication to obtain more and higher-quality images:

from waifuc.export import SaveExporter
from waifuc.source import ZerochanSource

if __name__ == '__main__':
    s = ZerochanSource(
        'Amiya',
        username='your_username',
        password='your_password',
    )
    s[:50].export(
        SaveExporter('/data/amiya_zerochan')
    )

Indeed, we successfully obtained many member-only images, as shown below:

However, many of these images have relatively low resolutions, with few exceeding 1000 pixels in either dimension. This is because Zerochan defaults to using the large size to speed up downloads. If you need larger images, you can modify the size selection like this:

from waifuc.export import SaveExporter
from waifuc.source import ZerochanSource

if __name__ == '__main__':
    s = ZerochanSource(
        'Amiya',
        username='your_username',
        password='your_password',
        select='full',
    )
    s[:50].export(
        SaveExporter('/data/amiya_zerochan')
    )

After crawling, all the images will be in full size.

However, there’s still an issue—many high-quality images are official promotional art, and since Amiya is a main character, she often appears in group art. We actually need images that only feature her. No problem, just set the search mode to strict:

from waifuc.export import SaveExporter
from waifuc.source import ZerochanSource

if __name__ == '__main__':
    s = ZerochanSource(
        'Amiya',
        username='your_username',
        password='your_password',
        select='full',
        strict=True,
    )
    s[:50].export(
        SaveExporter('/data/amiya_zerochan')
    )

Now we have high-quality images of Amiya alone from Zerochan, as shown below:

../../_images/zerochan_login_50_full_strict1.png

Danbooru

Clearly, Danbooru can also be crawled easily:

from waifuc.export import SaveExporter
from waifuc.source import DanbooruSource

if __name__ == '__main__':
    s = DanbooruSource(['amiya_(arknights)'])
    s[:50].export(
        SaveExporter('/data/amiya_danbooru')
    )

Moreover, on Danbooru and many similar sites, you can directly collect solo images by adding the solo tag, like this:

from waifuc.export import SaveExporter
from waifuc.source import DanbooruSource

if __name__ == '__main__':
    s = DanbooruSource(['amiya_(arknights)', 'solo'])
    s[:50].export(
        SaveExporter('/data/amiya_solo_danbooru')
    )

Pixiv

waifuc also supports crawling for Pixiv, including keyword searches, artist-specific crawls, and crawls based on rankings.

We can use PixivSearchSource to crawl images based on keywords, as shown below:

from waifuc.export import SaveExporter
from waifuc.source import PixivSearchSource

if __name__ == '__main__':
    s = PixivSearchSource(

        'アークナイツ (amiya OR アーミヤ OR 阿米娅)',
        refresh_token='your_pixiv_refresh_token',
    )
    s[:50].export(
        SaveExporter('/data/amiya_pixiv')
    )

We can also use PixivUserSource to crawl images from a specific artist, as shown below:

from waifuc.export import SaveExporter
from waifuc.source import PixivUserSource

if __name__ == '__main__':
    s = PixivUserSource(
        2864095,  # pixiv user 2864095
        refresh_token='your_pixiv_refresh_token',
    )
    s[:50].export(
        SaveExporter('/data/pixiv_user_misaka_12003')
    )

We can use PixivRankingSource to crawl images from the ranking, as shown below:

from waifuc.export import SaveExporter
from waifuc.source import PixivRankingSource

if __name__ == '__main__':
    s = PixivRankingSource(
        mode='day',  # daily ranklist
        refresh_token='your_pixiv_refresh_token',
    )
    s[:50].export(
        SaveExporter('/data/pixiv_daily_best')
    )

Anime-Pictures

Anime-Pictures is a site with fewer images, but generally high quality. waifuc also supports crawling from it, as shown below:

from waifuc.export import SaveExporter
from waifuc.source import AnimePicturesSource

if __name__ == '__main__':
    s = AnimePicturesSource(['amiya (arknights)'])
    s[:50].export(
        SaveExporter('/data/amiya_animepictures')
    )

Sankaku

Sankaku is a site with a large number of images of various types, and waifuc also supports it, as shown below:

from waifuc.export import SaveExporter
from waifuc.source import SankakuSource

if __name__ == '__main__':
    s = SankakuSource(
        ['amiya_(arknights)'],
        username='your_username',
        password='your_password',
    )
    s[:50].export(
        SaveExporter('/data/amiya_sankaku')
    )

Gelbooru

waifuc also supports crawling from Gelbooru, as shown below:

from waifuc.export import SaveExporter
from waifuc.source import GelbooruSource

if __name__ == '__main__':
    s = GelbooruSource(['amiya_(arknights)'])
    s[:50].export(
        SaveExporter('/data/amiya_gelbooru')
    )

Duitang

In response to a request from a mysterious user on civitai, waifuc has added support for Duitang. Duitang is a Chinese website that contains many high-quality anime images. The crawling code is as follows:

from waifuc.export import SaveExporter
from waifuc.source import DuitangSource

if __name__ == '__main__':
    s = DuitangSource('阿米娅')
    s[:50].export(
        SaveExporter('/data/amiya_duitang')
    )

Other Supported Sites

In addition to the above-mentioned websites, we also support a large number of other image websites. All supported websites are listed below:

ATFBooruSource (Website: https://booru.allthefallen.moe)
AnimePicturesSource (Website: https://anime-pictures.net)
DanbooruSource (Website: https://danbooru.donmai.us)
DerpibooruSource (Website: https://derpibooru.org)
DuitangSource (Website: https://www.duitang.com)
E621Source (Website: https://e621.net)
E926Source (Website: https://e926.net)
FurbooruSource (Website: https://furbooru.com)
GelbooruSource (Website: https://gelbooru.com)
Huashi6Source (Website: https://www.huashi6.com)
HypnoHubSource (Website: https://hypnohub.net)
KonachanNetSource (Website: https://konachan.net)
KonachanSource (Website: https://konachan.com)
LolibooruSource (Website: https://lolibooru.moe)
PahealSource (Website: https://rule34.paheal.net)
PixivRankingSource (Website: https://pixiv.net)
PixivSearchSource (Website: https://pixiv.net)
PixivUserSource (Website: https://pixiv.net)
Rule34Source (Website: https://rule34.xxx)
SafebooruOrgSource (Website: https://safebooru.org)
SafebooruSource (Website: https://safebooru.donmai.us)
SankakuSource (Website: https://chan.sankakucomplex.com)
TBIBSource (Website: https://tbib.org)
WallHavenSource (Website: https://wallhaven.cc)
XbooruSource (Website: https://xbooru.com)
YandeSource (Website: https://yande.re)
ZerochanSource (Website: https://www.zerochan.net)

For more information and details on using these data sources, refer to the official waifuc source code.

Crawling Data from Multiple Websites

In reality, there are often cases where we want to retrieve image data from multiple websites. For example, we might need 30 images from Danbooru and another 30 from Zerochan.

To address this situation, waifuc provides concatenation and union operations for data sources. In simple terms, you can integrate multiple data sources using concatenation (+) and union (|). For example, to fulfill the mentioned requirement, we can concatenate the data sources of Danbooru and Zerochan, creating a new data source, as shown below:

from waifuc.export import SaveExporter
from waifuc.source import DanbooruSource, ZerochanSource

if __name__ == '__main__':
    # First 30 images from Danbooru
    s_db = DanbooruSource(
        ['amiya_(arknights)', 'solo'],
        min_size=10000,
    )[:30]

    # First 30 images from Zerochan
    s_zerochan = ZerochanSource(
        'Amiya',
        username='your_username',
        password='your_password',
        select='full',
        strict=True,
    )[:30]

    # Concatenate these 2 data sources
    s = s_db + s_zerochan
    s.export(
        SaveExporter('/data/amiya_2datasources')
    )

The code above first crawls 30 images from Danbooru and then another 30 from Zerochan. Consequently, we get a dataset like this:

Moreover, in some cases, we might not know in advance the number of images each data source contains. Instead, we may want to collect images from different sources as much as possible, with a specific total quantity in mind. In such cases, we can use the union operation, as shown below:

from waifuc.export import SaveExporter
from waifuc.source import DanbooruSource, ZerochanSource

if __name__ == '__main__':
    # Images from Danbooru
    s_db = DanbooruSource(
        ['amiya_(arknights)', 'solo'],
        min_size=10000,
    )

    # Images from Zerochan
    s_zerochan = ZerochanSource(
        'Amiya',
        username='your_username',
        password='your_password',
        select='full',
        strict=True,
    )

    # We need 60 images from these 2 sites
    s = (s_db | s_zerochan)[:60]
    s.export(
        SaveExporter('/data/amiya_zerochan')
    )

In this example, it randomly crawls one image at a time from either of the two websites until it collects 60 images. Thus, the final dataset is not fixed, and the following dataset is just an example:

In fact, all waifuc data sources support such concatenation and union operations. You can even perform complex nested operations to construct a sophisticated data source:

from waifuc.export import SaveExporter
from waifuc.source import DanbooruSource, ZerochanSource, PixivSearchSource

if __name__ == '__main__':
    # Images from Danbooru
    s_db = DanbooruSource(
        ['amiya_(arknights)', 'solo'],
        min_size=10000,
    )

    # Images from Zerochan
    s_zerochan = ZerochanSource(
        'Amiya',
        username='your_username',
        password='your_password',
        select='full',
        strict=True,
    )

    # Images from Pixiv
    s_pixiv = PixivSearchSource(
        'アークナイツ (amiya OR アーミヤ OR 阿米娅)',
        refresh_token='your_pixiv_refresh_token',
    )

    # We need 60 images from these 2 sites
    s = s_zerochan[:50] + (s_db | s_pixiv)[:50]
    s.export(
        SaveExporter('/data/amiya_zerochan')
    )

Here, a complex data source s = s_zerochan[:50] + (s_db | s_pixiv)[:50] is created, which effectively means:

First, crawl 50 images from Zerochan.
Then, randomly crawl a total of 50 images from Danbooru and Pixiv.

This results in obtaining a maximum of 100 images.

Moreover, concatenation and union operations can be performed after the attach syntax, meaning that you can preprocess the data source and then concatenate or union it. For example, in the following example:

from waifuc.action import BackgroundRemovalAction, FileExtAction
from waifuc.export import SaveExporter
from waifuc.source import DanbooruSource, ZerochanSource

if __name__ == '__main__':
    # Images from Danbooru
    s_db = DanbooruSource(
        ['amiya_(arknights)', 'solo'],
        min_size=10000,
    )

    # Images from Zerochan
    s_zerochan = ZerochanSource(
        'Amiya',
        username='your_username',
        password='your_password',
        select='full',
        strict=True,
    )
    # Remove background for Zerochan images
    s_zerochan = s_zerochan.attach(
        BackgroundRemovalAction()
    )

    # We need 60 images from these 2 sites
    s = (s_zerochan | s_db)[:60]
    s.attach(
        FileExtAction('.png'),  # Use PNG format to save
    ).export(
        SaveExporter('/data/amiya_zerochan')
    )

The above code crawls images from both Zerochan and Danbooru, removes the background for images from Zerochan, and saves a total of 60 images. The results you get might look similar to the following, where images from Zerochan have their backgrounds removed:

../../_images/source_complex_attach1.png

Concatenation and union are essential features of waifuc data sources, and using them wisely makes the configuration of data sources very flexible and versatile.

Why Are There So Many JSON Files?

If you’ve read up to this point, you may have noticed something — all the datasets saved using SaveExporter have a corresponding JSON file with the same name as each image. You must be curious about the purpose of these JSON files, and this section will provide an explanation.

Firstly, let’s open the file .danbooru_6814120_meta.json and take a look at its content. The JSON data looks something like this:

A Sample Meta-Information JSON

{
  "danbooru": {
    "id": 6814120,
    "created_at": "2023-10-26T08:20:14.250-04:00",
    "uploader_id": 499293,
    "score": 16,
    "source": "https://i.pximg.net/img-original/img/2023/10/26/20/46/42/112868915_p0.jpg",
    "md5": "0ce585ccbbe7c79f0970466ef7e464ee",
    "last_comment_bumped_at": null,
    "rating": "s",
    "image_width": 4586,
    "image_height": 3758,
    "tag_string": "1girl absurdres alternate_costume amiya_(arknights) animal_ears arknights backpack bag black_bag black_footwear black_jacket black_shorts blue_eyes brown_hair commentary_request flag ganet_p highres holding holding_flag holding_megaphone jacket jewelry long_hair long_sleeves looking_at_viewer mask mask_pull megaphone midriff mouth_mask multiple_rings open_clothes open_jacket outdoors rabbit_ears ring shadow shirt shorts sleeveless sleeveless_shirt socks solo tied_shirt very_long_hair white_socks",
    "fav_count": 13,
    "file_ext": "jpg",
    "last_noted_at": null,
    "parent_id": null,
    "has_children": false,
    "approver_id": null,
    "tag_count_general": 41,
    "tag_count_artist": 1,
    "tag_count_character": 1,
    "tag_count_copyright": 1,
    "file_size": 4320312,
    "up_score": 16,
    "down_score": 0,
    "is_pending": false,
    "is_flagged": false,
    "is_deleted": false,
    "tag_count": 47,
    "updated_at": "2023-10-26T08:30:55.486-04:00",
    "is_banned": false,
    "pixiv_id": 112868915,
    "last_commented_at": null,
    "has_active_children": false,
    "bit_flags": 0,
    "tag_count_meta": 3,
    "has_large": true,
    "has_visible_children": false,
    "media_asset": {
      "id": 15651277,
      "created_at": "2023-10-26T08:19:17.692-04:00",
      "updated_at": "2023-10-26T08:19:23.071-04:00",
      "md5": "0ce585ccbbe7c79f0970466ef7e464ee",
      "file_ext": "jpg",
      "file_size": 4320312,
      "image_width": 4586,
      "image_height": 3758,
      "duration": null,
      "status": "active",
      "file_key": "vMcpfhxKG",
      "is_public": true,
      "pixel_hash": "1e09de6748794e2fb1138a4a30ec905a",
      "variants": [
        {
          "type": "180x180",
          "url": "https://cdn.donmai.us/180x180/0c/e5/0ce585ccbbe7c79f0970466ef7e464ee.jpg",
          "width": 180,
          "height": 148,
          "file_ext": "jpg"
        },
        {
          "type": "360x360",
          "url": "https://cdn.donmai.us/360x360/0c/e5/0ce585ccbbe7c79f0970466ef7e464ee.jpg",
          "width": 360,
          "height": 295,
          "file_ext": "jpg"
        },
        {
          "type": "720x720",
          "url": "https://cdn.donmai.us/720x720/0c/e5/0ce585ccbbe7c79f0970466ef7e464ee.webp",
          "width": 720,
          "height": 590,
          "file_ext": "webp"
        },
        {
          "type": "sample",
          "url": "https://cdn.donmai.us/sample/0c/e5/sample-0ce585ccbbe7c79f0970466ef7e464ee.jpg",
          "width": 850,
          "height": 697,
          "file_ext": "jpg"
        },
        {
          "type": "original",
          "url": "https://cdn.donmai.us/original/0c/e5/0ce585ccbbe7c79f0970466ef7e464ee.jpg",
          "width": 4586,
          "height": 3758,
          "file_ext": "jpg"
        }
      ]
    },
    "tag_string_general": "1girl alternate_costume animal_ears backpack bag black_bag black_footwear black_jacket black_shorts blue_eyes brown_hair flag holding holding_flag holding_megaphone jacket jewelry long_hair long_sleeves looking_at_viewer mask mask_pull megaphone midriff mouth_mask multiple_rings open_clothes open_jacket outdoors rabbit_ears ring shadow shirt shorts sleeveless sleeveless_shirt socks solo tied_shirt very_long_hair white_socks",
    "tag_string_character": "amiya_(arknights)",
    "tag_string_copyright": "arknights",
    "tag_string_artist": "ganet_p",
    "tag_string_meta": "absurdres commentary_request highres",
    "file_url": "https://cdn.donmai.us/original/0c/e5/0ce585ccbbe7c79f0970466ef7e464ee.jpg",
    "large_file_url": "https://cdn.donmai.us/sample/0c/e5/sample-0ce585ccbbe7c79f0970466ef7e464ee.jpg",
    "preview_file_url": "https://cdn.donmai.us/180x180/0c/e5/0ce585ccbbe7c79f0970466ef7e464ee.jpg"
  },
  "group_id": "danbooru_6814120",
  "filename": "danbooru_6814120.png",
  "tags": {
    "1girl": 1,
    "absurdres": 1,
    "alternate_costume": 1,
    "amiya_(arknights)": 1,
    "animal_ears": 1,
    "arknights": 1,
    "backpack": 1,
    "bag": 1,
    "black_bag": 1,
    "black_footwear": 1,
    "black_jacket": 1,
    "black_shorts": 1,
    "blue_eyes": 1,
    "brown_hair": 1,
    "commentary_request": 1,
    "flag": 1,
    "ganet_p": 1,
    "highres": 1,
    "holding": 1,
    "holding_flag": 1,
    "holding_megaphone": 1,
    "jacket": 1,
    "jewelry": 1,
    "long_hair": 1,
    "long_sleeves": 1,
    "looking_at_viewer": 1,
    "mask": 1,
    "mask_pull": 1,
    "megaphone": 1,
    "midriff": 1,
    "mouth_mask": 1,
    "multiple_rings": 1,
    "open_clothes": 1,
    "open_jacket": 1,
    "outdoors": 1,
    "rabbit_ears": 1,
    "ring": 1,
    "shadow": 1,
    "shirt": 1,
    "shorts": 1,
    "sleeveless": 1,
    "sleeveless_shirt": 1,
    "socks": 1,
    "solo": 1,
    "tied_shirt": 1,
    "very_long_hair": 1,
    "white_socks": 1
  },
  "url": "https://cdn.donmai.us/original/0c/e5/0ce585ccbbe7c79f0970466ef7e464ee.jpg"
}

In essence, this is a data file used to store metadata about the images. In this metadata file, it contains the following information:

Information about the image from the Danbooru website, including tags, dimensions, ID, upload time, etc.
URL information about the image, i.e., where the image was downloaded from.
Naming information for the image, i.e., what filename the image will be saved as.
Tag information for the image, i.e., what tags will be included when generating a training dataset.

These pieces of information play their respective roles in various processing steps. For example, tag information requires crawling from the website or using a tagger for labeling, and this information will be written during the generation of the LoRA training dataset. Therefore, in waifuc, they are maintained and can be saved using SaveExporter.

To make the most of this information, you can reload the dataset saved to your local machine using `LocalSource`. You can achieve this using the following code:

from waifuc.export import TextualInversionExporter
from waifuc.source import LocalSource

if __name__ == '__main__':
    # Images from your disk
    s = LocalSource('/data/amiya_zerochan')
    s.export(
        TextualInversionExporter('/data/amiya_zerochan_save')
    )

The code above reloads the dataset containing metadata that was previously saved and re-saves it in the format of the LoRA training dataset. The resulting files will look like the following:

Of course, it’s worth noting that LocalSource can be used not only for paths containing metadata files but also for paths containing only images. However, it will not include the initial metadata in the pipeline, meaning that information such as tags will need to be regenerated.

Additionally, like other data sources, `LocalSource` also supports concatenation and union operations. Leveraging this feature, you can use data from both the internet and local sources to build a dataset.

When you are sure that you only need images and don’t need any metadata for the next loading, you can use the no_meta parameter of SaveExporter to achieve this:

from waifuc.export import SaveExporter
from waifuc.source import DanbooruSource

if __name__ == '__main__':
    s = DanbooruSource(
        ['amiya_(arknights)', 'solo'],
        min_size=10000,
    )[:50]
    s.export(
        SaveExporter('/data/amiya_danbooru_nometa', no_meta=True)
    )

This code will not save any metadata files, and you will precisely get 50 images, as shown below:

I Don’t Want to Save Images to Disk, How Can I Use Them Directly?

In another scenario, you might not want to save image files to the hard disk but rather use them directly in memory to save time. waifuc also supports this usage, as shown below:

from waifuc.source import DanbooruSource

if __name__ == '__main__':
    s = DanbooruSource(
        ['amiya_(arknights)', 'solo'],
        min_size=10000,
    )[:50]
    for item in s:
        print(item)

Yes, data sources (including those that have used the attach method) can be iterated over. The type of each item is defined as follows:

from dataclasses import dataclass

from PIL import Image


@dataclass
class ImageItem:
    image: Image.Image
    meta: dict

As you can see, the structure of each item is straightforward, containing an image object of type PIL.Image and a meta item for storing metadata.

Once you have the item, you can customize the operations you need using its image object and metadata.