/t/ - Technology

Discussion of Technology

Index Catalog Archive Bottom Refresh
Mode: Reply
Options
Subject
Message

Max message length: 8000

Files

Max file size: 32.00 MB

Max files: 5

Supported file types: GIF, JPG, PNG, WebM, OGG, and more

E-mail
Password

(used to delete files and postings)

Misc

Remember to follow the rules

The backup domain is located at 8chan.se. .cc is a third fallback. TOR access can be found here, or you can access the TOR portal from the clearnet at Redchannit 2.0.

Visit >>>/h/ for all your lewd needs!

8chan.moe is a hobby project with no affiliation whatsoever to the administration of any other "8chan" site, past or present.

(43.05 KB 618x656 ChannelChangerLogo_Avatar.png)
(133.69 KB 1510x924 ChannelChangerLogo.png)
ChannelChanger Development & Support Anonymous 09/06/2020 (Sun) 18:37:48 No. 1257
This is the official development and support thread for ChannelChanger. Please request help, post bugs, or offer suggestions here. What is ChannelChanger? A cross-platform, multi-site scraper and importer. It allows anyone to back up a board and then import it to their own website. https://gitgud.io/Codexx/channel_changer What do I need to run this? Python 3.8+ and most of the dependencies listed in requirements.txt. A basic set-up guide is provided in the readme. This software was developed and tested exclusively on Linux. I intend to support both OSX and WIndows. If you use either of these platforms and encounter any issues, please let me know. Can I scrape a board from [site] with this? Probably. There is explicit support for LynxChan, Vichan, and JSChan websites. Some vichan sites may have issues with thumbnails because their APIs do not expose thumbnail extensions; I have added an override but you may need to run two scrapes of boards on some sites to get all of the thumbnails. Vichan's API matches 4chan's with some extensions, so the scraper might work on other sites which clone the 4chan API, but this is untested. Many vichan sites have customized frontends, such as OpenIB, Lainchan, or Kissue. I've tested and confirmed these work, but can't always guarantee full compatibility with each of these, especially if they decide to alter the API or where files are stored. LynxChan sites should work fine, since the direct path for both the thumbnail and the file are in the JSON. JSChan works, but its API is presumably unstable. if it changes, please alert me and I will make the necessary tweaks. Can I import these boards to my own website? Sure, but for the moment only importing LynxChan boards from LynxChan or Vichan sites has any support. Importing is currently undergoing a heavy refactor. Once it is done, it will be possible to import from any board to a LynxChan website. Imports to other imageboard engines are planned. Can I view the board offline? Easily? No, but I am looking into an option to do this. You will have a local copy of the threads and files, but the data is not modified for local viewing. I will continue to iterate and refactor. The code is a bit of a mess at the moment, but I plan to simplify it and make it PEP8-compliant soon. It's very likely there's still some big kinks to work out. Your feedback is incredibly valuable!
(43.59 KB 467x413 nice desu ne (2).jpg)
>>1257 Can the average anon scrape boards, or do they need to own the target board/site in order to scrape everything? How much strain does this cause in the target site?
>>1259 It just scrapes what is publicly available. Anyone can do it on any supported site. It should get everything; let me know if it chokes or misses anything. No more of a strain than a single anon clicking on and reading every single thread on the board and expanding all images. That's not much of a hit. It also only scrapes files that are missing, so that will minimize scrape time and server bandwidth.
>channel Christ >There is explicit support for JSChan Ok this is based. Blacked.moe is ultra gay but you're cool codexx.
>>1257 So basically it lets you copy and paste boards?
>>1260 >That's not much of a hit Care to elaborate? You keep comparing this tool with various 4chan scrappers, but 4chan's servers are dozens of times higher than any other imageboard's and probably have features like load balancing to reduce the strain. If configured wrong, a tool like your scrapper could take down a small site by accident. Another difference between this tool and things like 4chan's archives is that they use 4chan's dedicated API, which limits the amount of content they can scrape (just the text and images in the posts instead of requesting the entire page every time something changes), while your tool requests and downloads everything (at least judging by your description). There are related ethic problems, like bad actors using your tool to clone a site for malicious purposes (stealing users, scamming advertisers, bloating the target's bandwidth, etc), but that's something to be expected with this kind of tools.
>>1263 >site performance problems Assuming a good faith actor, the worst-case would be transferring a copy of every file on the server and a copy of each thread. That sounds like a lot, and it is, but even small sites (such as this one and the webring) handle that kind of data transfer regularly. Vanwa removed a lot of their performance and traffic statistics, but this site fulfills hundreds of requests a second even during off-hours. Even our $5 VPS test server handles being scraped just fine, although it's not also handling other traffic. In terms of bad actors, yes, you could just have this tool constantly make requests. But a DDoS tool would be far more effective and use up less of your own bandwidth in the process. I just don't think it's well-suited for this purpose. In short, I don't think it can take down sites by accident unless used maliciously, and someone with malicious intent has better tools to accomplish the same thing. >4chan dedicated API I actually don't know what 4chan scrapers are out there and haven't compared this one to them. I only know Vichan's API is an superset of 4chan's. Everything I grab is from the public, dedicated JSON API these sites provide, and then I also grab the HTML pages on top of that for the sake of posterity (and I think I might need them when I write the vichan importer). It will re-request a page on re-scrape, but the size of a JSON or HTML request is peanuts and it will not re-scrape files. >related ethic problems... bad actors... clone sites for malicious purposes Yes, I am concerned with this, too. Primarily, angry users forking boards because the mods deleted their post or pissed them off in some way. But my tool doesn't do anything these people couldn't do themselves. It only lowers the barrier to entry a bit. The alternative is to keep the source closed and just advertise it to migrating boards as an option, but I do believe in free software and I also believe that, should I get hit by a bus tomorrow, other anons should have the ability to restore boards and sites that get deplatformed. This tool gives them the capability to do that. It's up to anons to not be "stolen" by other websites just because the posts are cloned. Ultimately, it's how you use it. But I think it would be unethical to just keep this tool private. >>1262 Yes.
>>1264 Thanks for the clarification. I was a bit worried about how this tool could backfire and cause more damage than good. One last thing, is there a risk of accidentally triggering Vanwa/Cloudflare's DDoS protection and getting your IP banned by them for using this tool?
>>1265 Potentially, but there's some mitigation. The user-agent is spoofed to Firefox and if people run into issues I could implement user-agent randomization on a per-request basis, which would probably throw this off. Cloudflare's bot-check page would likely be an issue if enabled. Nothing to be done about that, really. I ran into throttling with 8kun very early on, but since implementing the user-agent spoofing it hasn't been an issue, and I scraped two of their largest boards back-to-back multiple times.
>>1266 >Cloudflare's bot-check page would likely be an issue if enabled. Nothing to be done about that, really. There are two bot check pages. The first is a simple matter of waiting and executing some javascript. This only stops the lowest effort automation. The second involves a captcha, where your options are to either solve it or try again with a different IP address.
>>1259 Ask any of the 4chan archivers for details.
As a turbo CL linux brainlet someone explain what I'm doing wrong. python change-channel.py -s 8chan.moe -b test -o test File "change-channel.py", line 92 print('Unable to download ' + ('thumbnail' if thumbnail else 'image') + f' {url.split("/")[-1]}. Skipping...') ^ SyntaxError: invalid syntax
>>1271 sometimes invoking python on some gnu/linux distros grabs python2 rather than python3, you can state python3 & double check with the '-v' flag (python3 -v not python3 change-channel.py -s 8chan.moe -b test -o test -v)
>>1272 Looks like my nigger distro doesn't have python 3.8, probably the issue
>>1271 >>1272 >>1273 That's a syntax error on an f-string, so it does indeed look like the issue is Python 3.8 is not installed, or at least isn't being called with your python3 alias. Debian and Ubuntu can use the Deadsnakes PPA, but they only package 3.9 because you can technically get 3.8 with a dist-upgrade. You can also compile from scratch. It's a straightforward process. Just make note of where the binary is installed. Bear in mind you'll need to install pip and dependencies per-version of Python. Using a virtual environment helps a lot with managing this. Please let me know if you need any further help.
>>1274 I compiled python 3.8 and after getting each module it bitched at me about I got it to work. Probably went about this all the wrong way but in the end it worked. Works on some boards/sites fine but getting errors for others on zzzchan like this python3.8 change-channel.py -s zzzchan.xyz -b b -o b Threads |██▊⚠ | (!) 3/42 [7%] in 1.8s (1.68/s) Traceback (most recent call last): File "change-channel.py", line 600, in <module> scrapeJschanBoard(args, json_catalog, output_root) File "change-channel.py", line 233, in scrapeJschanBoard for result in results: File "/usr/local/lib/python3.8/multiprocessing/pool.py", line 865, in next raise value File "/usr/local/lib/python3.8/multiprocessing/pool.py", line 125, in worker result = (True, func(*args, **kwds)) File "change-channel.py", line 194, in scrapeThread thumb_loc = f'/file/thumb-{file_data["hash"]}{file_data["thumbextension"]}' KeyError: 'thumbextension'
>>1276 I thought I fixed that; turned out I forgot to apply the fix to threads and not just posts. I've pushed an update which should rectify the issue. It should now just fail to download thumbnails for those files. Not very elegant, and I'd like to revisit it eventually, but I was able to complete an entire scrape with no errors. I show 2,343 images in the src/ folder for z/v/, compared with the original list of 2,396. If you can find anything missing that is available on the site, let me know and I will investigate. Thank you for the bug report!
>>1279 Seems to just hang on lynxchan.net boards and doesn't do anything.
>>1280 Their site uses the www subdomain and redirects requests without it, but their certificate only covers URLs with the subdomain included. I actually encountered a similar issue with some of lainchan's alternate domains, which the certificate is not properly configured for. I almost removed the validation check entirely, but the requests library screams bloody murder about insecure requests, so I reverted. I am able to scrape as long as I include the www in the site argument. For example: ./change-channel -s www.lynxchan.net -b lynx -o lynx -j 8 Most hangs are failure to resolve the URL, although some are occasionally caused by multithreaded scrapes failing to release locks. I'll see if I can't add an explicit error when this happens, though.
>>1281 Figured that was it but I guess I unnecessarily added the https shit too which threw me for a loop.
>>1257 What about using it through Tor? >inb4 scraping through Tor bad not anymore, Tor now holds many connections easily
>>1286 I haven't tested it. But assuming you're routing all traffic through Tor, I don't see why a request would fail. As long as your network can resolve an address and handle requests/responses, it should work. Give it a try and if there are any issues you can report them.
JSchan importing when
>>1465 I'm focusing on the next LynxChan upgrade right now. I'll pick up development of ChannelChanger once that is done.
>>1475 Okay thank you lain tranny
>>1475 Is there a rough estimate for when this will be done? Not trying to pester just need to account for potential board migrations and when/how they might occur. I wouldn't expect you to prioritize off-site migrations, or care much at all about them.
>>1538 LynxChan 2.5 RC1 launches October 17th. It will be no sooner than that. With some luck, the update will happen at the end of the month. Is this regarding migrating /r9k/ from the zchan database to zzzchan? If so, I was planning to reach out to Sturgeon about that once my plate was clear. If this is regarding another move, or if you're the admin of a different JSChan site, I'd encourage you to reach out to me via e-mail; I will need a few active JSChan users to lend me their eyes for bug hunting.
>>1539 >Is this regarding migrating /r9k/ from the zchan database to zzzchan? It is about migrating /r9k/, but from lynxchan.net to zzzchan. Unless I'm mistaken sturgeon does not have access to the original zchan database. A month from now is a better estimate than I was expecting. If you think it'll be possible in November it may be worth waiting.
>>1540 A month from now is when I start working on ChannelChanger actively again. If you want the posts from zchan and the admin is willing to either share the backup or host it long enough to grab a scrape then you could also import those. Merging the zchan and lynxchan boards wouldn't be straightforward, but would be possible. Either way, I'll try to get it done by the end of November. Can't promise anything, though. I'd be willing to help with the migration when it does happen. I'd still recommend you contact me via e-mail; board owners have a good sense for when something on their board is broken, and since I haven't imported to JSChan before I think having you look over everything beforehand would be best.
>>1542 Alright, I'll have a think on it for awhile.
I've pushed an update which allows users to ignore invalid certificates. This will solve both the invalid certificate after redirect issue as well as sites being inaccessible due to expired credentials.
(86.47 KB 1024x1024 1586634477331.jpg)
Yet another project that seems interesting, written in that fucking retarded faggot shit language. Fuck.
>>1597 Using a mediocre language to get shit done is still miles better than getting nothing done even while endlessly trying to put others down as they fail to meet your impossible standards, standards nobody cares for outside of you.
>>1597 What other better language could be used to do something like this with ease?
>>1598 >impossible standards Choosing a non- or less- retarded and/or gay language to do a project is not impossible. >trying to put others down Not trying to put anyone else down mate, just lamenting the fact. >nobody cares for outside of you Only few do care, and that's the main reason why things are how they are. >>1600 Ruby or Perl could be used to do something like this with ease. The former is less pozzed, the latter virtually AIDS-free.
>>1617 This is possible on LynxChan sites because of the "transfer threads" feature which can move threads between boards and dynamically re-numbers them. Building this into the tool itself for use on any website would be a pain, but it's theoretically doable.
>The 9th Circuit has defended the right to scrape publicly-accessible data what utter faggotry it's up to the server admin to permit that
>>1542 >>1545 I'm going to be opting to migrate r9k to zzzchan sooner rather than later. So don't concern yourself over time frames for development or whatever, you won't have me waiting on you.
The political posts are off-topic and have been moved to >>1631.
(378.66 KB 400x358 sonic the hedgehog.gif)
I have thusfar failed to install Python on my Linux machine such that it will allow the Lynxchan board export script to function. Is a more portable version of that utility upcoming, or should I continue hacking away at it?
Can you explain your problem exactly? How have you tried to install it, and what issues are you having with it? I do plan to make a standalone executable version eventually, but I want it to be feature-complete first and importing still needs a revamp. Also, I'm going to move this thread to the ChannelChanger general on >>>/t/ after you reply again. Sonic Jam was great
Your gif is actually a fucking webp. You just named it .gif, retard.
Merged the thread from >>>/site/ into here. >>1995 If you continue to have issues with installing Python then I can help if you provide more information.
>>1257 So is the project dead? I was waiting for the importing to Vichan feature, that would allow to archive any Vichan board and make a read-only version.
Pushed an update to rectify an oversight in DB versioning. File hashes were updated but one of the identifier fields was not. >>2099 No, but between the holidays, the main site, the streaming site, and some other projects my hands have been full. I released this when it hit a minimum viable state. I will need to learn the layout of the Vichan database, possibly accounting for the most popular forks. If you are familiar with administrating Vichan servers, please drop me a line. You may be able to expedite the process. >>2125 Are you facing difficulty scraping from Cloudflare protected sites? If so, let me know which sites are causing a problem. Even on a VPN I've been able to take test scrapes from every webring site.
UnicodeEncodeError: 'charmap' codec can't encode character '\u2b24' in position 36197: character maps to <undefined>
>>2380 Your terminal does not support Unicode. Most likely, it is set to a basic ASCII mode. Try changing your locale to en_US.UTF-8.
Would it be possible to use this with onion site boards?
>>2598 Yes. If your terminal is making requests through Tor then it should resolve the .onion. You can either enable Tor and tunnel all traffic through it or look at tools like Torify that claim to proxy a single command through the network.
Is it possible to scrape more specifically? Like only scrape a single thread, or even only files from a thread rather than also the post comments?
>>2642 There is both a blacklist and a whitelist option. Check help for syntax. Just whitelist the thread you want the images from. It will still grab the html and json, but the image and thumbnail folder should only have files from the associated thread. I could add a file only option if that would help.
>>2653 >Check help for syntax Oh I see, I'm a retard and didn't know you could do that. Was looking in the readme.
Does your own tool not work on your own website? I'm able to download jschan boards fine but get failures to establish connection errors when attempting it with boards here. Is this because of the anti-media off-site linking autism?
>>2692 >Does your own tool not work on your own website? >Is this because of the anti-media off-site linking autism? Well that's just non-throrough testing. I forgot the new session manager doesn't hang in situations like this. I've pushed a fix to master.
>>2708 Seems to work now. I probably should have caught it sooner, just been lazy and haven't bothered to backup any 8moe boards yet.
>>2709 >>2708 Actually it doesn't work for specific boards. >Request for https://8chan.moe/delicious/catalog returned code 404. The resource could not be located. >Request for https://8chan.moe/sm/catalog returned code 404. The resource could not be located. I assume this is because they're NSFW boards or something? Can this be remedied?
>>2710 Turns out the earlier fix was partial. Patched again. Should solve the problem for both this site and any others that decide to implement the same thing.
Is there a way to know the total size of a board before downloading it?
>>2725 Not currently, but I could add that in. Note that it would be a lowball estimate, because thumbnails would not be included in the count.
>>2728 A rough estimate is fine.
>>2729 I pushed an update. The script will now halt after gathering thread data, display the file size, and ask if you want to proceed. I've also added an override option of -y or --yes to skip this prompt.
>>2730 >OSError: Not enough disk space for all files. It spit out an error confirming there isn't enough disk space, which is good, but doesn't provide information on how much space is needed when there isn't enough space.
>>2731 Good catch. I've added the space requirements and a notice about thumbnails not being included to the error.
>>2745 It appears to be erroring out because my home partition doesn't have enough space but the drive I'm pointing the download to does have enough space.
>>2752 Forgot to pass it the working directory. Should be good now.
If I download a board to the same folder more than once will it add new content while keeping the remaining content intact? What if posts or threads have been deleted since last backing up the board?
>>2837 I'd also wonder if the src and thumbnails could be organized into folders based on their thread number or potentially thread subject if available? I know the intent of the tool was for importing it into other sites databases and the like but its main use for me is archiving.
>>2837 If you have a copy of a thread on disk, and you scrape it again, the json and html retrieved will entirely overwrite the last one. So a post deleted between scrapes will go missing. Expired/deleted threads will remain as they were, since they will not be overwritten. If you kept scraping to the same directory over a long period of time, you'd effectively build up a catalog of thousands of threads, and all their media. More than would fit on a live board. It effectively accumulates. You're really only at risk of losing deleted posts inside of threads. If you have a great need to retain those, too, my recommendation would be to put the json and html into source control. >>2838 I'll see what I can do, but the current system is designed to reduce redundant downloads. The directory structure mimics a Vichan server, which stores media per-board. That said, I would like the tool to be useful as a general-purpose scraper. Ideally I'd like a way to do it without making archives unusable for import later. Let me know what else you'd need for archival purposes and I'll keep it in mind when I revisit scraper functionality.
>>2850 >If you have a copy of a thread on disk, and you scrape it again, the json and html retrieved will entirely overwrite the last one. So a post deleted between scrapes will go missing. Expired/deleted threads will remain as they were, since they will not be overwritten. If you kept scraping to the same directory over a long period of time, you'd effectively build up a catalog of thousands of threads, and all their media. More than would fit on a live board. It effectively accumulates. You're really only at risk of losing deleted posts inside of threads. If you have a great need to retain those, too, my recommendation would be to put the json and html into source control. Interesting, that's workable information. I assume the files/media don't get deleted alongside the posts then? >That said, I would like the tool to be useful as a general-purpose scraper. Ideally I'd like a way to do it without making archives unusable for import later. >Let me know what else you'd need for archival purposes and I'll keep it in mind when I revisit scraper functionality. It's basically completely functional for archiving, it's just a matter of convenience in how files are organized for offline/personal viewing. I'd agree that you shouldn't impede ease of importing in achieving this though, since being able to re-import the content as is down the line is just as important for archival.
>>2879 > I assume the files/media don't get deleted alongside the posts then? Exactly. It always checks if a conflicting file exists, and if so, it will skip the download. It does this for both files and thumbnails. I believe it relies on the filename for this, which is an sha256 hash on LynxChan, JSChan, and some Vichan forks, so outside of some Vichan instances you're guaranteed files are unique. The upside is you can download a huge board and then, once finished, grab an updated copy and only retrieve missing files. >it's just a matter of convenience in how files are organized for offline/personal viewing. It's possible I don't actually need the HTML for anything at the moment (it's mostly there to mirror Vichan's folders) so once I finish up importing I'll look at an option to modify it for local viewing. The JSON is far more important for imports.
this would be like 10 lines in bash if you used wget
>The 9th Circuit has defended the right to scrape publicly-accessible data (Archive). In the same ruling, they explicitly disallow measures which discriminate against scrapers, holding that they have the right to access any data a user with a web browser does. Who the fuck is the 9th circuit and why should anyone care? And also you're wrong faggot, you don't have the right to rape web servers with automated requests because that stupid article lies on a fallacy that bot activity = "a user with a web browser" which is clearly wrong unless you limit your tool to 20KB/s which I doubt is the case so fuck off.


Quick Reply
Extra
Delete
Report

no cookies?