This is more like Part 3 but I'll catch you up, lol.
My first choice for capstone project was Recursion Cellular Image Classification to disentangle experimental 'noise' from biological signals. This would help researchers understand how drugs interact with human cells which translates to decreasing the cost of treatments and the time it takes to bring new treatments to market. My computer couldn't handle the dataset. There were over 85GB worth of images and I shouldn't have attempted to download it, lol. It took a day to get the zipped file into my computer and partially unzipped before I ran out of room. Then it took another day to get the files off my computer.
My second choice is currently an image recognition application to identify metastatic tissue in histopathologic scans of lymph node sections. At 6 GB, my computer is currently struggling to do things with the dataset (220,006 image files). Right now it needs to parse a list and separate the 'cancer' images from the 'not cancer' images which were not-so-conveniently itemized in a csv file.
I spent the morning separating those files into 14 different folders in the hopes that my computer finds searching through under 20k files each more reasonable than the full dataset. Of course I'm using automation scripts but my computer is a dinosaur so it's still struggling to get by. But I just need to get the data separated and into Azure. After that I can get these images off my computer and handle everything else from the cloud.
The first file is still running by the time I posted this. Remember I have 14 to get though and once I do all that, I need to upload them into Azure.
I am thankful the cloud exists so I can work with datasets like this! I know I need to update my computer. It's on my list, especially because I REALLY want to attempt the Recursion Cellular Image problem once I have enough time to do it properly.
No comments:
Post a Comment