The filter takes in a listing id and photo database to retrieve the photo URLs. A data frame of each image URL, image array, height, and width is generated from these photo URLs. A dictionary containing the listing id, image URL and reasons for passing or failing the verdict of whether the listing is of quality.
The distribution of heights and widths across the sample photos is studied first, before finding out sample means and standard deviations of heights and widths.
Photos with widths and heights below 2 standard deviations below the mean are filtered out as "poor quality".
Since the plot shows the distribution to be more skewed towards the left, the heights and widths are normalized before finding out their respective sample means and standard deviations and then filtering out those below 2 standard deviations.
We plot the sample photos after filtering to determine the correctness of the filter.
Link to jupyter notebook: photo-size.ipynb
One attribute of a poor quality photo is one with human faces. OpenCV already has a trained model to use to detect a human face. The photo is flagged as 'crappy' if a human face is detected.
With Pytesseract, we can also filter poor quality photos with words on them.
Link to jupyter notebook: face-word-detector.ipynb
Low quality photos also includes those which are blurry or have low resolution.
Laplacian kernel scores filters out the blurry photos when set at 600. Improvements to text detection, especially white and coloured texts have yet to be explored.
Link to jupyter notebook: photo-analyses-v2.ipynb