Improving amazon data quality

20
Data Quality Analysis and Reporting Data Record Science Derek Pappas

Transcript of Improving amazon data quality

Page 1: Improving amazon data quality

Data Quality Analysis and Reporting

Data Record ScienceDerek Pappas

Page 2: Improving amazon data quality

Detecting data quality problems

❖ amazon.com product data quality can be improved

❖ DRS software will detect the problems

Page 3: Improving amazon data quality

Filtering

There are do many opportunities to improve the user experience on amazon.com by identifying and fixing/filtering out the problematic data.

Page 4: Improving amazon data quality

Suggestions for Fixing Data❖ Product matching

❖ Variant elimination

❖ Identify bad data

❖ Identify duplicate products with different names from the same vendor

❖ Identify missing data

❖ Suggest fixes for data

❖ Identify over/underpriced items at third party stores (significantly overpriced items on amazon.com makes Amazon look bad in my opinion)

❖ Find bad/correct product classification

❖ Wrong product images

❖ Wrong specifications

❖ Google SEO violations

Page 5: Improving amazon data quality

Data Processing Pipeline

❖ Our pipeline was built with Hadoop map/reduce which scales. The pipeline processed 200 million records last week. It can process billions.

Page 6: Improving amazon data quality

Detecting problems

The following are just a few examples of problems that the DRS pipeline can detect.

Page 7: Improving amazon data quality

Overpricing

See the attached image of the massage balls. We can group those product variants and we can identify the overpricing.

Page 8: Improving amazon data quality

Overpricing Example

Page 9: Improving amazon data quality

"Jamming" the Amazon Index❖ The link below shows the same product over and over with

different product names-these are not variants. The vendor is "jamming" the amazon index so that their product shows up under different search terms. Google will algorithmically reduce the number of links in the Google index when a site is "spammy" or Google will manually exclude a site from or reduce the number of links in the from the Google index when black hat SEO tactics are being used by the site. See the image below

❖ https://www.amazon.com/s/ref=sr_st_price-asc-rank?keywords=ab+straps+hanging&rh=i%3Aaps%2Ck%3Aab+straps+hanging&qid=1480277091&sort=price-asc-rank

Page 10: Improving amazon data quality

“Jamming” the Amazon Index

Page 11: Improving amazon data quality

Bad Classification

❖ 3. In other instances on amazon.com I see misclassified items. In most cases we can identify the classification problems now.

Page 12: Improving amazon data quality

Bad Classification

There are biking and racing helmets mixed together.

https://www.amazon.com/s/ref=sr_nr_p_36_2?srs=2592626011&fst=as%3Aoff&rh=n%3A3375251%2Cn%3A%213375301%2Cn%3A706814011%2Cn%3A3403201%2Cn%3A6389202011%2Cn%3A3404571%2Ck%3ARACING%2Cp_36%3A1253557011&bbn=3404571&sort=price-asc-rank&keywords=RACING&ie=UTF8&qid=1480301345&rnid=386589011

Page 13: Improving amazon data quality

Wrong Product Image

❖ 5. Does not know who the manufacturer is. Searching for racing inside of Giro getting Fox and Bell at the top of the search results.

Page 14: Improving amazon data quality

Wrong Product Image (Socks)

Page 15: Improving amazon data quality

Bad Specifications

❖ Name value pairs do not match

Page 16: Improving amazon data quality

Bad Specifications

Page 17: Improving amazon data quality

Mining Reviews

❖ Product Quality Issues (including Amazon basics)

❖ Store customer service issues

❖ Graph ratings vs number of reviews (is one 5 star review better than fifty 4 star reviews-validity)

Page 18: Improving amazon data quality

Sort by Price Does Not include Shipping

Page 20: Improving amazon data quality

Reporting and Analysis

❖ Our data analysis and reporting can find the good/bad records and the good/bad/missing fields/images.

❖ Moreover, our software can often suggest fixes on the data analysis website.