Pinterest moved 100TB of data from MySQL databases to S3 and Hadoop continuously using a new data pipeline. The pipeline uses Kafka to stream database change events in real-time. It incorporates periodic compaction to merge snapshots and deltas into a compact format with 15 minute latency. The new system provides reliability, scalability, and enables features like real-time search and recommendations.
49. S3Nuances
● Eventual Consistency
● Read-after-write is OK, but not PUT followed
by LIST
● Directory listing is slow
● Shorter SLA —> More smaller files
● In early iterations, directly listing >> file content
reading
● Rate Limit:
● Launching thousands of mappers would
quickly hit S3 rate limit