SlideShare a Scribd company logo
1 of 19
Finding Similar Files in Large Document Repositories KDD’05, August 21-24, 2005, Chicago, Illinois, USA. Copyright 2005 ACM George Forman  HewlettPackard Labs [email_address] Kave Eshghi HewlettPackard Labs [email_address] Stephane Chiocchetti HewlettPackard France [email_address]
Agenda ,[object Object],[object Object],[object Object],[object Object],[object Object]
Introduction ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Method ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Hashing background ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Chunking ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Chunking and file similarity ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
 
 
File similarity algorithm ,[object Object],[object Object],[object Object],[object Object],[object Object]
File similarity algorithm (cont.) ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
File similarity algorithm (cont.) ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
File similarity algorithm (cont.) ,[object Object],[object Object],[object Object]
Handling identical files ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Handling identical files (cont.)
Complexity analysis ,[object Object],[object Object],[object Object]
Results ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Related work ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Conclusions ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

More Related Content

What's hot

Duet @ TREC 2019 Deep Learning Track
Duet @ TREC 2019 Deep Learning TrackDuet @ TREC 2019 Deep Learning Track
Duet @ TREC 2019 Deep Learning TrackBhaskar Mitra
 
Algorithm Name Detection & Extraction
Algorithm Name Detection & ExtractionAlgorithm Name Detection & Extraction
Algorithm Name Detection & ExtractionDeeksha thakur
 
Conformer-Kernel with Query Term Independence @ TREC 2020 Deep Learning Track
Conformer-Kernel with Query Term Independence @ TREC 2020 Deep Learning TrackConformer-Kernel with Query Term Independence @ TREC 2020 Deep Learning Track
Conformer-Kernel with Query Term Independence @ TREC 2020 Deep Learning TrackBhaskar Mitra
 
TextRank: Bringing Order into Texts
TextRank: Bringing Order into TextsTextRank: Bringing Order into Texts
TextRank: Bringing Order into TextsShubhangi Tandon
 
IRE- Algorithm Name Detection in Research Papers
IRE- Algorithm Name Detection in Research PapersIRE- Algorithm Name Detection in Research Papers
IRE- Algorithm Name Detection in Research PapersSriTeja Allaparthi
 
Indexing for Large DNA Database sequences
Indexing for Large DNA Database sequencesIndexing for Large DNA Database sequences
Indexing for Large DNA Database sequencesCSCJournals
 
Tdm probabilistic models (part 2)
Tdm probabilistic  models (part  2)Tdm probabilistic  models (part  2)
Tdm probabilistic models (part 2)KU Leuven
 
Probabilistic models (part 1)
Probabilistic models (part 1)Probabilistic models (part 1)
Probabilistic models (part 1)KU Leuven
 
Similarity Measurement Preliminary Results
Similarity  Measurement  Preliminary ResultsSimilarity  Measurement  Preliminary Results
Similarity Measurement Preliminary Resultsxiaojuzheng
 
Classifying Text using CNN
Classifying Text using CNNClassifying Text using CNN
Classifying Text using CNNSomnath Banerjee
 
A Document Exploring System on LDA Topic Model for Wikipedia Articles
A Document Exploring System on LDA Topic Model for Wikipedia ArticlesA Document Exploring System on LDA Topic Model for Wikipedia Articles
A Document Exploring System on LDA Topic Model for Wikipedia Articlesijma
 
FINDING OUT NOISY PATTERNS FOR RELATION EXTRACTION OF BANGLA SENTENCES
FINDING OUT NOISY PATTERNS FOR RELATION EXTRACTION OF BANGLA SENTENCESFINDING OUT NOISY PATTERNS FOR RELATION EXTRACTION OF BANGLA SENTENCES
FINDING OUT NOISY PATTERNS FOR RELATION EXTRACTION OF BANGLA SENTENCESkevig
 
Text categorization
Text categorizationText categorization
Text categorizationKU Leuven
 
The vector space model
The vector space modelThe vector space model
The vector space modelpkgosh
 
Topic Extraction on Domain Ontology
Topic Extraction on Domain OntologyTopic Extraction on Domain Ontology
Topic Extraction on Domain OntologyKeerti Bhogaraju
 
Latent Semanctic Analysis Auro Tripathy
Latent Semanctic Analysis Auro TripathyLatent Semanctic Analysis Auro Tripathy
Latent Semanctic Analysis Auro TripathyAuro Tripathy
 

What's hot (20)

Text categorization
Text categorizationText categorization
Text categorization
 
Duet @ TREC 2019 Deep Learning Track
Duet @ TREC 2019 Deep Learning TrackDuet @ TREC 2019 Deep Learning Track
Duet @ TREC 2019 Deep Learning Track
 
Algorithm Name Detection & Extraction
Algorithm Name Detection & ExtractionAlgorithm Name Detection & Extraction
Algorithm Name Detection & Extraction
 
Conformer-Kernel with Query Term Independence @ TREC 2020 Deep Learning Track
Conformer-Kernel with Query Term Independence @ TREC 2020 Deep Learning TrackConformer-Kernel with Query Term Independence @ TREC 2020 Deep Learning Track
Conformer-Kernel with Query Term Independence @ TREC 2020 Deep Learning Track
 
TextRank: Bringing Order into Texts
TextRank: Bringing Order into TextsTextRank: Bringing Order into Texts
TextRank: Bringing Order into Texts
 
IRE- Algorithm Name Detection in Research Papers
IRE- Algorithm Name Detection in Research PapersIRE- Algorithm Name Detection in Research Papers
IRE- Algorithm Name Detection in Research Papers
 
Indexing for Large DNA Database sequences
Indexing for Large DNA Database sequencesIndexing for Large DNA Database sequences
Indexing for Large DNA Database sequences
 
Tdm probabilistic models (part 2)
Tdm probabilistic  models (part  2)Tdm probabilistic  models (part  2)
Tdm probabilistic models (part 2)
 
Probabilistic models (part 1)
Probabilistic models (part 1)Probabilistic models (part 1)
Probabilistic models (part 1)
 
Similarity Measurement Preliminary Results
Similarity  Measurement  Preliminary ResultsSimilarity  Measurement  Preliminary Results
Similarity Measurement Preliminary Results
 
Classifying Text using CNN
Classifying Text using CNNClassifying Text using CNN
Classifying Text using CNN
 
A Document Exploring System on LDA Topic Model for Wikipedia Articles
A Document Exploring System on LDA Topic Model for Wikipedia ArticlesA Document Exploring System on LDA Topic Model for Wikipedia Articles
A Document Exploring System on LDA Topic Model for Wikipedia Articles
 
How web searching engines work
How web searching engines workHow web searching engines work
How web searching engines work
 
FINDING OUT NOISY PATTERNS FOR RELATION EXTRACTION OF BANGLA SENTENCES
FINDING OUT NOISY PATTERNS FOR RELATION EXTRACTION OF BANGLA SENTENCESFINDING OUT NOISY PATTERNS FOR RELATION EXTRACTION OF BANGLA SENTENCES
FINDING OUT NOISY PATTERNS FOR RELATION EXTRACTION OF BANGLA SENTENCES
 
Text categorization
Text categorizationText categorization
Text categorization
 
The vector space model
The vector space modelThe vector space model
The vector space model
 
Does sizematter
Does sizematterDoes sizematter
Does sizematter
 
Topic Extraction on Domain Ontology
Topic Extraction on Domain OntologyTopic Extraction on Domain Ontology
Topic Extraction on Domain Ontology
 
ppt
pptppt
ppt
 
Latent Semanctic Analysis Auro Tripathy
Latent Semanctic Analysis Auro TripathyLatent Semanctic Analysis Auro Tripathy
Latent Semanctic Analysis Auro Tripathy
 

Similar to Finding Similar Files in Large Repositories Using Content-Based Chunking

Fota Delta Size Reduction Using FIle Similarity Algorithms
Fota Delta Size Reduction Using FIle Similarity AlgorithmsFota Delta Size Reduction Using FIle Similarity Algorithms
Fota Delta Size Reduction Using FIle Similarity AlgorithmsShivansh Gaur
 
FINAL PROJECT REPORT
FINAL PROJECT REPORTFINAL PROJECT REPORT
FINAL PROJECT REPORTDhrumil Shah
 
Comparison Study of Lossless Data Compression Algorithms for Text Data
Comparison Study of Lossless Data Compression Algorithms for Text DataComparison Study of Lossless Data Compression Algorithms for Text Data
Comparison Study of Lossless Data Compression Algorithms for Text DataIOSR Journals
 
Google File System
Google File SystemGoogle File System
Google File SystemDreamJobs1
 
Information Retrieval, Encoding, Indexing, Big Table. Lecture 6 - Indexing
Information Retrieval, Encoding, Indexing, Big Table. Lecture 6  - IndexingInformation Retrieval, Encoding, Indexing, Big Table. Lecture 6  - Indexing
Information Retrieval, Encoding, Indexing, Big Table. Lecture 6 - IndexingSean Golliher
 
Methodology for Optimizing Storage on Cloud Using Authorized De-Duplication –...
Methodology for Optimizing Storage on Cloud Using Authorized De-Duplication –...Methodology for Optimizing Storage on Cloud Using Authorized De-Duplication –...
Methodology for Optimizing Storage on Cloud Using Authorized De-Duplication –...IRJET Journal
 
An unsupervised framework for effective indexing of BigData
An unsupervised framework for effective indexing of BigDataAn unsupervised framework for effective indexing of BigData
An unsupervised framework for effective indexing of BigDataRamakrishna Prasad Sakhamuri
 
Advances in File Carving
Advances in File CarvingAdvances in File Carving
Advances in File CarvingRob Zirnstein
 
File System Implementation.pptx
File System Implementation.pptxFile System Implementation.pptx
File System Implementation.pptxRajapriya82
 
Amazon Product Sentiment review
Amazon Product Sentiment reviewAmazon Product Sentiment review
Amazon Product Sentiment reviewLalit Jain
 
Duplicate File Analyzer using N-layer Hash and Hash Table
Duplicate File Analyzer using N-layer Hash and Hash TableDuplicate File Analyzer using N-layer Hash and Hash Table
Duplicate File Analyzer using N-layer Hash and Hash TableAM Publications
 
Building modern data lakes
Building modern data lakes Building modern data lakes
Building modern data lakes Minio
 
An Efficient Search Engine for Searching Desired File
An Efficient Search Engine for Searching Desired FileAn Efficient Search Engine for Searching Desired File
An Efficient Search Engine for Searching Desired FileIDES Editor
 
Software tools for high-throughput materials data generation and data mining
Software tools for high-throughput materials data generation and data miningSoftware tools for high-throughput materials data generation and data mining
Software tools for high-throughput materials data generation and data miningAnubhav Jain
 
Efficient Shared Data in Perl
Efficient Shared Data in PerlEfficient Shared Data in Perl
Efficient Shared Data in PerlPerrin Harkins
 
An Efficient Approach to Manage Small Files in Distributed File Systems
An Efficient Approach to Manage Small Files in Distributed File SystemsAn Efficient Approach to Manage Small Files in Distributed File Systems
An Efficient Approach to Manage Small Files in Distributed File SystemsIRJET Journal
 
(Julien le dem) parquet
(Julien le dem)   parquet(Julien le dem)   parquet
(Julien le dem) parquetNAVER D2
 

Similar to Finding Similar Files in Large Repositories Using Content-Based Chunking (20)

Fota Delta Size Reduction Using FIle Similarity Algorithms
Fota Delta Size Reduction Using FIle Similarity AlgorithmsFota Delta Size Reduction Using FIle Similarity Algorithms
Fota Delta Size Reduction Using FIle Similarity Algorithms
 
2.5 lab1
2.5 lab12.5 lab1
2.5 lab1
 
FINAL PROJECT REPORT
FINAL PROJECT REPORTFINAL PROJECT REPORT
FINAL PROJECT REPORT
 
Comparison Study of Lossless Data Compression Algorithms for Text Data
Comparison Study of Lossless Data Compression Algorithms for Text DataComparison Study of Lossless Data Compression Algorithms for Text Data
Comparison Study of Lossless Data Compression Algorithms for Text Data
 
Google File System
Google File SystemGoogle File System
Google File System
 
Information Retrieval, Encoding, Indexing, Big Table. Lecture 6 - Indexing
Information Retrieval, Encoding, Indexing, Big Table. Lecture 6  - IndexingInformation Retrieval, Encoding, Indexing, Big Table. Lecture 6  - Indexing
Information Retrieval, Encoding, Indexing, Big Table. Lecture 6 - Indexing
 
Methodology for Optimizing Storage on Cloud Using Authorized De-Duplication –...
Methodology for Optimizing Storage on Cloud Using Authorized De-Duplication –...Methodology for Optimizing Storage on Cloud Using Authorized De-Duplication –...
Methodology for Optimizing Storage on Cloud Using Authorized De-Duplication –...
 
An unsupervised framework for effective indexing of BigData
An unsupervised framework for effective indexing of BigDataAn unsupervised framework for effective indexing of BigData
An unsupervised framework for effective indexing of BigData
 
Advances in File Carving
Advances in File CarvingAdvances in File Carving
Advances in File Carving
 
File System Implementation.pptx
File System Implementation.pptxFile System Implementation.pptx
File System Implementation.pptx
 
Amazon Product Sentiment review
Amazon Product Sentiment reviewAmazon Product Sentiment review
Amazon Product Sentiment review
 
Duplicate File Analyzer using N-layer Hash and Hash Table
Duplicate File Analyzer using N-layer Hash and Hash TableDuplicate File Analyzer using N-layer Hash and Hash Table
Duplicate File Analyzer using N-layer Hash and Hash Table
 
Building modern data lakes
Building modern data lakes Building modern data lakes
Building modern data lakes
 
An Efficient Search Engine for Searching Desired File
An Efficient Search Engine for Searching Desired FileAn Efficient Search Engine for Searching Desired File
An Efficient Search Engine for Searching Desired File
 
Software tools for high-throughput materials data generation and data mining
Software tools for high-throughput materials data generation and data miningSoftware tools for high-throughput materials data generation and data mining
Software tools for high-throughput materials data generation and data mining
 
Efficient Shared Data in Perl
Efficient Shared Data in PerlEfficient Shared Data in Perl
Efficient Shared Data in Perl
 
An Efficient Approach to Manage Small Files in Distributed File Systems
An Efficient Approach to Manage Small Files in Distributed File SystemsAn Efficient Approach to Manage Small Files in Distributed File Systems
An Efficient Approach to Manage Small Files in Distributed File Systems
 
(Julien le dem) parquet
(Julien le dem)   parquet(Julien le dem)   parquet
(Julien le dem) parquet
 
ALA Interoperability
ALA InteroperabilityALA Interoperability
ALA Interoperability
 
Hadoop data management
Hadoop data managementHadoop data management
Hadoop data management
 

More from feiwin

2007/7/25 Proposal update
2007/7/25 Proposal update2007/7/25 Proposal update
2007/7/25 Proposal updatefeiwin
 
2006/11/20 Proposal
2006/11/20 Proposal2006/11/20 Proposal
2006/11/20 Proposalfeiwin
 
2006/10/16 Proposal
2006/10/16 Proposal2006/10/16 Proposal
2006/10/16 Proposalfeiwin
 
Mining Product Reputations On the Web
Mining Product Reputations On the WebMining Product Reputations On the Web
Mining Product Reputations On the Webfeiwin
 
Mining from Open Answers in Questionnaire Data
Mining from Open Answers in Questionnaire DataMining from Open Answers in Questionnaire Data
Mining from Open Answers in Questionnaire Datafeiwin
 
An Integrated Framework on Mining Logs Files for Computing System Management
An Integrated Framework on Mining Logs Files for Computing System ManagementAn Integrated Framework on Mining Logs Files for Computing System Management
An Integrated Framework on Mining Logs Files for Computing System Managementfeiwin
 
Real Time Competitive Marketing Intelligence
Real Time Competitive Marketing IntelligenceReal Time Competitive Marketing Intelligence
Real Time Competitive Marketing Intelligencefeiwin
 

More from feiwin (7)

2007/7/25 Proposal update
2007/7/25 Proposal update2007/7/25 Proposal update
2007/7/25 Proposal update
 
2006/11/20 Proposal
2006/11/20 Proposal2006/11/20 Proposal
2006/11/20 Proposal
 
2006/10/16 Proposal
2006/10/16 Proposal2006/10/16 Proposal
2006/10/16 Proposal
 
Mining Product Reputations On the Web
Mining Product Reputations On the WebMining Product Reputations On the Web
Mining Product Reputations On the Web
 
Mining from Open Answers in Questionnaire Data
Mining from Open Answers in Questionnaire DataMining from Open Answers in Questionnaire Data
Mining from Open Answers in Questionnaire Data
 
An Integrated Framework on Mining Logs Files for Computing System Management
An Integrated Framework on Mining Logs Files for Computing System ManagementAn Integrated Framework on Mining Logs Files for Computing System Management
An Integrated Framework on Mining Logs Files for Computing System Management
 
Real Time Competitive Marketing Intelligence
Real Time Competitive Marketing IntelligenceReal Time Competitive Marketing Intelligence
Real Time Competitive Marketing Intelligence
 

Recently uploaded

My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 

Recently uploaded (20)

My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 

Finding Similar Files in Large Repositories Using Content-Based Chunking

  • 1. Finding Similar Files in Large Document Repositories KDD’05, August 21-24, 2005, Chicago, Illinois, USA. Copyright 2005 ACM George Forman HewlettPackard Labs [email_address] Kave Eshghi HewlettPackard Labs [email_address] Stephane Chiocchetti HewlettPackard France [email_address]
  • 2.
  • 3.
  • 4.
  • 5.
  • 6.
  • 7.
  • 8.  
  • 9.  
  • 10.
  • 11.
  • 12.
  • 13.
  • 14.
  • 16.
  • 17.
  • 18.
  • 19.