TransWikia.com

Parsing and storing a large amount of HTML data

Data Science Asked by nainometer on January 12, 2021

I have a data chunk (~30k) in which I have htmls pages and pngs saved in a folder for websites. These folders are titled based on some randomly generated hashes. My supervisor wants me to crunch through this data chunk and extract some attributes out of each HTML page and store it in a DB for future use. Attributes to be extracted comprises of page titles and copyright section from the HTML.
As per my understanding this data is unstructured because there is no relation per say in the folder data for now. Moreover, there is a somewhat inherent structure which is of HTML but essentially each page mutually disjoint with the rest which qualifies for unstructured. Please correct me if I am wrong here.

Manager wishes to have the data stored in an ELK stack. By storing, he is quite unclear at this point in time but so far he wants to have the whole HTML file, title and copyright for each single HTML file extracted and stored. Here comes my first concern which I need help with.

  • Is it a good idea to store whole HTML file into DB? I am of the
    opinion that we place HTML files in a centralized storage on some
    kind of FS and store the absolute paths of those files against each
    entry in the DB (we are already doing the same thing for PNGs btw).

I haven’t worked with ELK stack and I thought it would be a good learning opportunity. While going through online tutorials I have learned that it is essentially for logs parsing from different applications servers and storing and visualizing them in a presentable and searchable manner.

  • If anyone can comment about ELK, if it will work in my case, that
    would be very helpful.

So far the end objective is to crunch through this data and store the attributes and when required search through the attributes and use them as per future need. For example, if there is a specific copy right text that is coming up very frequently, then get that copyright text and use it for classifying certain pattern which takes to my third and last question.

  • Will it help to store it in a non relational database and then query
    accordingly? In my opinion RDBMS like mysql is a better contender
    because it will be easy to search through the tables for a specific
    type of title and then use it accordingly. End goal is not
    visualization, but to have data at hand to use whenever required.

One Answer

The terms "structured data" or "unstructured data" are not defined in such a way that a given dataset is always either one or the other. There are gray areas and I think this is one example. Since you cannot rely on the structure in your data, I would categorize this as unstructured.

To understand if it's a good idea to store the whole HTML in the DB (and same question for the PNGs), you need to weigh the pro's and con's. Pro storing everything in the DB is the simplicity: You don't have separate places where data is stored, so if you take a snapshot from the DB at some point in time and restore it, you restore the entire state as it was at that time. You do not need to worry about your disk storage separately, to restore that to a given state. Against storing everything in the DB is the amount of data. Can the DB handle it, or does performance suffer too much? Think about retrieving the data, searching/querying it, storing data, making back-ups. This will depend on your choice of database.

The same question for ELK and MySQL: What are the pros and cons of each? MySQL is simpler to install, that's always good. MySQL gives you a relational datamodel (tables can be related using foreign keys). Is that an advantage? MySQL gives you transactions. Is that helpful? ELK mainly gives you scalability, meaning that probably it would allow everything to be stored in the DB and still meet your performance needs.

If you can't store everything in MySQL (HTML and PNGs), then before choosing to store part of the data somewhere else, my first option would be to change DB technology to something that can store everything, rather than to start storing things in different places. So in that case ELK might be a good option, but store the PNGs there, too.

Answered by Paul on January 12, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP