simon987/dataarchivist.net

Fork 0

mirror of https://github.com/simon987/dataarchivist.net.git synced 2025-04-24 11:15:50 +00:00

simon987 f16e743150 CSS tweaks

2019-05-14 10:42:43 -04:00

3.9 KiB

Raw Blame History

title

date

tags

draft

Web Scraping 101: E-Reader app

2019-05-13T10:46:34-04:00

scraping

false

Let's say you bought a textbook and it comes with an online code that lets you read its online version.

Of course that online version is tied to your account, it expires in 6 months and is not compatible with your tablet browser. So what are you gonna do? Hack together a script that takes screenshots of the pages? That's not a bad idea, but first let's see if we can get through the e-reader's DRM.

After logging into the app, I immediately open the dev tools and this is what I see:

So, individual pdf pages are being read from this getpdfpage endpoint, rendered with a Javascript library and displayed in your browser every time you flip a page in the app.

This is what is sent to the endpoint:

{{}} globalbookid: "" pdfpage: ".pdf" iscover: "N" authkey: "" hsid: "" {{}}

Obviously, the globalbookid is the unique ID of the book I am looking at. pdfpage is the ID of the page, there is probably a way to get a list of those with another endpoint. iscover and authkey are self explanatory. So what exactly is that hsid parameter? From what I can see, it is different for every request.

Looking further, I find the getpagedetails endpoint, which does exactly what the name suggests:

{{< figure src="/scrape/dev_tools2.png" title="/getpagedetails">}}

Okay, so we have our authkey, the list of pdfpages, and we know the globalbookid. Let's try to dig into the minified Javascript code to find out how the getpdfpage endpoint is called.

{{<highlight javascript "linenos=table,linenostart=370931">}} var o = "".concat(e.serverDetails, "/ebook/pdfplayer/getpdfpage?globalbookid=") + "".concat(e.globalBookId, "&pdfpage=").concat(t.pdfPath, "&iscover=N&authkey=").concat(r), i = o.replace("https", "http"), c = Object(s.c)(l.b.MD5_SECRET_KEY + i); o = "".concat(o, "&hsid=").concat(c), n.pdfPath = o, a.bookPagesInfo.pages.push(n) {{}}

Interesting... So the query URL is built by concatenating the different parameters together as you would expect but then a part of the URL - everything but the mysterious hsid parameter - is put into a hash function and its result is the value of the hsid parameter.

Without even looking at the s.c function it is becoming more and more obvious that the value of hsid is an MD5 hash of the whole query URL, with l.b.MD5_SECRET_KEY as the salt.

{{< figure src="/scrape/dev_tools3.png" title="MD5_SECRET_KEY hidden in plain sight">}}

The secret code was hidden only a few keystrokes into the source. Now that we have all the puzzle pieces, let's hack together a simple Python script to automate the download process:

def get_page(page): # Generate the 'hsid' verification hash verification = hashlib.md5(("%s%s/ebook/pdfplayer/getpdfpage?globalbookid=%s&pdfpage=%s&iscover=N&authkey=%s" % (MD5_SECRET, URL, BOOKID, page["pdfPath"], AUTHKEY)).encode()).hexdigest()

r = requests.get("%s/ebook/pdfplayer/getpdfpage?globalbookid=%s&pdfpage=%s&iscover=N&authkey=%s&hsid=%s"
                 % (URL, BOOKID, page["pdfPath"], AUTHKEY, verification, ))

print(r.status_code)

# Write the raw pdf response to a file
with open(BOOKID + "_" + str(page["pageOrder"]) + "_" + page["bookPageNumber"] + ".pdf", "wb") as out:
    out.write(r.content)

To save time, I manually saved the content of /getpagedetails to file

with open("book.json") as f: for page in json.load(f)[0]["pdfPlayerPageInfoTOList"]: get_page(page)

{{}}

To stitch the pages together, I used pdfunite:

{{}} pdfunite $(ls -v) output.pdf {{}}

Now even if you wanted, you couldn't even buy a digital version of that book of that quality.

3.9 KiB Raw Blame History

To save time, I manually saved the content of /getpagedetails to file

3.9 KiB

Raw Blame History