Tuesday Tip: Capturing Web Content

As often with DEVONthink, to capture web content you have a large number of options:

  1. Drag the URL to save only the bookmark
  2. ‘Print’ to PDF
  3. Save the page as a web archive
  4. Save only the page’s HTML code
  5. Copy text and images and save the clipping as rich text

    All these options have their advantages and disadvantages:

    Saving only the page’s HTML code is usually only interesting if you are a web developer as this keeps the page but not the linked images so as soon as the images become unavailable the page layout looks ugly.

    Better are web archives as they also keep all the linked materials and so the look-and-feel of the original page. But: web archives are saved by the Safari engine in a file format proprietary to Apple and so they may become unreadable in the future should Apple decide to drop support for it. In the last 20 years we have seen many file formats come and go. PDF keeps the layout and the content and it’s relatively future-proof as it is also an ISO standard.

    One thing all these options have in common is that they keep the original layout ā€” but also store unnecessary elements, even though DEVONthink tries to strip them if possible, such as navigation, advertising, or other text elements not related to the core information you want to keep. This does, of course, decrease the effectiveness of the AI engine.

    If you are only interested in the actual information the best option may be to select text and images and drag them from Safari or DEVONagent to your database. This saves the selection including images as rich text document which should be relatively future-proof (as it is a widely used standard) and saves only the data you are interested in, not n copies of the words ‘Home’ and ‘Back’ šŸ™‚

    So, depending on your needs, saving only the interesting parts of a web page can be more efficient than saving the whole page as a web archive or PDF. If you are interesting in the original look of the page, PDF is a future-proof, standard-based option.

    12 Responses to “Tuesday Tip: Capturing Web Content”

    1. The print to PDF option creates PDFs without text (unlike WebSnapper’s vector PDF solution). Likewise, using a tool such as SiteCapture creates image files, without text (perhaps this should be option 6 above).

      Would it be possible to use DevonThink to have these files searchable? I.e. through the OCR feature or similar?

    2. Eric says:

      The Print to PDF option should produce PDFs with text. But images can be run through OCR, of course.

    3. derek says:

      it would be great if text copied/dragged from a web page would retain its url when pasted into DT. Maybe this is possible now with some command/script unknown to me?

    4. Eric says:

      This solely depends on the application from which the drag originates. From Safari the URL is retained, from other browsers this does not need to be the case, depending on the browser.

    5. Marty says:

      Need to find software equivalent to Adobe Acrobat Distiller 6, which I used on an old Windows98 computer. Within the limits of time and storage space, and depending how complicated the web pages, Distiller could convert an entire website into a PDF, with links intact. Each link in the final PDF would either still jump to the correct internet page, or jump to some point inside the PDF file, if that particular page had been downloaded. Download could be from one URL to all links, to a specific number of levels. Alternatively, the user could click on links in the PDF, one at a time, and see them downloaded and merged into the PDF. Anyone know of Mac OSX software that can do this? Sometimes, you need to save as much as you can of a web site, before it changes or disappears. (I recently changed from Win98 to Mac OSX.) Might DevonThink possibly learn how to do this html to pdf conversion?

    6. Eric says:

      @Marty: You could use DEVONthink Pro’s download manager to download a web page tree with link following. This gives you only HTML so you would need to create an AppleScript script to actually convert them into PDFs. Still, this is way more cumbersome than what Distiller 6 did in its days.

    7. Adam says:

      Whenever I try to drag a clipping into DEVONthink or make a note from the clipboard, an html file is created instead of a rich text file. Is there any way to fix this?

    8. Adam says:

      It should be noted that I am using Google Chrome. Is all of DEVONnote’s web functionality centered around safari? When will better support arrive for popular alternatives?

    9. Eric says:

      @ Adam: DEVONthink takes the best choice of what the source application gives it. In the case of Chrome it seems that what arrives on DEVONthink’s side is an HTML file. Please ask Google to improve this behavior and put e.g. also deliver rich text in a drag-and-drop operation or via the clipboard. Alternatively, use a different browser such as Safari.

    10. Ralph Bauer says:

      New user here. Reading the above, I tried to save the text from this page by selecting it, then clicking Print, then “save selection only”, then “save to pdf”, then saving to devon think, This was my first save. I was presented with the option to save it once, in “documents”, but navigated to Devonthink and saved it there.

    11. Ralph Bauer says:

      whoops! Hit send too early. When I looked at what I had saved, the file was blank. So this method needs some clarification- but in theory ought to be both memory- & time- efficient…


    12. eboehnisch says:

      @ Ralph: You don’t need to do all of them. The options are not steps but single actions šŸ™‚ Just e.g. select some text and drag that to DEVONthink’s icon in the Dock. Alternative: use the Clip to DEVONthink extension that DEVONthink can install for you (via DEVONthink > Install Add-ons).