Automating the boring stuff : Data scraping from Outlook, IMAP & Two Factor Authentication

Prerequisites:

  • Python 2.x

The article is going to be a short summary of what I did for a friend to automate his data scraping needs. The task was to read IDs/keywords from an Excel sheet and populate data against those IDs by searching and scraping data from matched emails. Though I won’t be covering all the points but I will cover the challenges I faced and the solutions to those challenges.

The first thing that comes to mind for such a task is the IMAP package in Python. So, I started my tmux session, fired up VIM and started working on the script. With the help of IMAP package, I was able to access the Outlook/Office 365 mailbox. Here is a small snippet for login:

def outlook_login(username, password):
    """Signs in to a mailbox using IMAP.

    :param username:
        Username of the mailbox.
    :param password:
        Password of the mailbox.
    :return:
        IMAP object on successful login else exits script.
    """
    try:
        imap = imaplib.IMAP4_SSL(
                config.imap_server,
                config.imap_port
                )
        r, d = imap.login(username, password)
        print d[0]
        return imap
    except Exception as e:
        print e, " Aborting..."
        exit()

def main():
    imap = outlook_login(gourav.chawla@domain.tld, 'password')

Challenge 1: Two factor authentication

The script that I was creating didn’t take 2FA into account. Though after a bit of searching, I found out there was a simple solution to it.

Solution: App passwords. App password allows access to the office 365 account in client applications like Outlook, word etc. In my case, I just had to replace the account password with the App password and I was good to go.

After the login, I quickly realized that the mailbox I had logged in was not the one I wanted to access. I wanted to access the shared mailboxes provided by Office 365.

Challenge 2: Sign in to a shared mailbox

Solution: The solution to this problem was also as easy as putting the shared mailbox email/alias after a \ like this: [email protected]\alias-name

In my case, it was something like: [email protected]\[email protected]

Once that was taken care of and I was able to access the correct mailbox, I had to search emails by searching for a keyword in the email’s subject. Here is a snippet for searching:

def main():
    # ...Code removed for brevity.
    # Selects the Inbox for further operations
    print "Selected Inbox..."
    imap.select("Inbox")
    # Search for a keyword in all the email's subject
    keyword = 'Delivery'
    typ, data = imap.search(
            None, 'All',
            '(SUBJECT "'+keyword+'")')

The above search, if successful, returns a list of string like this: ['142 123 111']

Now, using these ids, I can fetch the email from the server and scrape the content I need. Here is the snippet for fetching the email and then parsing it:

def getEmail(imap, id):
    """Fetches email from the server.

    :param imap:
        IMAP class instance.
    :param id:
        The id of the email fulfilling the search criteria.
    :return:
        Email message.
    """
    r, d = imap.fetch(id, "(RFC822)")
    raw_email = d[0][1]
    email_message = email.message_from_string(raw_email)
    return email_message


def mailbody(email_message):
    """Extracts mailbody from Email message.

    :param email_message:
        Email message returned by getEmail function.
    :return:
        Email body.
    """
    body = None
    if email_message.is_multipart():
        for part in email_message.walk():
            if part.is_multipart():
                for subpart in part.walk():
                    if subpart.get_content_type() == 'text/html':
                        body = subpart.get_payload(decode=True)
                        content_transfer_encoding = subpart.get('Content-Transfer-Encoding')
            elif part.get_content_type() == 'text/html':
                body = part.get_payload(decode=True)
                content_transfer_encoding = part.get('Content-Transfer-Encoding')
    elif email_message.get_content_type() == 'text/html':
        body = email_message.get_payload(decode=True)
        content_transfer_encoding = email_message.get('Content-Transfer-Encoding')
    else:
        body = email_message.get_payload(decode=True)
        content_transfer_encoding = email_message.get('Content-Transfer-Encoding')
    return body


def parse(imap, id):
    """Fetches email from server and returns required data.

    :param imap:
        IMAP class instance.
    :param id:
        Id of email matched by search criteria.
    :return:
        Dictionary containing required data.
    """
    result = {}
    email_message = getEmail(imap, id)
    result['From'] = email_message['From']
    result['To'] = email_message['To']
    result['Date'] = parser.parse(email_message['Date'])
    result['Subject'] = (decode_header(email_message['Subject'])[0][0])
    result['Body'] = mailbody(email_message)
    return result

def main():
    # ...Code removed for brevity
    # Sends empty string at 0th index of data if nothing is found
    if data[0] == '':
        print "Could not find the email for keyword: %s" % keyword
    else:
        for id in data[0].split():
            result = parse(imap, id)
            print "\n", result['To'], result['Subject'], "\n", result['Body']

The above code fetches the email and stores it in a dictionary called, ‘result’. Now, whatever information you need could be stored in an excel sheet by using ‘openpyxl’.

In the above code, you might have noticed that I’m extracting the HTML content of email body. This is because, there was a table in the body which I had to scrape. I used BeautifulSoup to do that.

If you have any questions, feel free to ask them in the comments.

Reference:

Comments !